# Final Project Phase 2 Summary
This Jupyter Notebook (.ipynb) will serve as the skeleton file for your submission for Phase 2 of the Final Project. Answer all statements addressed below as specified in the instructions for the project, covering all necessary details. Please be clear and concise in your answers. Each response should be at most 3 sentences. Good luck! <br><br>

Note: To edit a Markdown cell, double-click on its text.

## Jupyter Notebook Quick Tips
Here are some quick formatting tips to get you started with Jupyter Notebooks. This is by no means exhaustive, and there are plenty of articles to highlight other things that can be done. We recommend using HTML syntax for Markdown but there is also Markdown syntax that is more streamlined and might be preferable.
<a href = "https://towardsdatascience.com/markdown-cells-jupyter-notebook-d3bea8416671">Here's an article</a> that goes into more detail. (Double-click on cell to see syntax)

# Heading 1
## Heading 2
### Heading 3
#### Heading 4
<br>
<b>BoldText</b> or <i>ItalicText</i>
<br> <br>
Math Formulas: $x^2 + y^2 = 1$
<br> <br>
Line Breaks are done using br enclosed in < >.
<br><br>
Hyperlinks are done with: <a> https://www.google.com </a> or
<a href="http://www.google.com">Google</a><br>

# Data Collection and Cleaning
You are required to provide data collection and cleaning for the three (3) minimum datasets. Create a function for each of the following sections that reads or scrapes data from a file or website, manipulate and cleans the parsed data, and writes the cleaned data into a new file.

Make sure your data cleaning and manipulation process is not too simple. Performing complex manipulation and using modules not taught in class shows effort, which will increase the chance of receiving full credit.


## Data Sources
Include sources (as links) to your datasets. Add any additional data sources if needed. Clearly indicate if a data source is different from one submitted in your Phase I, as we will check that it satisfies the requirements.

* *All of our data sets are different.*

*   Downloaded Dataset Source: https://ourworldindata.org/water-access  
*   Web Collection #1 Source: https://pkgstore.datahub.io/world-bank/en.atm.pm25.mc.zs/data_json/data/29de7393f85d3518de927c2a2e54bc6e/data_json.json
*   Web Collection #2 Source: https://www.indexmundi.com/facts/indicators/SH.STA.MALN.MA.ZS/rankings

    https://www.indexmundi.com/facts/indicators/SH.STA.MALN.FE.ZS/rankings

*   Additional Downloaded Data Set : https://data.worldbank.org/indicator/SP.DYN.IMRT.IN

## Downloaded Dataset Requirement

Fill in the predefined functions with your data scraping/parsing code. You may modify/rename each function as you seem fit, but you must provide at least 3 separate functions that clean each of your required datasets.


In [None]:
import csv, json, requests, re
from pprint import pprint
from bs4 import BeautifulSoup
import pandas as pd

def water_parser(filename):
  reader_list = []
  with open(filename, encoding = "utf8") as csvfile:
      reader = csv.reader(csvfile)
      reader_list = list(reader)

  header = reader_list[0]
  reader_list = reader_list[1:]

  final = []
  for i in reader_list:
      country = i[0] + " 2017"
      deaths = round(float(i[3]),2)
      if i[2] == "2017":
          final += [[country, deaths]]
  final_pd = pd.DataFrame(data = final, columns = ["Country", "Deaths"])
  final_pd.to_csv("waterdata.csv")
  return final_pd
############ Function Call ############
water_parser("NumberofDeathsbyWater.csv")

Unnamed: 0,Country,Deaths
0,Afghanistan 2017,5256.65
1,Albania 2017,4.09
2,Algeria 2017,189.94
3,American Samoa 2017,0.72
4,Andean Latin America 2017,1257.99
...,...,...
226,Western Sub-Saharan Africa 2017,232494.40
227,World 2017,1232368.28
228,Yemen 2017,6435.59
229,Zambia 2017,6288.67


## Web Collection Requirement \#1


In [None]:
def pollution_parser(api):
  res = requests.get(api)
  json_data = res.json()

  for i in range(len(json_data)):
      if json_data[i]["Country Name"] == "Afghanistan":
        json_data = json_data[i:]
      break

  data_pd = pd.DataFrame(data = json_data)
  data_pd.set_index("Country Name", inplace = True)

  data_pd = data_pd.drop(["Country Code"], axis = 1)
  data_pd = data_pd[data_pd['Year'] == 2015]
  data_pd.to_csv("pollutiondata.csv")
  return data_pd
############ Function Call ############
pollution_parser("https://pkgstore.datahub.io/world-bank/en.atm.pm25.mc.zs/data_json/data/29de7393f85d3518de927c2a2e54bc6e/data_json.json")

Unnamed: 0_level_0,Value,Year
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Arab World,100.000000,2015
Caribbean small states,99.986694,2015
Central Europe and the Baltics,98.727719,2015
Early-demographic dividend,99.890981,2015
East Asia & Pacific,98.592512,2015
...,...,...
Virgin Islands (U.S.),100.000000,2015
West Bank and Gaza,100.000000,2015
"Yemen, Rep.",100.000000,2015
Zambia,100.000000,2015


## Web Collection Requirement \#2

In [None]:
import csv, json, requests, re
from pprint import pprint
from bs4 import BeautifulSoup
import pandas as pd

def nd_parser():
    res_male = requests.get("https://www.indexmundi.com/facts/indicators/SH.STA.MALN.MA.ZS/rankings")
    text_male = res_male.text
    soup_male = BeautifulSoup(text_male, "html.parser")
    items_male = soup_male.find_all("tr")
    male_data = {}
    male_names =[]
    for i in items_male:
        name = i.find("a")
        if name != None:
            percent = i.find("td", {"class": "r"})
            year = i.find_all("td")[3]
            male_data[name.text] = (percent.text, year.text)
            male_names += [name.text]

    res_female = requests.get("https://www.indexmundi.com/facts/indicators/SH.STA.MALN.FE.ZS/rankings")
    text_female = res_female.text
    soup_female = BeautifulSoup(text_female, "html.parser")
    items_female = soup_female.find_all("tr")
    female_data = {}
    female_names =[]
    for i in items_female:
        name = i.find("a")
        if name != None:
            percent = i.find("td", {"class": "r"})
            year = i.find_all("td")[3]
            female_data[name.text] = (percent.text, year.text)
            female_names += [name.text]

    all_names = list(set(male_names) & set(female_names))
    combined_data = []
    for i in all_names:
        combined_data += [[i, float(male_data[i][0]), float(female_data[i][0]), int(male_data[i][1])]]

    final_pd = pd.DataFrame(data = combined_data, columns = ["Country", "Male%", "Female%", "Year"])
    final_pd.to_csv("nddata.csv")
    return final_pd

############ Function Call ############
nd_parser()

Unnamed: 0,Country,Male%,Female%,Year
0,Belarus,1.5,1.0,2005
1,Japan,3.5,3.3,2010
2,Jamaica,2.8,2.1,2012
3,Canada,1.8,1.7,1971
4,Montenegro,1.1,0.9,2013
...,...,...,...,...
131,Netherlands,2.2,1.1,1980
132,Sri Lanka,25.7,26.8,2012
133,Chile,0.6,0.4,2014
134,Argentina,2.4,2.2,2005


## Additional Dataset Parsing/Cleaning Functions

Write any supplemental (optional) functions here.

In [None]:
def imr_parser(filename):
  reader_list = []
  with open(filename, encoding = "utf8") as csvfile:
    reader = csv.reader(csvfile)
    reader_list = list(reader)

    header = reader_list[8]
    reader_list = reader_list[9:]

    final = []
    for i in reader_list:
        temp = 0
        count = 0
        if len(i) != 0:
            country = i[0] + " mean"
            for x in i[4:]:
                if x != "":
                    temp += float(x)
                    count += 1
            if count > 0:
                avg = round(temp/count,2)
                final+=[[country, avg]]
    final_pd = pd.DataFrame(data = final, columns =["Country", "IMR"])
    final_pd.to_csv("imrdata.csv")
    return final_pd

############ Function Call ############
imr_parser("imr.csv")

Unnamed: 0,Country,IMR
0,Africa Eastern and Southern mean,95.11
1,Afghanistan mean,127.42
2,Africa Western and Central mean,111.44
3,Angola mean,106.78
4,Albania mean,29.13
...,...,...
236,Samoa mean,32.47
237,"Yemen, Rep. mean",117.07
238,South Africa mean,48.24
239,Zambia mean,88.83


In [None]:
# Define further extra source functions as necessary

#Inconsistencies
For each inconsistency (NaN, null, duplicate values, empty strings, etc.) you discover in your datasets, write at least 2 sentences stating the significance, how you identified it, and how you handled it.

1. In our Downloaded csv data set, we noticed that the name the column of data for deaths held extremely long floats. Including long floats inn the data set makes the data seem unnecessarily messy and harder to read. We fixed this issue by rounding the column data to two decimal places using the round function.

2. In our Web Collection #1, API data set, an inconsistency we found was that the pollution data given was for continents, regions, and then country. This inconsistency is significant because in order to maintain consistency and comparison with the rest of our data sets, we only want to look at the countries. In order to fix this inconsistency, we cleaned the data for continents and regions and only considered data after the match of “Afghanistan” which was the first country in our data set.

3. In our Web Collection #2, HTML data set, an inconsistency we found was that two separate data sets needed to be combined in order to provide a more overarching range of data. The two datasets we combined were prevalence of underweight children male and prevalence of underweight children female, because combining these data sets gives us more data to understand the extent of underweight children overall in each country. In order to clean this we combined the two data sets which were in a dictionary into a new list that contains all remaining elements of the data set.

4. In the CSV file for the extra downloaded data set, one of the inconsistencies we found were that there were numerous empty cells and strings within the data set. This inconsistency was important to identify because when we get our returned data, it makes our output look clean and sorted and does not cause any possible issues during the later vizualization aspects of the data project. In order to clean this we did not include any empty cells/ strings in our data set when we coded for the columns we specifically wanted to use.
