#### Project 5: Predicted Pollution Mortality
#### Corey J Sinnott
# Data Import and Cleaning, EDA

## Executive Summary

This report was commissioned to explore mortality influenced by pollution. Data was obtained from several sources listed below. The problem statement was defined as, can we predict pollution mortality? After in-depth analysis, conclusions and recommendations will be presented.


## Contents:
- [Data Import & Cleaning](#Data-Import-&-Cleaning)
- [EDA](#EDA)

# Data Import and Cleaning

#### Importing Libraries

In [147]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

#### Reading in Initial Dataframe
 - Dataframe was created from over 20 individual datasets; work is available in extra code notebook.

In [2]:
df = pd.read_csv('./data/df_final.csv')

In [6]:
df.sample(2)

Unnamed: 0.1,Unnamed: 0,Entity,Year,Consumption of Ozone-Depleting Substances - Hydrochlorofluorocarbons (HCFCs),Consumption of Ozone-Depleting Substances - Carbon Tetrachloride,Consumption of Ozone-Depleting Substances - Chlorofluorocarbons (CFCs),Consumption of Ozone-Depleting Substances - Halons,Consumption of Ozone-Depleting Substances - Methyl Bromide,Consumption of Ozone-Depleting Substances - Methyl Chloroform,"PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total)",...,"Health expenditure per capita, PPP (constant 2011 international $)","Life expectancy at birth, total (years)",Consumption of Ozone-Depleting Substances - All,Annual CO2 emissions,Ozone depleting emissions (Index 1986 = 100),Minimum daily concentration (NASA),Mean daily concentration (NASA),Deaths - Household air pollution from solid fuels - Sex: Both - Age: All Ages (Number),Deaths - Air pollution - Sex: Both - Age: All Ages (Number),Deaths – Outdoor air pollution (all ages) (IHME)
47674,4106,Southern Latin America,2007,,,,,,,,...,,,,,4.73,108.0,116.2,3383.416129,24506.160527,21289.71
11077,12006,Congo,1811,,,,,,,,...,,,,,,,,,,


#### Cleaning Data
 - Filtering for year
 - Filtering for population
 - Filtering for country
     - Comparing values to an established country list to remove world and continent values

In [19]:
df = df[df.Year > 1950]

In [21]:
df_2 = df.set_index(keys = 'Year')

In [22]:
df_2 = df_2.drop(columns = 'Unnamed: 0')

In [23]:
df_2.sample(5)

Unnamed: 0_level_0,Entity,Consumption of Ozone-Depleting Substances - Hydrochlorofluorocarbons (HCFCs),Consumption of Ozone-Depleting Substances - Carbon Tetrachloride,Consumption of Ozone-Depleting Substances - Chlorofluorocarbons (CFCs),Consumption of Ozone-Depleting Substances - Halons,Consumption of Ozone-Depleting Substances - Methyl Bromide,Consumption of Ozone-Depleting Substances - Methyl Chloroform,"PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total)","PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)","Total population (Gapminder, HYDE & UN)",...,"Health expenditure per capita, PPP (constant 2011 international $)","Life expectancy at birth, total (years)",Consumption of Ozone-Depleting Substances - All,Annual CO2 emissions,Ozone depleting emissions (Index 1986 = 100),Minimum daily concentration (NASA),Mean daily concentration (NASA),Deaths - Household air pollution from solid fuels - Sex: Both - Age: All Ages (Number),Deaths - Air pollution - Sex: Both - Age: All Ages (Number),Deaths – Outdoor air pollution (all ages) (IHME)
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1968,Samoa,,,,,,,,,137000.0,...,,53.774,,0.029312,,,,,,
1966,Nauru,,,,,,,,,6000.0,...,,,,0.032976,,,,,,
1961,Hungary,,,,,,,,,10030000.0,...,,68.936098,,48.929906,,,,,,
1998,Sub-Saharan Africa (excluding high income),,,,,,,,,,...,108.417024,50.140852,,,25.97,86.0,98.8,,,
2001,Asia,,,,,,,,,3789285000.0,...,,,,9153.74462,15.97,91.0,100.9,,,


In [17]:
df_2.shape

(56601, 21)

In [24]:
df_2['population'] = df_2['Total population (Gapminder, HYDE & UN)']

In [25]:
df_2.shape

(21365, 22)

In [26]:
df_2 = df_2[df_2.population > 800_000]

In [27]:
df_2.shape

(10958, 22)

#### Filtering against an established country-list

In [34]:
country_list = ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 
                'Angola', 'Argentina', 'Armenia', 'Australia', 
                'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 
                'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 
                'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia Herzegovina', 
                'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina', 
                'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 
                'Central African Rep', 'Chad', 'Chile', 'China', 'Colombia', 
                'Comoros', 'Congo', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 
                'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 
                'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt', 
                'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 
                'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 
                'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala', 
                'Guinea',  'Guyana', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 
                'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 
                'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan', 
                'Kazakhstan', 'Kenya', 'Kiribati', 'South Korea', 'Korea South', 
                'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 
                'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 
                'Luxembourg', 'Macedonia', 'Madagascar', 'Malawi', 'Malaysia', 
                'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 
                'Mauritius', 'Mexico', 'Micronesia', 'Moldova', 'Monaco', 
                'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 
                'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Zealand', 
                'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman', 'Pakistan', 
                'Palau', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 
                'Philippines', 'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia', 
                'Russian Federation', 'Rwanda', 'Samoa','San Marino', 'Saudi Arabia', 
                'Senegal', 'Serbia', 'Sierra Leone', 'Singapore', 'Slovakia', 
                'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Sudan', 
                'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden', 
                'Switzerland', 'Syria', 'Taiwan', 'Tajikistan', 'Tanzania', 'Thailand', 
                'Togo', 'Tonga', 'Trinidad & Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 
                'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 
                'United States', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Vatican City', 
                'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe']

In [43]:
df_trim = df_2[df_2['Entity'].isin(country_list)]

In [55]:
df_x = df_trim[['Annual CO2 emissions', 'Total population (Gapminder, HYDE & UN)', 
               'Health expenditure per capita, PPP (constant 2011 international $)',
              'Life expectancy at birth, total (years)',
              'Ozone depleting emissions (Index 1986 = 100)',
              'Minimum daily concentration (NASA)',
              'Mean daily concentration (NASA)',
              'Deaths – Outdoor air pollution (all ages) (IHME)',
              'population']]

In [68]:
df_x['pollution_deaths'] = df_x['Deaths – Outdoor air pollution (all ages) (IHME)']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_x['pollution_deaths'] = df_x['Deaths – Outdoor air pollution (all ages) (IHME)']


 - Dropping null values for target variable

In [76]:
df_x = df_x.dropna(subset = ['pollution_deaths'])

In [143]:
df_x = df_x.drop(columns = ['Deaths – Outdoor air pollution (all ages) (IHME)', 'Total population (Gapminder, HYDE & UN)'])

#### Defining Features
  - columns below will most likely constitute variables

In [145]:
df_x = df_x.rename(columns = {
    'Annual CO2 emissions' : 'annual_co2_emmissions',
    'Health expenditure per capita, PPP (constant 2011 international $)' : 'health_spend_per_capita',
    'Life expectancy at birth, total (years)' : 'life_expectancy',
    'Ozone depleting emissions (Index 1986 = 100)' : 'ozone_depleting_emissions',
    'Minimum daily concentration (NASA)' : 'min_daily_ozone',
    'Mean daily concentration (NASA)' : 'mean_daily_ozone'
})

In [182]:
#source: https://www.open.edu/openlearn/health-sports-psychology/health/epidemiology-introduction/content-section-2.1.1
df_x['crude_death_per_1000'] = (df_x['pollution_deaths'] / df_x['population'] * 1000)

# EDA

#### Test Model
 - Performing initial test model to determine linearity before proceeding. 

In [257]:
X = df_x.drop(['health_spend_per_capita','crude_death_per_1000', 'pollution_deaths'], axis = 1)
y = df_x['pollution_deaths']

In [258]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [259]:
impute = SimpleImputer(missing_values = np.nan)

In [260]:
X_train_fill = impute.fit_transform(X_train)
X_test_fill = impute.transform(X_test)

In [261]:
ss = StandardScaler()

In [262]:
X_train_fill_scaled = ss.fit_transform(X_train_fill)
X_test_fill_scaled = ss.transform(X_test_fill)

In [263]:
lr = LinearRegression()

In [264]:
lr.fit(X_train_fill_scaled, y_train)

LinearRegression()

In [265]:
y_pred = lr.predict(X_test_fill_scaled)

#### Test Model Results
 - Including population introduces too much colinearity
 - Will explore this and more in model testing

In [266]:
def regression_eval(y_test, y_pred):
    print(f'MSE = {np.round(mean_squared_error(y_test, y_pred), 3)}')
    print(f'RMSE = {np.round(mean_squared_error(y_test, y_pred, squared = False), 3)}')
    print(f'MAE = {np.round(mean_absolute_error(y_test, y_pred), 3)}')
    print(f'r^2  = {np.round(r2_score(y_test, y_pred), 3)}')

regression_eval(y_test, y_pred)

MSE = 287728366.896
RMSE = 16962.558
MAE = 8744.158
r^2  = 0.927


 - Exporting the final dataframe for analysis

In [160]:
df_x.to_csv('model_df.csv')