#### Project 5: Predicted Pollution Mortality
#### Corey J Sinnott
# Model Testing

## Executive Summary

This report was commissioned to explore mortality influenced by pollution. Data was obtained from several sources listed below. The problem statement was defined as, can we predict pollution mortality? After in-depth analysis, conclusions and recommendations will be presented.


## Contents:
- [Model Testing](#Model-Testing)



#### Importing Libraries

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, plot_confusion_matrix, classification_report, plot_roc_curve
import statsmodels.api as sm
from category_encoders import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso, SGDClassifier, SGDRegressor, ElasticNet, LassoLars
from sklearn.neighbors import KNeighborsRegressor
from sklearn.compose import make_column_transformer, make_column_selector, TransformedTargetRegressor
from sklearn.preprocessing import OneHotEncoder as OHE
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn import set_config
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import RFE, RFECV, VarianceThreshold
from sklearn.inspection import permutation_importance
from sklearn.metrics import confusion_matrix 
import scipy as sp
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

#### Reading in Data

In [3]:
df = pd.read_csv('./data/model_df.csv')

In [4]:
df.sample(3)

Unnamed: 0,Year,annual_co2_emmissions,health_spend_per_capita,life_expectancy,ozone_depleting_emissions,min_daily_ozone,mean_daily_ozone,population,pollution_deaths
1338,2014,78.659796,2098.052256,81.385366,2.37,114.0,128.6,10701000.0,6721.26
833,1998,6.614649,955.466362,77.738,25.97,86.0,98.8,909000.0,484.27
2122,2014,1.021415,93.482087,61.932,2.37,114.0,128.6,16290000.0,1190.87


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4015 entries, 0 to 4014
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Year                       4015 non-null   int64  
 1   annual_co2_emmissions      4010 non-null   float64
 2   health_spend_per_capita    2793 non-null   float64
 3   life_expectancy            3806 non-null   float64
 4   ozone_depleting_emissions  3868 non-null   float64
 5   min_daily_ozone            3874 non-null   float64
 6   mean_daily_ozone           3874 non-null   float64
 7   population                 4015 non-null   float64
 8   pollution_deaths           4015 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 282.4 KB


In [90]:
df['crude_death_per_1_000_000'] = (df['pollution_deaths'] / df['population'] * 1_000_000)

# Model Testing

#### Modeling with population and crude death rate
 - Unsuccessful; would have to standardize all of the variables to the same scale, which may introduce too much error.

In [224]:
X = df.drop(['health_spend_per_capita', 'pollution_deaths', 'crude_death_per_1000', 
             'crude_death_per_1_000_000', 'life_expectancy'], axis = 1)
y = df['crude_death_per_1_000_000']

In [225]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [226]:
impute = SimpleImputer(missing_values = np.nan)

In [227]:
X_train_fill = impute.fit_transform(X_train)
X_test_fill = impute.transform(X_test)

In [228]:
ss = StandardScaler()

In [229]:
X_train_fill_scaled = ss.fit_transform(X_train_fill)
X_test_fill_scaled = ss.transform(X_test_fill)

In [230]:
pf = PolynomialFeatures()

In [231]:
X_train_use = pf.fit_transform(X_train_fill)
X_test_use = pf.transform(X_test_fill)

In [218]:
LL = LassoLars()

In [219]:
LL.fit(X_train_use, y_train)

LassoLars()

In [234]:
y_pred = LL.predict(X_test_use)

In [235]:
def regression_eval(y_test, y_pred):
    print(f'MSE = {np.round(mean_squared_error(y_test, y_pred), 3)}')
    print(f'RMSE = {np.round(mean_squared_error(y_test, y_pred, squared = False), 3)}')
    print(f'MAE = {np.round(mean_absolute_error(y_test, y_pred), 3)}')
    print(f'r^2  = {np.round(r2_score(y_test, y_pred), 3)}')

regression_eval(y_test, y_pred)

MSE = 46537.689
RMSE = 215.726
MAE = 156.192
r^2  = -0.046


#### Modeling with population and total deaths
 - Population introducing a lot of colinearity

In [164]:
X = df.drop(['health_spend_per_capita', 'pollution_deaths', 'crude_death_per_1000', 'crude_death_per_1_000_000'], axis = 1)
y = df['pollution_deaths']

In [165]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [166]:
impute = SimpleImputer(missing_values = np.nan)

In [167]:
X_train_fill = impute.fit_transform(X_train)
X_test_fill = impute.transform(X_test)

In [168]:
ss = StandardScaler()

In [169]:
X_train_fill_scaled = ss.fit_transform(X_train_fill)
X_test_fill_scaled = ss.transform(X_test_fill)

In [170]:
pf = PolynomialFeatures()

In [171]:
X_train_use = pf.fit_transform(X_train_fill)
X_test_use = pf.transform(X_test_fill)

In [172]:
LL = LassoLars()

In [173]:
LL.fit(X_train_use, y_train)

LassoLars()

In [174]:
y_pred = LL.predict(X_test_use)

In [175]:
def regression_eval(y_test, y_pred):
    print(f'MSE = {np.round(mean_squared_error(y_test, y_pred), 3)}')
    print(f'RMSE = {np.round(mean_squared_error(y_test, y_pred, squared = False), 3)}')
    print(f'MAE = {np.round(mean_absolute_error(y_test, y_pred), 3)}')
    print(f'r^2  = {np.round(r2_score(y_test, y_pred), 3)}')

regression_eval(y_test, y_pred)

MSE = 61950968.585
RMSE = 7870.894
MAE = 3473.279
r^2  = 0.991


#### Modeling without population and total deaths

In [3]:
X = df.drop(['health_spend_per_capita', 'pollution_deaths', 
             'crude_death_per_1000', 'crude_death_per_1_000_000', 'population'], axis = 1)
y = df['pollution_deaths']

NameError: name 'df' is not defined

In [261]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [262]:
impute = SimpleImputer(missing_values = np.nan)

In [263]:
X_train_fill = impute.fit_transform(X_train)
X_test_fill = impute.transform(X_test)

In [264]:
ss = StandardScaler()

In [265]:
X_train_fill_scaled = ss.fit_transform(X_train_fill)
X_test_fill_scaled = ss.transform(X_test_fill)

In [266]:
pf = PolynomialFeatures()

In [267]:
X_train_use = pf.fit_transform(X_train_fill)
X_test_use = pf.transform(X_test_fill)

In [268]:
LL = LassoLars()

In [269]:
LL.fit(X_train_use, y_train)

LassoLars()

In [270]:
y_pred = LL.predict(X_test_use)

In [271]:
def regression_eval(y_test, y_pred):
    print(f'MSE = {np.round(mean_squared_error(y_test, y_pred), 3)}')
    print(f'RMSE = {np.round(mean_squared_error(y_test, y_pred, squared = False), 3)}')
    print(f'MAE = {np.round(mean_absolute_error(y_test, y_pred), 3)}')
    print(f'r^2  = {np.round(r2_score(y_test, y_pred), 3)}')

regression_eval(y_test, y_pred)

MSE = 2174983028.574
RMSE = 46636.713
MAE = 14265.438
r^2  = 0.747


In [None]:
#XGBoost

In [None]:
#RandomForest

Lasso with least angle regression shows best potential to predict mortality given the covariate nature of the variables. Will proceed with a GridSearch to tune, and further exploration of features.