## World Health Report Analysis

In [2]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
df = pd.read_csv(WHRDataSet_filename)
df.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454


In [4]:
df.columns

Index(['country', 'year', 'Life Ladder', 'Log GDP per capita',
       'Social support', 'Healthy life expectancy at birth',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Positive affect', 'Negative affect',
       'Confidence in national government', 'Democratic Quality',
       'Delivery Quality', 'Standard deviation of ladder by country-year',
       'Standard deviation/Mean of ladder by country-year',
       'GINI index (World Bank estimate)',
       'GINI index (World Bank estimate), average 2000-15',
       'gini of household income reported in Gallup, by wp5-year'],
      dtype='object')

In [5]:
df.shape

(1562, 19)

In [6]:
nan_count = np.sum(df.isnull())
nan_count

country                                                       0
year                                                          0
Life Ladder                                                   0
Log GDP per capita                                           27
Social support                                               13
Healthy life expectancy at birth                              9
Freedom to make life choices                                 29
Generosity                                                   80
Perceptions of corruption                                    90
Positive affect                                              18
Negative affect                                              12
Confidence in national government                           161
Democratic Quality                                          171
Delivery Quality                                            171
Standard deviation of ladder by country-year                  0
Standard deviation/Mean of ladder by cou

In [7]:
df['country'].value_counts()

Zimbabwe        12
Israel          12
India           12
Russia          12
Saudi Arabia    12
                ..
Cuba             1
Swaziland        1
Suriname         1
Guyana           1
Oman             1
Name: country, Length: 164, dtype: int64

In [8]:
print (f"{df['year'].min()} - {df['year'].max()}")

2005 - 2017


In [9]:
print (f"{df['Log GDP per capita'].min()} - {df['Log GDP per capita'].max()}")
print(df['Log GDP per capita'].mean())

6.37739563 - 11.77027607
9.220822267882737


1. After inspecting the data I plan to start by using all the features avaiable minus those I intend to drop. I will be dropping 'GINI index (World Bank estimate)' because of the 1562 examples 979 are missing, which means it will not be a good feature to use or replace given that the data is missing for over half the examples
I will also be removing 'gini of household income reported in Gallup, by wp5-year' due to its high amount of missing data. I will also be removing the 'country' and 'year' columns as they do not directly affect the GDP of a country. I wil also remove examples that do not have a value for the Log GDP per Capita since they will not be useful in predicting the label. Lastly, I will be removing all mean and standard deviation coloumns for simplicity.

2. I will replace missing data with mean values for the columns applicable. I will also rename columns to improve readability and understandabilty for the average person.
   
3. I will be using a multiple linear regression model for this data set. I think that is the best option because there are a lot of variables/features that should be taken into account when predicting the Log GDP per Capita.
   
4. I will be using a multiple linear regression model using the features described and than examing how the r^2 and mean squared error change using different features. Since this is not a binary classification problem, I wil not be attempting to use decision tree, logistic regression, or k-nearest neighbor models.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [10]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [11]:
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
df = pd.read_csv(WHRDataSet_filename)

In [12]:
# drop columns that are irrelevent or redundent for determining the label.
df = df.drop(columns= ['country', 'year','GINI index (World Bank estimate)', 'gini of household income reported in Gallup, by wp5-year', 
                       'Standard deviation of ladder by country-year', 'Standard deviation/Mean of ladder by country-year','GINI index (World Bank estimate), average 2000-15'])

# drop examples that do not have a value for Log GDP per capita
df = df.dropna(subset=['Log GDP per capita'])

# rename columns that will be used for the model, to improve readability and simplicity
cols_to_include = ['Life Ladder', 'Log GDP per capita',
       'Social support', 'Healthy life expectancy at birth',
       'Freedom to make life choices',
       'Perceptions of corruption', 'Positive affect', 'Negative affect',
       'Confidence in national government', 'Democratic Quality',
       'Delivery Quality']

renaming = {'Life Ladder': 'Happiness', 
            'Log GDP per capita': 'Log GDP per capita',  
            'Social support': 'Support', 
            'Healthy life expectancy at birth': 'Life Expectancy', 
            'Freedom to make life choices': 'Freedom', 
            'Perceptions of corruption': 'Corruption', 
            'Positive affect': 'Positive', 
            'Negative affect': 'Negative',
           'Confidence in national government': 'Confidence in Gov',
            'Democratic Quality': 'Democratic Quality',
            'Delivery Quality': 'Delivery Quality',
           }

df = df[cols_to_include].rename(renaming, axis=1)

df.head()

Unnamed: 0,Happiness,Log GDP per capita,Support,Life Expectancy,Freedom,Corruption,Positive,Negative,Confidence in Gov,Democratic Quality,Delivery Quality
0,3.72359,7.16869,0.450662,49.209663,0.718114,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084
1,4.401778,7.33379,0.552308,49.624432,0.678896,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025
2,4.758381,7.386629,0.539075,50.008961,0.600127,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176
3,3.831719,7.415019,0.521104,50.367298,0.495901,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221
4,3.782938,7.517126,0.520637,50.709263,0.530935,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078


In [13]:
# determine the number of NaN values per column
nan_count = np.sum(df.isnull())
nan_count

Happiness               0
Log GDP per capita      0
Support                13
Life Expectancy         0
Freedom                29
Corruption             89
Positive               18
Negative               12
Confidence in Gov     158
Democratic Quality    155
Delivery Quality      155
dtype: int64

In [14]:
# remove rows with 3 or more NaN values of the desired features 
# since they will be less relabile to predict the label
nan_per_row = df.isna().sum(axis=1)
rows_to_keep = nan_per_row <= 2

df = df[rows_to_keep]

nan_count = np.sum(df.isnull())
print(nan_count)

df.shape

Happiness               0
Log GDP per capita      0
Support                 2
Life Expectancy         0
Freedom                10
Corruption             59
Positive                5
Negative                2
Confidence in Gov     120
Democratic Quality    136
Delivery Quality      136
dtype: int64


(1493, 11)

In [15]:
# drop examples of columns with only a few nan_value rows
df = df.dropna(subset=['Support', 'Positive', 'Negative'])

In [16]:
mean_courruption = df['Corruption'].mean()
df['Corruption'].fillna(value=mean_courruption, inplace=True)

In [17]:
mean_freedom = df['Freedom'].mean()
df['Freedom'].fillna(value=mean_freedom, inplace=True)

In [18]:
mean_confidence = df['Confidence in Gov'].mean()
df['Confidence in Gov'].fillna(value=mean_confidence, inplace=True)

In [19]:
mean_dem_qual= df['Democratic Quality'].mean()
df['Democratic Quality'].fillna(value=mean_dem_qual, inplace=True)

In [20]:
mean_del_qual= df['Delivery Quality'].mean()
df['Delivery Quality'].fillna(value=mean_del_qual, inplace=True)

In [21]:
nan_count = np.sum(df.isnull())
nan_count

Happiness             0
Log GDP per capita    0
Support               0
Life Expectancy       0
Freedom               0
Corruption            0
Positive              0
Negative              0
Confidence in Gov     0
Democratic Quality    0
Delivery Quality      0
dtype: int64

In [33]:
y = df['Log GDP per capita']
X = df.drop(columns= ['Log GDP per capita'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

In [34]:
model = LinearRegression()

model.fit(X_train, y_train)

prediction = model.predict(X_test)

In [35]:
print('Model Summary:\n')

# Print intercept (alpha)
print('Intercept:')
print('alpha = ' , model.intercept_)

# Print weights
print('\nWeights:')
i = 0
for w in model.coef_:
    print('w_',i+1,'= ', w, ' [ weight of ', df.columns[i],']')
    i += 1

Model Summary:

Intercept:
alpha =  2.727484975150805

Weights:
w_ 1 =  0.2830145349158518  [ weight of  Happiness ]
w_ 2 =  1.8610267092147097  [ weight of  Log GDP per capita ]
w_ 3 =  0.06213592659732381  [ weight of  Support ]
w_ 4 =  -0.2862976417359598  [ weight of  Life Expectancy ]
w_ 5 =  0.31079317308122417  [ weight of  Freedom ]
w_ 6 =  -0.9671108353537844  [ weight of  Corruption ]
w_ 7 =  0.581970140669067  [ weight of  Positive ]
w_ 8 =  0.0985201440282306  [ weight of  Negative ]
w_ 9 =  -0.15231110533430617  [ weight of  Confidence in Gov ]
w_ 10 =  0.43486059870533034  [ weight of  Democratic Quality ]


This means that the values of Corruption, Positive, and Democratic Quality effected
the value of the Log GDP per Capita the most.

In [36]:
# Print mean squared error
print('\nModel Performance\n\nRMSE =   %.2f'
      % np.sqrt(mean_squared_error(y_test, prediction)))
# The coefficient of determination: 1 is perfect prediction
print(' R^2 =   %.2f'
      % r2_score(y_test, prediction))


Model Performance

RMSE =   0.52
 R^2 =   0.82


I will build another multiple linear regression models without the 'Support' or 'Negative'
features to see if this will negatively or positivly effect the r^2 value, since these values had the smallest weights in the above test.

In [39]:
y2 = df['Log GDP per capita']
X2 = df.drop(columns= ['Log GDP per capita', 'Support', 'Negative'])
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.15, random_state=42)

model2 = LinearRegression()

model2.fit(X_train2, y_train2)

prediction2 = model2.predict(X_test2)

print('Model Summary:\n')

# Print intercept (alpha)
print('Intercept:')
print('alpha = ' , model2.intercept_)

# Print weights
print('\nWeights:')
i = 0
for w in model.coef_:
    print('w_',i+1,'= ', w, ' [ weight of ', df.columns[i],']')
    i += 1

# Print mean squared error
print('\nModel Performance\n\nRMSE =   %.2f'
      % np.sqrt(mean_squared_error(y_test2, prediction2)))
# The coefficient of determination: 1 is perfect prediction
print(' R^2 =   %.2f'
      % r2_score(y_test2, prediction2))

Model Summary:

Intercept:
alpha =  3.3547266879795163

Weights:
w_ 1 =  0.2830145349158518  [ weight of  Happiness ]
w_ 2 =  1.8610267092147097  [ weight of  Log GDP per capita ]
w_ 3 =  0.06213592659732381  [ weight of  Support ]
w_ 4 =  -0.2862976417359598  [ weight of  Life Expectancy ]
w_ 5 =  0.31079317308122417  [ weight of  Freedom ]
w_ 6 =  -0.9671108353537844  [ weight of  Corruption ]
w_ 7 =  0.581970140669067  [ weight of  Positive ]
w_ 8 =  0.0985201440282306  [ weight of  Negative ]
w_ 9 =  -0.15231110533430617  [ weight of  Confidence in Gov ]
w_ 10 =  0.43486059870533034  [ weight of  Democratic Quality ]

Model Performance

RMSE =   0.53
 R^2 =   0.81


By removing the features 'Support' and 'Negative' the RMSE increated by 0.01 and the R^2 decreased 0.2. This means it is better to have these features involved in the process. Next I want to try different test sizes and random states to see if they can imporve the accuracy.

In [47]:
y3 = df['Log GDP per capita']
X3 = df.drop(columns= ['Log GDP per capita'])
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.33, random_state=1234)

model3 = LinearRegression()

model3.fit(X_train3, y_train3)

prediction3 = model3.predict(X_test3)

print('Model Summary:\n')

# Print intercept (alpha)
print('Intercept:')
print('alpha = ' , model3.intercept_)

# Print weights
print('\nWeights:')
i = 0
for w in model.coef_:
    print('w_',i+1,'= ', w, ' [ weight of ', df.columns[i],']')
    i += 1

# Print mean squared error
print('\nModel Performance\n\nRMSE =   %.2f'
      % np.sqrt(mean_squared_error(y_test3, prediction3)))
# The coefficient of determination: 1 is perfect prediction
print(' R^2 =   %.2f'
      % r2_score(y_test3, prediction3))

Model Summary:

Intercept:
alpha =  2.8836683722431813

Weights:
w_ 1 =  0.2830145349158518  [ weight of  Happiness ]
w_ 2 =  1.8610267092147097  [ weight of  Log GDP per capita ]
w_ 3 =  0.06213592659732381  [ weight of  Support ]
w_ 4 =  -0.2862976417359598  [ weight of  Life Expectancy ]
w_ 5 =  0.31079317308122417  [ weight of  Freedom ]
w_ 6 =  -0.9671108353537844  [ weight of  Corruption ]
w_ 7 =  0.581970140669067  [ weight of  Positive ]
w_ 8 =  0.0985201440282306  [ weight of  Negative ]
w_ 9 =  -0.15231110533430617  [ weight of  Confidence in Gov ]
w_ 10 =  0.43486059870533034  [ weight of  Democratic Quality ]

Model Performance

RMSE =   0.50
 R^2 =   0.83


In [None]:
With a random state of 