Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.ipynb_checkpoints
csv_files
data_files
EDA_initial.ipynb
README.md
feature_engineering.ipynb
feature_engineering_final.ipynb
feature_engineering_take_2.ipynb
obesity_FINAL.csv
obesity_data_county.csv

README.md

Life Expectancy in the US

Goal

To inspect what variables have a statistically significant affect on life expectancy in the United States.

ETL

To gather data we looked at several different data sources on a county level from 2010 in the United States. 2010 was a census year for the United States, which allowed us to find several datasets from government and peer-reviewed sources for feature selection including

Finally, we got the average Life Expectancy by county from Global Health Data Exchange for our target variable.

Since the data came from 5 different sources we added a state and county ID column to each data set to consistently join the data (e.g. there were 31 “Washington” counties). The individual cleaned data frames, as well as the final data frame, can be found in the data_files folder as .csv files.

EDA & FEATURE ENGINEERING

Confirming Distributions

In order to run a linear regression analysis, our data needed to meet three criteria:

  • They needed to be normally distributed
  • They needed to be linear
  • They needed homoskedastic

screen shot 2019-02-11 at 9 54 44 am

We found that the majority of our data was normally distributed, with the exception of food stamps (SNAP) data and minimum-wage data.

  • The distribution of the minimum wage is due to the fact that it was on the state-level while the rest of our data was on the county-level. To make this feature more workable, we decided to turn it into a categorical variable indicating whether each state’s minimum wage was above or below the mean minimum wage of the data as a whole.
  • To normalize the SNAP data we used a log-transform, which helped it become less skewed and have lower variability.

screen shot 2019-02-11 at 10 09 31 am

Linearity & Homoskedasticity

To confirm linearity and homoscedasticity, we inspected scatter plots of each of our independent variables with our dependent variable.

screen shot 2019-02-11 at 10 29 17 am

Notes on interpreting the scatter plots:

  • Homoskedasticity essentially means that all of the data points within our features have the relatively the same variance. If your data is heteroskedastic, you'll run the risk of overfitting your model. If our features are homoskedastic, they will be scattered across a scatter plot in a consistently-dense way (i.e. the dots would not form a horizontal cone shape in either direction).
  • If our features have linear relationships with our dependent variable (Life Expectancy), our scatters will form a line cluster going either in the positive or negative direction.

We found that most of our data was both linear and homoskedastic.

Categorical variables

To account for states in the model, and use them as possible features, we created a categorical variable for each state.

Scaling

To perform analysis and fit models on the data, we also wanted to scale our continuous variables using a Standard Scaler from the scikit-learn library.

Correlations

Finally, we inspected each features’ correlations to one another.

First with the initial features.

screen shot 2019-02-11 at 10 44 56 am

As well as the correlations with the states.

screen shot 2019-02-11 at 10 45 27 am

BUILDING BASE MODELS

Statsmodels

We chose both to run an OLS regression using the Statsmodels library as well as a linear regression with the Scikit-Learn library, wanting to compare the performance between the two.

screen shot 2019-02-11 at 10 55 37 am

From our Statsmodels’ OLS we got an R-squared value of approximately 80%, which is the percent variance explained by our model.

We observed that our data was slightly negatively skewed at -0.285 and leptokurtic at 5.619, the latter meaning that we had some outlier data points, or occasional values exceeding (in terms of standard deviations from the mean) what was predicted by the normal distribution. None of these were extreme enough for us to further tweak our algorithm at this point.

From this model, we also observed that our SNAP data, as well as the categorical variables we created for AZ, CO, CT, FL, GA, ME, MA, MS, RI, TX, UT all have p-values > 0.05, meaning that there is no statistically significant relationship between these variables and our target variable (life expectancy).

We visualized the model’s predictions versus the true y-values from our test dataset to evaluate how accurate our model is, and got a MSE of 0.17352770075852306 and RMSE of 0.4165665622184803.

screen shot 2019-02-11 at 10 57 58 am

Scikit-Learn

After running a linear regression with sklearn, we got the same R-squared value (0.799) and found no predictive difference between the libraries.

FEATURE SELECTION & FURTHER ENGINEERING

Feature Selection

After running the initial linear regression model (using Scikit-Learn), we did some feature selection and engineering, starting with the wrapper method to select the top features of our model. Then we used two filter methods

  • Variance Threshold
  • Univariate Feature Selection

Further Feature Engineering

Using polynomial terms to add features to the model we ended with 1,539 features. This model is raised the R-squared value, but decreased the adjusted R-squared value.

LASSO Method

Using the "Least Absolute Shrinkage and Selection Operator" or LASSO to fit a model. LASSO is similar to Ordinary Least Squares although it performs both L1 regularization and selects features. We tried several alpha values (which is the constant that multiplies the L1 term) to optimize for the best R-squared value. An interesting note, if alpha is equal to 0, then the model is functionally a Ordinary Least Squares model, which was one of our better performing models.

screen shot 2019-02-11 at 11 12 03 am

This graphs shows the features LASSO kept and their corresponding weight in the model.

FINAL THOUGHTS

The initial linear regression using Sklearn is the best performing model with a R-squared value of .799 and a Test Mean Squared Error of .174.

screen shot 2019-02-11 at 11 32 07 am

We would like to add more features to increase the predictive power of our model (including Medicare, Educational Attainment, and Small Area Income), however we were not able to find data sets for either 2010 or for a county level in 2010. Also, we could further feature engineer by incorporating polynomial features into other regression models.

You can’t perform that action at this time.