Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [94]:
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.feature_selection import f_regression, SelectKBest
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import classification_report
from scipy.stats import randint, uniform
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error
import warnings
import pandas as pd
import plotly_express as px
import eli5
from eli5.sklearn import PermutationImportance

In [50]:
df=pd.read_csv('https://raw.githubusercontent.com/VeraMendes/Project---Train-a-predictive-model/master/led.csv')
print(df.shape)
df.head()

(2938, 22)


Unnamed: 0,Country,Year,Status,Lifeexpectancy,AdultMortality,infantdeaths,Alcohol,percentageexpenditure,HepatitisB,Measles,...,Polio,Totalexpenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness1-19years,thinness5-9years,Incomecompositionofresources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [40]:
df.isnull().sum()

Country                           0
Year                              0
Status                            0
Lifeexpectancy                   10
AdultMortality                   10
infantdeaths                      0
Alcohol                         194
percentageexpenditure             0
HepatitisB                      553
Measles                           0
BMI                              34
under-fivedeaths                  0
Polio                            19
Totalexpenditure                226
Diphtheria                       19
HIV/AIDS                          0
GDP                             448
Population                      652
thinness1-19years                34
thinness5-9years                 34
Incomecompositionofresources    167
Schooling                       163
dtype: int64

In [41]:
df[pd.isnull(df['Lifeexpectancy'])]

Unnamed: 0,Country,Year,Status,Lifeexpectancy,AdultMortality,infantdeaths,Alcohol,percentageexpenditure,HepatitisB,Measles,...,Polio,Totalexpenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness1-19years,thinness5-9years,Incomecompositionofresources,Schooling
624,CookIslands,2013,Developing,,,0,0.01,0.0,98.0,0,...,98.0,3.58,98.0,0.1,,,0.1,0.1,,
769,Dominica,2013,Developing,,,0,0.01,11.419555,96.0,0,...,96.0,5.58,96.0,0.1,722.75665,,2.7,2.6,0.721,12.7
1650,MarshallIslands,2013,Developing,,,0,0.01,871.878317,8.0,0,...,79.0,17.24,79.0,0.1,3617.752354,,0.1,0.1,,0.0
1715,Monaco,2013,Developing,,,0,0.01,0.0,99.0,0,...,99.0,4.3,99.0,0.1,,,,,,
1812,Nauru,2013,Developing,,,0,0.01,15.606596,87.0,0,...,87.0,4.65,87.0,0.1,136.18321,,0.1,0.1,,9.6
1909,Niue,2013,Developing,,,0,0.01,0.0,99.0,0,...,99.0,7.2,99.0,0.1,,,0.1,0.1,,
1958,Palau,2013,Developing,,,0,,344.690631,99.0,0,...,99.0,9.27,99.0,0.1,1932.12237,292.0,0.1,0.1,0.779,14.2
2167,SaintKittsandNevis,2013,Developing,,,0,8.54,0.0,97.0,0,...,96.0,6.14,96.0,0.1,,,3.7,3.6,0.749,13.4
2216,SanMarino,2013,Developing,,,0,0.01,0.0,69.0,0,...,69.0,6.5,69.0,0.1,,,,,,15.1
2713,Tuvalu,2013,Developing,,,0,0.01,78.281203,9.0,0,...,9.0,16.61,9.0,0.1,3542.13589,1819.0,0.2,0.1,,0.0


In [28]:
df[pd.isnull(df['GDP'])]

Unnamed: 0,Country,Year,Status,Lifeexpectancy,AdultMortality,infantdeaths,Alcohol,percentageexpenditure,HepatitisB,Measles,...,Polio,Totalexpenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness1-19years,thinness5-9years,Incomecompositionofresources,Schooling
160,Bahamas,2015,Developing,76.1,147.0,0,,0.0,95.0,0,...,95.0,,95.0,0.1,,,2.5,2.5,0.790,12.6
161,Bahamas,2014,Developing,75.4,16.0,0,9.45,0.0,96.0,0,...,96.0,7.74,96.0,0.1,,,2.5,2.5,0.789,12.6
162,Bahamas,2013,Developing,74.8,172.0,0,9.42,0.0,97.0,0,...,97.0,7.50,97.0,0.1,,,2.5,2.5,0.790,12.6
163,Bahamas,2012,Developing,74.9,167.0,0,9.50,0.0,96.0,0,...,99.0,7.43,98.0,0.2,,,2.5,2.5,0.789,12.6
164,Bahamas,2011,Developing,75.0,162.0,0,9.34,0.0,95.0,0,...,97.0,7.63,98.0,0.1,,,2.5,2.5,0.788,12.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2901,Yemen,2004,Developing,62.2,247.0,42,0.06,0.0,43.0,12708,...,72.0,4.90,72.0,0.1,,,13.9,13.9,0.464,8.4
2902,Yemen,2003,Developing,61.9,249.0,43,0.04,0.0,38.0,8536,...,61.0,5.00,61.0,0.1,,,14.0,13.9,0.457,8.2
2903,Yemen,2002,Developing,61.5,25.0,45,0.07,0.0,31.0,890,...,64.0,4.22,65.0,0.1,,,14.0,14.0,0.450,8.0
2904,Yemen,2001,Developing,61.1,251.0,46,0.08,0.0,19.0,485,...,73.0,4.34,73.0,0.1,,,14.0,14.0,0.444,7.9


In [29]:
df[pd.isnull(df['Polio'])]

Unnamed: 0,Country,Year,Status,Lifeexpectancy,AdultMortality,infantdeaths,Alcohol,percentageexpenditure,HepatitisB,Measles,...,Polio,Totalexpenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness1-19years,thinness5-9years,Incomecompositionofresources,Schooling
1742,Montenegro,2005,Developing,73.6,133.0,0,,527.307672,,0,...,,8.46,,0.1,3674.617924,614261.0,2.3,2.3,0.746,12.8
1743,Montenegro,2004,Developing,73.5,134.0,0,0.01,57.121901,,0,...,,8.45,,0.1,338.199535,613353.0,2.3,2.4,0.74,12.6
1744,Montenegro,2003,Developing,73.5,134.0,0,0.01,495.078296,,0,...,,8.91,,0.1,2789.1735,612267.0,2.4,2.4,0.0,0.0
1745,Montenegro,2002,Developing,73.4,136.0,0,0.01,36.48024,,0,...,,8.33,,0.1,216.243274,69828.0,2.5,2.5,0.0,0.0
1746,Montenegro,2001,Developing,73.3,136.0,0,0.01,33.669814,,0,...,,8.23,,0.1,199.583957,67389.0,2.5,2.6,0.0,0.0
1747,Montenegro,2000,Developing,73.0,144.0,0,0.01,274.54726,,0,...,,7.32,,0.1,1627.42893,6495.0,2.6,2.7,0.0,0.0
2414,SouthSudan,2010,Developing,55.0,359.0,27,,0.0,,0,...,,,,4.0,1562.239346,167192.0,,,0.0,0.0
2415,SouthSudan,2009,Developing,54.3,369.0,27,,0.0,,0,...,,,,4.2,1264.78998,967667.0,,,0.0,0.0
2416,SouthSudan,2008,Developing,53.6,377.0,27,,0.0,,0,...,,,,4.2,1678.711862,9263136.0,,,0.0,0.0
2417,SouthSudan,2007,Developing,53.1,381.0,27,,0.0,,0,...,,,,4.2,,88568.0,,,0.0,0.0


In [30]:
df[pd.isnull(df['thinness5-9years'])]

Unnamed: 0,Country,Year,Status,Lifeexpectancy,AdultMortality,infantdeaths,Alcohol,percentageexpenditure,HepatitisB,Measles,...,Polio,Totalexpenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness1-19years,thinness5-9years,Incomecompositionofresources,Schooling
1715,Monaco,2013,Developing,,,0,0.01,0.0,99.0,0,...,99.0,4.3,99.0,0.1,,,,,,
2216,SanMarino,2013,Developing,,,0,0.01,0.0,69.0,0,...,69.0,6.5,69.0,0.1,,,,,,15.1
2409,SouthSudan,2015,Developing,57.3,332.0,26,,0.0,31.0,878,...,41.0,,31.0,3.4,758.725782,11882136.0,,,0.421,4.9
2410,SouthSudan,2014,Developing,56.6,343.0,26,,46.074469,,441,...,44.0,2.74,39.0,3.5,1151.861715,1153971.0,,,0.421,4.9
2411,SouthSudan,2013,Developing,56.4,345.0,26,,47.44453,,525,...,5.0,2.62,45.0,3.6,1186.11325,1117749.0,,,0.417,4.9
2412,SouthSudan,2012,Developing,56.0,347.0,26,,38.338232,,1952,...,64.0,2.77,59.0,3.8,958.45581,1818258.0,,,0.419,4.9
2413,SouthSudan,2011,Developing,55.4,355.0,27,,0.0,,1256,...,66.0,,61.0,3.9,176.9713,1448857.0,,,0.429,4.9
2414,SouthSudan,2010,Developing,55.0,359.0,27,,0.0,,0,...,,,,4.0,1562.239346,167192.0,,,0.0,0.0
2415,SouthSudan,2009,Developing,54.3,369.0,27,,0.0,,0,...,,,,4.2,1264.78998,967667.0,,,0.0,0.0
2416,SouthSudan,2008,Developing,53.6,377.0,27,,0.0,,0,...,,,,4.2,1678.711862,9263136.0,,,0.0,0.0


In [42]:
train.columns

Index(['Country', 'Year', 'Status', 'Lifeexpectancy', 'AdultMortality',
       'infantdeaths', 'Alcohol', 'percentageexpenditure', 'HepatitisB',
       'Measles', 'BMI', 'under-fivedeaths', 'Polio', 'Totalexpenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness1-19years',
       'thinness5-9years', 'Incomecompositionofresources', 'Schooling'],
      dtype='object')

In [51]:
df = df.rename(columns = {
    'Country':'country','Year':'year', 'Status':'development','Lifeexpectancy':'lifespan',
    'AdultMortality':'adult_mortality', 'infantdeaths':'infant_deaths',
    'Alcohol':'alcohol_conpsumption', 'percentageexpenditure':'percentage_expenditure',
    'HepatitisB':'hepatitisb','Measles':'measles','BMI':'BMI','under-fivedeaths':'baby_deaths',
    'Polio':'polio', 'Totalexpenditure':'total_expenditure','Diphtheria':'diphtheria','HIV/AIDS':'HIV/AIDS',
    'GDP':'GDP','Population':'population','thinness1-19years':'thinness_teenager',
    'thinness5-9years':'thinness_children','Incomecompositionofresources':'ICR','Schooling':'education'
})

df.head()

Unnamed: 0,country,year,development,lifespan,adult_mortality,infant_deaths,alcohol_conpsumption,percentage_expenditure,hepatitisb,measles,...,polio,total_expenditure,diphtheria,HIV/AIDS,GDP,population,thinness_teenager,thinness_children,ICR,education
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [52]:
df = df.dropna(axis=0, subset=['lifespan'])

In [53]:
df.shape

(2928, 22)

In [54]:
df['next_year_lifespan']= df['lifespan'].shift(1)
df.head()

Unnamed: 0,country,year,development,lifespan,adult_mortality,infant_deaths,alcohol_conpsumption,percentage_expenditure,hepatitisb,measles,...,total_expenditure,diphtheria,HIV/AIDS,GDP,population,thinness_teenager,thinness_children,ICR,education,next_year_lifespan
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1,
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0,65.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9,59.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8,59.9
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5,59.5


In [55]:
target = 'next_year_lifespan'

In [56]:
df.head(200)

Unnamed: 0,country,year,development,lifespan,adult_mortality,infant_deaths,alcohol_conpsumption,percentage_expenditure,hepatitisb,measles,...,total_expenditure,diphtheria,HIV/AIDS,GDP,population,thinness_teenager,thinness_children,ICR,education,next_year_lifespan
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1,
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0,65.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9,59.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8,59.9
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5,59.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,Bangladesh,2012,Developing,77.0,137.0,111,0.01,59.258926,94.0,1986,...,3.80,94.0,0.1,856.342857,15572753.0,18.5,19.0,0.557,9.9,71.0
196,Bangladesh,2011,Developing,73.0,14.0,118,0.01,62.349885,96.0,5625,...,3.16,96.0,0.1,835.789341,153911916.0,18.7,19.2,0.545,9.4,77.0
197,Bangladesh,2010,Developing,69.9,142.0,126,0.01,62.659454,94.0,788,...,3.60,94.0,0.1,757.671757,15214912.0,18.9,19.4,0.535,8.9,73.0
198,Bangladesh,2009,Developing,69.5,144.0,135,0.01,53.264004,97.0,718,...,2.91,97.0,0.1,681.125368,1545478.0,19.1,19.7,0.523,8.4,69.9


In [88]:
df['year'].value_counts()

2015    183
2013    183
2011    183
2009    183
2007    183
2005    183
2003    183
2001    183
2014    183
2012    183
2010    183
2008    183
2006    183
2004    183
2002    183
2000    183
Name: year, dtype: int64

In [90]:
year_2015 = df[df.year == 2015]
year_2015.head()

Unnamed: 0,country,year,development,lifespan,adult_mortality,infant_deaths,alcohol_conpsumption,percentage_expenditure,hepatitisb,measles,...,total_expenditure,diphtheria,HIV/AIDS,GDP,population,thinness_teenager,thinness_children,ICR,education,next_year_lifespan
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1,
16,Albania,2015,Developing,77.8,74.0,0,4.6,364.975229,99.0,0,...,6.0,99.0,0.1,3954.22783,28873.0,1.2,1.3,0.762,14.2,54.8
32,Algeria,2015,Developing,75.6,19.0,21,,0.0,95.0,63,...,,95.0,0.1,4132.76292,39871528.0,6.0,5.8,0.743,14.4,72.6
48,Angola,2015,Developing,52.4,335.0,66,,0.0,64.0,118,...,,64.0,1.9,3695.793748,2785935.0,8.3,8.2,0.531,11.4,71.3
64,AntiguaandBarbuda,2015,Developing,76.4,13.0,0,,0.0,99.0,0,...,,99.0,0.2,13566.9541,,3.3,3.3,0.784,13.9,45.3


In [None]:
# Looking into Time Series I am:
# using 2013 & 2014 as test
# using 2011 & 2012 as val
# I cannot use 2015 values as I don't have values for the next year lifespan (2016)

In [91]:
df = df[df.year != 2015]
df.head()

Unnamed: 0,country,year,development,lifespan,adult_mortality,infant_deaths,alcohol_conpsumption,percentage_expenditure,hepatitisb,measles,...,total_expenditure,diphtheria,HIV/AIDS,GDP,population,thinness_teenager,thinness_children,ICR,education,next_year_lifespan
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0,65.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9,59.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8,59.9
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5,59.5
5,Afghanistan,2010,Developing,58.8,279.0,74,0.01,79.679367,66.0,1989,...,9.2,66.0,0.1,553.32894,2883167.0,18.4,18.4,0.448,9.2,59.2


In [95]:
df.isnull().sum()

country                     0
year                        0
development                 0
lifespan                    0
adult_mortality             0
infant_deaths               0
alcohol_conpsumption       16
percentage_expenditure      0
hepatitisb                544
measles                     0
BMI                        30
baby_deaths                 0
polio                      19
total_expenditure          45
diphtheria                 19
HIV/AIDS                    0
GDP                       414
population                603
thinness_teenager          30
thinness_children          30
ICR                       150
education                 150
next_year_lifespan          0
dtype: int64

In [104]:
train = df[df['year']<2011]
val = df[(df.year == 2011) | (df.year == 2012)]
test = df[(df.year == 2013) | (df.year == 2014)]
train.shape, val.shape, test.shape

((2013, 23), (366, 23), (366, 23))

In [101]:
y_baseline = pd.Series(train['lifespan'], index=train.index)
print(y_baseline)

5       58.8
6       58.6
7       58.1
8       57.5
9       57.3
        ... 
2933    44.3
2934    44.5
2935    44.8
2936    45.3
2937    46.0
Name: lifespan, Length: 2013, dtype: float64


In [110]:
# Arrange data into X features matrix and y target vector
target = 'next_year_lifespan'
baseline_values = 'lifespan'
cols_to_drop = ['next_year_lifespan', 'lifespan']
X_train = train[train.columns.drop(cols_to_drop)]
y_train = train['next_year_lifespan']
X_val = val[val.columns.drop(cols_to_drop)]
y_val = val['next_year_lifespan']
X_test = test[test.columns.drop(cols_to_drop)]
y_test = test['next_year_lifespan']

In [112]:
mean_baseline = train['next_year_lifespan'].mean()
y_pred = test['lifespan']
baseline_mae = mean_absolute_error(y_val, y_pred)
print('mean baseline:', train['next_year_lifespan'].mean())
print(f'Validation Error (predict lifespan for years 2011 & 2012): {baseline_mae}')

mean baseline: 68.68564331843021
Validation Error (predict lifespan for years 2011 & 2012): 0.860655737704918


dtype('float64')