# Úkol č. 4 - regrese
**Deadline úkolu je uveden na [course pages](https://courses.fit.cvut.cz/BI-VZD/homeworks/index.html).**

  * Cílem tohoto úkolu je vyzkoušet si řešit regresní problém na reálných datech.
  
> **Nejdůležitější na úkolu je to, abyste udělali vše procesně správně: korektní rozdělení datasetu, ladění hyperparametrů, vyhodnocení výsledků atp.**

## Dataset

  * Zdrojem dat je soubor `LifeExpectancyData.csv` na course pages (originál zde: https://www.kaggle.com/kumarajarshi/life-expectancy-who).
  * Popis datasetu najdete na uvedené stránce s originálem datasetu.
  * Cílová (vysvětlovaná) proměnná se jmenuje `Life expectancy `.
  

## Pokyny k vypracování
Body zadání, za jejichž (poctivé) vypracování získáte 12 bodů:

  1. Odeberte z dat body u kterých neznáte vysvětlovanou proměnnou.
  1. Rozdělte data na trénovací a testovací množinu.
  1. Proveďte základní průzkum dat. Na jeho základě adekvátně reagujte na problematické věci v datech (chybějící hodnoty, atd.).
  1. Aplikujte lineární a hřebenovou regresi a výsledky řádně vyhodnoťte:
    * K měření chyby použijte `mean_absolute_error`.
    * Experimentujte s tvorbou nových příznaků (na základě těch dostupných).
    * Experimentujte se standardizací/normalizací dat.
    * Vyberte si hyperparametry modelů k ladění a najděte jejich nejlepší hodnoty.
  1. Použijte i jiný model než jen lineární a hřebenovou regresi.


## Poznámky k odevzdání

  * Řiďte se pokyny ze stránky https://courses.fit.cvut.cz/BI-VZD/homeworks/index.html.
  * Odevzdejte tento Jupyter Notebook.
  * Opravující Vám může umožnit úkol dodělat či opravit a získat tak další body. První verze je ale důležitá a bude-li odbytá, budete za to penalizováni.

# Solution

In [183]:
import pandas as pd
import math
import numpy as np
import sys

from sklearn import metrics, datasets
from sklearn.model_selection import ParameterGrid, train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import KNNImputer

from sklearn.linear_model import LinearRegression, Ridge
from typing import Callable, Tuple
from scipy import optimize

In [83]:
data = pd.read_csv('./LifeExpectancyData.csv')
display(data)

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,44.3,723.0,27,4.36,0.000000,68.0,31,...,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,Zimbabwe,2003,Developing,44.5,715.0,26,4.06,0.000000,7.0,998,...,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,Zimbabwe,2002,Developing,44.8,73.0,25,4.43,0.000000,73.0,304,...,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
2936,Zimbabwe,2001,Developing,45.3,686.0,25,1.72,0.000000,76.0,529,...,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


In [84]:
def null_counts():
    display(data.isnull().sum()[data.isnull().sum()>0])
null_counts()

Life expectancy                     10
Adult Mortality                     10
Alcohol                            194
Hepatitis B                        553
 BMI                                34
Polio                               19
Total expenditure                  226
Diphtheria                          19
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

Folowing cell output shows data doesn't really make sense. How on earth Georgia could have 43 people in 2008?

In [85]:
data[data['Population']<100][['Country','Year','Population']]

Unnamed: 0,Country,Year,Population
985,Georgia,2008,43.0
1603,Maldives,2014,41.0
1608,Maldives,2009,36.0
1614,Maldives,2003,34.0


In [86]:
data.describe()

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2938.0,2928.0,2928.0,2938.0,2744.0,2938.0,2385.0,2938.0,2904.0,2938.0,2919.0,2712.0,2919.0,2938.0,2490.0,2286.0,2904.0,2904.0,2771.0,2775.0
mean,2007.51872,69.224932,164.796448,30.303948,4.602861,738.251295,80.940461,2419.59224,38.321247,42.035739,82.550188,5.93819,82.324084,1.742103,7483.158469,12753380.0,4.839704,4.870317,0.627551,11.992793
std,4.613841,9.523867,124.292079,117.926501,4.052413,1987.914858,25.070016,11467.272489,20.044034,160.445548,23.428046,2.49832,23.716912,5.077785,14270.169342,61012100.0,4.420195,4.508882,0.210904,3.35892
min,2000.0,36.3,1.0,0.0,0.01,0.0,1.0,0.0,1.0,0.0,3.0,0.37,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,0.0
25%,2004.0,63.1,74.0,0.0,0.8775,4.685343,77.0,0.0,19.3,0.0,78.0,4.26,78.0,0.1,463.935626,195793.2,1.6,1.5,0.493,10.1
50%,2008.0,72.1,144.0,3.0,3.755,64.912906,92.0,17.0,43.5,4.0,93.0,5.755,93.0,0.1,1766.947595,1386542.0,3.3,3.3,0.677,12.3
75%,2012.0,75.7,228.0,22.0,7.7025,441.534144,97.0,360.25,56.2,28.0,97.0,7.4925,97.0,0.8,5910.806335,7420359.0,7.2,7.2,0.779,14.3
max,2015.0,89.0,723.0,1800.0,17.87,19479.91161,99.0,212183.0,87.3,2500.0,99.0,17.6,99.0,50.6,119172.7418,1293859000.0,27.7,28.6,0.948,20.7


### GDP filling
Minimal GDP seemed unreasonable, so I decided to compare range of possible GDP per capita values with [wikipedia](https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(PPP)_per_capita#IMF_estimates_between_2000_and_2009)

In [87]:
gdp_web = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(PPP)_per_capita#IMF_estimates_between_2000_and_2009')
true_gdp = pd.DataFrame(data={'Country':gdp_web[0][gdp_web[0].columns[0]]});
for i in range(len(gdp_web) - 1):
    gdp_web[i].rename(columns = {'Country (or dependent territory)':'Country'}, inplace=True)
    true_gdp = true_gdp.join(other=gdp_web[i].set_index('Country'), on='Country', how='outer')
true_gdp = pd.concat([true_gdp['Country'],true_gdp.iloc[:,21:37]],axis=1).set_index('Country')
display(true_gdp)

Unnamed: 0_level_0,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Afghanistan,,,886.0,942.0,938.0,1045.0,1100.0,1233.0,1273.0,1507.0,1608.0,1694.0,1953.0,2010.0,2120.0,2136.0
Albania,4331.0,4838.0,5153.0,5559.0,6049.0,6616.0,7264.0,7965.0,8797.0,9223.0,9724.0,10208.0,10526.0,10571.0,11259.0,11662.0
Algeria,8605.0,8926.0,9435.0,10150.0,10710.0,11521.0,11890.0,12432.0,12698.0,12754.0,13105.0,13480.0,13264.0,13003.0,12940.0,11945.0
Angola,3136.0,3242.0,3634.0,3702.0,4095.0,4715.0,5261.0,5980.0,6580.0,6492.0,6686.0,6857.0,7624.0,7948.0,8508.0,7669.0
Antigua and Barbuda,17012.0,16247.0,16495.0,17629.0,18938.0,20558.0,23622.0,26219.0,26436.0,23162.0,21426.0,21224.0,20393.0,19499.0,19674.0,19456.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,11652.0,12109.0,11029.0,10197.0,12191.0,13651.0,15215.0,16733.0,17689.0,16997.0,16691.0,17494.0,18562.0,18871.0,18215.0,17011.0
Vietnam,2492.0,2686.0,2884.0,3123.0,3426.0,3764.0,4110.0,4479.0,4778.0,5025.0,5356.0,5760.0,6329.0,6689.0,7203.0,7556.0
Yemen,2873.0,2954.0,3023.0,3096.0,3204.0,3381.0,3486.0,3588.0,3678.0,3734.0,3950.0,3417.0,3301.0,3479.0,3434.0,2423.0
Zambia,1619.0,1698.0,1757.0,1866.0,1999.0,2152.0,2327.0,2516.0,2686.0,2870.0,3108.0,3250.0,3348.0,3504.0,3467.0,3360.0


In [88]:
true_gdp.describe()

Unnamed: 0,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
count,187.0,188.0,190.0,191.0,192.0,192.0,192.0,193.0,193.0,193.0,193.0,193.0,193.0,193.0,193.0,193.0
mean,11810.919786,12314.101064,12564.078947,13089.089005,13962.505208,14743.270833,15729.46875,16933.238342,17219.310881,16774.797927,17445.53886,18215.65285,18786.0,19283.989637,19547.994819,19038.911917
std,15516.465819,15819.642318,16102.996856,16643.950522,17822.942082,18279.79023,19207.517827,20109.754935,19521.135194,19097.327716,20316.114831,21873.783468,22624.78675,22914.176524,22437.026825,20128.54505
min,438.0,426.0,419.0,422.0,434.0,445.0,468.0,494.0,518.0,521.0,547.0,578.0,581.0,656.0,720.0,787.0
25%,2435.0,2433.75,2536.5,2530.5,2690.75,2963.75,3033.75,3347.0,3424.0,3543.0,3696.0,3417.0,3654.0,3935.0,4237.0,4330.0
50%,6310.0,6533.0,6652.0,6938.0,7425.0,7781.0,8318.0,9077.0,9664.0,9803.0,10029.0,10243.0,10555.0,11106.0,11411.0,11945.0
75%,14169.5,15733.0,16413.75,16803.5,18149.0,20103.5,21914.25,25750.0,25037.0,23162.0,22520.0,22862.0,24646.0,26045.0,26754.0,26810.0
max,102898.0,100721.0,105668.0,106752.0,119789.0,116805.0,130807.0,133290.0,118421.0,127921.0,145583.0,163740.0,169698.0,161194.0,143222.0,109014.0


Nothing better comes to my head then completely replace original GDP values with data, downloaded from Wiki

In [89]:
data.loc[data.Country.str.startswith('Venezue'),'Country'] = 'Venezuela'
data.loc[data.Country.str.startswith('Bolivia'),'Country'] = 'Bolivia'
data.loc[data.Country.str.startswith('Viet'),'Country'] = 'Vietnam'
data.loc[data.Country.str.startswith('Syria'),'Country'] = 'Syria'
data.loc[data.Country.str.startswith('United Kingdom'),'Country'] = 'United Kingdom'
data.loc[data.Country.str.startswith('United States'),'Country'] = 'United States'
data.loc[data.Country.str.contains('Tanzania'),'Country'] = 'Tanzania'
data.loc[data.Country.str.startswith('Russia'),'Country'] = 'Russia'
data.loc[data.Country.str.contains('Moldova'),'Country'] = 'Moldova'
data.loc[data.Country.str.startswith('Micron'),'Country'] = 'Federated States of Micronesia'
data.loc[data.Country.str.startswith('Iran'),'Country'] = 'Iran'
data.loc[data.Country.str.startswith('Czech'),'Country'] = 'Czech Republic'
data.loc[data.Country.str.startswith('Congo'),'Country'] = 'Republic of the Congo'
data.loc[data.Country.str.startswith('Swaziland'),'Country'] = 'Eswatini'
data.loc[data.Country.str.contains('Yugoslav'),'Country'] = 'North Macedonia'
data.loc[data.Country.str.startswith('Republic of Korea'),'Country'] = 'South Korea'
data.loc[data.Country.str.startswith('Cabo Verde'),'Country'] = 'Cape Verde'
data.loc[data.Country.str.startswith('Brunei'),'Country'] = 'Brunei'

In [90]:
def get_true_gdp(country, year, original):
    if not country in true_gdp.index:
        print(country, 'doesn\'t exist')
        return original
    return true_gdp.at[country, str(year)]

In [91]:
data.GDP = data.apply(lambda row: get_true_gdp(row.Country, row.Year, row.GDP), axis=1)

Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Côte d'Ivoire doesn't exist
Cook Islands doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Cuba doesn't exist
Democratic People's Republic of Korea doesn't exist
Democratic People's Republic of Korea doesn't exist
Democratic People's Republic of Korea doesn't exist
Democratic People's Republic of Korea doesn't exist
Democratic Pe

In [92]:
null_counts()

Life expectancy                     10
Adult Mortality                     10
Alcohol                            194
Hepatitis B                        553
 BMI                                34
Polio                               19
Total expenditure                  226
Diphtheria                          19
GDP                                 89
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

### Population filling
Since population is unrealistic (as was shown before on Georgia. Most certainly floating point is missplaced) - I'll also get it from wikipedia

In [93]:
pop_2005 = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2005')[1]
pop_2005.rename(columns={'Country / Territory':'Country', 'PopulationJuly 2005UN estimate': '2005'}, inplace=True)
pop_2005.Country = pop_2005['Country'].str.replace('\[[0-9]*\]','',regex=True)
pop_2005.Country = pop_2005['Country'].str.replace(' \(.*\).*$','',regex=True)
pop_2005.drop(columns=['Change from 2000*', 'Rank'], inplace=True)

In [94]:
pop_2010 = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')[1]
pop_2010.rename(columns={'Country / territory':'Country', 'Population2010(OECD estimate)': '2010'}, inplace=True)
pop_2010.Country = pop_2010['Country'].str.replace('\[[0-9]*\]','',regex=True)
pop_2010.Country = pop_2010['Country'].str.replace(' \(.*\).*$','',regex=True)
pop_2010 = pop_2010.iloc[:-1,[1,2]]

In [95]:
pop_2015 = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2015')[1]
pop_2015.rename(columns={'Country / territory':'Country', 'Population2015(UN estimate)': '2015'}, inplace=True)
pop_2015.Country = pop_2015['Country'].str.replace('\[[0-9]*\]','',regex=True)
pop_2015.Country = pop_2015['Country'].str.replace(' \(.*\).*$','',regex=True)
pop_2015 = pop_2015.iloc[:-1,[1,2]]

In [96]:
true_population = pd.DataFrame(data = pop_2005.Country)
true_population = true_population.join(pop_2005.set_index('Country'), on = 'Country', how='outer')
true_population = true_population.join(pop_2010.set_index('Country'), on = 'Country', how='outer')
true_population = true_population.join(pop_2015.set_index('Country'), on = 'Country', how='outer').set_index('Country')

In [97]:
true_population['2005'] = pd.to_numeric(true_population['2005'])
true_population['2010'] = pd.to_numeric(true_population['2010'])
true_population['2015'] = pd.to_numeric(true_population['2015'])

In [98]:
true_population

Unnamed: 0_level_0,2005,2010,2015
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
World,6.464750e+09,6.843523e+09,7.379797e+09
China,1.307593e+09,1.339725e+09,1.376049e+09
India,1.103371e+09,1.182106e+09,1.311051e+09
United States,2.955200e+08,3.093497e+08,3.214188e+08
Indonesia,2.227810e+08,2.376413e+08,2.575638e+08
...,...,...,...
Timor-Leste,,1.149028e+06,1.149028e+06
Bahamas,,3.536580e+05,3.536580e+05
Mayotte,,2.020000e+05,2.020000e+05
Curacao,,1.421800e+05,1.421800e+05


time to replace values in our dataframe with +- real values

In [99]:
def set_population(country, year, current):
    col = ''
    if year <= 2005:
        col = '2005'
    elif year <= 2010:
        col = '2010'
    else:
        col = '2015'
    
    if not country in true_population.index:
        print (country, 'isn\'t in list')
        return current
    else:
        return true_population.at[country, col]

In [100]:
data.apply(lambda row: set_population(row['Country'], row['Year'], row['Population']), axis = 1)

Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Democratic People's Republic of Korea isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Rep

0       24485600.0
1       24485600.0
2       24485600.0
3       24485600.0
4       24485600.0
           ...    
2933    13010000.0
2934    13010000.0
2935    13010000.0
2936    13010000.0
2937    13010000.0
Length: 2938, dtype: float64

In [101]:
tmp = true_population.index.to_series()
tmp[tmp.str.contains('Korea')]

Country
South Korea    South Korea
North Korea    North Korea
Name: Country, dtype: object

In [102]:
def replace_index_value(a, b):
    as_list = true_population.index.tolist()
    idx = as_list.index(a)
    as_list[idx] = b
    true_population.index = as_list

In [103]:
replace_index_value('Swaziland', 'Eswatini')
replace_index_value('Macedonia', 'North Macedonia')
replace_index_value('North Korea', 'Democratic People\'s Republic of Korea')

In [104]:
data['Population'] = data.apply(lambda row: set_population(row['Country'], row['Year'], row['Population']), axis = 1)

Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Lao People's Democratic Republic isn't in list
Sao Tome and Principe isn't in list
Sao Tome and Principe isn't in list
Sao Tome and Principe isn't in list
Sao Tome and Principe isn't in list
Sao Tome and Principe isn't in list
Sao Tome and Principe isn't in list
Sao Tome and Principe isn't in l

In [105]:
null_counts()

Life expectancy                     10
Adult Mortality                     10
Alcohol                            194
Hepatitis B                        553
 BMI                                34
Polio                               19
Total expenditure                  226
Diphtheria                          19
GDP                                 89
Population                          89
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

In [106]:
data.Status.unique()

array(['Developing', 'Developed'], dtype=object)

convert status column to sth. numeric

In [107]:
data['Status'] = data.Status.replace({'Developing':0, 'Developed':1})

### Filling columns which don't miss much data
There are some columns where not many values are missing. I'll fill them with mean value based on their country. If country mean value in culumn is NaN - I'll leave it as it is, and later KNN imputer will do the job

In [108]:
def fill_by_country_mean(column):
    # set mean value based on country
    def set_mean(id, country, col):
        mean = data[data['Country']==country][col].mean()
        data.loc[id, col] = mean if not math.isnan(mean) else np.nan
        
    # entries who have value in column = NaN
    null_column = data[data[column].isnull()]
    # for each entry with missing value determine value
    null_column.apply(lambda row: set_mean(row.name, row['Country'], column), axis=1)

try to fill columns which lack less than 3% of data

In [138]:
for col in data.columns[(data.isnull().sum() < (3 * len(data) / 100)) & (data.isnull().sum() > 0)]:
    if col != 'Life expectancy ':
        print (col)
        fill_by_country_mean(col)

Adult Mortality
 BMI 
Polio
Diphtheria 
 thinness  1-19 years
 thinness 5-9 years


In [139]:
null_counts()

Life expectancy                     10
Adult Mortality                     10
Alcohol                            194
Hepatitis B                        553
 BMI                                34
Total expenditure                  226
GDP                                 89
Population                          89
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

### Normalize and impute missing values
At this point I was not sure which path would be more correct
* normalize all data **->** KNN impute missing values (all but Life expectancy) **->** remove data with unknown Life expectancy **->** split data into train, validate and test sets
* remove data with unknown Life expectancy **->** split remaining data into train, validate and test sets **->** normalize each dataframe separately **->** KNN impute missing values separately

I decided to choose the second path, because this way dataframes would not influence each other, providing more realistic error measures

first - let's split data with Life expectancy being unknown from the rest

In [140]:
unknown = data[data['Life expectancy '].isnull()].drop(columns=['Life expectancy '])
known = data[data['Life expectancy '].notnull()]

split our data into **train**, **validate** and **test** sets

In [141]:
rnd = 12345
Xtrain, Xrest, ytrain, yrest = train_test_split(
    known.drop(columns=['Life expectancy ', 'Country']),
    known['Life expectancy '],
    test_size = 0.4,
    random_state = rnd
)

Xval, Xtest, yval, ytest = train_test_split(Xrest, yrest, test_size = 0.4, random_state = rnd)

normalize all sets separately, so they don't influence one another.

In [144]:
scaler = MinMaxScaler()
Xtrain = pd.DataFrame(scaler.fit_transform(Xtrain), index=Xtrain.index, columns=Xtrain.columns)
Xval = pd.DataFrame(scaler.fit_transform(Xval), index=Xval.index, columns=Xval.columns)
Xtest = pd.DataFrame(scaler.fit_transform(Xtest), index=Xtest.index, columns=Xtest.columns)

unknown_normalized = unknown.drop(columns=['Country'])
unknown_normalized = pd.DataFrame(
    scaler.fit_transform(unknown_normalized),
    index=unknown_normalized.index,
    columns=unknown_normalized.columns
)

  data_min = np.nanmin(X, axis=0)
  data_max = np.nanmax(X, axis=0)


In [145]:
unknown

Unnamed: 0,Country,Year,Status,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
624,Cook Islands,2013,0,,0,0.01,0.0,98.0,0,82.8,...,98.0,3.58,98.0,0.1,,,0.1,0.1,,
769,Dominica,2013,0,,0,0.01,11.419555,96.0,0,58.4,...,96.0,5.58,96.0,0.1,10261.0,,2.7,2.6,0.721,12.7
1650,Marshall Islands,2013,0,,0,0.01,871.878317,8.0,0,81.6,...,79.0,17.24,79.0,0.1,3155.0,,0.1,0.1,,0.0
1715,Monaco,2013,0,,0,0.01,0.0,99.0,0,,...,99.0,4.3,99.0,0.1,,,,,,
1812,Nauru,2013,0,,0,0.01,15.606596,87.0,0,87.3,...,87.0,4.65,87.0,0.1,7646.0,,0.1,0.1,,9.6
1909,Niue,2013,0,,0,0.01,0.0,99.0,0,77.3,...,99.0,7.2,99.0,0.1,,,0.1,0.1,,
1958,Palau,2013,0,,0,,344.690631,99.0,0,83.3,...,99.0,9.27,99.0,0.1,12820.0,,0.1,0.1,0.779,14.2
2167,Saint Kitts and Nevis,2013,0,,0,8.54,0.0,97.0,0,5.2,...,96.0,6.14,96.0,0.1,20970.0,,3.7,3.6,0.749,13.4
2216,San Marino,2013,0,,0,0.01,0.0,69.0,0,,...,69.0,6.5,69.0,0.1,53898.0,,,,,15.1
2713,Tuvalu,2013,0,,0,0.01,78.281203,9.0,0,79.3,...,9.0,16.61,9.0,0.1,3075.0,,0.2,0.1,,0.0


Seems there are columns, which are completely filled with NaN in our unknown dataset. Because of it I'll have to use a KNN imputer on data, before it's split. And after data is imputed - I'll split everything again and re-normalize pieces separately.

In [149]:
normalized = data.drop(columns=['Country', 'Life expectancy '])
normalized = pd.DataFrame(scaler.fit_transform(normalized), index=normalized.index, columns=normalized.columns)

#### Impute missing values with KNNImputer

In [150]:
imputer = KNNImputer(n_neighbors=5, weights='distance')
filled_data = pd.DataFrame(
    imputer.fit_transform(normalized),
    index=normalized.index,
    columns=normalized.columns)
filled_data['Life expectancy '] = data['Life expectancy ']

In [151]:
filled_data.isnull().sum()

Year                                0
Status                              0
Adult Mortality                     0
infant deaths                       0
Alcohol                             0
percentage expenditure              0
Hepatitis B                         0
Measles                             0
 BMI                                0
under-five deaths                   0
Polio                               0
Total expenditure                   0
Diphtheria                          0
 HIV/AIDS                           0
GDP                                 0
Population                          0
 thinness  1-19 years               0
 thinness 5-9 years                 0
Income composition of resources     0
Schooling                           0
Life expectancy                    10
dtype: int64

resplit into unknown, train, validate and test sets. than normalize all of them

In [153]:
unknown = filled_data[filled_data['Life expectancy '].isnull()].drop(columns=['Life expectancy '])
known = filled_data[filled_data['Life expectancy '].notnull()]

Xtrain, Xrest, ytrain, yrest = train_test_split(
    known.drop(columns=['Life expectancy ']),
    known['Life expectancy '],
    test_size = 0.4,
    random_state = rnd
)

Xval, Xtest, yval, ytest = train_test_split(Xrest, yrest, test_size = 0.4, random_state = rnd)

renormalize

In [158]:
Xtrain = pd.DataFrame(scaler.fit_transform(Xtrain), index=Xtrain.index, columns=Xtrain.columns)
Xval = pd.DataFrame(scaler.fit_transform(Xval), index=Xval.index, columns=Xval.columns)
Xtest = pd.DataFrame(scaler.fit_transform(Xtest), index=Xtest.index, columns=Xtest.columns)
unknown_normalized = pd.DataFrame(scaler.fit_transform(unknown), index=unknown.index, columns=unknown.columns)

## Regression

tune linear regression

In [210]:
lin_reg = LinearRegression().fit(Xtrain,ytrain)
display(pd.DataFrame(data=[lin_reg.coef_.tolist()], columns=Xtrain.columns).transpose().rename(columns={0:'weight'}))
print('test set MAE being:', metrics.mean_absolute_error(ytest, lin_reg.predict(Xtest)))

Unnamed: 0,weight
Year,-0.526166
Status,0.823916
Adult Mortality,-11.903116
infant deaths,152.75971
Alcohol,-0.214295
percentage expenditure,4.555501
Hepatitis B,0.214281
Measles,-2.553694
BMI,2.278471
under-five deaths,-159.677489


test set MAE being: 2.9139658115621936


tune ridge regression

In [206]:
def get_model(Xtrain, ytrain, Xval, yval)->Callable[[float],float]:
    def inner(alpha: float)->float:
        model = Ridge(alpha=alpha).fit(Xtrain,ytrain)
        # I'll try to minimize MAE on validation dataframe
        return metrics.mean_absolute_error(yval, model.predict(Xval))
    
    return inner

def get_ridge_and_lambda(Xtrain, ytrain, Xval, yval) -> Tuple[Ridge, float]:
    opt_function = get_model(Xtrain, ytrain, Xval, yval)
    opt_alpha = optimize.minimize_scalar(
        opt_function,
        options={'maxiter':30},
        method='bounded',
        bounds=(0,400)
    )
    
    best_model = Ridge(alpha=opt_alpha.x).fit(Xtrain, ytrain)
    return best_model, opt_alpha.x

In [209]:
ridge, alpha = get_ridge_and_lambda(Xtrain, ytrain, Xval, yval)
print('with alpha being', alpha, 'validation MAE score is', metrics.mean_squared_error(yval, ridge.predict(Xval)))
print('test MAE:', metrics.mean_squared_error(ytest, ridge.predict(Xtest)))

with alpha being 0.008500692529006997 validation MAE score is 14.63411455733296
test MAE: 15.335183123651491


 I don't understand how could ridge regression have so much worse results. Worst case scenario alpha would be set to zero and results would have been pretty simmilar

try classical decision tree regressor

In [164]:
param_grid = {
    'max_depth': range(1,40),
    'criterion': ['mse', 'friedman_mse', 'mae'],
    'min_samples_leaf': range(1,10),
    'min_samples_split': range(2, 8)
}
param_comb = ParameterGrid(param_grid)

errors = []

counter = 0
for params in param_comb:
    regressor = DecisionTreeRegressor(**params)
    regressor.fit(Xtrain, ytrain)
    errors.append(metrics.mean_absolute_error(yval, regressor.predict(Xval)))
    counter+=1
    sys.stdout.write('\r{:.3f}% done'.format(counter/len(param_comb)*100))


best_tree_params = param_comb[np.argmin(errors)]
print('best params are\n', best_tree_params, '\nwith MAE score of', min(errors))

regressor = DecisionTreeRegressor(**best_tree_params).fit(Xtrain, ytrain)
print('test set MAE being:', metrics.mean_absolute_error(ytest, regressor.predict(Xtest)))

100.000% donebest params are
 {'min_samples_split': 5, 'min_samples_leaf': 6, 'max_depth': 17, 'criterion': 'mae'} 
with MAE score of 1.7543385490753909
test set MAE being: 1.9655650319829423


## Predict unknown
even though decision tree regressor yielded best results, this task is more about regression, so I'll make predictions of unknown Life Expectancy with classic linear regression.

In [226]:
final = unknown.copy()
final['Life expectancy '] = lin_reg.predict(final)
final['Country'] = data.iloc[final.index]['Country']
display(final[['Country', 'Life expectancy ']])

Unnamed: 0,Country,Life expectancy
624,Cook Islands,71.6207
769,Dominica,73.220641
1650,Marshall Islands,57.701541
1715,Monaco,71.880824
1812,Nauru,69.561908
1909,Niue,69.946867
1958,Palau,75.793836
2167,Saint Kitts and Nevis,72.40047
2216,San Marino,74.648956
2713,Tuvalu,54.114498
