### Classification (Supervised Learning)

**Human Development Index (HDI)** is between 0.0 and 1.0 inclusively, so if we directly use those values to train the model and try to predict the HDI in 2020 for the 9 countries, this problem will become a regression problem. Hence, we mask the HDI to 4 levels according to https://worldpopulationreview.com/country-rankings/hdi-by-country:
* very high (0.8-1.0), masked as 0
* high (0.7-0.79), masked as 1
* medium (0.55-0.70), masked as 2
* low (< 0.55), masked as 3

In [32]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [33]:
LABEL = "human_development_index"
static_fields = [LABEL, "year_num", "country_name"]

In [34]:
def mask_hdi(label: pd.DataFrame):
    for _, row in label.iterrows():
        if 0.8 <= row[LABEL] and row[LABEL] <= 1.0:
            row[LABEL] = 0
        elif 0.7 <= row[LABEL] and row[LABEL] < 0.8:
            row[LABEL] = 1
        elif 0.55 <= row[LABEL] and row[LABEL] < 0.7:
            row[LABEL] = 2
        else:
            row[LABEL] = 3

In [36]:
train_set = pd.read_csv("../data_imputed/train.csv")
train_data = train_set.drop(columns=static_fields)
train_label = train_set[[LABEL]] # HDI
mask_hdi(train_label)
train_label = train_label.astype(int)
train_data

Unnamed: 0,secondary_school_enrollment_percent_of_gross,life_expectancy_female,life_expectancy_male,tertiary_school_enrollment_percent_of_gross,government_health_expenditure_percent_of_gdp,birth_rate,hospital_beds_per_1000,prevalence_of_overweight_adult,fertility_rate,growth_rate,current_health_expenditure_percent_of_gdp,basic_drinking_water_rate_rural,basic_drinking_water_rate,prevalence_of_undernourishment,open_defecation_rate,open_defecation_rate_rural,open_defecation_rate_urban,prevalence_of_hiv,capital_health_expenditure_percent_of_gdp
0,1.127325,1.159917,1.165150,1.004027,1.390093,-1.305304,1.925774,0.704147,-1.046742,-0.955096,0.620098,0.856394,0.775509,-0.915383,-0.764583,-0.841343,-0.727362,-0.324478,0.636980
1,0.821231,0.885589,0.816180,1.687073,1.401717,-0.932006,2.043557,0.874690,-0.639155,-0.990006,2.006198,0.717585,0.750988,-0.915383,-0.764583,-0.841343,-0.727362,-0.299879,1.446801
2,0.205614,0.655043,0.524369,-0.512046,-0.193247,-0.082093,-0.488784,0.719651,-0.291563,-0.231936,-0.176099,0.011330,0.380427,-0.647064,-0.235156,0.310192,-0.131227,-0.410573,-0.828543
3,-1.162674,-0.732180,-0.544924,-1.039692,-0.845085,0.175483,-1.242598,-1.590422,0.086978,0.019513,-0.685022,-0.127010,-0.260867,1.781928,3.743687,3.476385,3.493500,-0.459770,-0.656766
4,-0.863992,-0.320688,-0.317491,-0.784512,-0.856922,-0.079458,-1.077701,-1.301017,-0.276485,-0.353926,-0.998413,-0.567814,-0.461140,1.442999,1.171635,1.303083,1.626866,-0.472069,-0.524060
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130,0.415835,0.212607,0.163004,-0.201701,-0.592866,-0.520719,-0.514697,0.485026,-0.455836,-0.719654,-0.905575,0.157152,0.287133,-0.350501,-0.207915,-0.097242,-0.134797,-0.349076,-0.526395
131,0.357700,0.642534,0.887538,0.898119,0.134995,-0.446718,0.102488,0.756860,-0.568525,-0.329402,0.041534,0.613724,0.654499,-0.491722,-0.737956,-0.769504,-0.727362,-0.533565,-0.310573
132,0.552170,0.254634,0.183461,-0.135710,-0.632924,0.353458,0.481750,0.766163,0.331404,0.627945,-0.448894,0.909792,0.785073,-0.505844,-0.764583,-0.841343,-0.727362,-0.533565,-0.711165
133,-1.258836,-0.327052,-0.469835,-1.064687,-0.328713,0.637824,-0.646614,-1.067425,0.444887,1.082792,-0.485763,-1.571499,-1.688471,2.233833,-0.079150,-0.159386,-0.436158,2.110772,0.866546


In [None]:
model_dt = DecisionTreeClassifier()
model_gb = GradientBoostingClassifier()
model_rf = RandomForestClassifier()