### Classification (Supervised Learning)

**Human Development Index (HDI)** is between 0.0 and 1.0 inclusively, so if we directly use those values to train the model and try to predict the HDI in 2020 for the 9 countries, this problem will become a regression problem. Hence, we mask the HDI to 4 levels according to https://worldpopulationreview.com/country-rankings/hdi-by-country:
* very high (0.8-1.0), masked as 0
* high (0.7-0.79), masked as 1
* medium (0.55-0.70), masked as 2
* low (< 0.55), masked as 3

In [143]:
import pandas as pd
import numpy as np
import multiprocessing
import time

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report

np.random.seed(42)
JOB_NUM = multiprocessing.cpu_count() // 2 - 1

#### Train and test the model
Use the data from 2005 to 2019

In [144]:
LABEL = "human_development_index"
static_fields = [LABEL, "year_num", "country_name"]

In [145]:
def mask_hdi(label: pd.DataFrame):
    for _, row in label.iterrows():
        if 0.8 <= row[LABEL] and row[LABEL] <= 1.0:
            row[LABEL] = 0
        elif 0.7 <= row[LABEL] and row[LABEL] < 0.8:
            row[LABEL] = 1
        elif 0.55 <= row[LABEL] and row[LABEL] < 0.7:
            row[LABEL] = 2
        else:
            row[LABEL] = 3

In [146]:
dataset = pd.read_csv("../data_imputed/dataset.csv")
train_set, test_set = train_test_split(dataset, test_size=0.2)

In [147]:
# get the train data and train labels
train_data = train_set.drop(columns=static_fields)
train_label = train_set[[LABEL]] # HDI
mask_hdi(train_label)
train_label = train_label.astype(int)
train_label = train_label[LABEL]

In [148]:
# get the test data and test labels
test_data = test_set.drop(columns=static_fields)
test_label = test_set[[LABEL]] # HDI
mask_hdi(test_label)
test_label = test_label.astype(int)
test_label = test_label[LABEL]

In [149]:
# create the models
model_dt = DecisionTreeClassifier()
model_gb = GradientBoostingClassifier()
model_rf = RandomForestClassifier()

In [150]:
# params for each model to be used in grid search
params_dt = [
    {
        'min_samples_split': [5, 10],
        'min_samples_leaf': [10, 20]
    }
]
params_gb = [
    {
        'n_estimators': [100, 200, 250, 300],
        'learning_rate': [1e-2, 1e-3],
        'min_samples_split': [5, 10]
    }
]
params_rf = [
    {
        'n_estimators': [100, 200, 250, 300],
        'min_samples_split': [5, 10],
        'random_state': [42, 50],
        'bootstrap': [False, True]
    }
]

In [151]:
scoring = {"accuracy": "accuracy", "bal_accuracy": "balanced_accuracy", "F1_macro": "f1_macro"}

# grid search for each model
grid_search_dt = GridSearchCV(
    model_dt,
    params_dt,
    cv=3,
    scoring=scoring,
    return_train_score=True,
    refit='bal_accuracy',
    n_jobs=JOB_NUM
)
grid_search_gb = GridSearchCV(
    model_gb,
    params_gb,
    cv=3,
    scoring=scoring,
    return_train_score=True,
    refit='bal_accuracy',
    n_jobs=JOB_NUM
)
grid_search_rf = GridSearchCV(
    model_rf,
    params_rf,
    cv=3,
    scoring=scoring,
    return_train_score=True,
    refit='bal_accuracy',
    n_jobs=JOB_NUM
)

grid_search_list = [
    {
        'model_name': 'DecisionTreeClassifier',
        'grid_search': grid_search_dt
    },
    {
        'model_name': 'GradientBoostingClassifier',
        'grid_search': grid_search_gb
    },
    {
        'model_name': 'RandomForestClassifier',
        'grid_search': grid_search_rf
    }
]

In [152]:
def print_grid_search_res(grid_search_list, train_data, train_labels, test_data, test_labels):
    for gs in grid_search_list:
        start_time_second = time.perf_counter()
        gs['grid_search'].fit(train_data, train_labels)
        
        print(gs['model_name'])
        print(f"best score: {gs['grid_search'].best_score_}")
        print(f"best parameters: {gs['grid_search'].best_params_}")
        
        # validation data
        cv_results = gs['grid_search'].cv_results_
        print(f"best accuracy (validation): {max(cv_results['mean_test_accuracy'])}")
        print(f"best f1_macro (validation): {max(cv_results['mean_test_F1_macro'])}")
        for mean_accuracy, mean_F1_macro, params in zip(cv_results['mean_test_accuracy'], cv_results['mean_test_F1_macro'], cv_results['params']):
            if params == gs['grid_search'].best_params_:
                if mean_accuracy == max(cv_results['mean_test_accuracy']) \
                    and mean_F1_macro == max(cv_results['mean_test_F1_macro']):
                    print("Verified")
        
        # test data
        test_predictions = gs['grid_search'].best_estimator_.predict(test_data)
        classification_report_test = classification_report(test_labels, test_predictions)
        print(f"Classification Report (Test):\n {classification_report_test}")
        end_time_second = time.perf_counter()
        print(f"Elasped time: {end_time_second - start_time_second} seconds")
        print()

In [153]:
print_grid_search_res(grid_search_list, train_data, train_label, test_data, test_label)

DecisionTreeClassifier
best score: 0.9127768065268066
best parameters: {'min_samples_leaf': 10, 'min_samples_split': 5}
best accuracy (validation): 0.9166666666666666
best f1_macro (validation): 0.9170676545676546
Verified
Classification Report (Test):
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.75      1.00      0.86         3
           2       0.90      0.90      0.90        10
           3       1.00      0.86      0.92         7

    accuracy                           0.93        27
   macro avg       0.91      0.94      0.92        27
weighted avg       0.94      0.93      0.93        27

Elasped time: 0.05609659999754513 seconds

GradientBoostingClassifier
best score: 0.937937062937063
best parameters: {'learning_rate': 0.01, 'min_samples_split': 10, 'n_estimators': 100}
best accuracy (validation): 0.9444444444444443
best f1_macro (validation): 0.9442390419564332
Verified
Classification Report

##### Result Summary

**DecisionTreeClassifier**

best score: 0.9127768065268066

best parameters: {'min_samples_leaf': 10, 'min_samples_split': 5}

best accuracy (validation): 0.9166666666666666

best f1_macro (validation): 0.9170676545676546

Verified

Classification Report (Test):
```
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.75      1.00      0.86         3
           2       0.90      0.90      0.90        10
           3       1.00      0.86      0.92         7

    accuracy                           0.93        27
   macro avg       0.91      0.94      0.92        27
weighted avg       0.94      0.93      0.93        27
```

Elasped time: 0.05609659999754513 seconds

**GradientBoostingClassifier**

best score: 0.937937062937063

best parameters: {'learning_rate': 0.01, 'min_samples_split': 10, 'n_estimators': 100}

best accuracy (validation): 0.9444444444444443

best f1_macro (validation): 0.9442390419564332

Verified

Classification Report (Test):
```
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       1.00      0.67      0.80         3
           2       0.83      1.00      0.91        10
           3       1.00      0.86      0.92         7

    accuracy                           0.93        27
   macro avg       0.96      0.88      0.91        27
weighted avg       0.94      0.93      0.92        27
```

Elasped time: 4.151155199997447 seconds

**RandomForestClassifier**

best score: 0.9455128205128206

best parameters: {'bootstrap': True, 'min_samples_split': 5, 'n_estimators': 200, 'random_state': 42}

best accuracy (validation): 0.9537037037037036

best f1_macro (validation): 0.9515406292580205

Verified

Classification Report (Test):
```
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       1.00      1.00      1.00         3
           2       0.91      1.00      0.95        10
           3       1.00      0.86      0.92         7

    accuracy                           0.96        27
   macro avg       0.98      0.96      0.97        27
weighted avg       0.97      0.96      0.96        27
```

Elasped time: 3.6763307999972312 seconds


#### Predict HDI in 2020

In [162]:
# Use the best model from RandomForestClassifier to predict the HDI levels
# of the 9 countries
to_predict = pd.read_csv("../data_imputed/to_predict.csv")
to_predict_data = to_predict.drop(columns=static_fields)
hdi_level_predictions = grid_search_list[2]['grid_search'].best_estimator_.predict(to_predict_data)
to_predict[LABEL] = hdi_level_predictions

# show the HDI result prediction
to_predict
results = to_predict[["year_num", "country_name", "human_development_index"]]
results

Unnamed: 0,year_num,country_name,human_development_index
0,2020,Canada,0
1,2020,United States,0
2,2020,Mexico,1
3,2020,India,2
4,2020,Indonesia,2
5,2020,Iran,1
6,2020,Egypt,2
7,2020,Kenya,2
8,2020,Nigeria,3
