## I. Introduction
### Background

This study explores the correlation between lung cancer survival rates and societal factors on a global scale. While existing research has linked lung cancer to environmental pollutants and resource accessiblity in the U.S., little is known about how broader societal influences contribute to varied health outcomes worldwide. Through a comprehensive analysis of diverse variables, including healthcare accessibility, socio-economic conditions, and cultural factors, this research aims to uncover patterns and relationships that can deepen our understanding of lung cancer survival dynamics.

By investigating the interplay between lung cancer outcomes and societal factors, this research seeks to inform public health policies and resource allocation strategies. The study's findings may offer valuable insights into addressing the complexities of lung cancer on a global level, ultimately contributing to more effective interventions and equitable health outcomes.

### Hypothesis (Research Questions)

The effects of the factors on lung cancer survival rates will vary depending on the country. For example, developed countries will have a higher cancer survival rate if they have a greater forest area, while developing countries will have a lower cancer survival rate for a greater forest area.

## II. Methods
### 2.1 Data Description

Individually, we selected 7 factors we thought are likely to impact the lung cancer survival rate for a total of 7 data sets. These datasets are all sourced from the WorldBank site. The 8th data set is the data set of lung cancer survival rates by country and by year. 

**What are the observations (rows) and the attributes (columns)?:** 
Attributes of the X DataFrame are indexed by the 'country' (the name of the relevant country) and further contextualized by the attribute 'year' (the relevant year of the data in between 2000-2019). Most country and year combination has an observation correlating to the following attributes: 
- c_dollar2_poverty : Proportion of Population Pushed Below 3 dollar and 65 cents Poverty Line by Out-of-Pocket Health Care Expenditure
- c_forest_area : The Percentage of Land Area covered by Forest
- c_health_expenditure : The Percentage of a Country's GDP that goes towards Health Expenditures
- c_out_of_pocket : Out-of-Pocket Expenditure per Capita (Current US Dollars)
- c_physician : Physicians (Per 1000 People)
- c_tuberculosis : Incidence of Tuberculosis (Per 100,000 People)
- c_urban_pop : The Percentage of the Total Population living in Urban Areas 

The observation of the y dataframe are by countries and attributed by the year of observation. The dataframe contained data on the age-standardized rate of mortality for lung cancer.

### 2.2 Variables and DataFrames

In [75]:
import csv
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import duckdb
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_white
import statsmodels.api as sm
from pathlib import Path
from datetime import datetime

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, mean_squared_error

import pytorch_lightning as pl
import torch
from torch.utils.data import DataLoader, TensorDataset
from xgboost import XGBClassifier, plot_tree
from sklearn.model_selection import cross_val_score

In [8]:
X = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/X.txt')
X_df = X.set_index('country')
y_mortality = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/y.txt')

years = ['2000','2001','2002','2003','2004','2005','2006','2007','2008','2009',
         '2010','2011','2012','2013','2014','2015','2016','2017','2017','2018','2019']

countries = X['country'].unique()

dict_of_year_dfs = {}
for yr in years:
    dict_of_year_dfs[yr] = X[X["year"] == str(yr)]
    
dict_of_country_dfs = {}
for c in countries:
    dict_of_country_dfs[c] = X[X['country'] == c]

y_mortality_stat = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/y_stat.txt')

country_means_df = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/country_means.txt')
country_medians_df = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/country_medians.txt')
country_variances_df = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/country_variances.txt')

### 2.3 Data Analysis

#### Summary Statistics

Correlation matrix to check for confounding variables.

In [10]:
# Melt y_mortality to concatenate with X_df
y_mortality_melt = pd.melt(y_mortality,id_vars='country',value_vars=years, \
                           var_name='year',value_name='mortality')
y_mortality_melt['year'] = y_mortality_melt['year'].astype(int)

# Create corr_Xy with X_df and y_mortality_melt merged for correlation matrix
Xy_df = X_df.merge(y_mortality_melt,on=['country','year'])
Xy_df = Xy_df.dropna()

# Display the correlation matrix
Xy_df.drop(columns='country').corr()

Unnamed: 0,year,c_dollar2_poverty,c_forest_area,c_health_expenditure,c_out_of_pocket,c_physician,c_tuberculosis,c_urban_pop,mortality
year,1.0,0.025665,0.166164,-0.000186,0.141739,0.1407,0.138147,-0.057464,-0.130958
c_dollar2_poverty,0.025665,1.0,0.551714,0.015412,0.522908,0.481998,0.516071,-0.27998,-0.348154
c_forest_area,0.166164,0.551714,1.0,0.037501,0.945444,0.806301,0.854961,-0.30448,-0.315205
c_health_expenditure,-0.000186,0.015412,0.037501,1.0,0.035244,0.001168,-0.009891,0.00572,-0.05739
c_out_of_pocket,0.141739,0.522908,0.945444,0.035244,1.0,0.624724,0.655668,-0.298269,-0.290178
c_physician,0.1407,0.481998,0.806301,0.001168,0.624724,1.0,0.943421,-0.390521,-0.360016
c_tuberculosis,0.138147,0.516071,0.854961,-0.009891,0.655668,0.943421,1.0,-0.287769,-0.356403
c_urban_pop,-0.057464,-0.27998,-0.30448,0.00572,-0.298269,-0.390521,-0.287769,1.0,0.419949
mortality,-0.130958,-0.348154,-0.315205,-0.05739,-0.290178,-0.360016,-0.356403,0.419949,1.0


#### Grouping Countries

Grouped countries into four different groups based on their Human Development Index (HDI) as following: 1 being "Low", 2 being "Medium", 3 being "High", and 4 being "Very High".

In [14]:
# Get dataset of countries' Human Development Index
hdi_countries = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/hdi.txt')

# Merge HDI dataframe with X_df on country
X_hdi = X_df.merge(hdi_countries,on='country')

# Group into countries with different hdi values
X_hdi_1 = X_hdi[X_hdi['hdicode']==1.0] 
X_hdi_2 = X_hdi[X_hdi['hdicode']==2.0]
X_hdi_3 = X_hdi[X_hdi['hdicode']==3.0]
X_hdi_4 = X_hdi[X_hdi['hdicode']==4.0]

# Merge dataframe y_mortality with each X_hdi dataframes to create separate 
# dataframes for countries with different hdi values
Xy_hdi_1 = X_hdi_1.merge(y_mortality_melt,on=['country','year']).dropna()
Xy_hdi_2 = X_hdi_2.merge(y_mortality_melt,on=['country','year']).dropna()
Xy_hdi_3 = X_hdi_3.merge(y_mortality_melt,on=['country','year']).dropna()
Xy_hdi_4 = X_hdi_4.merge(y_mortality_melt,on=['country','year']).dropna()

# Save each Xy_hdi DataFrame to separate CSV files
Xy_hdi_1.to_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/Xy_hdi_1.csv', index=False)
Xy_hdi_2.to_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/Xy_hdi_2.csv', index=False)
Xy_hdi_3.to_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/Xy_hdi_3.csv', index=False)
Xy_hdi_4.to_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/Xy_hdi_4.csv', index=False)

Xy_hdi_1.head()

Unnamed: 0.1,country,year,c_dollar2_poverty,c_forest_area,c_health_expenditure,c_out_of_pocket,c_physician,c_tuberculosis,c_urban_pop,Unnamed: 0,hdicode,mortality
0,Burundi,2000,8.246,8.52,7.552181,2.021629,3.56,6.44,290.0,11,1.0,36.8
1,Burundi,2001,8.461,8.68,7.552181,2.294789,3.5,4.66,270.0,11,1.0,35.1
2,Burundi,2002,8.682,8.03,7.552181,1.905938,3.31,4.87,250.0,11,1.0,34.0
3,Burundi,2003,8.908,7.22,7.552181,1.549582,3.03,4.37,230.0,11,1.0,33.0
4,Burundi,2004,9.139,9.9,7.552181,3.043692,3.52,4.47,211.0,11,1.0,32.3


#### Model Training

Trained separate models for countries with different HDI values to compare the weight of each input. First we split the test set and train set.

In [15]:
# Split train test for each hdi categories
X_hdi1_train, X_hdi1_test, y_hdi1_train, y_hdi1_test = train_test_split \
(Xy_hdi_1.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']), \
 Xy_hdi_1['mortality'], test_size=0.2,random_state=2950)
X_hdi2_train, X_hdi2_test, y_hdi2_train, y_hdi2_test = train_test_split \
(Xy_hdi_2.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']),\
 Xy_hdi_2['mortality'], test_size=0.2,random_state=2950)
X_hdi3_train, X_hdi3_test, y_hdi3_train, y_hdi3_test = train_test_split \
(Xy_hdi_3.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']), \
 Xy_hdi_3['mortality'], test_size=0.2,random_state=2950)
X_hdi4_train, X_hdi4_test, y_hdi4_train, y_hdi4_test = train_test_split \
(Xy_hdi_4.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']), \
 Xy_hdi_4['mortality'], test_size=0.2,random_state=2950)

In [50]:
# Further split the training set into train and validation sets
X_hdi1_train, X_hdi1_valid, y_hdi1_train, y_hdi1_valid = train_test_split(
    X_hdi1_train,
    y_hdi1_train,
    test_size=0.25,
    random_state=2950
)
X_hdi2_train, X_hdi2_valid, y_hdi2_train, y_hdi2_valid = train_test_split(
    X_hdi2_train,
    y_hdi2_train,
    test_size=0.25,
    random_state=2950
)
X_hdi3_train, X_hdi3_valid, y_hdi3_train, y_hdi3_valid = train_test_split(
    X_hdi3_train,
    y_hdi3_train,
    test_size=0.25,
    random_state=2950
)
X_hdi4_train, X_hdi4_valid, y_hdi4_train, y_hdi4_valid = train_test_split(
    X_hdi4_train,
    y_hdi4_train,
    test_size=0.25,
    random_state=2950
)

#### Model 1: Linear Regression

In [16]:
# Fit each multilinear model
model_hdi1 = LinearRegression()
model_hdi2 = LinearRegression()
model_hdi3 = LinearRegression()
model_hdi4 = LinearRegression()

model_hdi1.fit(X_hdi1_train,y_hdi1_train)
model_hdi2.fit(X_hdi2_train,y_hdi2_train)
model_hdi3.fit(X_hdi3_train,y_hdi3_train)
model_hdi4.fit(X_hdi4_train,y_hdi4_train)

# Print MSE of each model
y_hdi1_pred = model_hdi1.predict(X_hdi1_test)
mse_hdi1 = mean_squared_error(y_hdi1_test, y_hdi1_pred)
print(f'HDI 1 Mean Squared Error: {mse_hdi1}')

y_hdi2_pred = model_hdi2.predict(X_hdi2_test)
mse_hdi2 = mean_squared_error(y_hdi2_test, y_hdi2_pred)
print(f'HDI 2 Mean Squared Error: {mse_hdi2}')

y_hdi3_pred = model_hdi3.predict(X_hdi3_test)
mse_hdi3 = mean_squared_error(y_hdi3_test, y_hdi3_pred)
print(f'HDI 3 Mean Squared Error: {mse_hdi3}')

y_hdi4_pred = model_hdi4.predict(X_hdi4_test)
mse_hdi4 = mean_squared_error(y_hdi4_test, y_hdi4_pred)
print(f'HDI 4 Mean Squared Error: {mse_hdi4}')

HDI 1 Mean Squared Error: 16.949771786522657
HDI 2 Mean Squared Error: 46.14491654415083
HDI 3 Mean Squared Error: 47.453855043340965
HDI 4 Mean Squared Error: 39.07412663779996


In [60]:
# Linear Regression for HDI 1
model_hdi1 = LinearRegression()
model_hdi1.fit(X_hdi1_train, y_hdi1_train)

# Predictions on the validation set
preds_valid_hdi1 = model_hdi1.predict(X_hdi1_valid)
preds_valid_hdi1 = np.round(preds_valid_hdi1).astype(int)  # Assuming you want integer predictions

# Convert to integers (for classification)
preds_valid_hdi1 = preds_valid_hdi1.astype(int)

# Calculate metrics for HDI 1
print("Recall hdi1:", recall_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("Precision hdi1:", precision_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("F1 Score hdi1:", f1_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))

# You can also calculate F1 score for the training set if needed
preds_train_hdi1 = model_hdi1.predict(X_hdi1_train)
preds_train_hdi1 = np.round(preds_train_hdi1).astype(int)
preds_train_hdi1 = preds_train_hdi1.astype(int)

print("F1 Score (train) hdi1:", f1_score(y_hdi1_train, preds_train_hdi1, average='macro'))

# Repeat the process for other HDI categories (model_hdi2, model_hdi3, model_hdi4)


Recall hdi1: 0.026785714285714284
Precision hdi1: 0.016271674753817612
F1 Score hdi1: 0.020047277284119393
F1 Score (train) hdi1: 0.061957041690649854


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [17]:
# Print coefficients of each variable for each model
coeff_hdi1 = model_hdi1.coef_
coeff_hdi2 = model_hdi2.coef_
coeff_hdi3 = model_hdi3.coef_
coeff_hdi4 = model_hdi4.coef_
print(f'HDI 1 Coefficient: \n{coeff_hdi1}')
print(f'HDI 2 Coefficient: \n{coeff_hdi2}')
print(f'HDI 3 Coefficient: \n{coeff_hdi3}')
print(f'HDI 4 Coefficient: \n{coeff_hdi4}')

HDI 1 Coefficient: 
[-0.07392172  0.12771702 -0.05161373 -0.18896928  0.94565316 -1.07900793
  0.01607057]
HDI 2 Coefficient: 
[-0.08988428  0.08733333 -0.00322938 -0.04053926  0.03084154 -0.17943786
  0.00972751]
HDI 3 Coefficient: 
[-0.08400009  0.08655574  0.00451196 -0.10076958  0.02652856 -0.11258985
  0.01490172]
HDI 4 Coefficient: 
[-0.0068523  -0.01690743 -0.11933517  0.01250459  0.0575888  -0.04337106
  0.01381115]


#### Model 2: XGBoost

In XGBoost, we determine the hyperparameters of xgboost by grid vs randomized search and F1 scores by each search.

The features (X_hdi1_train, X_hdi2_train, X_hdi3_train, X_hdi4_train) are continuous numerical values. But in case of XGBoost, a tree-based model, typically works well with categorical variables and requires special treatment for continuous features.

In [43]:
# Convert the target variables to integers
y_hdi1_train = y_hdi1_train.astype(int)
y_hdi2_train = y_hdi2_train.astype(int)
y_hdi3_train = y_hdi3_train.astype(int)
y_hdi4_train = y_hdi4_train.astype(int)

In [44]:
print(np.unique(y_hdi1_train))
print(np.unique(y_hdi2_train))
print(np.unique(y_hdi3_train))
print(np.unique(y_hdi4_train))

[17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 45 46 47 48]
[10 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
 37 38 39 40 50 51 52 53 54 55 56]
[ 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
 33 34 35 36 37 38 39 40 41 42 43 44 45 46]
[ 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
 33 34 35 36 37 39 40 41]


In [45]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_hdi1_train_scaled = scaler.fit_transform(X_hdi1_train)
X_hdi2_train_scaled = scaler.fit_transform(X_hdi2_train)
X_hdi3_train_scaled = scaler.fit_transform(X_hdi3_train)
X_hdi4_train_scaled = scaler.fit_transform(X_hdi4_train)

In [46]:
# Adjust class labels to start from 0
y_hdi1_train_adjusted = y_hdi1_train - 17
y_hdi2_train_adjusted = y_hdi2_train - 10
y_hdi3_train_adjusted = y_hdi3_train - 9
y_hdi4_train_adjusted = y_hdi4_train - 9

# Verify unique values after adjustment
print(np.unique(y_hdi1_train_adjusted))
print(np.unique(y_hdi2_train_adjusted))
print(np.unique(y_hdi3_train_adjusted))
print(np.unique(y_hdi4_train_adjusted))

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 28 29 30 31]
[ 0  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
 27 28 29 30 40 41 42 43 44 45 46]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 30 31 32]


In [47]:
# Adjust class labels to start from 0
y_hdi1_train_adjusted = y_hdi1_train - 17
y_hdi2_train_adjusted = y_hdi1_train - 10
y_hdi3_train_adjusted = y_hdi1_train - 9
y_hdi4_train_adjusted = y_hdi1_train - 9

# Verify unique values after adjustment
print(np.unique(y_hdi1_train_adjusted))
print(np.unique(y_hdi2_train_adjusted))
print(np.unique(y_hdi3_train_adjusted))
print(np.unique(y_hdi4_train_adjusted))

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 28 29 30 31]
[ 7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
 31 32 33 35 36 37 38]
[ 8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
 32 33 34 36 37 38 39]
[ 8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
 32 33 34 36 37 38 39]


In [48]:
print(np.unique(X_hdi1_train))
print(np.unique(X_hdi2_train))
print(np.unique(X_hdi3_train))
print(np.unique(X_hdi4_train))

[1.51609670e-01 2.41587575e-01 2.43313201e-01 ... 1.20000000e+03
 1.23000000e+03 1.24000000e+03]
[9.00e-02 1.00e-01 1.30e-01 ... 1.24e+03 1.25e+03 1.26e+03]
[0.00000000e+00 5.40000000e-01 6.62960693e-01 ... 1.23000000e+03
 1.26000000e+03 1.27000000e+03]
[0.00000000e+00 8.07754443e-03 8.40064620e-03 ... 9.90630000e+02
 9.94670000e+02 9.95340000e+02]


In [49]:
# Grid Search
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.2, 0.3],
    'subsample': [0.7, 0.8, 0.9],
    'reg_lambda': [0.01, 0.1, 1.0],
    'reg_alpha': [0.01, 0.1, 1.0]
}

clf_grid_hdi1 = XGBClassifier()
clf_grid_hdi2 = XGBClassifier()
clf_grid_hdi3 = XGBClassifier()
clf_grid_hdi4 = XGBClassifier()

grid_search_hdi1 = GridSearchCV(clf_grid_hdi1, param_grid, cv=5, scoring='f1', error_score='raise')
grid_search_hdi1.fit(X_hdi1_train_scaled,y_hdi1_train_adjusted)
grid_search_hdi2 = GridSearchCV(clf_grid_hdi2, param_grid, cv=5, scoring='f1', error_score='raise')
grid_search_hdi2.fit(X_hdi2_train_scaled,y_hdi2_train_adjusted)
grid_search_hdi3 = GridSearchCV(clf_grid_hdi3, param_grid, cv=5, scoring='f1', error_score='raise')
grid_search_hdi3.fit(X_hdi3_train_scaled,y_hdi3_train_adjusted)
grid_search_hdi4 = GridSearchCV(clf_grid_hdi4, param_grid, cv=5, scoring='f1', error_score='raise')
grid_search_hdi4.fit(X_hdi4_train_scaled,y_hdi4_train_adjusted)

best_params_grid_hdi1 = grid_search_hdi1.best_params_
best_score_grid_hdi1 = grid_search_hdi1.best_score_

print("Grid Search Best Hyperparameters (hdi1):", best_params_grid_hdi1)
print("Grid Search Best F1 Score (hdi1):", best_score_grid_hdi1)

# Randomized Search
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
import numpy as np

param_dist = {
    'max_depth': np.arange(3, 7),  
    'learning_rate': np.linspace(0.01, 0.3, num=10),  
    'subsample': np.linspace(0.5, 0.9, num=5),  
    'reg_lambda': np.logspace(-2, 2, num=5),  
    'reg_alpha': np.logspace(-2, 2, num=5)  
}

clf_randomized_hdi1 = XGBClassifier()
clf_randomized_hdi2 = XGBClassifier()
clf_randomized_hdi3 = XGBClassifier()
clf_randomized_hdi4 = XGBClassifier()

random_search_hdi1 = RandomizedSearchCV(
    clf_randomized_hdi1, 
    param_distributions=param_dist, 
    n_iter=50,  # Number of random combinations to try
    scoring='f1',  # Scoring metric
    cv=5  # Number of cross-validation folds
)
random_search_hdi1.fit(X_hdi1_train_scaled,y_hdi1_adjusted)
random_search_hdi2 = RandomizedSearchCV(
    clf_randomized_hdi2, 
    param_distributions=param_dist, 
    n_iter=50,
    scoring='f1',
    cv=5 
)
random_search_hdi2.fit(X_hdi2_train_scaled,y_hdi2_train_adjusted)
random_search_hdi3 = RandomizedSearchCV(
    clf_randomized_hdi3, 
    param_distributions=param_dist, 
    n_iter=50, 
    scoring='f1',
    cv=5  
)
random_search_hdi3.fit(X_hdi3_train_scaled,y_hdi3_train_adjusted)
random_search_hdi4 = RandomizedSearchCV(
    clf_randomized_hdi4, 
    param_distributions=param_dist, 
    n_iter=50,  
    scoring='f1',  
    cv=5  
)
random_search_hdi4.fit(X_hdi4_train_scaled,y_hdi4_train_adjusted)

best_params_randomized_hdi1 = random_search_hdi1.best_params_
best_score_randomized_hdi1 = random_search_hdi1.best_score_

print("Randomized Search Best Hyperparameters (hdi1):", best_params_randomized_hdi1)
print("Randomized Search Best F1 Score (hdi1):", best_score_randomized_hdi1)



ValueError: Invalid classes inferred from unique values of `y`.  Expected: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30], got [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 28 29 30 31]

### Model 3: Adaboost

***Result***

Grid Search Best Hyperparameters: {'base_estimator__max_depth': 7, 'learning_rate': 0.5, 'n_estimators': 150}
Grid Search Best F1 macro Score: 0.4804543009188122

Randomized Search Best Hyperparameters: {'n_estimators': 150, 'learning_rate': 0.5, 'base_estimator__max_depth': 10}
Randomized Search Best Accuracy: 0.5903846153846154

***We use hyperparameters given by Randomized Search***

In [77]:
# Ensemble Methods: AdaBoost
ada_clf = AdaBoostClassifier(
        DecisionTreeClassifier(max_depth=5), n_estimators=100,
        algorithm="SAMME.R", learning_rate=1
    )

# Parameter grid for Grid Search
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.1, 0.5, 1.0],
    'base_estimator__max_depth': [3, 5, 7],
}

# Grid Search
grid_search = GridSearchCV(ada_clf, param_grid, cv=5, scoring='f1_macro', verbose=2)
grid_search.fit(X_hdi1_train, y_hdi1_train)

# Best parameters and best F1 score from Grid Search
best_params_grid = grid_search.best_params_
best_score_grid = grid_search.best_score_

print("Grid Search Best Hyperparameters:", best_params_grid)
print("Grid Search Best F1 macro Score:", best_score_grid)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   0.4s
[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.4s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.4s
[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   0.5s
[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   0.5s
[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   0.5s
[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.6s
[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   1.0s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   1.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.9s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.9s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.9s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.6s


  self.best_estimator_ = clone(base_estimator).set_params(


Grid Search Best Hyperparameters: {'base_estimator__max_depth': 7, 'learning_rate': 0.5, 'n_estimators': 50}
Grid Search Best F1 Score: 0.467585944593102


In [79]:
# Parameter distribution for Randomized Search
param_dist = {
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.01, 0.1, 0.2, 0.5, 1.0],
    'base_estimator__max_depth': [3, 5, 7, 10],
}

# Randomized Search
random_search = RandomizedSearchCV(ada_clf, param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy', verbose=2)
random_search.fit(X_hdi1_train, y_hdi1_train)

# Best parameters and best F1 score from Randomized Search
best_params_randomized = random_search.best_params_
best_score_randomized = random_search.best_score_

print("Randomized Search Best Hyperparameters:", best_params_randomized)
print("Randomized Search Best Accuracy:", best_score_randomized)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.5s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.9s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  self.best_estimator_ = clone(base_estimator).set_params(


Randomized Search Best Hyperparameters: {'n_estimators': 150, 'learning_rate': 0.5, 'base_estimator__max_depth': 10}
Randomized Search Best F1 Score: 0.5903846153846154


In [80]:
#Ensemble Methods: AdaBoost
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
        DecisionTreeClassifier(max_depth=10), n_estimators=150,
        algorithm="SAMME.R", learning_rate=0.5
    )


# HDI 1
ada_clf.fit(X_hdi1_train, y_hdi1_train)
preds_train_hdi1 = ada_clf.predict(X_hdi1_train)
preds_valid_hdi1 = ada_clf.predict(X_hdi1_valid)

print("Recall hdi1:", recall_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("Precision hdi1:", precision_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("F1 Score hdi1:", f1_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("F1 Score (train) hdi1:", f1_score(y_hdi1_train, preds_train_hdi1, average='macro'))

# HDI 2
ada_clf.fit(X_hdi2_train, y_hdi2_train)
preds_train_hdi2 = ada_clf.predict(X_hdi2_train)
preds_valid_hdi2 = ada_clf.predict(X_hdi2_valid)

print("Recall hdi2:", recall_score(y_hdi2_valid, preds_valid_hdi2, average='macro'))
print("Precision hdi2:", precision_score(y_hdi2_valid, preds_valid_hdi2, average='macro'))
print("F1 Score hdi2:", f1_score(y_hdi2_valid, preds_valid_hdi2, average='macro'))
print("F1 Score (train) hdi2:", f1_score(y_hdi2_train, preds_train_hdi2, average='macro'))

# HDI 3
ada_clf.fit(X_hdi3_train, y_hdi3_train)
preds_train_hdi3 = ada_clf.predict(X_hdi3_train)
preds_valid_hdi3 = ada_clf.predict(X_hdi3_valid)

print("Recall hdi3:", recall_score(y_hdi3_valid, preds_valid_hdi3, average='macro'))
print("Precision hdi3:", precision_score(y_hdi3_valid, preds_valid_hdi3, average='macro'))
print("F1 Score hdi3:", f1_score(y_hdi3_valid, preds_valid_hdi3, average='macro'))
print("F1 Score (train) hdi3:", f1_score(y_hdi3_train, preds_train_hdi3, average='macro'))

# HDI 4
ada_clf.fit(X_hdi4_train, y_hdi4_train)
preds_train_hdi4 = ada_clf.predict(X_hdi4_train)
preds_valid_hdi4 = ada_clf.predict(X_hdi4_valid)

print("Recall hdi4:", recall_score(y_hdi4_valid, preds_valid_hdi4, average='macro'))
print("Precision hdi4:", precision_score(y_hdi4_valid, preds_valid_hdi4, average='macro'))
print("F1 Score hdi4:", f1_score(y_hdi4_valid, preds_valid_hdi4, average='macro'))
print("F1 Score (train) hdi4:", f1_score(y_hdi4_train, preds_train_hdi4, average='macro'))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Recall hdi1: 0.4436933106575963
Precision hdi1: 0.41143707482993197
F1 Score hdi1: 0.398989485324889
F1 Score (train) hdi1: 1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Recall hdi2: 0.5536025802154834
Precision hdi2: 0.5703876627051501
F1 Score hdi2: 0.5116277987245729
F1 Score (train) hdi2: 1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Recall hdi3: 0.4465834818775995
Precision hdi3: 0.41821311858076565
F1 Score hdi3: 0.40450106596160845
F1 Score (train) hdi3: 1.0
Recall hdi4: 0.3386054421768708
Precision hdi4: 0.31940321583178727
F1 Score hdi4: 0.32312132312132313
F1 Score (train) hdi4: 1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Model 4: GBRT

In case of regression problem (predicting a continuous variable), you use regression metrics instead.

--> shows overfitting

In [93]:
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=4, n_estimators=100)
gbrt.fit(X_hdi1_train, y_hdi1_train)

# Calculate errors for different numbers of estimators
errors = [mean_squared_error(y_hdi1_valid, y_pred)
          for y_pred in gbrt.staged_predict(X_hdi1_valid)]

# Find the index of the smallest error
bst_n_estimators = np.argmin(errors)

# Use the index to get the best number of estimators
best_n_estimators = bst_n_estimators + 1  # Add 1 to the index

# Create the best GBRT model with the found number of estimators
gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=best_n_estimators)
gbrt_best.fit(X_hdi1_train, y_hdi1_train)

print(gbrt_best)

GradientBoostingRegressor(max_depth=2)


In [103]:
from sklearn.metrics import mean_squared_error

# Assuming you have already trained and validated your gbrt_best model
preds_train = gbrt_best.predict(X_hdi1_train)
preds = gbrt_best.predict(X_hdi1_valid)

mse_train = mean_squared_error(y_hdi1_train, preds_train)
mse_valid = mean_squared_error(y_hdi1_valid, preds)

print("Mean Squared Error on Train:", mse_train)
print("Mean Squared Error on Valid:", mse_valid)

Mean Squared Error on Train: 2.3792314465148947
Mean Squared Error on Valid: 8.792632568585587


### Model 5: Random Forest

In [106]:
from sklearn.ensemble import RandomForestClassifier

clf_randomforest = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
clf_randomforest.fit(X_hdi1_train, y_hdi1_train)
preds_train = clf_randomforest.predict(X_hdi1_train)
preds = clf_randomforest.predict(X_hdi1_valid)

print("Recall:", recall_score(y_hdi1_valid, preds, average='macro', zero_division=0))
print("Precision:", precision_score(y_hdi1_valid, preds, average='macro', zero_division=0))
print("F1 Score:", f1_score(y_hdi1_valid, preds, average='macro', zero_division=0))
print("F1 Score (train):", f1_score(y_hdi1_train, preds_train, average='macro', zero_division=0))

Recall: 0.32375073486184597
Precision: 0.39883891828336265
F1 Score: 0.31291211209511866
F1 Score (train): 0.5916701070707013


### Model 6: Stacking

In [122]:
#Ensemble Methods: Stacking
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

#Define base models
base_models = [
    RandomForestClassifier(n_estimators=100, random_state=42),
    GradientBoostingClassifier(n_estimators=100, random_state=42)
]

#Train base models and generate base model predictions
base_preds_train = []
base_preds_val = []
for model in base_models:
    model.fit(X_hdi1_train, y_hdi1_train)
    base_pred_train = model.predict(X_hdi1_train)
    base_pred_val = model.predict(X_hdi1_valid)
    base_preds_train.append(base_pred_train)
    base_preds_val.append(base_pred_val)

#Generate base model predictions arrays
base_preds_train = np.array(base_preds_train).T
base_preds_val = np.array(base_preds_val).T

#Define meta model
meta_model = LogisticRegression(max_iter=100000)

#Train meta model
meta_model.fit(base_preds_train, y_hdi1_train)

#Generate final predictions
final_preds = meta_model.predict(base_preds_val)

#Calculate F1 scores on training and validation sets using meta model
train_preds = meta_model.predict(base_preds_train)  # Predictions on the training set
val_preds = meta_model.predict(base_preds_val)      # Predictions on the validation set

train_f1 = f1_score(y_hdi1_train, train_preds, average='macro')
val_f1 = f1_score(y_hdi1_valid, val_preds, average='macro')

#Compare the F1 scores
print("F1 Score on Training Set:", train_f1)
print("F1 Score on Validation Set:", val_f1)

F1 Score on Training Set: 0.3337791870655029
F1 Score on Validation Set: 0.21279337946004612


### Model 7: Decision Tree

In [124]:
clf_DecisionTree = DecisionTreeClassifier(random_state=42, max_depth=5)

clf_DecisionTree.fit(X_hdi1_train, y_hdi1_train)

score = clf_DecisionTree.score(X_hdi1_valid, y_hdi1_valid)
y_predict1 = clf_DecisionTree.predict(X_hdi1_valid)
print(score)
# score depends strongly on train-test-split

0.14814814814814814


In [129]:
confusion_matrix(y_hdi1_valid, y_predict1)
print(f1_score(y_hdi1_valid, y_predict1, average='micro'))

0.14814814814814814


In [130]:
y_predict2 = clf_DecisionTree.predict(X_hdi1_train)
confusion_matrix(y_hdi1_train, y_predict2)
print(f1_score(y_hdi1_train, y_predict2, average='micro'))

0.32608695652173914
