## I. Introduction
### Background

This study explores the correlation between lung cancer survival rates and societal factors on a global scale. While existing research has linked lung cancer to environmental pollutants and resource accessiblity in the U.S., little is known about how broader societal influences contribute to varied health outcomes worldwide. Through a comprehensive analysis of diverse variables, including healthcare accessibility, socio-economic conditions, and cultural factors, this research aims to uncover patterns and relationships that can deepen our understanding of lung cancer survival dynamics.

By investigating the interplay between lung cancer outcomes and societal factors, this research seeks to inform public health policies and resource allocation strategies. The study's findings may offer valuable insights into addressing the complexities of lung cancer on a global level, ultimately contributing to more effective interventions and equitable health outcomes.

### Hypothesis (Research Questions)

The effects of the factors on lung cancer survival rates will vary depending on the country. For example, developed countries will have a higher cancer survival rate if they have a greater forest area, while developing countries will have a lower cancer survival rate for a greater forest area.

## II. Methods
### 2.1 Data Description

Individually, we selected 7 factors we thought are likely to impact the lung cancer survival rate for a total of 7 data sets. These datasets are all sourced from the WorldBank site. The 8th data set is the data set of lung cancer survival rates by country and by year. 

**What are the observations (rows) and the attributes (columns)?:** 
Attributes of the X DataFrame are indexed by the 'country' (the name of the relevant country) and further contextualized by the attribute 'year' (the relevant year of the data in between 2000-2019). Most country and year combination has an observation correlating to the following attributes: 
- c_dollar2_poverty : Proportion of Population Pushed Below 3 dollar and 65 cents Poverty Line by Out-of-Pocket Health Care Expenditure
- c_forest_area : The Percentage of Land Area covered by Forest
- c_health_expenditure : The Percentage of a Country's GDP that goes towards Health Expenditures
- c_out_of_pocket : Out-of-Pocket Expenditure per Capita (Current US Dollars)
- c_physician : Physicians (Per 1000 People)
- c_tuberculosis : Incidence of Tuberculosis (Per 100,000 People)
- c_urban_pop : The Percentage of the Total Population living in Urban Areas 

The observation of the y dataframe are by countries and attributed by the year of observation. The dataframe contained data on the age-standardized rate of mortality for lung cancer.

### 2.2 Variables and DataFrames

In [33]:
import csv
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import duckdb
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_white
import statsmodels.api as sm
from pathlib import Path
from datetime import datetime

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, mean_squared_error

import pytorch_lightning as pl
import torch
from torch.utils.data import DataLoader, TensorDataset
from xgboost import XGBClassifier, plot_tree
from sklearn.model_selection import cross_val_score

In [46]:
X = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/X.txt')
X_df = X.set_index('country')
y_mortality = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/y.txt')

years = ['2000','2001','2002','2003','2004','2005','2006','2007','2008','2009',
         '2010','2011','2012','2013','2014','2015','2016','2017','2017','2018','2019']

countries = X['country'].unique()

dict_of_year_dfs = {}
for yr in years:
    dict_of_year_dfs[yr] = X[X["year"] == str(yr)]
    
dict_of_country_dfs = {}
for c in countries:
    dict_of_country_dfs[c] = X[X['country'] == c]

y_mortality_stat = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/y_stat.txt')

country_means_df = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/country_means.txt')
country_medians_df = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/country_medians.txt')
country_variances_df = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/country_variances.txt')

### 2.3 Data Analysis

#### Summary Statistics

Correlation matrix to check for confounding variables.

In [47]:
# Melt y_mortality to concatenate with X_df
y_mortality_melt = pd.melt(y_mortality,id_vars='country',value_vars=years, \
                           var_name='year',value_name='mortality')
y_mortality_melt['year'] = y_mortality_melt['year'].astype(int)

# Create corr_Xy with X_df and y_mortality_melt merged for correlation matrix
Xy_df = X_df.merge(y_mortality_melt,on=['country','year'])
Xy_df = Xy_df.dropna()

# Display the correlation matrix
Xy_df.drop(columns='country').corr()

Unnamed: 0,year,c_dollar2_poverty,c_forest_area,c_health_expenditure,c_out_of_pocket,c_physician,c_tuberculosis,c_urban_pop,mortality
year,1.0,0.025665,0.166164,-0.000186,0.141739,0.1407,0.138147,-0.057464,-0.130958
c_dollar2_poverty,0.025665,1.0,0.551714,0.015412,0.522908,0.481998,0.516071,-0.27998,-0.348154
c_forest_area,0.166164,0.551714,1.0,0.037501,0.945444,0.806301,0.854961,-0.30448,-0.315205
c_health_expenditure,-0.000186,0.015412,0.037501,1.0,0.035244,0.001168,-0.009891,0.00572,-0.05739
c_out_of_pocket,0.141739,0.522908,0.945444,0.035244,1.0,0.624724,0.655668,-0.298269,-0.290178
c_physician,0.1407,0.481998,0.806301,0.001168,0.624724,1.0,0.943421,-0.390521,-0.360016
c_tuberculosis,0.138147,0.516071,0.854961,-0.009891,0.655668,0.943421,1.0,-0.287769,-0.356403
c_urban_pop,-0.057464,-0.27998,-0.30448,0.00572,-0.298269,-0.390521,-0.287769,1.0,0.419949
mortality,-0.130958,-0.348154,-0.315205,-0.05739,-0.290178,-0.360016,-0.356403,0.419949,1.0


#### Grouping Countries

Grouped countries into four different groups based on their Human Development Index (HDI) as following: 1 being "Low", 2 being "Medium", 3 being "High", and 4 being "Very High".

In [48]:
# Get dataset of countries' Human Development Index
hdi_countries = pd.read_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/hdi.txt')

# Merge HDI dataframe with X_df on country
X_hdi = X_df.merge(hdi_countries,on='country')

# Group into countries with different hdi values
X_hdi_1 = X_hdi[X_hdi['hdicode']==1.0] 
X_hdi_2 = X_hdi[X_hdi['hdicode']==2.0]
X_hdi_3 = X_hdi[X_hdi['hdicode']==3.0]
X_hdi_4 = X_hdi[X_hdi['hdicode']==4.0]

# Merge dataframe y_mortality with each X_hdi dataframes to create separate 
# dataframes for countries with different hdi values
Xy_hdi_1 = X_hdi_1.merge(y_mortality_melt,on=['country','year']).dropna()
Xy_hdi_2 = X_hdi_2.merge(y_mortality_melt,on=['country','year']).dropna()
Xy_hdi_3 = X_hdi_3.merge(y_mortality_melt,on=['country','year']).dropna()
Xy_hdi_4 = X_hdi_4.merge(y_mortality_melt,on=['country','year']).dropna()

# Save each Xy_hdi DataFrame to separate CSV files
Xy_hdi_1.to_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/Xy_hdi_1.csv', index=False)
Xy_hdi_2.to_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/Xy_hdi_2.csv', index=False)
Xy_hdi_3.to_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/Xy_hdi_3.csv', index=False)
Xy_hdi_4.to_csv('/Users/ganghwayeon/Documents/Hack-for-Health/data/dataframes/Xy_hdi_4.csv', index=False)

Xy_hdi_1.head()

Unnamed: 0.1,country,year,c_dollar2_poverty,c_forest_area,c_health_expenditure,c_out_of_pocket,c_physician,c_tuberculosis,c_urban_pop,Unnamed: 0,hdicode,mortality
0,Burundi,2000,8.246,8.52,7.552181,2.021629,3.56,6.44,290.0,11,1.0,36.8
1,Burundi,2001,8.461,8.68,7.552181,2.294789,3.5,4.66,270.0,11,1.0,35.1
2,Burundi,2002,8.682,8.03,7.552181,1.905938,3.31,4.87,250.0,11,1.0,34.0
3,Burundi,2003,8.908,7.22,7.552181,1.549582,3.03,4.37,230.0,11,1.0,33.0
4,Burundi,2004,9.139,9.9,7.552181,3.043692,3.52,4.47,211.0,11,1.0,32.3


## III. Model Training

Trained separate models for countries with different HDI values to compare the weight of each input. First we split the test set and train set.

In [49]:
# Split train test for each hdi categories
X_hdi1_train, X_hdi1_test, y_hdi1_train, y_hdi1_test = train_test_split \
(Xy_hdi_1.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']), \
 Xy_hdi_1['mortality'], test_size=0.2,random_state=2950)
X_hdi2_train, X_hdi2_test, y_hdi2_train, y_hdi2_test = train_test_split \
(Xy_hdi_2.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']),\
 Xy_hdi_2['mortality'], test_size=0.2,random_state=2950)
X_hdi3_train, X_hdi3_test, y_hdi3_train, y_hdi3_test = train_test_split \
(Xy_hdi_3.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']), \
 Xy_hdi_3['mortality'], test_size=0.2,random_state=2950)
X_hdi4_train, X_hdi4_test, y_hdi4_train, y_hdi4_test = train_test_split \
(Xy_hdi_4.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']), \
 Xy_hdi_4['mortality'], test_size=0.2,random_state=2950)

In [50]:
# Further split the training set into train and validation sets
X_hdi1_train, X_hdi1_valid, y_hdi1_train, y_hdi1_valid = train_test_split(
    X_hdi1_train,
    y_hdi1_train,
    test_size=0.25,
    random_state=2950
)
X_hdi2_train, X_hdi2_valid, y_hdi2_train, y_hdi2_valid = train_test_split(
    X_hdi2_train,
    y_hdi2_train,
    test_size=0.25,
    random_state=2950
)
X_hdi3_train, X_hdi3_valid, y_hdi3_train, y_hdi3_valid = train_test_split(
    X_hdi3_train,
    y_hdi3_train,
    test_size=0.25,
    random_state=2950
)
X_hdi4_train, X_hdi4_valid, y_hdi4_train, y_hdi4_valid = train_test_split(
    X_hdi4_train,
    y_hdi4_train,
    test_size=0.25,
    random_state=2950
)

### Model 1: Linear Regression

In [40]:
# Fit each multilinear model
model_hdi1 = LinearRegression()
model_hdi2 = LinearRegression()
model_hdi3 = LinearRegression()
model_hdi4 = LinearRegression()

model_hdi1.fit(X_hdi1_train,y_hdi1_train)
model_hdi2.fit(X_hdi2_train,y_hdi2_train)
model_hdi3.fit(X_hdi3_train,y_hdi3_train)
model_hdi4.fit(X_hdi4_train,y_hdi4_train)

# Print MSE of each model
y_hdi1_pred = model_hdi1.predict(X_hdi1_test)
mse_hdi1 = mean_squared_error(y_hdi1_test, y_hdi1_pred)
print(f'HDI 1 Mean Squared Error: {mse_hdi1}')

y_hdi2_pred = model_hdi2.predict(X_hdi2_test)
mse_hdi2 = mean_squared_error(y_hdi2_test, y_hdi2_pred)
print(f'HDI 2 Mean Squared Error: {mse_hdi2}')

y_hdi3_pred = model_hdi3.predict(X_hdi3_test)
mse_hdi3 = mean_squared_error(y_hdi3_test, y_hdi3_pred)
print(f'HDI 3 Mean Squared Error: {mse_hdi3}')

y_hdi4_pred = model_hdi4.predict(X_hdi4_test)
mse_hdi4 = mean_squared_error(y_hdi4_test, y_hdi4_pred)
print(f'HDI 4 Mean Squared Error: {mse_hdi4}')

HDI 1 Mean Squared Error: 17.455789102786028
HDI 2 Mean Squared Error: 47.206868567392355
HDI 3 Mean Squared Error: 47.64044850105149
HDI 4 Mean Squared Error: 38.72336569639016


In [60]:
# Linear Regression for HDI 1
model_hdi1 = LinearRegression()
model_hdi1.fit(X_hdi1_train, y_hdi1_train)

# Predictions on the validation set
preds_valid_hdi1 = model_hdi1.predict(X_hdi1_valid)
preds_valid_hdi1 = np.round(preds_valid_hdi1).astype(int)  # Assuming you want integer predictions

# Convert to integers (for classification)
preds_valid_hdi1 = preds_valid_hdi1.astype(int)

# Calculate metrics for HDI 1
print("Recall hdi1:", recall_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("Precision hdi1:", precision_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("F1 Score hdi1:", f1_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))

# You can also calculate F1 score for the training set if needed
preds_train_hdi1 = model_hdi1.predict(X_hdi1_train)
preds_train_hdi1 = np.round(preds_train_hdi1).astype(int)
preds_train_hdi1 = preds_train_hdi1.astype(int)

print("F1 Score (train) hdi1:", f1_score(y_hdi1_train, preds_train_hdi1, average='macro'))

# Repeat the process for other HDI categories (model_hdi2, model_hdi3, model_hdi4)


Recall hdi1: 0.026785714285714284
Precision hdi1: 0.016271674753817612
F1 Score hdi1: 0.020047277284119393
F1 Score (train) hdi1: 0.061957041690649854


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [41]:
# Print coefficients of each variable for each model
coeff_hdi1 = model_hdi1.coef_
coeff_hdi2 = model_hdi2.coef_
coeff_hdi3 = model_hdi3.coef_
coeff_hdi4 = model_hdi4.coef_
print(f'HDI 1 Coefficient: \n{coeff_hdi1}')
print(f'HDI 2 Coefficient: \n{coeff_hdi2}')
print(f'HDI 3 Coefficient: \n{coeff_hdi3}')
print(f'HDI 4 Coefficient: \n{coeff_hdi4}')

HDI 1 Coefficient: 
[-0.0864262   0.10424894 -0.050019   -0.16559496  0.92279766 -1.02265593
  0.01747262]
HDI 2 Coefficient: 
[-0.0837448   0.1069697  -0.00117724 -0.0718396   0.0168877  -0.1877496
  0.00929865]
HDI 3 Coefficient: 
[-0.10360563  0.07285097  0.0088263  -0.08679533  0.02848084 -0.09881974
  0.01506031]
HDI 4 Coefficient: 
[ 0.00732892  0.0105862  -0.12841764 -0.01578873  0.06007885 -0.07489197
  0.0143597 ]


### Model 2: Adaboost

***Result***

Grid Search Best Hyperparameters: {'base_estimator__max_depth': 7, 'learning_rate': 0.5, 'n_estimators': 150}
Grid Search Best F1 macro Score: 0.4804543009188122

Randomized Search Best Hyperparameters: {'n_estimators': 150, 'learning_rate': 0.5, 'base_estimator__max_depth': 10}
Randomized Search Best Accuracy: 0.5903846153846154

***We use hyperparameters given by Randomized Search***

In [64]:
y_hdi1_train = y_hdi1_train.astype(int)
y_hdi2_train = y_hdi2_train.astype(int)
y_hdi3_train = y_hdi3_train.astype(int)
y_hdi4_train = y_hdi4_train.astype(int)

y_hdi1_valid = y_hdi1_valid.astype(int)
y_hdi2_valid = y_hdi2_valid.astype(int)
y_hdi3_valid = y_hdi3_valid.astype(int)
y_hdi4_valid = y_hdi4_valid.astype(int)

In [70]:
# Ensemble Methods: AdaBoost
ada_clf = AdaBoostClassifier(
        DecisionTreeClassifier(max_depth=5), n_estimators=100,
        algorithm="SAMME.R", learning_rate=1
    )

# Parameter grid for Grid Search
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.1, 0.5, 1.0],
    'base_estimator__max_depth': [3, 5, 7],
}

# Grid Search
grid_search = GridSearchCV(ada_clf, param_grid, cv=5, scoring='f1_macro', verbose=2)
grid_search.fit(X_hdi1_train, y_hdi1_train)

# Best parameters and best F1 score from Grid Search
best_params_grid = grid_search.best_params_
best_score_grid = grid_search.best_score_

print("Grid Search Best Hyperparameters:", best_params_grid)
print("Grid Search Best F1 macro Score:", best_score_grid)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=100; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   1.0s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=150; total time=   1.0s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.3s
[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.5, n_estimators=150; total time=   0.7s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=150; total time=   0.4s
[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.3s
[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.9s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=100; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   0.9s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   1.0s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   1.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.4s
[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   0.9s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   1.0s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   0.9s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.5, n_estimators=150; total time=   0.5s
[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=100; total time=   0.3s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   1.0s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   1.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=1.0, n_estimators=150; total time=   0.5s
[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=150; total time=   0.8s
[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.6s
[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.2s
[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=100; total time=   0.4s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=1.0, n_estimators=150; total time=   0.6s


  self.best_estimator_ = clone(base_estimator).set_params(


Grid Search Best Hyperparameters: {'base_estimator__max_depth': 7, 'learning_rate': 0.5, 'n_estimators': 150}
Grid Search Best F1 macro Score: 0.47150074978146445


In [79]:
# Parameter distribution for Randomized Search
param_dist = {
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.01, 0.1, 0.2, 0.5, 1.0],
    'base_estimator__max_depth': [3, 5, 7, 10],
}

# Randomized Search
random_search = RandomizedSearchCV(ada_clf, param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy', verbose=2)
random_search.fit(X_hdi1_train, y_hdi1_train)

# Best parameters and best F1 score from Randomized Search
best_params_randomized = random_search.best_params_
best_score_randomized = random_search.best_score_

print("Randomized Search Best Hyperparameters:", best_params_randomized)
print("Randomized Search Best Accuracy:", best_score_randomized)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.5, n_estimators=150; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=0.1, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=200; total time=   0.5s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s
[CV] END base_estimator__max_depth=3, learning_rate=1.0, n_estimators=50; total time=   0.1s


  estimator = estimator.set_params(**clone(parameters, safe=False))
  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.5s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=150; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.2, n_estimators=50; total time=   0.2s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.7s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=7, learning_rate=0.5, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.8s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=10, learning_rate=0.1, n_estimators=200; total time=   0.9s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  estimator = estimator.set_params(**clone(parameters, safe=False))


[CV] END base_estimator__max_depth=5, learning_rate=0.1, n_estimators=200; total time=   0.6s


  self.best_estimator_ = clone(base_estimator).set_params(


Randomized Search Best Hyperparameters: {'n_estimators': 150, 'learning_rate': 0.5, 'base_estimator__max_depth': 10}
Randomized Search Best F1 Score: 0.5903846153846154


In [65]:
#Ensemble Methods: AdaBoost
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
        DecisionTreeClassifier(max_depth=10), n_estimators=150,
        algorithm="SAMME.R", learning_rate=0.5
    )


# HDI 1
ada_clf.fit(X_hdi1_train, y_hdi1_train)
preds_train_hdi1 = ada_clf.predict(X_hdi1_train)
preds_valid_hdi1 = ada_clf.predict(X_hdi1_valid)

print("Recall hdi1:", recall_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("Precision hdi1:", precision_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("F1 Score hdi1:", f1_score(y_hdi1_valid, preds_valid_hdi1, average='macro'))
print("F1 Score (train) hdi1:", f1_score(y_hdi1_train, preds_train_hdi1, average='macro'))

# HDI 2
ada_clf.fit(X_hdi2_train, y_hdi2_train)
preds_train_hdi2 = ada_clf.predict(X_hdi2_train)
preds_valid_hdi2 = ada_clf.predict(X_hdi2_valid)

print("Recall hdi2:", recall_score(y_hdi2_valid, preds_valid_hdi2, average='macro'))
print("Precision hdi2:", precision_score(y_hdi2_valid, preds_valid_hdi2, average='macro'))
print("F1 Score hdi2:", f1_score(y_hdi2_valid, preds_valid_hdi2, average='macro'))
print("F1 Score (train) hdi2:", f1_score(y_hdi2_train, preds_train_hdi2, average='macro'))

# HDI 3
ada_clf.fit(X_hdi3_train, y_hdi3_train)
preds_train_hdi3 = ada_clf.predict(X_hdi3_train)
preds_valid_hdi3 = ada_clf.predict(X_hdi3_valid)

print("Recall hdi3:", recall_score(y_hdi3_valid, preds_valid_hdi3, average='macro'))
print("Precision hdi3:", precision_score(y_hdi3_valid, preds_valid_hdi3, average='macro'))
print("F1 Score hdi3:", f1_score(y_hdi3_valid, preds_valid_hdi3, average='macro'))
print("F1 Score (train) hdi3:", f1_score(y_hdi3_train, preds_train_hdi3, average='macro'))

# HDI 4
ada_clf.fit(X_hdi4_train, y_hdi4_train)
preds_train_hdi4 = ada_clf.predict(X_hdi4_train)
preds_valid_hdi4 = ada_clf.predict(X_hdi4_valid)

print("Recall hdi4:", recall_score(y_hdi4_valid, preds_valid_hdi4, average='macro'))
print("Precision hdi4:", precision_score(y_hdi4_valid, preds_valid_hdi4, average='macro'))
print("F1 Score hdi4:", f1_score(y_hdi4_valid, preds_valid_hdi4, average='macro'))
print("F1 Score (train) hdi4:", f1_score(y_hdi4_train, preds_train_hdi4, average='macro'))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Recall hdi1: 0.4567885487528344
Precision hdi1: 0.4324557387057387
F1 Score hdi1: 0.41504149749491365
F1 Score (train) hdi1: 1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Recall hdi2: 0.5403537322892161
Precision hdi2: 0.5637213145277661
F1 Score hdi2: 0.5219737885514024
F1 Score (train) hdi2: 1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Recall hdi3: 0.44004753416518116
Precision hdi3: 0.4220368011457976
F1 Score hdi3: 0.3974976505366155
F1 Score (train) hdi3: 1.0
Recall hdi4: 0.335671768707483
Precision hdi4: 0.32329931972789117
F1 Score hdi4: 0.3231224706788617
F1 Score (train) hdi4: 1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Model 3: GBRT

In case of regression problem (predicting a continuous variable), you use regression metrics instead.

--> shows overfitting

In [43]:
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=4, n_estimators=100)
gbrt.fit(X_hdi1_train, y_hdi1_train)

# Calculate errors for different numbers of estimators
errors = [mean_squared_error(y_hdi1_valid, y_pred)
          for y_pred in gbrt.staged_predict(X_hdi1_valid)]

# Find the index of the smallest error
bst_n_estimators = np.argmin(errors)

# Use the index to get the best number of estimators
best_n_estimators = bst_n_estimators + 1  # Add 1 to the index

# Create the best GBRT model with the found number of estimators
gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=best_n_estimators)
gbrt_best.fit(X_hdi1_train, y_hdi1_train)

print(gbrt_best)

GradientBoostingRegressor(max_depth=2, n_estimators=99)


In [44]:
from sklearn.metrics import mean_squared_error

# Assuming you have already trained and validated your gbrt_best model
preds_train = gbrt_best.predict(X_hdi1_train)
preds = gbrt_best.predict(X_hdi1_valid)

mse_train = mean_squared_error(y_hdi1_train, preds_train)
mse_valid = mean_squared_error(y_hdi1_valid, preds)

print("Mean Squared Error on Train:", mse_train)
print("Mean Squared Error on Valid:", mse_valid)

Mean Squared Error on Train: 2.3338623216384953
Mean Squared Error on Valid: 8.770940690471539


### Model 4: Random Forest

In [66]:
from sklearn.ensemble import RandomForestClassifier

clf_randomforest = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
clf_randomforest.fit(X_hdi1_train, y_hdi1_train)
preds_train = clf_randomforest.predict(X_hdi1_train)
preds = clf_randomforest.predict(X_hdi1_valid)

print("Recall:", recall_score(y_hdi1_valid, preds, average='macro', zero_division=0))
print("Precision:", precision_score(y_hdi1_valid, preds, average='macro', zero_division=0))
print("F1 Score:", f1_score(y_hdi1_valid, preds, average='macro', zero_division=0))
print("F1 Score (train):", f1_score(y_hdi1_train, preds_train, average='macro', zero_division=0))

Recall: 0.32375073486184597
Precision: 0.39883891828336265
F1 Score: 0.31291211209511866
F1 Score (train): 0.5916701070707013


### Model 5: Stacking

In [67]:
#Ensemble Methods: Stacking
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

#Define base models
base_models = [
    RandomForestClassifier(n_estimators=100, random_state=42),
    GradientBoostingClassifier(n_estimators=100, random_state=42)
]

#Train base models and generate base model predictions
base_preds_train = []
base_preds_val = []
for model in base_models:
    model.fit(X_hdi1_train, y_hdi1_train)
    base_pred_train = model.predict(X_hdi1_train)
    base_pred_val = model.predict(X_hdi1_valid)
    base_preds_train.append(base_pred_train)
    base_preds_val.append(base_pred_val)

#Generate base model predictions arrays
base_preds_train = np.array(base_preds_train).T
base_preds_val = np.array(base_preds_val).T

#Define meta model
meta_model = LogisticRegression(max_iter=100000)

#Train meta model
meta_model.fit(base_preds_train, y_hdi1_train)

#Generate final predictions
final_preds = meta_model.predict(base_preds_val)

#Calculate F1 scores on training and validation sets using meta model
train_preds = meta_model.predict(base_preds_train)  # Predictions on the training set
val_preds = meta_model.predict(base_preds_val)      # Predictions on the validation set

train_f1 = f1_score(y_hdi1_train, train_preds, average='macro')
val_f1 = f1_score(y_hdi1_valid, val_preds, average='macro')

#Compare the F1 scores
print("F1 Score on Training Set:", train_f1)
print("F1 Score on Validation Set:", val_f1)

F1 Score on Training Set: 0.3337791870655029
F1 Score on Validation Set: 0.21279337946004612


### Model 6: Decision Tree

In [124]:
clf_DecisionTree = DecisionTreeClassifier(random_state=42, max_depth=5)

clf_DecisionTree.fit(X_hdi1_train, y_hdi1_train)

score = clf_DecisionTree.score(X_hdi1_valid, y_hdi1_valid)
y_predict1 = clf_DecisionTree.predict(X_hdi1_valid)
print(score)
# score depends strongly on train-test-split

0.14814814814814814


In [129]:
confusion_matrix(y_hdi1_valid, y_predict1)
print(f1_score(y_hdi1_valid, y_predict1, average='micro'))

0.14814814814814814


In [130]:
y_predict2 = clf_DecisionTree.predict(X_hdi1_train)
confusion_matrix(y_hdi1_train, y_predict2)
print(f1_score(y_hdi1_train, y_predict2, average='micro'))

0.32608695652173914


### Model 7: LSTM

In [18]:
# sequence binary classification using LSTM
# define network

hidden_size = 20
num_layers = 5
fact_lin_feat = 0.5

#define architecture of LSTM-based neural network
class lstm_net(torch.nn.Module):
    def __init__(self):
        super(lstm_net, self).__init__()
        self.lstm = torch.nn.LSTM(input_size=int(X_hdi1_train.shape[1]), hidden_size=hidden_size, num_layers=num_layers, batch_first=True,
                                    bidirectional=True, dropout=0.2)
        self.fc1 = torch.nn.Linear(in_features=hidden_size * 2, out_features=int(fact_lin_feat * hidden_size * 2))
        self.fc2 = torch.nn.Linear(in_features=int(fact_lin_feat * hidden_size * 2), out_features=1)

    def forward(self, x):
        output, _status = self.lstm(x)
        output = output[:, -1, :]
        output = self.fc1(torch.relu(output))
        output = self.fc2(torch.relu(output))
        output = torch.nn.Sigmoid()(output)
        return output

In [28]:
#assign the number of feature we input // different number of columns 
class LSTMModel(pl.LightningModule):
    def __init__(self, learning_rate=1.e-4, hidden_size=20, num_layers=5, dropout=0.2):
        super().__init__()
        self.save_hyperparameters()
        self.model = lstm_net()
        self.loss_fn = torch.nn.BCELoss()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.dropout = dropout
        self.validation_outputs = []  # Store validation outputs as an instance attribute
        
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(
        self.parameters(),
        lr=self.hparams.learning_rate,
        )
        return torch.optim.Adam(
            self.parameters(),
            lr=self.hparams.learning_rate,
            betas=self.hparams.betas,
            eps=self.hparams.eps,
            weight_decay=self.hparams.weight_decay,
            amsgrad=self.hparams.amsgrad), optimizer

    def _shared_step(self, batch, batch_idx):
        spectrum, label = batch

        predicted_label = self.model(spectrum)
        loss = self.loss_fn(predicted_label, label)

        return {'loss': loss, 'predictions': predicted_label.detach(), 'labels': label.detach()}

    def training_step(self, batch, batch_idx):
        step_dict = self._shared_step(batch, batch_idx)

        self.log("train_loss", step_dict['loss'], on_step=False, on_epoch=True, prog_bar=True, logger=True)
        return step_dict
    
    def validation_step(self, batch, batch_idx):
        step_dict = self._shared_step(batch, batch_idx)
        self.log("val_loss", step_dict['loss'], on_step=False, on_epoch=True, prog_bar=True, logger=True)

        return step_dict

    def forward(self, batch):
        # NOTE: this is what happens during inference!
        spectrum, label = batch
        predicted_label = self.model(spectrum)
        return {'prediction': predicted_label, 'label': label}
    
    def on_validation_epoch_end(self):
        # Compute validation metrics and log them
        labels = np.concatenate([x['labels'].numpy() for x in self.validation_outputs])
        predictions = np.concatenate([x['predictions'].numpy() for x in self.validation_outputs])

        print(labels.shape, predictions.shape)
        print(len(self.validation_outputs))
    
        for step in self.validation_outputs:
            print(step['labels'].shape, step['predictions'].shape)
        
        metrics = {
            "f1_score": f1_score(labels, np.round(predictions)),
            "Precision": precision_score(labels, np.round(predictions)),
            "Recall": recall_score(labels, np.round(predictions))
        }

        # Log the metrics using self.log
        self.log("val_f1", metrics["f1_score"], prog_bar=True)
        self.log("val_precision", metrics["Precision"], prog_bar=True)
        self.log("val_recall", metrics["Recall"], prog_bar=True)

In [20]:
# Convert DataFrame to NumPy array
X_hdi1_train_np = np.array(X_hdi1_train)
y_hdi1_train_np = np.array(y_hdi1_train)

X_hdi1_valid_np = np.array(X_hdi1_valid)
y_hdi1_valid_np = np.array(y_hdi1_valid)

In [21]:
batch_size = 10
train_loader = DataLoader(
        TensorDataset(
            torch.Tensor(X_hdi1_train_np).unsqueeze(dim=1), 
            torch.Tensor(y_hdi1_train_np).unsqueeze(dim=1)
        ),
        shuffle=False, batch_size=batch_size
)

validation_loader = DataLoader(
    TensorDataset(
        torch.Tensor(X_hdi1_valid_np).unsqueeze(dim=1), 
        torch.Tensor(y_hdi1_valid_np).unsqueeze(dim=1)
    ),
    shuffle=False, batch_size=batch_size
)

In [29]:
model = LSTMModel(
    learning_rate=0.001,
    hidden_size=128,
    num_layers=6,
    dropout=0.3,
)
print(model)

LSTMModel(
  (model): lstm_net(
    (lstm): LSTM(7, 20, num_layers=5, batch_first=True, dropout=0.2, bidirectional=True)
    (fc1): Linear(in_features=40, out_features=20, bias=True)
    (fc2): Linear(in_features=20, out_features=1, bias=True)
  )
  (loss_fn): BCELoss()
)


In [12]:
trainer = pl.Trainer(
        max_epochs=800,
        num_sanity_val_steps=0,
        default_root_dir='/Users/ganghwayeon/Hack-for-Health/lightning_logs/version_0',
        accelerator="auto",
        gradient_accumulation_steps=2
        )

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [13]:
print(trainer.logger.log_dir)

/Users/ganghwayeon/Hack-for-Health/lightning_logs/version_0/lightning_logs/version_0


In [None]:
trainer.fit(
        model,
        train_dataloaders=train_loader,
        val_dataloaders=validation_loader)

In [None]:
y_pred = trainer.predict(model, validation_loader)
y_pred = np.concatenate([x['prediction'].numpy()>0.2 for x in y_pred])

In [None]:
print(f1_score(y_hdi1_valid, y_pred), recall_score(y_hdi1_valid, y_pred), precision_score(y_hdi1_valid, y_pred))
# goal is the recall to be 1 while precision should be high enough (>= 0.7)

In [None]:
y_pred_train = trainer.predict(model, train_loader)
y_pred_train = np.round(np.concatenate([x['prediction'].numpy() for x in y_pred_train]))
f1_score(y_hdi1_train, y_pred_train)

## IV. Result

**Linear Regression**
- Recall: 0.0268
- Precision: 0.0163
- F1 Score: 0.0200
- F1 Score (train): 0.06196

**Adaboost**
- Recall: 0.4437
- Precision: 0.4114
- F1 Score: 0.3990
- F1 Score (train): 1.0

**GBRT**
- Mean Squared Error on Train: 2.3792
- Mean Squared Error on Valid: 8.7926

**Random Forest**
- Recall: 0.3238
- Precision: 0.3988
- F1 Score: 0.3129
- F1 Score (train): 0.5917

**Stacking**
- F1 Score on Training Set: 0.3338
- F1 Score on Validation Set: 0.2128

**Decision Tree**
- F1 Score on Training Set: 0.3261
- F1 Score on Validation Set: 0.1481

Considering the results, Adaboost stands out with a balanced F1 score, leading to its selection for further analysis of variable coefficients (HD1, HD2, HD3, HD4).

### Coefficient

In [69]:
# HDI 1
ada_clf_hdi1 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10), n_estimators=150, algorithm="SAMME.R", learning_rate=0.5)
ada_clf_hdi1.fit(X_hdi1_train, y_hdi1_train)
feature_importances_hdi1 = ada_clf_hdi1.feature_importances_
print(f'HDI 1 Feature Importances: \n{feature_importances_hdi1}')

# HDI 2
ada_clf_hdi2 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10), n_estimators=150, algorithm="SAMME.R", learning_rate=0.5)
ada_clf_hdi2.fit(X_hdi2_train, y_hdi2_train)
feature_importances_hdi2 = ada_clf_hdi2.feature_importances_
print(f'HDI 2 Feature Importances: \n{feature_importances_hdi2}')

# HDI 3
ada_clf_hdi3 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10), n_estimators=150, algorithm="SAMME.R", learning_rate=0.5)
ada_clf_hdi3.fit(X_hdi3_train, y_hdi3_train)
feature_importances_hdi3 = ada_clf_hdi3.feature_importances_
print(f'HDI 3 Feature Importances: \n{feature_importances_hdi3}')

# HDI 4
ada_clf_hdi4 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10), n_estimators=150, algorithm="SAMME.R", learning_rate=0.5)
ada_clf_hdi4.fit(X_hdi4_train, y_hdi4_train)
feature_importances_hdi4 = ada_clf_hdi4.feature_importances_
print(f'HDI 4 Feature Importances: \n{feature_importances_hdi4}')

HDI 1 Feature Importances: 
[0.20756339 0.10574759 0.17214779 0.17831023 0.09481277 0.08677082
 0.15464741]
HDI 2 Feature Importances: 
[0.17911793 0.10100164 0.18218712 0.13652324 0.11264578 0.09305338
 0.19547092]
HDI 3 Feature Importances: 
[0.19072331 0.09443842 0.20278797 0.14293174 0.11811113 0.09945197
 0.15155547]
HDI 4 Feature Importances: 
[0.19601371 0.10405289 0.19913598 0.12486381 0.12056033 0.10213644
 0.15323684]


## Conclusion: Feature Importance Analysis for HDI Prediction using AdaBoost: A Scientific Report

### Abstract:
This report presents an analysis of feature importances derived from AdaBoost models trained to predict Human Development Index (HDI) values for four distinct indices (HDI 1, HDI 2, HDI 3, and HDI 4). The corresponding columns in the dataset are associated with key socioeconomic and health indicators: `c_dollar2_poverty`, `c_forest_area`, `c_health_expenditure`, `c_out_of_pocket`, `c_physician`, `c_tuberculosis`, and `c_urban_pop`. Feature importances provide insights into the relative contribution of each feature to the predictive capabilities of the model.

### Methodology:
AdaBoost classifiers, each comprising decision trees with a maximum depth of 10, were trained for each HDI index using a dataset split into training, validation, and testing sets. Feature importances were obtained from the trained models, reflecting the proportion of importance each feature holds in making predictions.

### Results:

### HDI 1 Feature Importances:
- `c_dollar2_poverty`: 20.76%
- `c_forest_area`: 10.57%
- `c_health_expenditure`: 17.21%
- `c_out_of_pocket`: 17.83%
- `c_physician`: 9.48%
- `c_tuberculosis`: 8.68%
- `c_urban_pop`: 15.46%

### HDI 2 Feature Importances:
- `c_dollar2_poverty`: 17.91%
- `c_forest_area`: 10.10%
- `c_health_expenditure`: 18.22%
- `c_out_of_pocket`: 13.65%
- `c_physician`: 11.26%
- `c_tuberculosis`: 9.31%
- `c_urban_pop`: 19.55%

### HDI 3 Feature Importances:
- `c_dollar2_poverty`: 19.07%
- `c_forest_area`: 9.44%
- `c_health_expenditure`: 20.28%
- `c_out_of_pocket`: 14.29%
- `c_physician`: 11.81%
- `c_tuberculosis`: 9.95%
- `c_urban_pop`: 15.16%

### HDI 4 Feature Importances:
- `c_dollar2_poverty`: 19.60%
- `c_forest_area`: 10.41%
- `c_health_expenditure`: 19.91%
- `c_out_of_pocket`: 12.49%
- `c_physician`: 12.06%
- `c_tuberculosis`: 10.21%
- `c_urban_pop`: 15.32%

### Discussion:
1. **Consistent Feature Importance:**
   - The `c_health_expenditure` feature consistently holds high importance across all HDI indices, indicating its significant role in predicting HDI values.

2. **Variable Importance:**
   - While `c_health_expenditure` is crucial, other features also contribute, and their importance varies across HDI indices.

3. **Interpreting Feature Indices:**
   - Feature indices (0 to 6) correspond to specific columns in the dataset, namely `c_dollar2_poverty`, `c_forest_area`, `c_health_expenditure`, `c_out_of_pocket`, `c_physician`, `c_tuberculosis`, and `c_urban_pop`.

4. **Importance Scale:**
   - Feature importances are presented as percentages, offering a relative scale of each feature's contribution within a specific AdaBoost model. The sum of importances for each model is 100%.

### Conclusion:
This analysis provides valuable insights into the features influencing HDI predictions using AdaBoost models. Researchers and policymakers can leverage this information to prioritize and understand the socioeconomic and health indicators crucial for accurate HDI estimation in diverse contexts. Further investigations and model refinements may enhance the interpretability and generalizability of these findings.