# Regression Model to Predict Number of People Showing up at a Vaccination Clinic

<b> Modelling Approach: </b>

Regression model for UK, USA, Japan & Brazil to predict # of people showing up at a vaccination clinic at a monthly level

- Data Cleaning (in SQL)
- Imputing missing data using MICE
- Feature Engineering (Demographic features + covid-related features)
- Regression using KNN & Decision Tree
- Hyper parameter tuning & model stacking -> to increase prediction accuracy


<b> Assumptions: </b>
- Choice of countries based on data availability and to have a representative model for each continent
- Model can be used in the event of another pandemic / epidemic to predict the number of people to be vaccinated at a given point in time and make operational decisions accordingly
- Individual models built for each country as the population of each country / continent respond differently and have different sets of rules / regulations

<b> Model Benefits: </b>
- Gives us a forecast / estimate of how many incoming patients to expect by time of year and by location
- Decision Tree shows the most important features impacting # of people looking to get a vaccine, from the feature importance chart - these features should be monitored. Should any of these change drastically, the # of people looking for a vaccine will also change drastically.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from dmba import regressionSummary, stepwise_selection
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from stepwise_regression import step_reg

import statsmodels.imputation.mice as mice
import statsmodels.api as sm
# import statsmodels.regression.linear_model as sm

pd.set_option('display.max_rows', 500)

no display found. Using non-interactive Agg backend


In [None]:
df = pd.read_csv('/Users/mohammadananjaved/Downloads/1_location_key_level_all.csv', index_col = 0)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 646952 entries, 0 to 646951
Data columns (total 34 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   location_key                           646917 non-null  object 
 1   YEAR                                   646952 non-null  int64  
 2   MONTH                                  646952 non-null  int64  
 3   country_code                           646917 non-null  object 
 4   school_closing                         9934 non-null    float64
 5   restrictions_on_gatherings             9915 non-null    float64
 6   country_name                           646952 non-null  object 
 7   population_mean                        610305 non-null  float64
 8   population_male_mean                   496796 non-null  float64
 9   population_age_00_09_mean              486768 non-null  float64
 10  population_age_10_19_mean              486744 non-null  

### Imputing Missing Data - Using MICE

In [None]:
df_categorical = df.select_dtypes('object')
categorical_columns = df_categorical.columns.to_list()

df_numerical = df.drop(columns = categorical_columns)

In [None]:
imp = mice.MICEData(df_numerical)
imputed_data = imp.next_sample()

df_numerical_2 = imputed_data

  self.params[vname] = np.random.multivariate_normal(mean=mu, cov=cov)


In [None]:
df_numerical_2.isna().sum()

YEAR                                     0
MONTH                                    0
school_closing                           0
restrictions_on_gatherings               0
population_mean                          0
population_male_mean                     0
population_age_00_09_mean                0
population_age_10_19_mean                0
area_sq_km_mean                          0
gdp_usd_mean                             0
new_persons_vaccinated_mean              0
cumulative_persons_vaccinatedmean        0
new_confirmed_mean                       0
cumulative_tested_mean                   0
population_rural_mean                    0
population_urban_mean                    0
population_density_mean                  0
human_development_index_mean             0
population_age_20_29_mean                0
population_age_30_39_mean                0
population_age_40_49_mean                0
population_age_50_59_mean                0
population_age_60_69_mean                0
population_

In [None]:
df_3 = df_numerical_2.merge(df_categorical, how = 'inner', left_index = True, right_index = True)
df_3.to_csv('Imputed_Data_All_Countries.csv')

In [None]:
country_df = pd.DataFrame(df['country_name'].value_counts())
country_df.reset_index(inplace = True)
country_df.rename(columns = {'index': 'country_name',
                              'country_name': 'data_points'},
                   inplace = True)
country_df.sort_values(by = 'data_points', ascending = False)

Unnamed: 0,country_name,data_points
0,Brazil,158340
1,United States of America,92471
2,Mexico,70691
3,Peru,53414
4,Israel,41673
5,Spain,39042
6,Colombia,32542
7,India,21229
8,Indonesia,15348
9,Argentina,15244


In [None]:
df_4 = df_3[df_3['country_name'].isin(['United States of America', 'United Kingdom', 'Japan', 'Brazil'])]

In [None]:
df_4.head()

Unnamed: 0,YEAR,MONTH,school_closing,restrictions_on_gatherings,population_mean,population_male_mean,population_age_00_09_mean,population_age_10_19_mean,area_sq_km_mean,gdp_usd_mean,...,population_age_80_and_older_mean,gdp_per_capita_usd_mean,nurses_per_1000_mean,physicians_per_1000_mean,health_expenditure_usd_mean,new_hospitalized_patients_mean,cumulative_hospitalized_patients_mean,location_key,country_code,country_name
0,2020,11,0.0,0.0,18567.0,9334.0,2682.0,3796.0,1061.0,17906830000.0,...,474.0,502.0,5.403,0.5269,204.492249,1.0,22.0,BR_CE_230330,BR,Brazil
1,2020,6,1.0,1.0,24091.0,12139.0,4100.0,5364.0,267.0,15460030000.0,...,364.0,31978.0,3.8938,0.0791,2585.563965,1.6,24.666667,BR_CE_230495,BR,Brazil
2,2021,5,1.0,0.0,18392.0,9414.0,2926.0,3294.0,423.0,302571300000.0,...,250.0,3020.0,1.9262,0.5809,105.768456,1.26087,50.565217,BR_CE_230535,BR,Brazil
3,2020,12,1.0,1.0,44240.0,22091.0,7622.0,9829.0,1111.0,31580.0,...,798.0,7463.0,2.0426,1.1189,222.015488,0.5,81.333333,BR_CE_230810,BR,Brazil
4,2021,3,1.0,0.0,80604.0,39769.0,13513.0,17225.0,2019.0,55361200000.0,...,1540.0,16190.0,1.9412,5.0794,98.824577,0.818182,27.909091,BR_CE_231130,BR,Brazil


#### Filtering down to 4 countries - one from each continent for representation

In [None]:
df_4['country_name'].value_counts()

Brazil                      158340
United States of America     92471
United Kingdom                5555
Japan                         1339
Name: country_name, dtype: int64

In [None]:
country_list = list(df_4['country_name'].unique())

In [None]:
country_list

['Brazil', 'United States of America', 'United Kingdom', 'Japan']

In [None]:
df_4.columns

Index(['YEAR', 'MONTH', 'school_closing', 'restrictions_on_gatherings',
       'population_mean', 'population_male_mean', 'population_age_00_09_mean',
       'population_age_10_19_mean', 'area_sq_km_mean', 'gdp_usd_mean',
       'new_persons_vaccinated_mean', 'cumulative_persons_vaccinatedmean',
       'new_confirmed_mean', 'cumulative_tested_mean', 'population_rural_mean',
       'population_urban_mean', 'population_density_mean',
       'human_development_index_mean', 'population_age_20_29_mean',
       'population_age_30_39_mean', 'population_age_40_49_mean',
       'population_age_50_59_mean', 'population_age_60_69_mean',
       'population_age_70_79_mean', 'population_age_80_and_older_mean',
       'gdp_per_capita_usd_mean', 'nurses_per_1000_mean',
       'physicians_per_1000_mean', 'health_expenditure_usd_mean',
       'new_hospitalized_patients_mean',
       'cumulative_hospitalized_patients_mean', 'location_key', 'country_code',
       'country_name'],
      dtype='object')

### Feature Engineering

In [None]:
df_4['Seasons'] = np.where(df_4['MONTH'] <= 2, 'winter',
                                              np.where(df_4['MONTH'] <= 5, 'spring',
                                                      np.where(df_4['MONTH'] <= 8, 'summer',
                                                              np.where(df_4['MONTH'] <= 11, 'fall', 'winter'))))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_4['Seasons'] = np.where(df_4['MONTH'] <= 2, 'winter',


In [None]:
df_4[['MONTH', 'Seasons']].value_counts()

MONTH  Seasons
5      spring     23528
3      spring     23513
7      summer     23501
8      summer     23422
6      summer     23421
4      spring     23388
1      winter     23342
9      fall       23342
2      winter     23327
12     winter     15669
10     fall       15646
11     fall       15606
dtype: int64

### One-Hot Encoding Categorical Variables

In [None]:
seasons = pd.get_dummies(df_4['Seasons'], drop_first = True)
df_5 = df_4.merge(seasons, how = 'left', left_index = True, right_index = True)

In [None]:
df_5.columns

Index(['YEAR', 'MONTH', 'school_closing', 'restrictions_on_gatherings',
       'population_mean', 'population_male_mean', 'population_age_00_09_mean',
       'population_age_10_19_mean', 'area_sq_km_mean', 'gdp_usd_mean',
       'new_persons_vaccinated_mean', 'cumulative_persons_vaccinatedmean',
       'new_confirmed_mean', 'cumulative_tested_mean', 'population_rural_mean',
       'population_urban_mean', 'population_density_mean',
       'human_development_index_mean', 'population_age_20_29_mean',
       'population_age_30_39_mean', 'population_age_40_49_mean',
       'population_age_50_59_mean', 'population_age_60_69_mean',
       'population_age_70_79_mean', 'population_age_80_and_older_mean',
       'gdp_per_capita_usd_mean', 'nurses_per_1000_mean',
       'physicians_per_1000_mean', 'health_expenditure_usd_mean',
       'new_hospitalized_patients_mean',
       'cumulative_hospitalized_patients_mean', 'location_key', 'country_code',
       'country_name', 'Seasons', 'spring', 'summ

#### Regression - United States

In [None]:
df_6 = df_5[df_5['country_name'] == 'United States of America']

outcome = 'new_persons_vaccinated_mean'

columns_to_drop_from_feature_set = ['YEAR', 'MONTH', 'country_name', 'country_code', 'location_key', 'Seasons',
                    #outcome variables must also be dropped
                      'cumulative_persons_vaccinatedmean', 'new_persons_vaccinated_mean']

X = df_6.drop(columns = columns_to_drop_from_feature_set)
y = df_6[outcome]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

bckwd_select_predictors = step_reg.backward_regression(X_train, y_train, 0.05, verbose = False)

## Decision Tree Regression

param_grid = {
        'max_depth': [10, 25],
        'min_samples_split': [100, 250, 500],
        'min_impurity_decrease': [0, 0.00001, 0.0001]}

gridSearch_tree = GridSearchCV(DecisionTreeRegressor(random_state=1), param_grid, cv=5, n_jobs=-1) # n_jobs=-1 will utilize all available CPUs
gridSearch_tree.fit(X_train, y_train)

bestRegTree = gridSearch_tree.best_estimator_

bestRegTree.fit(X_train, y_train)
us_predict_train = bestRegTree.predict(X_train)
us_predict_test = bestRegTree.predict(X_test)

mean_absolute_error_train_tree = mean_absolute_error(y_train, bestRegTree.predict(X_train))
mean_absolute_error_test_tree = mean_absolute_error(y_test, bestRegTree.predict(X_test))
mean_squared_error_train_tree = mean_squared_error(y_train, bestRegTree.predict(X_train), squared = False)
mean_squared_error_test_tree = mean_squared_error(y_test, bestRegTree.predict(X_test), squared = False)
r2_score_train_tree = r2_score(y_train, bestRegTree.predict(X_train))
r2_score_test_tree = r2_score(y_test, bestRegTree.predict(X_test))

## KNN Regression

sc = StandardScaler()
X_train = sc.fit_transform(X_train[bckwd_select_predictors])
X_test = sc.fit_transform(X_test[bckwd_select_predictors])
knn = KNeighborsRegressor()

param_grid = {'n_neighbors': [50, 75, 100]}

gridSearch_knn = GridSearchCV(knn, param_grid, cv=5, n_jobs=-1, scoring = 'r2') # n_jobs=-1 will utilize all available CPUs
gridSearch_knn.fit(X_train, y_train)

bestKNN = gridSearch_knn.best_estimator_
bestKNN.fit(X_train, y_train)
us_predict_knn_train = bestKNN.predict(X_train)
us_predict_knn_test = bestKNN.predict(X_test)

mean_absolute_error_train_knn = mean_absolute_error(y_train, bestKNN.predict(X_train))
mean_absolute_error_test_knn = mean_absolute_error(y_test, bestKNN.predict(X_test))
mean_squared_error_train_knn = mean_squared_error(y_train, bestKNN.predict(X_train), squared = False)
mean_squared_error_test_knn = mean_squared_error(y_test, bestKNN.predict(X_test), squared = False)
r2_score_train_knn = r2_score(y_train, bestKNN.predict(X_train))
r2_score_test_knn = r2_score(y_test, bestKNN.predict(X_test))


## Stacking the models to get the best results

y_train = pd.DataFrame(y_train).reset_index().drop(columns = 'index')
y_test = pd.DataFrame(y_test).reset_index().drop(columns = 'index')

us_results_train = (y_train.merge(pd.DataFrame(us_predict_knn_train), how = 'left',
                                  left_index = True, right_index = True)).merge(pd.DataFrame(us_predict_train),
                                                                                how = 'left', left_index = True,
                                                                                    right_index = True)

us_results_test = (y_test.merge(pd.DataFrame(us_predict_knn_test), how = 'left',
                                  left_index = True, right_index = True)).merge(pd.DataFrame(us_predict_test),
                                                                                how = 'left', left_index = True,
                                                                                    right_index = True)

us_results_train['averaged_prediction'] = us_results_train[['0_x', '0_y']].mean(axis = 1)
us_results_test['averaged_prediction'] = us_results_test[['0_x', '0_y']].mean(axis = 1)

mean_absolute_error_train_stacked = mean_absolute_error(us_results_train['new_persons_vaccinated_mean'],
                                                     us_results_train['averaged_prediction'])

mean_absolute_error_test_stacked = mean_absolute_error(us_results_test['new_persons_vaccinated_mean'],
                                                     us_results_test['averaged_prediction'])

mean_squared_error_train_stacked = mean_squared_error(us_results_train['new_persons_vaccinated_mean'],
                                                     us_results_train['averaged_prediction'],
                                                     squared = False)

mean_squared_error_test_stacked = mean_squared_error(us_results_test['new_persons_vaccinated_mean'],
                                                     us_results_test['averaged_prediction'],
                                                     squared = False)

r2_score_train_stacked = r2_score(us_results_train['new_persons_vaccinated_mean'], us_results_train['averaged_prediction'])
r2_score_test_stacked = r2_score(us_results_test['new_persons_vaccinated_mean'], us_results_test['averaged_prediction'])

us_evaluation_results = pd.DataFrame({

        'Country': ['USA', 'USA', 'USA'],

        'Model' : ['Decision_Tree', 'KNN', 'Stacked'],

        'MEA (Train)' : [mean_absolute_error_train_tree,  mean_absolute_error_train_knn,
                         mean_absolute_error_train_stacked],

        'MEA (Test)' :  [mean_absolute_error_test_tree, mean_absolute_error_test_knn,
                         mean_absolute_error_test_stacked],

        'RMSE (Train)' : [mean_squared_error_train_tree, mean_squared_error_train_knn,
                          mean_squared_error_train_stacked],

        'RMSE (Test)' :  [mean_squared_error_test_tree, mean_squared_error_test_knn,
                         mean_squared_error_test_stacked],

        'R2 (Train)' :   [r2_score_train_tree, r2_score_train_knn, r2_score_train_stacked],

        'R2 (Test)'  :   [r2_score_test_tree, r2_score_test_knn, r2_score_test_stacked]})


us_evaluation_results

Unnamed: 0,Country,Model,MEA (Train),MEA (Test),RMSE (Train),RMSE (Test),R2 (Train),R2 (Test)
0,USA,Decision_Tree,452.511807,499.732525,5925.808988,8645.050474,0.44745,0.381864
1,USA,KNN,470.978171,510.710209,6440.446048,9276.702343,0.347308,0.288235
2,USA,Stacked,453.051548,495.678466,6037.69375,8807.434382,0.426387,0.358424


In [None]:
gridSearch_tree.best_params_

{'max_depth': 10, 'min_impurity_decrease': 0, 'min_samples_split': 100}

In [None]:
gridSearch_knn.best_params_

{'n_neighbors': 50}

#### Final Results From US - Regression

In [None]:
us_best_tree = gridSearch_tree.best_estimator_
us_best_knn = gridSearch_knn.best_estimator_

us_results_tree = us_best_tree.predict(X)
us_results_knn = us_best_knn.predict(sc.fit_transform(X[bckwd_select_predictors]))

us_results_tree = pd.DataFrame(us_results_tree)
us_results_knn = pd.DataFrame(us_results_knn)
us_results = (X.merge(us_results_knn, how = 'inner', left_index = True, right_index = True)).merge(us_results_tree,
                                                                                                  how = 'inner',
                                                                                                  left_index = True,
                                                                                                  right_index = True)

In [None]:
us_results.rename(columns = {'0_x': 'Decision_Tree_Result',
                            '0_y': 'KNN_Result'}, inplace = True)

us_results['Stacking_Results'] = us_results[['Decision_Tree_Result', 'KNN_Result']].mean(axis = 1)

In [None]:
us_results.head()

Unnamed: 0,school_closing,restrictions_on_gatherings,population_mean,population_male_mean,population_age_00_09_mean,population_age_10_19_mean,area_sq_km_mean,gdp_usd_mean,new_confirmed_mean,cumulative_tested_mean,...,physicians_per_1000_mean,health_expenditure_usd_mean,new_hospitalized_patients_mean,cumulative_hospitalized_patients_mean,spring,summer,winter,Decision_Tree_Result,KNN_Result,Stacking_Results
154,0.0,1.0,8182.0,4127.0,1222.0,1225.0,891.0,34272270000.0,2.064516,309.0,...,3.1905,38.426441,0.0,67.0,0,1,0,92.294952,132.806763,112.550858
155,1.0,1.0,24067.0,12759.0,2776.0,2766.0,492.0,29455970000.0,0.741935,115.578947,...,4.4833,1300.481689,20.709677,9.0,0,1,0,98.094289,132.806763,115.450526
156,0.0,0.0,10205.0,6677.0,1488.0,1270.0,651.0,11412550000.0,2.7,782.380952,...,0.4017,301.150055,0.4,18.8,1,0,0,104.680875,132.806763,118.743819
157,1.0,1.0,16254.0,7934.0,1666.0,2088.0,451.0,63237320000.0,2.774194,0.0,...,5.1905,1112.303223,233.607143,4.645161,0,0,0,58.423253,220.04925,139.236251
158,0.0,1.0,143376.0,68189.0,19132.0,22113.0,519.0,33874610000.0,5.4,2572.210526,...,3.1905,1300.481689,0.4,9.833333,1,0,0,58.229844,132.806763,95.518304


#### Feature Importance Chart for US Results

In [None]:
import matplotlib
matplotlib.use('TkAgg')

importances = bestRegTree.feature_importances_
feature_names = X.columns

# Sort the feature importances in descending order
indices = np.argsort(importances)[::-1]
sorted_feature_names = [feature_names[i] for i in indices]
sorted_importances = importances[indices]

# Create the plot
plt.figure()
plt.title("Feature Importance")
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), sorted_feature_names, rotation=90)
plt.show()

#### Regression - Brazil

In [None]:
df_7 = df_5[df_5['country_name'] == 'Brazil']

outcome = 'new_persons_vaccinated_mean'

columns_to_drop_from_feature_set = ['YEAR', 'MONTH', 'country_name', 'country_code', 'location_key', 'Seasons',
                    #outcome variables must also be dropped
                      'cumulative_persons_vaccinatedmean', 'new_persons_vaccinated_mean']

X = df_7.drop(columns = columns_to_drop_from_feature_set)
y = df_7[outcome]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

bckwd_select_predictors = step_reg.backward_regression(X_train, y_train, 0.05, verbose = False)

## Decision Tree Regression

param_grid = {
        'max_depth': [5, 10, 15, 25, 50],
        'min_samples_split': [100, 500, 1000],
        'min_impurity_decrease': [0, 0.0001, 0.0025, 0.0005]}

gridSearch_tree = GridSearchCV(DecisionTreeRegressor(random_state=1), param_grid, cv=5, n_jobs=-1, scoring =
                              'r2') # n_jobs=-1 will utilize all available CPUs
gridSearch_tree.fit(X_train, y_train)

bestRegTree = gridSearch_tree.best_estimator_

bestRegTree.fit(X_train, y_train)
brazil_predict_train = bestRegTree.predict(X_train)
brazil_predict_test = bestRegTree.predict(X_test)

mean_absolute_error_train_tree = mean_absolute_error(y_train, bestRegTree.predict(X_train))
mean_absolute_error_test_tree = mean_absolute_error(y_test, bestRegTree.predict(X_test))
mean_squared_error_train_tree = mean_squared_error(y_train, bestRegTree.predict(X_train), squared = False)
mean_squared_error_test_tree = mean_squared_error(y_test, bestRegTree.predict(X_test), squared = False)
r2_score_train_tree = r2_score(y_train, bestRegTree.predict(X_train))
r2_score_test_tree = r2_score(y_test, bestRegTree.predict(X_test))

##KNN Regression

sc = StandardScaler()
X_train = sc.fit_transform(X_train[bckwd_select_predictors])
X_test = sc.fit_transform(X_test[bckwd_select_predictors])
knn = KNeighborsRegressor()

param_grid = {'n_neighbors': [25, 50, 100, 250, 500]}

gridSearch_knn = GridSearchCV(knn, param_grid, cv=5, n_jobs=-1, scoring = 'r2') # n_jobs=-1 will utilize all available CPUs
gridSearch_knn.fit(X_train, y_train)

bestKNN = gridSearch_knn.best_estimator_
bestKNN.fit(X_train, y_train)
brazil_predict_knn_train = bestKNN.predict(X_train)
brazil_predict_knn_test = bestKNN.predict(X_test)

mean_absolute_error_train_knn = mean_absolute_error(y_train, bestKNN.predict(X_train))
mean_absolute_error_test_knn = mean_absolute_error(y_test, bestKNN.predict(X_test))
mean_squared_error_train_knn = mean_squared_error(y_train, bestKNN.predict(X_train), squared = False)
mean_squared_error_test_knn = mean_squared_error(y_test, bestKNN.predict(X_test), squared = False)
r2_score_train_knn = r2_score(y_train, bestKNN.predict(X_train))
r2_score_test_knn = r2_score(y_test, bestKNN.predict(X_test))

#Stacking to increase predictive accuracy

y_train = pd.DataFrame(y_train).reset_index().drop(columns = 'index')
y_test = pd.DataFrame(y_test).reset_index().drop(columns = 'index')

brazil_results_train = (y_train.merge(pd.DataFrame(brazil_predict_knn_train), how = 'left',
                                  left_index = True, right_index = True)).merge(pd.DataFrame(brazil_predict_train),
                                                                                how = 'left', left_index = True,
                                                                                    right_index = True)

brazil_results_test = (y_test.merge(pd.DataFrame(brazil_predict_knn_test), how = 'left',
                                  left_index = True, right_index = True)).merge(pd.DataFrame(brazil_predict_test),
                                                                                how = 'left', left_index = True,
                                                                                    right_index = True)

brazil_results_train['averaged_prediction'] = brazil_results_train[['0_x', '0_y']].mean(axis = 1)
brazil_results_test['averaged_prediction'] = brazil_results_test[['0_x', '0_y']].mean(axis = 1)

mean_absolute_error_train_stacked = mean_absolute_error(brazil_results_train['new_persons_vaccinated_mean'],
                                                     brazil_results_train['averaged_prediction'])

mean_absolute_error_test_stacked = mean_absolute_error(brazil_results_test['new_persons_vaccinated_mean'],
                                                     brazil_results_test['averaged_prediction'])

mean_squared_error_train_stacked = mean_squared_error(brazil_results_train['new_persons_vaccinated_mean'],
                                                     brazil_results_train['averaged_prediction'],
                                                     squared = False)

mean_squared_error_test_stacked = mean_squared_error(brazil_results_test['new_persons_vaccinated_mean'],
                                                     brazil_results_test['averaged_prediction'],
                                                     squared = False)

r2_score_train_stacked = r2_score(brazil_results_train['new_persons_vaccinated_mean'],
                                  brazil_results_train['averaged_prediction'])

r2_score_test_stacked = r2_score(brazil_results_test['new_persons_vaccinated_mean'],
                                 brazil_results_test['averaged_prediction'])

brazil_evaluation_results = pd.DataFrame({

        'Country': ['Brazil', 'Brazil', 'Brazil'],

        'Model' : ['Decision_Tree', 'KNN', 'Stacked'],

        'MEA (Train)' : [mean_absolute_error_train_tree,  mean_absolute_error_train_knn,
                         mean_absolute_error_train_stacked],

        'MEA (Test)' :  [mean_absolute_error_test_tree, mean_absolute_error_test_knn,
                         mean_absolute_error_test_stacked],

        'RMSE (Train)' : [mean_squared_error_train_tree, mean_squared_error_train_knn,
                          mean_squared_error_train_stacked],

        'RMSE (Test)' :  [mean_squared_error_test_tree, mean_squared_error_test_knn,
                         mean_squared_error_test_stacked],

        'R2 (Train)' :   [r2_score_train_tree, r2_score_train_knn, r2_score_train_stacked],

        'R2 (Test)'  :   [r2_score_test_tree, r2_score_test_knn, r2_score_test_stacked]})


brazil_evaluation_results

Unnamed: 0,Country,Model,MEA (Train),MEA (Test),RMSE (Train),RMSE (Test),R2 (Train),R2 (Test)
0,Brazil,Decision_Tree,336.256588,392.797667,4140.032383,7138.581971,0.559178,-0.090939
1,Brazil,KNN,335.529132,348.503735,4938.100051,5614.596314,0.372844,0.32514
2,Brazil,Stacked,329.539,365.795538,4290.043911,5731.409601,0.526654,0.296767


In [None]:
gridSearch_tree.best_params_

{'max_depth': 10, 'min_impurity_decrease': 0, 'min_samples_split': 1000}

#### Brazil Model Results

In [None]:
brazil_best_tree = gridSearch_tree.best_estimator_
brazil_best_knn = gridSearch_knn.best_estimator_

brazil_results_tree = brazil_best_tree.predict(X)
brazil_results_knn = brazil_best_knn.predict(sc.fit_transform(X[bckwd_select_predictors]))

brazil_results_tree = pd.DataFrame(brazil_results_tree)
brazil_results_knn = pd.DataFrame(brazil_results_knn)
brazil_results = (X.merge(brazil_results_knn, how = 'inner', left_index = True, right_index = True)).merge(brazil_results_tree,
                                                                                                  how = 'inner',
                                                                                                  left_index = True,
                                                                                                  right_index = True)

brazil_results.rename(columns = {'0_x': 'Decision_Tree_Result',
                            '0_y': 'KNN_Result'}, inplace = True)

brazil_results['Stacking_Results'] = us_results[['Decision_Tree_Result', 'KNN_Result']].mean(axis = 1)

brazil_results.head()

Unnamed: 0,school_closing,restrictions_on_gatherings,population_mean,population_male_mean,population_age_00_09_mean,population_age_10_19_mean,area_sq_km_mean,gdp_usd_mean,new_confirmed_mean,cumulative_tested_mean,...,physicians_per_1000_mean,health_expenditure_usd_mean,new_hospitalized_patients_mean,cumulative_hospitalized_patients_mean,spring,summer,winter,Decision_Tree_Result,KNN_Result,Stacking_Results
0,0.0,0.0,18567.0,9334.0,2682.0,3796.0,1061.0,17906830000.0,1.033333,1796.772727,...,0.5269,204.492249,1.0,22.0,0,0,0,74.657036,138.138482,
1,1.0,1.0,24091.0,12139.0,4100.0,5364.0,267.0,15460030000.0,4.766667,414.5,...,0.0791,2585.563965,1.6,24.666667,0,1,0,156.898673,211.976029,
2,1.0,0.0,18392.0,9414.0,2926.0,3294.0,423.0,302571300000.0,16.354839,3558.074074,...,0.5809,105.768456,1.26087,50.565217,1,0,0,281.743438,138.138482,
3,1.0,1.0,44240.0,22091.0,7622.0,9829.0,1111.0,31580.0,4.193548,5938.692308,...,1.1189,222.015488,0.5,81.333333,0,0,1,122.789533,138.138482,
4,1.0,0.0,80604.0,39769.0,13513.0,17225.0,2019.0,55361200000.0,16.193548,13122.870968,...,5.0794,98.824577,0.818182,27.909091,1,0,0,61.114955,138.138482,


#### Feature Importance Chart - Brazil

In [None]:
matplotlib.use('TkAgg')

importances = bestRegTree.feature_importances_
feature_names = X.columns

# Sort the feature importances in descending order
indices = np.argsort(importances)[::-1]
sorted_feature_names = [feature_names[i] for i in indices]
sorted_importances = importances[indices]

# Create the plot
plt.figure()
plt.title("Feature Importance")
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), sorted_feature_names, rotation=90)
plt.show()

In [None]:
gridSearch_tree.best_params_

{'max_depth': 10, 'min_impurity_decrease': 0, 'min_samples_split': 1000}

In [None]:
gridSearch_knn.best_params_

{'n_neighbors': 50}

#### Regression - UK

In [None]:
df_8 = df_5[df_5['country_name'] == 'United Kingdom']

outcome = 'new_persons_vaccinated_mean'

columns_to_drop_from_feature_set = ['YEAR', 'MONTH', 'country_name', 'country_code', 'location_key', 'Seasons',
                    #outcome variables must also be dropped
                      'cumulative_persons_vaccinatedmean', 'new_persons_vaccinated_mean']

X = df_8.drop(columns = columns_to_drop_from_feature_set)
y = df_8[outcome]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

bckwd_select_predictors = step_reg.backward_regression(X_train, y_train, 0.05, verbose = False)

## Decision Tree Regression

param_grid = {
        'max_depth': [5, 10, 15, 50],
        'min_samples_split': [50, 100, 500, 1000],
        'min_impurity_decrease': [0, 0.0001, 0.0025, 0.0005]}

gridSearch_tree = GridSearchCV(DecisionTreeRegressor(random_state=1), param_grid, cv=5, n_jobs=-1) # n_jobs=-1 will utilize all available CPUs
gridSearch_tree.fit(X_train, y_train)

bestRegTree = gridSearch_tree.best_estimator_

bestRegTree.fit(X_train, y_train)
uk_predict_train = bestRegTree.predict(X_train)
uk_predict_test = bestRegTree.predict(X_test)

mean_absolute_error_train_tree = mean_absolute_error(y_train, bestRegTree.predict(X_train))
mean_absolute_error_test_tree = mean_absolute_error(y_test, bestRegTree.predict(X_test))
mean_squared_error_train_tree = mean_squared_error(y_train, bestRegTree.predict(X_train), squared = False)
mean_squared_error_test_tree = mean_squared_error(y_test, bestRegTree.predict(X_test), squared = False)
r2_score_train_tree = r2_score(y_train, bestRegTree.predict(X_train))
r2_score_test_tree = r2_score(y_test, bestRegTree.predict(X_test))

##KNN Regression

sc = StandardScaler()
X_train = sc.fit_transform(X_train[bckwd_select_predictors])
X_test = sc.fit_transform(X_test[bckwd_select_predictors])
knn = KNeighborsRegressor()

param_grid = {'n_neighbors': [50, 100, 250, 750, 1000]}

gridSearch_knn = GridSearchCV(knn, param_grid, cv=5, n_jobs=-1) # n_jobs=-1 will utilize all available CPUs
gridSearch_knn.fit(X_train, y_train)

bestKNN = gridSearch_knn.best_estimator_
bestKNN.fit(X_train, y_train)
uk_predict_knn_train = bestKNN.predict(X_train)
uk_predict_knn_test = bestKNN.predict(X_test)

mean_absolute_error_train_knn = mean_absolute_error(y_train, bestKNN.predict(X_train))
mean_absolute_error_test_knn = mean_absolute_error(y_test, bestKNN.predict(X_test))
mean_squared_error_train_knn = mean_squared_error(y_train, bestKNN.predict(X_train), squared = False)
mean_squared_error_test_knn = mean_squared_error(y_test, bestKNN.predict(X_test), squared = False)
r2_score_train_knn = r2_score(y_train, bestKNN.predict(X_train))
r2_score_test_knn = r2_score(y_test, bestKNN.predict(X_test))

## Stacking to increase predictive accuracy

y_train = pd.DataFrame(y_train).reset_index().drop(columns = 'index')
y_test = pd.DataFrame(y_test).reset_index().drop(columns = 'index')

uk_results_train = (y_train.merge(pd.DataFrame(uk_predict_knn_train), how = 'left',
                                  left_index = True, right_index = True)).merge(pd.DataFrame(uk_predict_train),
                                                                                how = 'left', left_index = True,
                                                                                    right_index = True)

uk_results_test = (y_test.merge(pd.DataFrame(uk_predict_knn_test), how = 'left',
                                  left_index = True, right_index = True)).merge(pd.DataFrame(uk_predict_test),
                                                                                how = 'left', left_index = True,
                                                                                    right_index = True)

uk_results_train['averaged_prediction'] = uk_results_train[['0_x', '0_y']].mean(axis = 1)
uk_results_test['averaged_prediction'] = uk_results_test[['0_x', '0_y']].mean(axis = 1)

mean_absolute_error_train_stacked = mean_absolute_error(uk_results_train['new_persons_vaccinated_mean'],
                                                     uk_results_train['averaged_prediction'])

mean_absolute_error_test_stacked = mean_absolute_error(uk_results_test['new_persons_vaccinated_mean'],
                                                     uk_results_test['averaged_prediction'])

mean_squared_error_train_stacked = mean_squared_error(uk_results_train['new_persons_vaccinated_mean'],
                                                     uk_results_train['averaged_prediction'],
                                                     squared = False)

mean_squared_error_test_stacked = mean_squared_error(uk_results_test['new_persons_vaccinated_mean'],
                                                     uk_results_test['averaged_prediction'],
                                                     squared = False)

r2_score_train_stacked = r2_score(uk_results_train['new_persons_vaccinated_mean'],
                                  uk_results_train['averaged_prediction'])

r2_score_test_stacked = r2_score(uk_results_test['new_persons_vaccinated_mean'],
                                 uk_results_test['averaged_prediction'])

uk_evaluation_results = pd.DataFrame({

        'Country': ['UK', 'UK', 'UK'],

        'Model' : ['Decision_Tree', 'KNN', 'Stacked'],

        'MEA (Train)' : [mean_absolute_error_train_tree,  mean_absolute_error_train_knn,
                         mean_absolute_error_train_stacked],

        'MEA (Test)' :  [mean_absolute_error_test_tree, mean_absolute_error_test_knn,
                         mean_absolute_error_test_stacked],

        'RMSE (Train)' : [mean_squared_error_train_tree, mean_squared_error_train_knn,
                          mean_squared_error_train_stacked],

        'RMSE (Test)' :  [mean_squared_error_test_tree, mean_squared_error_test_knn,
                         mean_squared_error_test_stacked],

        'R2 (Train)' :   [r2_score_train_tree, r2_score_train_knn, r2_score_train_stacked],

        'R2 (Test)'  :   [r2_score_test_tree, r2_score_test_knn, r2_score_test_stacked]})


uk_evaluation_results

Unnamed: 0,Country,Model,MEA (Train),MEA (Test),RMSE (Train),RMSE (Test),R2 (Train),R2 (Test)
0,UK,Decision_Tree,3841.018611,15678.79961,21146.226713,288334.3647,0.27171,0.002858
1,UK,KNN,4038.723165,15452.276983,21868.304349,288736.983257,0.221123,7.1e-05
2,UK,Stacked,3870.740356,15498.138722,21259.023231,288517.026568,0.263919,0.001594


#### Final Model Results - UK

In [None]:
uk_best_tree = gridSearch_tree.best_estimator_
uk_best_knn = gridSearch_knn.best_estimator_

uk_results_tree = uk_best_tree.predict(X)
uk_results_knn = uk_best_knn.predict(sc.fit_transform(X[bckwd_select_predictors]))

uk_results_tree = pd.DataFrame(uk_results_tree)
uk_results_knn = pd.DataFrame(uk_results_knn)
uk_results = (X.merge(uk_results_knn, how = 'inner', left_index = True, right_index = True)).merge(uk_results_tree,
                                                                                                  how = 'inner',
                                                                                                  left_index = True,
                                                                                                  right_index = True)

uk_results.rename(columns = {'0_x': 'Decision_Tree_Result',
                            '0_y': 'KNN_Result'}, inplace = True)

uk_results['Stacking_Results'] = uk_results[['Decision_Tree_Result', 'KNN_Result']].mean(axis = 1)

uk_results.head()

Unnamed: 0,school_closing,restrictions_on_gatherings,population_mean,population_male_mean,population_age_00_09_mean,population_age_10_19_mean,area_sq_km_mean,gdp_usd_mean,new_confirmed_mean,cumulative_tested_mean,...,physicians_per_1000_mean,health_expenditure_usd_mean,new_hospitalized_patients_mean,cumulative_hospitalized_patients_mean,spring,summer,winter,Decision_Tree_Result,KNN_Result,Stacking_Results
399,0.0,0.0,106522.0,51880.0,12481.0,12049.0,197.0,37333960000.0,1.580645,40.428571,...,1.3544,340.661804,0.0,106.0,1,0,0,111.824743,174.970709,143.397726
400,1.0,1.0,139098.0,68956.0,16428.0,15214.0,34.0,16022670000.0,1.903226,0.0,...,0.0838,191.185776,2.935484,12.266667,0,1,0,215.458037,174.970709,195.214373
401,1.0,1.0,568612.0,276253.0,60212.0,60546.0,3545.0,12741920000.0,402.935484,25894.0,...,2.0068,62.124634,0.0,4582.483871,0,0,0,74.520739,174.970709,124.745724
402,0.0,0.0,330712.0,163437.0,39241.0,37085.0,188.0,10045150000.0,5.193548,4065.16129,...,0.134,3361.644775,0.032258,12.625,0,1,0,122.375123,174.970709,148.672916
403,1.0,1.0,150265.0,71188.0,17195.0,16250.0,64.0,31580.0,3.529412,1455.071429,...,1.1189,171.41748,75.666667,575.935484,1,0,0,132.491907,174.970709,153.731308


#### Feature Importance Chart - UK

In [None]:
matplotlib.use('TkAgg')

importances = bestRegTree.feature_importances_
feature_names = X.columns

# Sort the feature importances in descending order
indices = np.argsort(importances)[::-1]
sorted_feature_names = [feature_names[i] for i in indices]
sorted_importances = importances[indices]

# Create the plot
plt.figure()
plt.title("Feature Importance")
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), sorted_feature_names, rotation=90)
plt.show()

In [None]:
gridSearch_tree.best_params_

{'max_depth': 10, 'min_impurity_decrease': 0, 'min_samples_split': 100}

In [None]:
gridSearch_knn.best_params_

{'n_neighbors': 50}

#### Regression - Japan

In [None]:
df_9 = df_5[df_5['country_name'] == 'Japan']

outcome = 'new_persons_vaccinated_mean'

columns_to_drop_from_feature_set = ['YEAR', 'MONTH', 'country_name', 'country_code', 'location_key', 'Seasons',
                    #outcome variables must also be dropped
                      'cumulative_persons_vaccinatedmean', 'new_persons_vaccinated_mean']

X = df_8.drop(columns = columns_to_drop_from_feature_set)
y = df_8[outcome]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

bckwd_select_predictors = step_reg.backward_regression(X_train, y_train, 0.05, verbose = False)

## Decision Tree Regression

param_grid = {
        'max_depth': [5, 10, 15, 20],
        'min_samples_split': [50, 100, 250, 500],
        'min_impurity_decrease': [0, 0.0001, 0.0025, 0.0005]}

gridSearch_tree = GridSearchCV(DecisionTreeRegressor(random_state=1), param_grid, cv=5, n_jobs=-1,
                              scoring = 'r2') # n_jobs=-1 will utilize all available CPUs
gridSearch_tree.fit(X_train, y_train)

bestRegTree = gridSearch_tree.best_estimator_

bestRegTree.fit(X_train, y_train)
japan_predict_train = bestRegTree.predict(X_train)
japan_predict_test = bestRegTree.predict(X_test)

mean_absolute_error_train_tree = mean_absolute_error(y_train, bestRegTree.predict(X_train))
mean_absolute_error_test_tree = mean_absolute_error(y_test, bestRegTree.predict(X_test))
mean_squared_error_train_tree = mean_squared_error(y_train, bestRegTree.predict(X_train), squared = False)
mean_squared_error_test_tree = mean_squared_error(y_test, bestRegTree.predict(X_test), squared = False)
r2_score_train_tree = r2_score(y_train, bestRegTree.predict(X_train))
r2_score_test_tree = r2_score(y_test, bestRegTree.predict(X_test))

#KNN Regression

sc = StandardScaler()
X_train = sc.fit_transform(X_train[bckwd_select_predictors])
X_test = sc.fit_transform(X_test[bckwd_select_predictors])
knn = KNeighborsRegressor()

param_grid = {'n_neighbors': [25, 50, 100, 500]}

gridSearch_knn = GridSearchCV(knn, param_grid, cv=5, n_jobs=-1, scoring = 'r2') # n_jobs=-1 will utilize all available CPUs
gridSearch_knn.fit(X_train, y_train)

bestKNN = gridSearch_knn.best_estimator_
bestKNN.fit(X_train, y_train)
japan_predict_knn_train = bestKNN.predict(X_train)
japan_predict_knn_test = bestKNN.predict(X_test)

mean_absolute_error_train_knn = mean_absolute_error(y_train, bestKNN.predict(X_train))
mean_absolute_error_test_knn = mean_absolute_error(y_test, bestKNN.predict(X_test))
mean_squared_error_train_knn = mean_squared_error(y_train, bestKNN.predict(X_train), squared = False)
mean_squared_error_test_knn = mean_squared_error(y_test, bestKNN.predict(X_test), squared = False)
r2_score_train_knn = r2_score(y_train, bestKNN.predict(X_train))
r2_score_test_knn = r2_score(y_test, bestKNN.predict(X_test))

# Stacking to increase accuracy

y_train = pd.DataFrame(y_train).reset_index().drop(columns = 'index')
y_test = pd.DataFrame(y_test).reset_index().drop(columns = 'index')

japan_results_train = (y_train.merge(pd.DataFrame(japan_predict_knn_train), how = 'left',
                                  left_index = True, right_index = True)).merge(pd.DataFrame(japan_predict_train),
                                                                                how = 'left', left_index = True,
                                                                                    right_index = True)

japan_results_test = (y_test.merge(pd.DataFrame(japan_predict_knn_test), how = 'left',
                                  left_index = True, right_index = True)).merge(pd.DataFrame(japan_predict_test),
                                                                                how = 'left', left_index = True,
                                                                                    right_index = True)

japan_results_train['averaged_prediction'] = japan_results_train[['0_x', '0_y']].mean(axis = 1)
japan_results_test['averaged_prediction'] = japan_results_test[['0_x', '0_y']].mean(axis = 1)

mean_absolute_error_train_stacked = mean_absolute_error(japan_results_train['new_persons_vaccinated_mean'],
                                                     japan_results_train['averaged_prediction'])

mean_absolute_error_test_stacked = mean_absolute_error(japan_results_test['new_persons_vaccinated_mean'],
                                                     japan_results_test['averaged_prediction'])

mean_squared_error_train_stacked = mean_squared_error(japan_results_train['new_persons_vaccinated_mean'],
                                                     japan_results_train['averaged_prediction'],
                                                     squared = False)

mean_squared_error_test_stacked = mean_squared_error(japan_results_test['new_persons_vaccinated_mean'],
                                                     japan_results_test['averaged_prediction'],
                                                     squared = False)

r2_score_train_stacked = r2_score(japan_results_train['new_persons_vaccinated_mean'],
                                  japan_results_train['averaged_prediction'])

r2_score_test_stacked = r2_score(japan_results_test['new_persons_vaccinated_mean'],
                                 japan_results_test['averaged_prediction'])

japan_evaluation_results = pd.DataFrame({

        'Country': ['Japan', 'Japan', 'Japan'],

        'Model' : ['Decision_Tree', 'KNN', 'Stacked'],

        'MEA (Train)' : [mean_absolute_error_train_tree,  mean_absolute_error_train_knn,
                         mean_absolute_error_train_stacked],

        'MEA (Test)' :  [mean_absolute_error_test_tree, mean_absolute_error_test_knn,
                         mean_absolute_error_test_stacked],

        'RMSE (Train)' : [mean_squared_error_train_tree, mean_squared_error_train_knn,
                          mean_squared_error_train_stacked],

        'RMSE (Test)' :  [mean_squared_error_test_tree, mean_squared_error_test_knn,
                         mean_squared_error_test_stacked],

        'R2 (Train)' :   [r2_score_train_tree, r2_score_train_knn, r2_score_train_stacked],

        'R2 (Test)'  :   [r2_score_test_tree, r2_score_test_knn, r2_score_test_stacked]})


japan_evaluation_results

Unnamed: 0,Country,Model,MEA (Train),MEA (Test),RMSE (Train),RMSE (Test),R2 (Train),R2 (Test)
0,Japan,Decision_Tree,4007.989055,15794.814059,21265.817167,288505.631443,0.263449,0.001673
1,Japan,KNN,3995.638062,15641.63203,21009.572677,288636.106543,0.281092,0.00077
2,Japan,Stacked,3941.329501,15676.909764,20825.119829,288548.148194,0.29366,0.001379


#### Final Model Output - Japan

In [None]:
japan_best_tree = gridSearch_tree.best_estimator_
japan_best_knn = gridSearch_knn.best_estimator_

japan_results_tree = japan_best_tree.predict(X)
japan_results_knn = japan_best_knn.predict(sc.fit_transform(X[bckwd_select_predictors]))

japan_results_tree = pd.DataFrame(japan_results_tree)
japan_results_knn = pd.DataFrame(japan_results_knn)
japan_results = (X.merge(japan_results_knn, how = 'inner', left_index = True, right_index = True)).merge(japan_results_tree,
                                                                                                  how = 'inner',
                                                                                                  left_index = True,
                                                                                                  right_index = True)

japan_results.rename(columns = {'0_x': 'Decision_Tree_Result',
                            '0_y': 'KNN_Result'}, inplace = True)

japan_results['Stacking_Results'] = japan_results[['Decision_Tree_Result', 'KNN_Result']].mean(axis = 1)

japan_results.head()

Unnamed: 0,school_closing,restrictions_on_gatherings,population_mean,population_male_mean,population_age_00_09_mean,population_age_10_19_mean,area_sq_km_mean,gdp_usd_mean,new_confirmed_mean,cumulative_tested_mean,...,physicians_per_1000_mean,health_expenditure_usd_mean,new_hospitalized_patients_mean,cumulative_hospitalized_patients_mean,spring,summer,winter,Decision_Tree_Result,KNN_Result,Stacking_Results
399,0.0,0.0,106522.0,51880.0,12481.0,12049.0,197.0,37333960000.0,1.580645,40.428571,...,1.3544,340.661804,0.0,106.0,1,0,0,145.208439,174.970709,160.089574
400,1.0,1.0,139098.0,68956.0,16428.0,15214.0,34.0,16022670000.0,1.903226,0.0,...,0.0838,191.185776,2.935484,12.266667,0,1,0,73.228734,174.970709,124.099722
401,1.0,1.0,568612.0,276253.0,60212.0,60546.0,3545.0,12741920000.0,402.935484,25894.0,...,2.0068,62.124634,0.0,4582.483871,0,0,0,62.131095,174.970709,118.550902
402,0.0,0.0,330712.0,163437.0,39241.0,37085.0,188.0,10045150000.0,5.193548,4065.16129,...,0.134,3361.644775,0.032258,12.625,0,1,0,153.817561,174.970709,164.394135
403,1.0,1.0,150265.0,71188.0,17195.0,16250.0,64.0,31580.0,3.529412,1455.071429,...,1.1189,171.41748,75.666667,575.935484,1,0,0,156.519934,174.970709,165.745322


#### Feature Importance Chart - Japan

In [None]:
importances = bestRegTree.feature_importances_
feature_names = X.columns

# Sort the feature importances in descending order
indices = np.argsort(importances)[::-1]
sorted_feature_names = [feature_names[i] for i in indices]
sorted_importances = importances[indices]

# Create the plot
plt.figure()
plt.title("Feature Importance")
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), sorted_feature_names, rotation=90)
plt.show()

In [None]:
gridSearch_tree.best_params_

{'max_depth': 10, 'min_impurity_decrease': 0, 'min_samples_split': 250}

In [None]:
gridSearch_knn.best_params_

{'n_neighbors': 25}

#### Final Model Output

In [None]:
final_output = pd.concat([japan_results, uk_results, brazil_results, us_results])

In [None]:
final_output.to_csv('Regression_Model_Outputs.csv')

#### Final Model Results

In [None]:
final_model_evaluation = pd.concat([japan_evaluation_results, uk_evaluation_results,
                                    brazil_evaluation_results, us_evaluation_results])

In [None]:
final_model_evaluation

Unnamed: 0,Country,Model,MEA (Train),MEA (Test),RMSE (Train),RMSE (Test),R2 (Train),R2 (Test)
0,Japan,Decision_Tree,4007.989055,15794.814059,21265.817167,288505.631443,0.263449,0.001673
1,Japan,KNN,3995.638062,15641.63203,21009.572677,288636.106543,0.281092,0.00077
2,Japan,Stacked,3941.329501,15676.909764,20825.119829,288548.148194,0.29366,0.001379
0,UK,Decision_Tree,3841.018611,15678.79961,21146.226713,288334.3647,0.27171,0.002858
1,UK,KNN,4038.723165,15452.276983,21868.304349,288736.983257,0.221123,7.1e-05
2,UK,Stacked,3870.740356,15498.138722,21259.023231,288517.026568,0.263919,0.001594
0,Brazil,Decision_Tree,336.256588,392.797667,4140.032383,7138.581971,0.559178,-0.090939
1,Brazil,KNN,335.529132,348.503735,4938.100051,5614.596314,0.372844,0.32514
2,Brazil,Stacked,329.539,365.795538,4290.043911,5731.409601,0.526654,0.296767
0,USA,Decision_Tree,452.511807,499.732525,5925.808988,8645.050474,0.44745,0.381864
