### The concept of hyperparameter
Hyperparameter tuning is one of the most important parts of a machine learning pipeline. A wrong choice of the hyperparameters’ values may lead to wrong results and a model with poor performance.<br>

Hyperparameters are model parameters whose values are set **before** training.<br> These hyperparameters might address model design questions such as:

- What **degree of polynomial features** should I use for my linear model?
- What should be the **maximum depth** allowed for my decision tree?
- What should be the **minimum number of samples** required at a leaf node in my decision tree?
- **How many trees** should I include in my random forest?
- **How many neurons** should I have in my neural network layer?
- **How many layers** should I have in my neural network?
- What should I set my **learning rate** to for gradient descent?

Let's make it simple. For example, **the number of neurons** of a feed-forward neural network is a hyperparameter, because we set it before training. Another example of hyperparameter is **the number of trees** in a random forest or the penalty intensity of a Lasso regression. As you can see, the hyperparameters are all numbers that are set before the training phase and their values affect the behavior of the model.

### IMPORTANT!
Hyperparameters are **not** model parameters and they cannot be directly trained from the data. Model parameters are **learned** during training when we optimize a loss function using something like gradient descent.


### The reason for tuning the hyperparameters
Why should we tune the hyperparameters of a model?<br>

That is because we don’t really know the models' optimal values in advance. A model with different hyperparameters is, actually, a different model so it may have a different performance. In the case of neural networks, a less number of neurons could cause underfitting and a more number of them could cause overfitting. In both cases, the models are not good, so we need to find the optimal number of neurons that cause the best performance.<br>

If the model has several hyperparameters, we need to find the best combination of values of the hyperparameters searching in a multi-dimensional space. That’s why hyperparameter tuning, which is the process of finding the right values of the hyperparameters, is a very complex and time-expensive task.

### Hyperparameter tuning in practice
Tuning hyperparameters means making decisions on the **stopping criteria**. There are several stopping criteria, but we're going to deal with four first, such as:
1. The max_depth
2. The minimum size of the node: min_samples_split
3. The minimum lift: min_impurity_decrease
4. The cost-complexity<br>
---
The **max depth** means the maximum number of depth in the decision tree. The tree structure cannot be deeper than this value we set using **`max_depth`**. The smaller it is, the smaller the tree will be.<br>

The **minimum size of the node** is the number of data(samples) to split. The smaller the value, the larger the tree will be, and its default value is 2.<br>

We can set this using **`min_samples_split`** A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The equation for min_sample_split is:<br>

$$\frac{N_t}{N} \times (impurity - \frac{N_{tR}}{N_t} \times right\;impurity - \frac{N_{tl}}{N_t} \times left\;impurity)$$

Where<br>
$N$ is the total number of samples<br>
$N_t$ is the total number of samples in current node<br>
$N_{tL}$ is the number of samples in the left child<br>
$N_{tR}$ is the number of samples in the right child<br>
$N$, $N_t$, $N_{tL}$, $N_{tR}$ are all refer to the weighted sum, if `sample_weight` is passed.<br>

The **minimum lift** is a criterion to see if the association rules between the items are coincidental or not. We can set the minimum lift using **`min_impurity_decrease`**.<br>

When the lift is the same or smaller than the value set, the tree will not split more. The smaller the value, the larger the tree will be.<br>

For pruning, we can think of two types of it. The first is **pre-pruning**, and the other is **post-pruning**. Pre-pruning is also called **early stopping**. It means literally stopping the training early. And we can do it by setting the max depth or the number of branches. Post-pruning is the process of performing pruning after we train the model. We can do post-pruning using the cost-complexity pruning technique.<br>

The **cost complexity** is a concept that is used in **cost complexity pruning**. Pruning is a technique to prevent overfitting by limiting the model by setting penalty coefficients for the impurity and for the decision tree being larger.<br>

In practice, we can do cost complexity pruning by finding the **$\alpha$** value with the least influence and prune the node with that value. The equation for cost complexity pruning is:

$$R_\alpha (T) = R(T) + \alpha |T|$$

where<br>
$R(T)$ is the learning errors of the leaf nodes<br>
$|T|$ is the number of leaf nodes<br>
$\alpha$ is the complexity parameter

When we focus on reducing the  𝑅(𝑇)  value only, the size of the tree gets bigger. It means the tree structure has more branches.  𝛼 decides the number of leaf nodes to be remained, thus we need to modify it to prevent overfitting. The bigger the  𝛼  value, the more nodes being pruned will be.<br>

Note that we need to calculate the $R_\alpha (T_t)$ for the sub-trees. The equation is very similar to above one.

$$R_\alpha (T_t) = R(T_t) + \alpha |T_t|$$

---
Using the stopping criteria such as above, we can set the optimal conditions for model training, and this process is called hyperparameter tuning.

### GridSearch

Amongst the hyperparameter tuning techniques, GridSearch, a sort of exhaustive search, shows the best performance. GridSearch is a technique that finds the best combination amongst the possible combinations. However, GridSearch also has cons because the training consumes a lot of time.<br>

For now, we will implement an exhaustive search using GridSearchCV module.

In [None]:
# Downloading data
!wget 'https://bit.ly/3gLj0Q6'

# Unzip the downloaded data
import zipfile
with zipfile.ZipFile('3gLj0Q6', 'r') as existing_zip:
    existing_zip.extractall('data')

In [None]:
# Import pandas and RandomForestRegressor
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Import numpy
import numpy as np

In [None]:
# Load data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [None]:
# Check if the data loading is successful
print('============ Train Data ============\n')
print('Train Data Information\n', train.info(), '\n')
print('Train Data Shape: ', train.shape, '\n')

print('============ Test Data ============')
print('Test Data Information\n', test.info(), '\n')
print('Test Data Shape: ', test.shape, '\n')

In [None]:
# Check the features of the data
train.describe()

### Handling missing values

We have four options for now, such as:<br>

1: Delete the missing value<br>
2: Replace the missing value with a specific scala value<br>
3: Replace the missing value with the mean of the feature<br>
4: Replace the missing value using interpolation<br>

Since the Jupyter notebook does not allow us multiple interactions in a single cell, we will write our code in separate cells.<br>

- I noticed this after I wrote the bellow code. I was stupid.

In [None]:
# print('Missing value handling options\n')
# print('You have four options now to handle the missing value.')
# print('Input the number of process you want to, then hit the return key.\n')
# print('  1: Delete the missing value')
# print('  2: Replace the missing value with specific scala value')
# print('  3: Replace the missing value with the mean of the feature')
# print('  4: Replace the missing value using interpolation')

# user_input = input()

# if user_input == 1:
#     train.dropna(inplace=True)
#     print('You have deleted the missing values!')
# elif user_input == 2:
#     print('Input the value you want to replace the missing value')
#     inputted_value = input()
#     train.fillna(inputted_value, inplace=True)
#     test.fillna(inputted_value, inplace=True)
#     print(f'You have replace the missing values of train data and test data with {inputted_value}.')
# elif user_input == 3:
#     train.fillna(train.mean(), inplace=True)
#     test.fillna(test.mean(), inplace=True)
#     print(f'You have replace the missing values of train data {train.mean()} and test data with {test.mean()}.')
# elif user_input == 4:
#     train.interpolate(inplace=True)
#     test.interpolate(inplace=True)
#     print('You have replace the missing values of train data and test data using linear interpolation.')

In [None]:
# Check if there are missing values
print(train.isnull().sum(), '\n')
print(test.isnull().sum())

In [None]:
# Delete the missing values
train.dropna(inplace=True)
print('You have deleted the missing values!')

In [None]:
# Replace the missing values with a specific value, or a string
inputted_value = input()
train.fillna(inputted_value, inplace=True)
test.fillna(inputted_value, inplace=True)
print(f'You have replace the missing values of train data and test data with {inputted_value}.')

In [None]:
# Replace the missing values with the mean of each feature
train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)
print(f'You have replace the missing values of train data\n{train.mean()}\n\nand test data with\n{test.mean()}.')

In [None]:
# Replace the missing values using linear interpolation
train.interpolate(inplace=True)
test.interpolate(inplace=True)
print('You have replace the missing values of train data and test data using linear interpolation.')

In [None]:
# Check if the null values are replaced well.
print(train.isnull().sum(), '\n')
print(test.isnull().sum())

In [None]:
# Import libraries for visualiztion
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Set the minus figure to be printed properly
plt.rc('axes', unicode_minus=False)

In [None]:
# Hide warnings that do not necessary for the analysis
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Visualize features
for column in train.columns:
    plt.figure(figsize=(4, 4))
    plt.title(column)
    sns.histplot(train[column])
    plt.show()

In [None]:
# When the data distribution is not even, we can use Min-Max Normalization
train['hour_bef_pm2.5'] = np.log1p(train['hour_bef_pm2.5'])
train['hour_bef_pm10'] = np.log1p(train['hour_bef_pm10'])

test['hour_bef_pm2.5'] = np.log1p(test['hour_bef_pm2.5'])
test['hour_bef_pm10'] = np.log1p(test['hour_bef_pm10'])

fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(20, 20))


sns.histplot(train['hour_bef_pm2.5'], ax=ax[0, 0])
sns.histplot(train['hour_bef_pm10'], ax=ax[0, 1])

sns.histplot(test['hour_bef_pm2.5'], ax=ax[1, 0])
sns.histplot(test['hour_bef_pm10'], ax=ax[1, 1])
plt.show()

In [None]:
# Compute pairwise correlation of columns, excluding NA/null values.
train.corr()

In [None]:
# Visualize the correlation
plt.figure(figsize = (12, 12))

# annot: optional. If True, write the data value in each cell.
# If an array-like with the same shape as data, then we can use this option to annotate the heatmap.
# The annotation will replace the heatmap's data.
# Note that DataFrames will match on position, not index.
sns.heatmap(train.corr(), annot = True)

In [None]:
# We also can visualize the data using bar plot
sns.barplot(x = 'hour', y = 'count', data = train)

### Modeling

In [None]:
# Declare the model
X_train = train.drop(['count'], axis=1)
Y_train = train['count']

# Train the model
model = RandomForestRegressor(criterion = 'squared_error')
model.fit(X_train, Y_train)

In [None]:
# Print the feature importances
model.feature_importances_

In [None]:
# Visualizing the feature importances
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

importance_values = model.feature_importances_
importances = pd.Series(importance_values, index = X_train.columns)
importance_top10 = importances.sort_values(ascending=False)[:10]

plt.figure(figsize=(8, 6))
plt.title('Top 10 feature importances')
sns.barplot(x = importance_top10, y = importance_top10.index)
plt.show()

In [None]:
# Create train datasets by removing the less important features
X_train1 = train.drop(['count', 'hour_bef_precipitation'], axis=1)
X_train2 = train.drop(['count', 'hour_bef_precipitation', 'hour_bef_pm2.5'], axis=1)
X_train3 = train.drop(['count', 'hour_bef_precipitation', 'hour_bef_pm2.5', 'id'], axis=1)
X_train4 = train.drop(['count', 'hour_bef_precipitation', 'hour_bef_pm2.5', 'id', 'hour_bef_windspeed'], axis=1)

Y_train = train['count']

# Create test datasets
test1 = test.drop(['hour_bef_precipitation'], axis=1)
test2 = test.drop(['hour_bef_precipitation', 'hour_bef_pm2.5'], axis=1)
test3 = test.drop(['hour_bef_precipitation', 'hour_bef_pm2.5', 'id'], axis=1)
test4 = test.drop(['hour_bef_precipitation', 'hour_bef_pm2.5', 'id', 'hour_bef_windspeed'], axis=1)

In [None]:
# Check the shape of training and test data
print('X_train1.shape: ', X_train1.shape, '\n')
print('X_train2.shape: ', X_train2.shape, '\n')
print('X_train3.shape: ', X_train3.shape, '\n')
print('X_train4.shape: ', X_train4.shape, '\n')
print('Y_train.shape: ', Y_train.shape, '\n')
print('test1.shape', test1.shape, '\n')
print('test2.shape', test2.shape, '\n')
print('test3.shape', test3.shape, '\n')
print('test4.shape', test4.shape, '\n')

In [None]:
# Declare separate models
model1 = RandomForestRegressor(criterion = 'squared_error')
model2 = RandomForestRegressor(criterion = 'squared_error')
model3 = RandomForestRegressor(criterion = 'squared_error')
model4 = RandomForestRegressor(criterion = 'squared_error')

# Train the saparated models
model1.fit(X_train1, Y_train)
model2.fit(X_train2, Y_train)
model3.fit(X_train3, Y_train)
model4.fit(X_train4, Y_train)

### RandomForest Hyperparameters

**n_estimators:** Number of decision making tree
- Default = 10
- When increase it, the performance may get better, but may cause too much train time.<br>

**min_samples_split**: The minimum number of sample used to split node
- Used to control overfitting
- Default = 2: The smaller the value, the greater possibility of overfitting because of the increasing node split<br>

**min_samples_leaf**: The minimum number of samples to be leaf node
- Along to min_samples_split, it is used to control the overfitting
- When the data is imbalanced, some data of a specific class may extremely small, thus it needs to be kept the small value<br>

**max_features**: Maximum number of features for optimal split
- Default = 'auto'
    - Note: The default value of max_feature is none in decision tree
- When specified in int type: The number of features
- When specified in float type: The ratio of features
- 'sqrt' or 'auto': Samples as many as $\sqrt{The\;number\;of\;whole\;features}$
- log : Samples as many as $\log_2{(The\;number\;of\;whole\;features)}$<br>

**max_depth**: Maximum depth of the tree
- Default = none
    - Split until the class value is completely determined
    - Or until the number of data is less than min_samples_split
- As the depth increases, it may overfit, so proper control is required.<br>

**max_leaf_nodes**: The maximum number of leaf nodes

### GridSearchCV initializer
- estimator: classifier, regressor, pipeline, and so on.

- param_grid: In the dictionary type, input the parameters that are going to be used for parameter tuning.

- scoring: Method to evaluate the prediction performance. Usually set to accuracy.

- cv: Specifies the number of divisions in cross-validation(The number of fold).

- refit: The default value is True. When it is set default, it finds the optimal hyperparameter and retrains it.

- n_jobs: The default value is 1, Set -1 to use all cores.

In [None]:
from sklearn.model_selection import GridSearchCV
import time

model = RandomForestRegressor(criterion = 'mse',
                              random_state=2022)

params = {'min_samples_split': [30, 50, 70],
          'max_depth': [5, 6, 7],
          'n_estimators': [50, 150, 250]}

# Declare GridSearchCV for each model 
greedy_CV1 = GridSearchCV(estimator = model1,
                          param_grid = params,
                          cv = 3,
                          scoring = 'neg_mean_squared_error')

greedy_CV2 = GridSearchCV(estimator = model2,
                          param_grid = params,
                          cv = 3,
                          scoring = 'neg_mean_squared_error')

greedy_CV3 = GridSearchCV(estimator = model3,
                          param_grid = params,
                          cv = 3,
                          scoring = 'neg_mean_squared_error')

greedy_CV4 = GridSearchCV(estimator = model4,
                          param_grid = params,
                          cv = 3,
                          scoring = 'neg_mean_squared_error')

start_time = time.time()

# Train for each dataset
greedy_CV1.fit(X_train1, Y_train)
greedy_CV2.fit(X_train2, Y_train)
greedy_CV3.fit(X_train3, Y_train)
greedy_CV4.fit(X_train4, Y_train)

end_time = time.time()

print("Processing time: ", end_time-start_time, 'seconds.')

In [None]:
# Predict with each trained model
prediction1 = greedy_CV1.predict(test1)
prediction2 = greedy_CV2.predict(test2)
prediction3 = greedy_CV3.predict(test3)
prediction4 = greedy_CV4.predict(test4)

print(prediction1)
print(prediction2)
print(prediction3)
print(prediction4)

In [None]:
# Save the prediction results
GridSearchCV1 = pd.read_csv('data/submission.csv')
GridSearchCV2 = pd.read_csv('data/submission.csv')
GridSearchCV3 = pd.read_csv('data/submission.csv')
GridSearchCV4 = pd.read_csv('data/submission.csv')

GridSearchCV1['count'] = np.round(prediction1, 2)
GridSearchCV2['count'] = np.round(prediction2, 2)
GridSearchCV3['count'] = np.round(prediction3, 2)
GridSearchCV4['count'] = np.round(prediction4, 2)

print(GridSearchCV1.head(), '\n\n',
      GridSearchCV2.head(), '\n\n',
      GridSearchCV3.head(), '\n\n',
      GridSearchCV4.head())

In [23]:
# Declare separate models
model1 = RandomForestRegressor(criterion = 'squared_error')
model2 = RandomForestRegressor(criterion = 'squared_error')
model3 = RandomForestRegressor(criterion = 'squared_error')
model4 = RandomForestRegressor(criterion = 'squared_error')

# Train the saparated models
model1.fit(X_train1, Y_train)
model2.fit(X_train2, Y_train)
model3.fit(X_train3, Y_train)
model4.fit(X_train4, Y_train)

RandomForestRegressor()

### RandomForest Hyperparameters

**n_estimators:** Number of decision making tree
- Default = 10
- When increase it, the performance may get better, but may cause too much train time.<br>

**min_samples_split**: The minimum number of sample used to split node
- Used to control overfitting
- Default = 2: The smaller the value, the greater possibility of overfitting because of the increasing node split<br>

**min_samples_leaf**: The minimum number of samples to be leaf node
- Along to min_samples_split, it is used to control the overfitting
- When the data is imbalanced, some data of a specific class may extremely small, thus it needs to be kept the small value<br>

**max_features**: Maximum number of features for optimal split
- Default = 'auto'
    - Note: The default value of max_feature is none in decision tree
- When specified in int type: The number of features
- When specified in float type: The ratio of features
- 'sqrt' or 'auto': Samples as many as $\sqrt{The\;number\;of\;whole\;features}$
- log : Samples as many as $\log_2{(The\;number\;of\;whole\;features)}$<br>

**max_depth**: Maximum depth of the tree
- Default = none
    - Split until the class value is completely determined
    - Or until the number of data is less than min_samples_split
- As the depth increases, it may overfit, so proper control is required.<br>

**max_leaf_nodes**: The maximum number of leaf nodes

### GridSearchCV initializer
- estimator: classifier, regressor, pipeline, and so on.

- param_grid: In the dictionary type, input the parameters that are going to be used for parameter tuning.

- scoring: Method to evaluate the prediction performance. Usually set to accuracy.

- cv: Specifies the number of divisions in cross-validation(The number of fold).

- refit: The default value is True. When it is set default, it finds the optimal hyperparameter and retrains it.

- n_jobs: The default value is 1, Set -1 to use all cores.

In [25]:
from sklearn.model_selection import GridSearchCV
import time

model = RandomForestRegressor(criterion = 'mse',
                              random_state=2022)

params = {'min_samples_split': [30, 50, 70],
          'max_depth': [5, 6, 7],
          'n_estimators': [50, 150, 250]}

# Declare GridSearchCV for each model 
greedy_CV1 = GridSearchCV(estimator = model1,
                          param_grid = params,
                          cv = 3,
                          scoring = 'neg_mean_squared_error')

greedy_CV2 = GridSearchCV(estimator = model2,
                          param_grid = params,
                          cv = 3,
                          scoring = 'neg_mean_squared_error')

greedy_CV3 = GridSearchCV(estimator = model3,
                          param_grid = params,
                          cv = 3,
                          scoring = 'neg_mean_squared_error')

greedy_CV4 = GridSearchCV(estimator = model4,
                          param_grid = params,
                          cv = 3,
                          scoring = 'neg_mean_squared_error')

start_time = time.time()

# Train for each dataset
greedy_CV1.fit(X_train1, Y_train)
greedy_CV2.fit(X_train2, Y_train)
greedy_CV3.fit(X_train3, Y_train)
greedy_CV4.fit(X_train4, Y_train)

end_time = time.time()

print("Processing time: ", end_time-start_time, 'seconds.')

Processing time:  73.97013258934021 seconds.


In [26]:
# Predict with each trained model
prediction1 = greedy_CV1.predict(test1)
prediction2 = greedy_CV2.predict(test2)
prediction3 = greedy_CV3.predict(test3)
prediction4 = greedy_CV4.predict(test4)

print(prediction1)
print(prediction2)
print(prediction3)
print(prediction4)

[ 86.71871077 237.14406021 106.20606326  31.79610907  53.48793414
 128.27713441 147.99642531 301.07885586  30.43086512 120.09864994
 295.44385557 249.09845249 128.4820542   39.30426673 240.82862855
 171.18264697  25.36847846 218.91678238 306.751906   172.41276054
 220.86813906  72.23829393  18.05434887 143.52577735 150.88297553
 108.88445571  24.75084585 115.93666015 110.10973064 150.11630809
  85.5457806   25.80662933  61.68099739 134.1035387  265.79127535
  29.70709515 135.6365994  146.71357134 214.59255148  65.82110303
  59.50237478 117.87305628 161.85648869  81.11902861 297.88675324
 192.44188024  69.07290419  61.08627149  18.57810893  85.87551384
 245.7245125   86.92760831 157.73476915 107.71915097 193.06145208
 158.62478771  40.37197872 174.50725831  18.11587828  17.68930299
  88.54702468  88.10633585 265.3801545  287.04923833 161.24382821
 293.77447707  17.95296809 215.84186713 138.07391273  25.54013975
 102.91132763  38.57747868 159.51049757  18.05434887 300.33495727
 208.81601

In [27]:
# Save the prediction results
GridSearchCV1 = pd.read_csv('data/submission.csv')
GridSearchCV2 = pd.read_csv('data/submission.csv')
GridSearchCV3 = pd.read_csv('data/submission.csv')
GridSearchCV4 = pd.read_csv('data/submission.csv')

GridSearchCV1['count'] = np.round(prediction1, 2)
GridSearchCV2['count'] = np.round(prediction2, 2)
GridSearchCV3['count'] = np.round(prediction3, 2)
GridSearchCV4['count'] = np.round(prediction4, 2)

print(GridSearchCV1.head(), '\n\n',
      GridSearchCV2.head(), '\n\n',
      GridSearchCV3.head(), '\n\n',
      GridSearchCV4.head())

   id   count
0   0   86.72
1   1  237.14
2   2  106.21
3   4   31.80
4   5   53.49 

    id   count
0   0   83.63
1   1  245.92
2   2  107.89
3   4   31.34
4   5   57.46 

    id   count
0   0   84.82
1   1  238.43
2   2   99.67
3   4   31.31
4   5   50.00 

    id   count
0   0   86.19
1   1  231.67
2   2   95.37
3   4   30.93
4   5   49.02


In [28]:
# Save the results
GridSearchCV1.to_csv('GridSearchCV1_result.csv')
GridSearchCV2.to_csv('GridSearchCV2_result.csv')
GridSearchCV3.to_csv('GridSearchCV3_result.csv')
GridSearchCV4.to_csv('GridSearchCV4_result.csv')