## Bash, Git, Hyperparameter Tuning and Logging Experiments

### By Kimberly Ton-Mai, Muhammad Chaudhry and Sahan Bulathwela

This week we are going to learn about a few ideas that will allow you to execute the development part of your thesis projects more successful. We are going to learn to about a few tools, namely, 

- Bash commands
- Git commands
- Doing hyperparameter tuning
- Logging relevant information when training models


## Learning Goals

There are several learning goals behind this week's exercise. You will familiarise with how to:


- Use [github](https://github.com/) to manage your code
- Use `bash` commands and `git` commands to add new changes to your code while keeping track of them. 
- Training a machine learning model in your personal computer
- Adding *hyperparameter tuning* to your code
- Logging the hyperparameters and metrics to track your experimental settings. 


## Exercise Plan

Today's exercise consists of several steps that need to be executed sequencially to obtain the final code that will conduct hyperparamter tuning for your machine learning model. 

- Step 1: Forking a public github repository to your own git profile
- Step 2: Pulling the git repository to your local environment
- Step 3: Do the instructed changes to your code to introduce hyperparameter tuning
- Step 4: Add logging to record the hyperparamters and the metrics 
- Step 5: Commit the changes and push them to a remote git repository. 
- Step 6: Submit a pull request

## Step 1: Forking a public github repository to your own git profile

- Sign into your [github account](https://github.com/) or register with github over [here](https://github.com/signup) 
- Fork the tutorial code found [here](https://github.com/comp0190/git_tutorial) to create your own version of the repository.

## Step 2: Pulling the git repository to your local environment

- Firstly, install Git on your local machine. You can find instructions for doing this [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
- Then go to your desire folder and create a directory named `hyperparameter_tuning_tutorial` using *bash* commands. 
- Use the relevant git command to `clone` the repository into the directory you just created. 
- Finally, create a ***new branch*** named `hyperparams_and_logging` from the `master` branch of your repository before doing changes to the code. 


## Step 3: Do the instructed changes to your code to introduce hyperparameter tuning

- Open this notebook using the Jupyter notebook software. 
- Instructions on installing Anaconda that contains Jupyter notebook editor is found [here](https://docs.anaconda.com/anaconda/install/index.html)

### Importing Relevant Python Modules

There are several Python modules that we we aim to use today. Let us import them here. We use numpy and pandas for data manipulation. We use scikit learn for splitting data into train and test splits and implement grid search. We use xgboost to implement the desired machine learning model. We use logging library to log information. 

***Hint:*** You may use the `pip` python package manager to install these libraries in your local python environment. 

```
pip install numpy
pip install pandas
pip install scikit-learn
pip install xgboost
```

In [None]:
# Import relevant libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
import logging

### Loading the Dataset

We use a publicly available dataset for this exercise which is provided with the data. The dataset we use is provided with the repository and is a Comma-Seperated Values (CSV) file called `concrete_data.csv`. 

- Let us load the data to our notebook now. 

In [None]:
# Upload the dataset

# -- Insert your code here -- 


In [None]:
# Preview the first few lines of the data
data.head()

### Processing the dataset

The objective of this task is to use features of concrete to predict its ***strength***. This makes it a regression problem that falls under the supervised learning domain. 

- Now let us reshape the data to have the features and the labels of the dataset.   

In [None]:
# Separate the training data and targets
X = data.iloc[:, :8].values
Y = data.iloc[:, 8].values.reshape(-1,1) # the strength column is the label. 

# Preview the size of the feature matrix and the label matrix/ vector
print(np.shape(X))
print(np.shape(Y))

### Splitting the data into train and test split

We use the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) in scikit-learn python library to split the data into the train and test splits. 

- Let us split the data in a way that the *training dataset gets* ***80%*** of the observations.

In [None]:
# Implement the train and test split 

# -- Insert your code here -- 


In [None]:
# Print the shapes of the data
print(np.shape(X_train))
print(np.shape(Y_train))
print(np.shape(X_test))
print(np.shape(Y_test))

### Implement the Model

As mentioned earlier, we are going to use the [XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn) model to tackle this prediction task. 

- Let us instatiate the model as `xgb_model`.

In [None]:
xgb_model = XGBRegressor(random_state = 2023)

### Defining the hyperparameter values we want to explore for the grid search

Xgboost model, like many other models have hyperparameters that guide the parameter learning process. In the context of xgboost, there are a few hyperprameters such as,
- n_estimators
- max_depth
- gamma
- learning_rate

that are important to tune. 

We create a dictionary where we specify the different hyperparameter values that we want to explore for each hyperparameter of interest. 

In [None]:
# Make a dictionary of the combination of hyperparameter values that you want to apply Grid Search on
# Keys are hyperpararmeter names and values are the different set of values for each hyperparameter
search_space = {
    "n_estimators" : [100, 200, 500],
    "max_depth" : [3, 6, 9],
    "gamma" : [0.01, 0.1],
    "learning_rate" : [0.001, 0.01, 0.1, 1]
}



### Exercise: Introducing the grid search to the code

Now that we have determined the set of hyperparameter values we want to consider, we specify the grid search mechanism we want to use in order to find the best hyperparameter combination that gives best performance for this dataset when using the xgboost model. 

- Use the [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class to explore the performance of our model across different values of the hyperparameters we defined in the search space above.
- Name the instantiated grid search object as `GS`.
- Use `r2` and `neg_root_mean_squared_error` to report two differnt evaluation metrics that can be used to assess the goodness of fit of a regression model. The different scorers available in scikit-learn are found [here](https://scikit-learn.org/stable/modules/model_evaluation.html)
- Set the number of cross-validation folds to 5. More info about k-fold cross validation found [here](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py)

***Hint:*** Use the documentation to understand the role of the different parameters and use the appropriate set to define the grid search correctly. 


In [None]:
# Implement the code for cross-validation Grid Search 

# -- Insert your code here -- 


### Fit the model to the training data. 

***Note: This may take a few minutes***

In [None]:
# Fit the model on the training data
GS.fit(X_train, Y_train)


### Explore the results for best parameter values

Once the grid search is complete. We can access different attributes populated in the [GridSearchCV](https://scikit-learn.org/stable/modules/g
enerated/sklearn.model_selection.GridSearchCV.html) object to explore the results obtained from the grid search. 

- For example, the model object of the best estimator

In [None]:
print(GS.best_estimator_) # to get the complete details of the best model

- The best performing hyperparameter combination

In [None]:
print(GS.best_params_) # to get only the best hyperparameter values that we searched for


- The score obtained for the best hyperparameter combination

In [None]:
print(GS.best_score_) # score according to the metric we passed in refit


### Exporting the results to a CSV for further analysis

Once we get the grid search results, we can further export it into a CSV file to further analyse the results using a different tool (e.g. Excel) 

In [None]:
# Exporting the grid search results to a csv file for analysis 
df = pd.DataFrame(GS.cv_results_)
df = df.sort_values("rank_test_r2")
df.to_csv("cv_results.csv", index = False)

***Hint***: From the analysis of the model's accuracy across different hyperparameter values, you may have noticed that the best model with 500 trees reaches an r-squared value of 0.9228 while the model with 200 trees reaches 0.9221. You have to make a choice if the computational power of 300 more trees in the model is worth this tiny improvement in the accuracy.

## Step 4: Add logging to record the hyperparamters and the metrics 

- Instead of printing the best set of hyperparameter values, use logging to add log messages that can capture these values. 

***Hint:*** We can use the `INFO` logging level to report such values. 

For example:

In [None]:
# Instantiate the logger
log = logging.getLogger("my-logger")

# Submit log entry
log.info("Best Gamma Value is: {}".format(GS.best_params_["gamma"]))

### Exercise

Report the remaining hyperprameter values (`learning_rate`, `max_depth` and `n_estimators`) using the `info` log level. 

In [None]:
# Add logging statements

# -- Insert your code here --


## Step 5: Commit the changes and push them to a remote git repository. 

- Use the relevant git command to *commit* the code into the local version of your repository
- Use the relevant git command to *push* the committed code into the `hyperparams_and_logging` branch of the remote repository (in github)

## Step 6: Submit a pull request

- Use the github web user interface to submit a ***pull request*** to your repository. Details found [here](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request)  

***Hint:*** Details about pushing the new code to the branch is also found [here](https://comp0190.github.io/lectures/topics/3_tuning/tutorial.html#committing-your-changes-and-pushing-to-your-remote-repository)

## (Optional)  Week 1 Cont... 
### Bayesian Search Optimization

- Explore how Scikit Bayesian Optimisation works with [Skopt](https://scikit-optimize.github.io/stable/auto_examples/bayesian-optimization.html). In dthe documentation you can see how it works with different Acquisition functions for choosing the next point to query the original function.

***Hint:*** You need to install a new python library called `skopt`

```
pip install scikit-optimize
```

- We will be defining the search space and using skopt.BayesSearchCV class to optimise the hyperparamters. You can see how this class samples and explores hyperparameter values from specific distributions [here](https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html)

In [None]:
# Importing relevant modules for Bayesian Optimisation 

# -- Insert your code here --


In [None]:
# Defining the search space for Bayesian Optimisation 
search_space_bo = {
    "n_estimators" : Integer(100,500),
    "max_depth" : Integer(4, 400), 
    "gamma" : (1e-2, 0.1, "log-uniform"),
    "learning_rate" : (0.0001, 0.1, "log-uniform")
}

In [None]:
# Make a Bayes search object

# -- Insert your code here --


#### Fitting the model based on Bayes Search CV object we defined earlier. Note: this may take a few minutes to run

In [None]:
# Fit the model on the training data
BO.fit(X_train, Y_train)

#### Exploring the results of best hyperparameter combinations

In [None]:
print(BO.best_score_) # score according to the metric we passed in refit


In [None]:
print(BO.best_params_) # to get only the best hyperparameter values that we searched for


In [None]:
print(BO.best_estimator_) # to get the complete details of the best model

In [None]:
# Exporting the Bayesian Optimization results to a csv file for analysis 
df1 = pd.DataFrame(BO.cv_results_)
df1 = df1.sort_values("mean_test_score")
df1.to_csv("bo_cv_results.csv", index = False)