![AIAP Banner](../images/AIAP-Banner.png "AIAP Banner")

<h1><center>Assignment 2 - Part 2:
<br>
Regression, Validation & Hyperparameter Tuning</center></h1>

<h3>Name of Apprentice:</h3>

---

# 1. Introduction
In the second part of this assignment, we will explore quite a few regression models, methods for model validation and hyperparameter tuning.

#### 1.1. Topics
1. EDA & Feature Engineering
2. Regression models (linear models, decision trees, k-nearest neighbours, boosting, stacked models) & metrics 
3. Model training
4. Feature selection
5. Hyperparameter tuning
6. L1 & L2 Regularisation


#### 1.2. Deliverables
1. Jupyter notebook
2. Python scripts (some templates are provided)

# 2. Data preparation

For this assignment, we will use regression models to predict the price of resale HDB flats. The data is hosted on the same database instance as the previous 2 assignments.

    server = 'aiap-training.database.windows.net'
    database = 'aiap'
    username = 'apprentice'
    password = 'Pa55w.rd'
    driver= '{ODBC Driver 17 for SQL Server}'
    

The data is stored in the tables `transactions`, `towns` and `flat_models`.

#### 2.1. Extract the data from the SQL server. Perform the necessary steps needed to combine the tables in an SQL query. Save the final table as a `.csv` file on your laptop.

#### 2.2. Perform data cleaning and EDA. Drop the `flatm_id` and `town_id` columns. The final dataframe should have 12 columns (including the id column).

#### 2.3. Engineer at least 3 features.

# 3. Model Validation

Model validation is the process of verifying the model's performance. The first step is to choose a metric for evaluating the model.

#### 3.1. Describe the differences between MSE, MAE and RMSE and the considerations when choosing between each metric. 

#### 3.2. Explain the purpose of splitting the full dataset into these 3 subsets: train, development/validation and test. State any assumption about the relationship between the 3 datasets.

#### 3.3. In general, how large (in terms of proportion of the full dataset) should the train / dev / test set be?

#### 3.4. Describe some considerations when choosing between using train-dev-test split and k-fold cross validation.

#### 3.5. Perform a train-test split on your dataset. Save both the train and test sets as separate `csv` files.

#### 3.6. Consolidate the data preprocessing steps in a `datapipeline.py` file in the [src folder](./src). 


The module should contain the class `Datapipeline` with at least two functions, `transform_train_data` and `transform_test_data`.

`transform_train_data` should take in the path to the training data and return two numpy arrays, `X_train` and `y_train`.

`transform_test_data` should take in the path to the test data and return two numpy arrays, `X_test` and `y_test`.

# 4. Model Training

You may find sklearn's `pipeline` useful for building pipelines so that the models you build will be easier to consume later during meta-model ensembling.

We will begin by building the most basic baseline model - a model that predicts the mean value for all data points regardless of the features.

#### 4.1 Calculate the mean of the target variables for the train dataset. Evaluate the "model" performance on the test set. 

#### 4.2. Train a linear model and evaluate the model's performance on the test set. Make use

**4.3. Using `model.py` file as a template, train a k-nearest neighbours regressor and evaluate the model's performance on the test set.**

**4.3. Using `model.py` file as a template, train a decision tree regressor and evaluate the model's performance on the test set.**

#### 4.5. Identify the best model from the linear model, k-nearest neighbour and decision tree. We will use this as our improved baseline.

# 5. Feature Selection

Feature selection is the process of identifying the features which contribute the most to the model's performance and the dropping of features which contribute less. One common method of evaluating feature importance is the use of Recursive Feature Elimination (RFE). RFE is referred to as backwards stepwise regression in linear models.

#### 5.1. Describe the RFE algorithm.

#### 5.2. Using the RFE algorithm, determine the optimal number of features to keep for linear regression and decision tree regressor models.

Additional instructions
- Use the train set. 
- Determine the optimal number of features to keep for each regressor. 
- Report the MSE (on the test set) for each of the k-nearest neighbours and decision tree models with reduced features.

# 6. Hyperparameter Tuning

Most ML algorithms have various parameters that can be changed or tuned to improve model performance. Hyperparameter tuning is an optimisation problem. One of the most common algorithms for optimisation is grid search.

#### 6.1. Describe the grid search algorithm. What are some drawbacks of grid search and what other search algorithms are there?

#### 6.2. Explain why gradient descent is not used for hyperparameter tuning.

We will now start to look at hyperparameters for these models. Between the 3 models, k-nearest neighbours probably has the least hyperparameters to tune. Hence, let's start with k-nearest neighbours. As we have limited time, you might not want to try all of these parameters. This is a good time to do some research with respect to what parameters to optimise.

#### 6.3. Perform hyperparameter tuning for k-nearest neighbours regressor and decision tree regressor (with all features). Report the metrics (on the test set) for the best model for each algorithm below.

The linear model also requires tuning, but the current API we are using `linear_model.LinearRegression` lacks the parameters we are looking for. Instead, we will work with elastic nets to build a better linear regression model. We will look at this again in the regularisation section.

# 7. Regularisation

Regularisation penalises large weights during the model fitting process and forces the model to trade-off between large weights and model accuracy. This is done by introducing an additional term in the loss function that increases the value of the loss function if the weights are increased. L1 and L2 regularisation are common methods.

#### 7.1. Explain how regularization (forcing the model to have smaller weights/parameters/coefficients) can help with feature selection. 

#### 7.2. Determine the optimal value of the regularization parameter for a linear model with L1 regularization (LASSO regression). Report the test MSE for the best model.

#### 7.3. State the coefficients for the linear model with L1 regularization.

#### 7.4. Repeat 7.2. and 7.3. for a linear model with L2 regularization (ridge regression). Comment on any differences observed between the coefficients of the two models.

<h1><center>End of Assignment 2 - Part 2:<br>Regression, Validation & Hyperparameter Tuning</center></h1>