In [68]:
# Initialize Otter
import otter
grader = otter.Notebook("hw5.ipynb")

# CPSC 330 - Applied Machine Learning 

## Homework 5: Putting it all together 
### Associated lectures: All material till lecture 13 

**Due date: [Monday, Mar 10, 11:59 pm](https://github.com/UBC-CS/cpsc330-2024W2?tab=readme-ov-file#deliverable-due-dates-tentative)**

## Table of contents
0. [Submission instructions](#si)
1. [Understanding the problem](#1)
2. [Data splitting](#2)
3. [EDA](#3)
4. [Feature engineering](#4)
5. [Preprocessing and transformations](#5) 
6. [Baseline model](#6)
7. [Linear models](#7)
8. [Different models](#8)
9. [Feature selection](#9)
10. [Hyperparameter optimization](#10)
11. [Interpretation and feature importances](#11) 
12. [Results on the test set](#12)
13. [Summary of the results](#13)
14. [Your takeaway from the course](#15)

<div class="alert alert-info">

## Submission instructions
<hr>
rubric={points:4}

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).
- If you would like to use late tokens for the homework, all group members must have the necessary late tokens available. Please note that the late tokens will be counted for all members of the group.   


Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2024W2/blob/master/docs/homework_instructions.md). 

1. Before submitting the assignment, run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Follow the [CPSC 330 homework instructions](https://ubc-cs.github.io/cpsc330-2024W2/docs/homework_instructions.html), which include information on how to do your assignment and how to submit your assignment.
4. Upload your solution on Gradescope. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope.


_Note: The assignments will get gradually more open-ended as we progress through the course. In many cases, there won't be a single correct solution. Sometimes you will have to make your own choices and your own decisions (for example, on what parameter values to use when they are not explicitly provided in the instructions). Use your own judgment in such cases and justify your choices, if necessary._

</div>

<!-- BEGIN QUESTION -->

## Imports

<div class="alert alert-warning">
    
Imports
    
</div>

_Points:_ 0

In [69]:
import os
import string
import sys
from collections import deque

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

sys.path.append(os.path.join(os.path.abspath(".."), "code"))
import seaborn as sns
# from plotting_functions import *
from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE
# from utils import *
from sklearn.metrics import mean_absolute_percentage_error

import warnings

# Suppress Warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", message=".*frozen modules.*")
warnings.filterwarnings("ignore", message=".*'force_all_finite' was renamed.*")
warnings.simplefilter("ignore")
warnings.simplefilter(action="ignore", category=DeprecationWarning)
warnings.simplefilter(action="ignore", category=UserWarning)


<!-- END QUESTION -->

## Introduction <a name="in"></a>

In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips
1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.


#### A final note
Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (15-20 hours???) is a good guideline for this project . Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

<!-- BEGIN QUESTION -->

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
<hr>
rubric={points:3}

In this mini project, you have the option to choose on which dataset you will be working on. The tasks you will need to carry on will be similar, independently of your choice.

### Option 1
You can choose to work on a classification problem of predicting whether a credit card client will default or not. 
For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 


### Option 2
You can choose to work on a regression problem using a [dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) of New York City Airbnb listings from 2019. As usual, you'll need to start by downloading the dataset, then you will try to predict `reviews_per_month`, as a proxy for the popularity of the listing. Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

> Note there is an updated version of this dataset with more features available [here](http://insideairbnb.com/). The features were are using in `listings.csv.gz` for the New York city datasets. You will also see some other files like `reviews.csv.gz`. For your own interest you may want to explore the expanded dataset and try your analysis there. However, please submit your results on the dataset obtained from Kaggle.


**Your tasks:**

1. Spend some time understanding the options and pick the one you find more interesting (it may help spending some time looking at the documentation available on Kaggle for each dataset).
2. After making your choice, focus on understanding the problem and what each feature means, again using the documentation on the dataset page on Kaggle. Write a few sentences on your initial thoughts on the problem and the dataset. 
3. Download the dataset and read it as a pandas dataframe. 

<div class="alert alert-warning">
    
Solution_1
    
</div>

_Points:_ 3

_Type your answer here, replacing this text._

In [None]:
airbnb_df = pd.read_csv("data/AB_NYC_2019.csv")
airbnb_df.head()

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 2. Data splitting <a name="2"></a>
<hr>
rubric={points:2}

**Your tasks:**

1. Split the data into train (70%) and test (30%) portions with `random_state=123`.

> If your computer cannot handle training on 70% training data, make the test split bigger.  

<div class="alert alert-warning">
    
Solution_2
    
</div>

_Points:_ 2

In [71]:
train_df, test_df = train_test_split(airbnb_df, test_size=0.30, random_state=123)

X_train, y_train = (
    train_df.drop(columns=["reviews_per_month"]),
    train_df["reviews_per_month"],
)

X_test, y_test = (
    test_df.drop(columns=["reviews_per_month"]),
    test_df["reviews_per_month"],
)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 3. EDA <a name="3"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

<div class="alert alert-warning">
    
Solution_3
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
train_df.head()

In [None]:
train_df.info()

The following columns have some missing values:
  - name
  - host_name
  - last_review
  - reviews_per_month

In [74]:
# For reviews_per_month, it might be reasonable to fill Nah in with 0

y_train = y_train.fillna(0)
y_test = y_test.fillna(0)

In [None]:
train_df.describe(include="all")

In [None]:
train_df.hist(bins=50, figsize=(20, 15));

Since reviews_per_month is a continuous variable, this is a regression problem. 
Possible metrics would be:
- RMSE
- R^2
- MAPE

Here we can see that the reviews_per_month, price, number_of_reviews, calculated_host_listings_cost are skewed.

A common trick in such cases is applying a log transform on the target column to make it more normal and less skewed.
That is, transform 
.
Linear regression will usually work better on something that looks more normal.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 4. Feature engineering <a name="4"></a>
<hr>
rubric={points:1}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing. 

<div class="alert alert-warning">
    
Solution_4
    
</div>

_Points:_ 1

<!-- END QUESTION -->

<br><br>

In [77]:
# Group neighbourhood_group into categories
def categorize_neighbourhood_group(group):
    if group == "Manhattan":
        return 'Manhattan'
    elif group == "Brooklyn":
        return 'Brooklyn'
    else:
        return "Others"

X_train['neighbourhood_group'] = X_train['neighbourhood_group'].apply(categorize_neighbourhood_group)
X_test['neighbourhood_group'] = X_test['neighbourhood_group'].apply(categorize_neighbourhood_group)

In [78]:
# Group room_type into categories
def categorize_room_type(room_type):
    if room_type == "Entire home/apt":
        return 'Entire home/apt'
    elif room_type == "Private room":
        return 'Private room'
    else:
        return "Others"

X_train['room_type'] = X_train['room_type'].apply(categorize_room_type)
X_test['room_type'] = X_test['room_type'].apply(categorize_room_type)

In [79]:
# Group minimum_nights into categories
def categorize_min_nights(nights):
    if nights <= 1:
        return 'Short Stay'
    elif nights <= 3:
        return 'Medium_Short Stay'
    elif nights <= 5:
        return 'Medium_Long Stay'
    else:
        return "Long Stay"

X_train['minimum_nights'] = X_train['minimum_nights'].apply(categorize_min_nights)
X_test['minimum_nights'] = X_test['minimum_nights'].apply(categorize_min_nights)

In [80]:
# Group calculated_host_listings_count into categories
# amount of listing per host

def categorize_host_listings(listing_count):
    if listing_count <= 10:
        return 'Little or no listings'
    else:
        return "Many listings"

X_train['calculated_host_listings_count'] = X_train['calculated_host_listings_count'].apply(categorize_host_listings)
X_test['calculated_host_listings_count'] = X_test['calculated_host_listings_count'].apply(categorize_host_listings)

<!-- BEGIN QUESTION -->

## 5. Preprocessing and transformations <a name="5"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

<div class="alert alert-warning">
    
Solution_5
    
</div>

_Points:_ 10

In [81]:
normal_numeric_feats = ['latitude','longitude']

skewed_numeric_feats = ['availability_365',
                 'number_of_reviews', 'price']

categorical_feats = ['minimum_nights','neighbourhood_group','neighbourhood','room_type','calculated_host_listings_count'] 
 # apply one-hot encoding # room type ordinal???

drop_feats = [
    'id',
    'name',
    'host_id',
    'host_name',
    'last_review',
]

In [82]:
# Transformer to apply log1p
from sklearn.pipeline import FunctionTransformer

log_transformer = FunctionTransformer(np.log1p, feature_names_out='one-to-one')

preprocessor = make_column_transformer(
    # 1. Right-skewed numeric features → impute → log1p → scale
    (make_pipeline(SimpleImputer(), log_transformer, StandardScaler()), skewed_numeric_feats),

    # 2. Normally distributed numeric features → impute → scale
    (make_pipeline(SimpleImputer(), StandardScaler()), normal_numeric_feats),

    # 3. Categorical features → one-hot encode
    (OneHotEncoder(handle_unknown="ignore"), categorical_feats),

    # 4. Drop unwanted features
    ("drop", drop_feats)
)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 6. Baseline model <a name="6"></a>
<hr>
rubric={points:2}

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

<div class="alert alert-warning">
    
Solution_6
    
</div>

_Points:_ 2

In [83]:
results_dict = {}  # dictionary to store all the results

In [84]:
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    y_train_log = np.log1p(y_train)  # log(1 + y), avoids issues with 0

    scores = cross_validate(model, X_train, y_train_log, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores.iloc[i], std_scores.iloc[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

In [None]:
# Baseline model

from sklearn.dummy import DummyRegressor

dummy = DummyRegressor()
pipe = make_pipeline(preprocessor, dummy)
results_dict["dummy"] = mean_std_cross_val_scores(pipe, X_train, y_train, cv=5,return_train_score=True)
results_df = pd.DataFrame(results_dict).T
results_df

<!-- END QUESTION -->

<br><br>

I evaluated the DummyRegressor using R² as the scoring metric and obtained a result of approximately 0.000 on the training and the test(cv) set. This outcome is expected, as the Dummy Regressor simply predicts a constant value, which captures no variance in the data.

Since R² measures how well the model explains the variance in the target, a result close to zero indicates that the model performs no better than a naive average guess. However, in this context, the target variable reviews_per_month contains a large number of zeros and has a skewed distribution. That makes R² less informative, especially when many values are clustered around zero and there’s little variance to explain.

Therefore, a metric like RMSE or MAPE might be more appropriate, as it gives a better sense of relative error and can handle skewed data more meaningfully.

<!-- BEGIN QUESTION -->

## 7. Linear models <a name="7"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.

<div class="alert alert-warning">
    
Solution_7
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [86]:
ridge_pipe = make_pipeline(preprocessor, Ridge(random_state=123))

In [None]:
param_grid = {"ridge__alpha": 10.0 ** np.arange(-6, 6, 1),
              "ridge__max_iter": [10,15,20,25,30]}

search = RandomizedSearchCV(
    ridge_pipe, param_grid, n_iter=12, return_train_score=True, n_jobs=-1, verbose=0,random_state=123)

y_train_log = np.log1p(y_train)  # log(1 + y), avoids issues with 0

# Fit the search object
search.fit(X_train, y_train_log)

# Print the best hyperparameters and score
print("Best hyperparameter values: ", search.best_params_)
print("Best score: %0.3f" % (search.best_score_))

# Convert the search results to a DataFrame and display
pd.DataFrame(search.cv_results_)[
    [
        "mean_train_score",
        "mean_test_score",
        "param_ridge__alpha",
        "param_ridge__max_iter",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().T

In [88]:
ridge_pipe_tuned = make_pipeline(preprocessor, Ridge(alpha=10,max_iter=30, random_state=123))

In [None]:
results_dict["ridge"] = mean_std_cross_val_scores(
    ridge_pipe_tuned, X_train, y_train, cv=5,return_train_score=True
)
results_df = pd.DataFrame(results_dict).T
results_df

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 8. Different models <a name="8"></a>
<hr>
rubric={points:12}

**Your tasks:**
1. Try at least 3 other models aside from a linear model. One of these models should be a tree-based ensemble model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat a linear model? 

<div class="alert alert-warning">
    
Solution_8
    
</div>

_Points:_ 12

_Type your answer here, replacing this text._

In [None]:
from lightgbm import LGBMRegressor

pipe_lgbm = make_pipeline(
    preprocessor, LGBMRegressor(verbosity=-1, random_state=123, n_jobs=-1)
)
results_dict["LGBM"] = mean_std_cross_val_scores(
    pipe_lgbm, X_train, y_train, cv=5, return_train_score=True
)
results_df = pd.DataFrame(results_dict).T
results_df

In [None]:
from xgboost import XGBRegressor

pipe_xgb = make_pipeline(preprocessor, XGBRegressor(n_jobs=-1,random_state=123))
results_dict["xgboost"] = mean_std_cross_val_scores(
    pipe_xgb, X_train, y_train, cv=5, return_train_score=True
)
results_df = pd.DataFrame(results_dict).T
results_df

In [None]:
forest_pipe = make_pipeline(preprocessor, RandomForestRegressor(n_jobs=-1, max_features=26,random_state=123))
results_dict["random_forest"] = mean_std_cross_val_scores(
    forest_pipe, X_train, y_train, cv=5, return_train_score=True
)
results_df = pd.DataFrame(results_dict).T
results_df

<!-- BEGIN QUESTION -->

## 9. Feature selection <a name="9"></a>
<hr>
rubric={points:2}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV` or forward selection for this. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises. 

<div class="alert alert-warning">
    
Solution_9
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

In [124]:
from sklearn.feature_selection import RFECV

# Apply preprocessing
X_train_transformed = preprocessor.fit_transform(X_train)

# Initialize RFECV with the model
rfe_cv = RFECV(LGBMRegressor(verbosity=-1, random_state=123, n_jobs=-1), cv=5, step=10)

# Fit the model
rfe_cv.fit(X_train_transformed, y_train)

# If X_train is a DataFrame, we need to get the feature names
# If you used a OneHotEncoder or similar, we get the feature names from the preprocessor
if hasattr(preprocessor, 'get_feature_names_out'):  # If using OneHotEncoder or similar
    feature_names = preprocessor.get_feature_names_out()
else:
    feature_names = X_train.columns  # If it's just scaling or simple transformations

# Now apply the selected features mask to get the feature names
selected_features = feature_names[rfe_cv.support_]

# Get the fitted model from RFECV
model = rfe_cv.estimator_

# Get feature importances from the fitted model
feature_importances = model.feature_importances_

# Create a DataFrame to show feature names with their corresponding importances
importance_df = pd.DataFrame({
    'Feature': selected_features,
    'Importance': feature_importances
})

# Sort by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print the feature importance
print(importance_df)


                                         Feature  Importance
1                  pipeline-1__number_of_reviews         672
0                   pipeline-1__availability_365         586
4                          pipeline-2__longitude         380
3                           pipeline-2__latitude         356
2                              pipeline-1__price         313
..                                           ...         ...
48      onehotencoder__neighbourhood_South Slope           1
44        onehotencoder__neighbourhood_Rego Park           1
24        onehotencoder__neighbourhood_Flatlands           1
13          onehotencoder__neighbourhood_Astoria           1
58  onehotencoder__neighbourhood_Windsor Terrace           0

[65 rows x 2 columns]


In [125]:
from sklearn.feature_selection import RFECV

# Apply preprocessing
X_train_transformed = preprocessor.fit_transform(X_train)

# Initialize RFECV with the model
rfe_cv = RFECV(XGBRegressor(n_jobs=-1,random_state=123), cv=5, step=10)

# Fit the model
rfe_cv.fit(X_train_transformed, y_train)

# If X_train is a DataFrame, we need to get the feature names
# If you used a OneHotEncoder or similar, we get the feature names from the preprocessor
if hasattr(preprocessor, 'get_feature_names_out'):  # If using OneHotEncoder or similar
    feature_names = preprocessor.get_feature_names_out()
else:
    feature_names = X_train.columns  # If it's just scaling or simple transformations

# Now apply the selected features mask to get the feature names
selected_features = feature_names[rfe_cv.support_]

# Get the fitted model from RFECV
model = rfe_cv.estimator_

# Get feature importances from the fitted model
feature_importances = model.feature_importances_

# Create a DataFrame to show feature names with their corresponding importances
importance_df = pd.DataFrame({
    'Feature': selected_features,
    'Importance': feature_importances
})

# Sort by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print the feature importance
print(importance_df)


                                              Feature  Importance
125     onehotencoder__neighbourhood_Theater District    0.134544
120  onehotencoder__neighbourhood_Springfield Gardens    0.077014
8            onehotencoder__minimum_nights_Short Stay    0.066137
1                       pipeline-1__number_of_reviews    0.063016
5             onehotencoder__minimum_nights_Long Stay    0.058757
..                                                ...         ...
127              onehotencoder__neighbourhood_Tremont    0.000509
112             onehotencoder__neighbourhood_Rosebank    0.000469
50           onehotencoder__neighbourhood_Eastchester    0.000383
24           onehotencoder__neighbourhood_Boerum Hill    0.000311
88               onehotencoder__neighbourhood_Midwood    0.000306

[145 rows x 2 columns]


<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 10. Hyperparameter optimization <a name="10"></a>
<hr>
rubric={points:10}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 

<div class="alert alert-warning">
    
Solution_10
    
</div>

_Points:_ 10

n_estimators 
 Number of boosting rounds

learning_rate 
 The learning rate of training
controls how strongly each tree tries to correct the mistakes of the previous trees
higher learning rate means each tree can make stronger corrections, which means more complex model

max_depth 
 max_depth of trees (similar to decision trees)

scale_pos_weight 
 Balancing of positive and negative weights [for class imbalance]

_Type your answer here, replacing this text._

In [96]:
def make_hyperparameter_tuning_plot(preprocessor, X_train, y_train, type, parameters, train_color, cv_color, text_color):
    """
    Make number_estimators vs score OR max_depth vs score plot for LGBMRegressor

    Parameters
    ----------
    preprocessor: sklearn transformer
        The preprocessing pipeline to apply before the model
    X_train: numpy.ndarray or DataFrame
        The X part of the train set
    y_train: numpy.ndarray or Series
        The y part of the train set
    type:
        specify whether it is number_estimators vs score OR max_depth vs score plot
    parameters: list of int
        Values for `n_estimators` OR max_depth argument of LGBMRegressor
    color1,2,3:
        just for color customization
    Returns
    -------
    None
        Shows the number of estimators vs error rate plot
    """
    train_scores = []
    test_scores = []

    y_train_log = np.log1p(y_train)  # log(1 + y), avoids issues with 0

    if type == "n_estimators":
        for n in parameters:
            model = make_pipeline(
                preprocessor,
                LGBMRegressor(
                    verbosity=-1,
                    random_state=123,
                    n_jobs=-1,
                    n_estimators=n
                )
            )
            scores = cross_validate(
                model,
                X_train,
                y_train_log,
                return_train_score=True
            )
            train_scores.append(np.mean(scores["train_score"]))
            test_scores.append(np.mean(scores["test_score"]))
    elif type == "max_depth":
        for m in parameters:
            model = make_pipeline(
                preprocessor,
                LGBMRegressor(
                    verbosity=-1,
                    random_state=123,
                    n_jobs=-1,
                    max_depth=m
                )
            )
            scores = cross_validate(
                model,
                X_train,
                y_train_log,
                return_train_score=True
            )
            train_scores.append(np.mean(scores["train_score"]))
            test_scores.append(np.mean(scores["test_score"]))

    # Plot
    plt.figure(figsize=(10, 6))
    plt.semilogx(parameters, train_scores, label="train", color=train_color, marker='o')
    plt.semilogx(parameters, test_scores, label="cv", color=cv_color, marker='o')

    # Add text labels beside each point
    for i, n in enumerate(parameters):
        plt.text(n, test_scores[i], f"{n}", fontsize=12, ha='right', va='bottom', color=text_color)

    plt.legend()
    plt.xlabel(type)
    plt.ylabel("score")
    plt.title(type + " vs cv performance")
    plt.grid(True)
    plt.show()


In [None]:
make_hyperparameter_tuning_plot(preprocessor, X_train, y_train,"n_estimators",[1, 5, 10, 25, 50, 100, 200, 400], "#96ddff", "#f6c5ef", "#dcbbf5")

In [None]:
make_hyperparameter_tuning_plot(preprocessor, X_train, y_train,"max_depth",[1, 5, 10, 25, 50, 100, 200, 400],"#ffcd94","#9afcae","#f19e9c")

In [None]:
pipe_lgbm_tuned_first_time = make_pipeline(
    preprocessor, LGBMRegressor(verbosity=-1, n_jobs=-1, random_state=123, n_estimators=50, max_depth=10)
)
results_dict["LGBM_tuned_first_time"] = mean_std_cross_val_scores(
    pipe_lgbm, X_train, y_train, cv=5, return_train_score=True
)
results_df = pd.DataFrame(results_dict).T
results_df

In [33]:
# param_grid = {"lgbmregressor__learning_rate": [0.01, 0.1, 1],
#                "lgbmregressor__max_depth": [100,200,300],
#               }

# search = RandomizedSearchCV(
#     pipe_lgbm, param_grid, return_train_score=True, n_jobs=-1, verbose=0,random_state=123)

# y_train_log = np.log1p(y_train)  # log(1 + y), avoids issues with 0

# # Fit the search object
# search.fit(X_train, y_train_log)

# # Print the best hyperparameters and score
# print("Best hyperparameter values: ", search.best_params_)
# print("Best score: %0.3f" % (search.best_score_))

# # Convert the search results to a DataFrame and display
# pd.DataFrame(search.cv_results_)[
#     [
#         "mean_train_score",
#         "mean_test_score",
#         "param_lgbmregressor__learning_rate",
#         "param_lgbmregressor__max_depth",
#         "mean_fit_time",
#         "rank_test_score",
#     ]
# ].set_index("rank_test_score").sort_index().T

In [34]:
# pipe_lgbm_tuned_second_time = make_pipeline(
#     preprocessor, LGBMRegressor(verbosity=-1,n_estimators=50, learning_rate=0.065, max_depth=10, random_state=123)
# )
# results_dict["LGBM_tuned_second_time"] = mean_std_cross_val_scores(
#     pipe_lgbm, X_train, y_train, cv=5, return_train_score=True
# )
# results_df = pd.DataFrame(results_dict).T
# results_df

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 11. Interpretation and feature importances <a name="1"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Use the methods we saw in class (e.g., `shap`) (or any other methods of your choice) to examine the most important features of one of the non-linear models. 
2. Summarize your observations. 

<div class="alert alert-warning">
    
Solution_11
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 12. Results on the test set <a name="12"></a>
<hr>

rubric={points:10}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 
3. Take one or two test predictions and explain these individual predictions (e.g., with SHAP force plots).  

<div class="alert alert-warning">
    
Solution_12
    
</div>

_Points:_ 10

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 13. Summary of results <a name="13"></a>
<hr>
rubric={points:12}

Imagine that you want to present the summary of these results to your boss and co-workers. 

**Your tasks:**

1. Create a table summarizing important results. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 
3. Report your final test score along with the metric you used at the top of this notebook in the [Submission instructions section](#si).

<div class="alert alert-warning">
    
Solution_13
    
</div>

_Points:_ 12

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<br><br>

<br><br>

<!-- BEGIN QUESTION -->

## 14. Your takeaway <a name="15"></a>
<hr>
rubric={points:2}

**Your tasks:**

What is your biggest takeaway from the supervised machine learning material we have learned so far? Please write thoughtful answers.  

<div class="alert alert-warning">
    
Solution_14
    
</div>

_Points:_ 2

<!-- END QUESTION -->

<br><br>

**PLEASE READ BEFORE YOU SUBMIT:** 

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
4. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope. 

This was a tricky one but you did it! 

![](img/eva-well-done.png)