# DSCI 573 - Feature and Model Selection

## Lab 1: Evaluation metrics

## Table of contents

- [Submission guidelines](#sg)
- [Exercise 0: Remembering important concepts from 571](#0)
- [Exercise 1: Precision, recall, and f1 score by hand](#1)
- [Exercise 2: Classification evaluation metrics using `sklearn`](#2)
- [Exercise 3: Regresison metrics](#3)

In [2]:
import os
import re
import sys
from hashlib import sha1

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tests_lab1
from sklearn import datasets
from sklearn.compose import (
    ColumnTransformer,
    TransformedTargetRegressor,
    make_column_transformer,
)
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    precision_score,
    recall_score,
)
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC, SVR

## Submission instructions <a name="si"></a>
<hr>
rubric={mechanics:2}

You will receive marks for correctly submitting this assignment. 

To correctly submit this assignment follow the instructions below:

- Push your assignment to your GitHub repository. 
- Add a link to your GitHub repository here: LINK TO YOUR GITHUB REPO 
- Upload an HTML render of your assignment to Canvas. The last cell of this notebook will help you do that.
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

[Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.

**NOTE: The data you download for use in this lab SHOULD NOT BE PUSHED TO YOUR REPOSITORY. You might be penalised for pushing datasets to your repository. I have seeded the repository with `.gitignore` and hoping that it won't let you push CSVs.**

## Exercise 0: Remembering important concepts from 571 <a name="0"></a>
<hr>

### 0.1 Overfitting 
rubric={reasoning:2}

Why would it be a problem to set the `max_depth` hyperparameter of `DecisionTreeClassifier` purely based on training accuracy?

### YOUR ANSWER HERE

### 0.2 Parameters and hyperparameters
rubric={reasoning:2}

Explain the difference between parameters and hyperparameters. Which ones do we tend to tune with cross-validation? Why wouldn't cross validation be appropriate in the other case?

### YOUR ANSWER HERE

### 0.3 Cross-validation
rubric={reasoning:2}

1. What is an advantage and a disadvantage of 20-fold cross-validation over 3-fold cross-validation? 
2. Assuming $n$ is the number of training examples, if you carry out $n$-fold cross-validation, how many validation example(s) there would be in each fold? 

### YOUR ANSWER HERE

### (optional) 0.4 
rubric={reasoning:1}

You have a (fictional) model that trains in $\mathcal{O}(n^2d)$ time (total) and makes predictions in $\mathcal{O}(d^2)$ time (per example), where $n$ is the number of examples and $d$ is the number of features. What is the time complexity of evaluating this model with $k$-fold cross-validation? Answer using big-O notation and explain or show your work. Your answer may depend on $n$, $d$, and/or $k$.

### YOUR ANSWER HERE

## Exercise 1: Precision, recall, and f1 score by hand <a name="1"></a>
<hr>

Consider the problem of predicting whether a patient has a disease or not. Below are confusion matrices of two machine learning models: Model A and Model B. 

- Model A
|    Actual/Predicted      | Predicted disease | Predicted no disease |
| :------------- | -----------------------: | -----------------------: |
| **Actual disease**       | 2 | 8 |
| **Actual no disease**       | 0 | 100 |


- Model B
|    Actual/Predicted      | Predicted disease | Predicted no disease |
| :------------- | -----------------------: | -----------------------: |
| **Actual disease**       | 6 | 4 |
| **Actual no disease**       | 10 | 90 |

### 1.1 Positive vs. negative class 
rubric={reasoning:2}

**Your tasks:**

Precision, recall, and f1 score depend crucially upon which class is considered "positive", that is the thing you wish to find. In the example above, which class is likely to be the "positive" class? Why? 

### YOUR ANSWER HERE

### 1.2 Accuracy
rubric={autograde:2}

**Your tasks:**

Calculate accuracies for Model A and Model B. 

We'll store all metrics associated with Model A and Model B in the `results_dict` below. 

In [None]:
### BEGIN STARTER CODE

results_dict = {"A": {}, "B": {}}

### END STARTER CODE

In [None]:
results_dict["A"]["accuracy"] = None
results_dict["B"]["accuracy"] = None

### YOUR ANSWER HERE

In [None]:
assert tests_lab1.ex1_2_1(
    results_dict["A"]["accuracy"]
), "Your answer is incorrect, see traceback above."
print("Success")

In [None]:
assert tests_lab1.ex1_2_2(
    results_dict["B"]["accuracy"]
), "Your answer is incorrect, see traceback above."
print("Success")

In [None]:
pd.DataFrame(results_dict)

### 1.3 Which model would you pick? 
rubric={reasoning:1}

Which model would you pick simply based on the accuracy metric? 

### YOUR ANSWER HERE

### 1.4 Precision, recall, f1-score
rubric={autograde:6}

**Your tasks:**

1. Calculate precision, recall, f1-score for Model A and Model B without using `scikit-learn` tools. 


In [None]:
results_dict["A"]["precision"] = None
results_dict["B"]["precision"] = None
results_dict["A"]["recall"] = None
results_dict["B"]["recall"] = None
results_dict["A"]["f1"] = None
results_dict["B"]["f1"] = None

### YOUR ANSWER HERE

In [None]:
assert tests_lab1.ex1_4_1(
    results_dict["A"]["precision"]
), "Your answer is incorrect, see traceback above."
print("Success")

In [None]:
assert tests_lab1.ex1_4_2(
    results_dict["B"]["precision"]
), "Your answer is incorrect, see traceback above."
print("Success")

In [None]:
assert tests_lab1.ex1_4_3(
    results_dict["A"]["recall"]
), "Your answer is incorrect, see traceback above."
print("Success")

In [None]:
assert tests_lab1.ex1_4_4(
    results_dict["B"]["recall"]
), "Your answer is incorrect, see traceback above."
print("Success")

In [None]:
assert tests_lab1.ex1_4_5(
    results_dict["A"]["f1"]
), "Your answer is incorrect, see traceback above."
print("Success")

In [None]:
assert tests_lab1.ex1_4_6(
    results_dict["B"]["f1"]
), "Your answer is incorrect, see traceback above."
print("Success")

Show the dataframe with all results

In [None]:
pd.DataFrame(results_dict)

### 1.5 Precision, recall, f1-score
rubric={reasoning:2}

**Your tasks:**
1. Which metric is more informative in this case? Why? 
2. Which model would you pick based on this information? 

### YOUR ANSWER HERE

### (optional) 1.6 
rubric={reasoning:1}

**Your tasks:**
1. Are confusion matrices, precision, and recall useful only where there is class imbalance? 
2. Provide 4 to 5 example classification datasets (with links) where accuracy metric would be misleading. Discuss which evaluation metric would be more appropriate for each dataset. You may consider datasets we used in 571. You could also look up datasets on Kaggle. 

### Exercise 2: Classification evaluation metrics using `sklearn` <a name="2"></a>
<hr>

In general, when the dataset is imbalanced, accuracy does not provide the whole story. In class, we looked at credit card fraud dataset which is a classic example of an imbalanced dataset. Another example is customer churn datasets, where most of the customers stay with the service and a small minority cancel their subscription. These are the people the company might want to target so in order to convince them to stay with them. For the next questions you will be using a [telecom customer churn dataset](https://www.kaggle.com/becksddf/churn-in-telecoms-dataset) from Kaggle. 

Download the data CSV and save it under your lab folder. **Please do not push the CSV in your repo.** 
The starter code below reads the data CSV as a pandas dataframe and splits it into 70% train and 30% test. 

Note that `churn` column in the dataset is the target. "True" means the customer left the subscription (churned) and "False" means they stayed.

In [None]:
### BEGIN STARTER CODE

df = pd.read_csv("bigml_59c28831336c6604c800002a.csv", encoding="latin-1")
train_df, test_df = train_test_split(df, test_size=0.3, random_state=123)
train_df

### END STARTER CODE

### 2.1 EDA
rubric={reasoning:3}

**Your tasks:**

Examine the distribution of target values in the train split. Do you see class imbalance? If yes, do we need to deal with it? Why or why not? 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### (optional) 2.2  
rubric={reasoning:1}

**Your tasks:**
1. Choose 5 features which you think are most relevant for the prediction task. Examine correlations of these features with the target.  
2. Which features look most promising? Note your observations.    

### 2.3 Identify numeric, categorical, binary, and other features
rubric={accuracy:1,reasoning:5}

**Your tasks:**

1. Identify different feature types (e.g., numeric, categorical, binary, drop features). 
2. Separate `X` and `y` from `train_df` and `test_df`.

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 2.4 Feature transformations
rubric={accuracy:4,reasoning:2}

**Your tasks:**
1. Describe your plan for feature transformations. Would you drop any features? 
2. Define a preprocessor with appropriate feature transformations using `ColumnTransformer`. 

Note that if you do not see any missing values, it's fine to skip imputation. 

### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 2.5 `DummyClassifier`
rubric={accuracy:2, reasoning:1}

**Your tasks:**

1. Carry out cross validation with `DummyClassifier`. Pass the following `scoring` metrics to `cross_validate`. 
    - accuracy
    - f1
    - recall
    - precision
    - average_precision
    - roc_auc

2. Comment on the scores given by `DummyClassifier`.  

You may use the `results_dict` below to store all your results. 

In [None]:
results_df = {}

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 2.6 Exploring different scoring methods on different classifiers
rubric={accuracy:6,reasoning:4}

In this exercise you will be using one of the most popular classifiers called [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) which we haven't studied yet. We'll look at the details of the classifier later in the course but that shouldn't prevent you from using it. At this point you should feel comfortable using models with our usual ML workflow even if you haven't seen them before. 

**Your tasks:**
1. For each of the following classifiers in the starter code below, carry out cross-validation using the `cross_validate` function and the following evaluation metrics. Note that you can pass multiple [scoring metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) as a list or a dict to the `scoring` parameter. 
    - `accuracy`
    - `precision`
    - `recall`
    - `f1`
    - `average_precision`
    - `roc_auc`
2. Discuss the results focusing on the following points
    1. How do the results compare to the baseline, i.e., `DummyClassifier`?
    2. Comment on precision and recall of `LogisticRegression`.    
    3. Comment on precision and recall of `RandomForestClassifier`.    
    4. How are the results affected when `class_weight` parameter is used?
    


In [None]:
### BEGIN STARTER CODE

classifiers = {
    "Logistic Regression": LogisticRegression(),
    "Logistic Regression (balanced)": LogisticRegression(class_weight="balanced"),
    "Random Forest": RandomForestClassifier(),
    "Random Forest (balanced)": RandomForestClassifier(class_weight="balanced"),
}

### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 2.7 Hyperparameter optimization and test results
rubric={accuracy:4,viz:2,reasoning:4}

The starter code below defines a pipeline with a preprocessor (assuming that your preprocessor is named as `preprocessor`), a `RandomForestClassifier` model, and a parameter grid. 

**Your taks:**
1. Carry out hyperparameter optimization with `RandomizedSearchCV` using this pipeline and parameter distributions using `f1` score for optimization. 
2. Try the best model on the test set. 
    - Display confusion matrix. 
    - Display classification report. 
    - Show precision-recall curve with average precision score.     
    - Show ROC curve with AUC. 
3. Comment on the results.     

In [None]:
### BEGIN STARTER CODE

import scipy
from scipy.stats import randint

rf_pipeline = make_pipeline(
    preprocessor, RandomForestClassifier(class_weight="balanced", random_state=123)
)

param_dist = {
    "randomforestclassifier__n_estimators": scipy.stats.randint(low=10, high=300),
    "randomforestclassifier__max_depth": scipy.stats.randint(low=2, high=20),
}

### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### (optional) 2.8 SMOTE
rubric={reasoning:1}

**Your tasks:**
- Instead of using `class_weight`, use SMOTE with the above classifiers to deal with class imbalance. 
- Compare your results with the results when you used `class_weight`?

Note that you'll have to use [imbalanced-learn](https://imbalanced-learn.readthedocs.io/en/stable/install.html) for SMOTE. 

In [None]:
### YOUR ANSWER HERE

### Exercise 3: Regression metrics <a name="3"></a>
<hr> 


For this exercise, we'll use [California housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) from `sklearn datasets`.  

In [None]:
### BEGIN STARTER CODE

from sklearn.datasets import fetch_california_housing

housing_df = fetch_california_housing(as_frame=True).frame

### END STARTER CODE

### Exercise 3.1: Data spitting and exploration 
rubric={accuracy:1,reasoning:3}

**Your tasks**

1. Split the data into train and test portions. 
2. Explore the train split. Do you need to apply any transformations? If yes, which transformations would you apply? 
3. Separate `X` and `y` in train and test splits. 

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 3.2 Baseline: DummyRegressor 
rubric={accuracy:1,reasoning:1}

**Your tasks:**
1. Carry out cross-validation using `DummyRegressor`. 
2. What metric is used by default? 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### 3.3 Different regressors
rubric={accuracy:4,reasoning:2}

In this exercise, we are going to use `RandomForestRegressor` model which we haven't looked into yet. As I said before, at this point you should feel comfortable using models with our usual ML workflow even if you don't know the details. We'll talk about `RandomForestRegressor` later in the course.  

**Your tasks:**

1. Using the models in the starter code below, carry out cross validation with each model (use a pipeline with the model as an estimator if you are applying any transformations) with all of the following evaluation metrics. You can do it by passing the evaluation metrics to `scoring` argument of `cross_validate`. 
    - neg_mean_squared_error
    - neg_root_mean_squared_error
    - neg_mean_absolute_error
    - r2
    - mape_scorer (user-defined scorer given in the starter code below)
2. Show results as a dataframe with models as columns and rows with different metrics. 
3. Interpret the results. How do the models compare to the baseline? Which model seems to be performing well with all different metrics? 


In [None]:
### BEGIN STARTER CODE


def mape(true, pred):
    return 100.0 * np.mean(np.abs((pred - true) / true))


# make a scorer function that we can pass into cross-validation
mape_scorer = make_scorer(mape, greater_is_better=False)

models = {
    "Ridge": Ridge(),
    "Random Forest": RandomForestRegressor(),
}

### END STARTER CODE

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 3.4 Hyperparameter optimization 
rubric={accuracy:2,reasoning:3}

**Your tasks:**
1. Carry out hyperparameter optimization using `RandomizedSearchCV` and `Ridge`. The `alpha` hyperparameter of `Ridge` controls the fundamental tradeoff. Choose the metric of your choice for hyperparameter optimization. 
2. Are you getting better scores compared to the default values? 
3. Try the best `Ridge` model on the test set.
4. Comment on the results.  

**Note: If you get errors with `n_jobs=-1`, try to use `n_jobs=1`, i.e., sequential processing.**

In [None]:
### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### (optional) 3.5 Hyperparameter optimization with `RandomForestRegressor`
rubric={reasoning:1}

**Your tasks:**
1. Carry out hyperparameter optimization using `RandomForestRegressor` and hyperparameter optimization tool of your choice. You can find the relevant hyperparameters in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). 
2. Are you getting better scores compared to the default values of the regressor? 
3. Try the best model on the test set and comment on the results. 

### 3.6 Model interpretation  
rubric={viz:2,reasoning:2}

Ridge is a linear model and it learns coefficients associated with each feature during fit. 

**Your tasks:**

1. Visualize coefficients learned by the `best_estimator_` of random search associated with each feature as a pandas dataframe with two columns: features and coefficients. 
2. Increasing which feature values would result in higher and lower housing price? 

In [None]:
### YOUR ANSWER HERE

### YOUR ANSWER HERE

### Submission to Canvas

**PLEASE READ: When you are ready to submit your assignment do the following:**

- Run all cells in your notebook to make sure there are no errors by doing Kernel -->  Restart Kernel and Run All Cells...
- If you are using the "573" `conda` environment, make sure to select it before running all cells. 
- Convert your notebook to .html format using the `convert_notebook()` function below or by File -> Export Notebook As... -> Export Notebook to HTML
- Run the code `submit()` below to go through an interactive submission process to Canvas.
After submission, be sure to do a final push of all your work to GitHub (including the rendered html file).

In [None]:
# from canvasutils.submit import convert_notebook, submit

# convert_notebook("lab1.ipynb", "html")  # uncomment and run when you want to try convert your notebook (or you can convert manually from the File menu)
# submit(course_code=53670, token=False)  # uncomment and run when ready to submit to Canvas

Congratulations on finishing the lab!! 