In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("p1.ipynb")

# Practice 1: Decision trees and machine learning fundamentals

The first section of this practice notebook walks through the steps we learned using a toy datasets, and the second uses a real dataset. This way you get to practice what you have learned twice. Don't worry if you can't finish the entire second part with the real data set, most questions give automated feedback so you should be able to continue on your own time after the workshop. This notebook is adapted to this workshop from our teaching material, so some questions have been removed and you might noticed that the question numbers are not always sequential. Have fun and let us know if you have any questions!

<br><br>

## Imports

Run this cells to initialize a couple of libraries we use for the automated feedback.

In [None]:
# Run this cell to start
from hashlib import sha1
import numpy as np

## Exercise 1: Decision trees with a toy dataset 
<hr>

Suppose you have three different job offers with comparable salaries and job descriptions. You want to decide which one to accept, and you want to make this decision based on which job is likely to make you happy. Being a very systematic person, you come up with three features associated with the offers, which are important for your happiness: whether the colleagues are supportive, whether there is work-hour flexibility, and whether the company is a start-up or not. So the `X` of your offer data looks as follows: 

In [None]:
import pandas as pd


offer_data = {
    # Features
    "supportive_colleagues": [1, 0, 0, 1],
    "work_hour_flexibility": [0, 0, 1, 1],
    "start_up": [0, 1, 1, 1],    
}

offer_df = pd.DataFrame(offer_data)
offer_df

Your goal is to get predictions for these rows. In other words, for each row, you want to predict whether that job would make you **happy** or **unhappy**.   

So you ask the following questions to some of your friends (who you think have similar notions of happiness) regarding their jobs:

1. Do you have supportive colleagues? (1 for 'yes' and 0 for 'no')
2. Do you have flexible work hours? (1 for 'yes' and 0 for 'no')
3. Do you work for a start-up? (1 for 'start up' and 0 for 'non start up')
4. Are you happy in your job? (happy or unhappy)

Suppose you get the following data from this toy survey. You decide to train a machine learning model using this toy survey data and use this model to predict which job from `offer_df` is likely to make you happy. 

In [None]:
happiness_data = {
    # Features
    "supportive_colleagues": [1, 1, 1, 0, 0, 1, 1, 0, 1, 0],
    "work_hour_flexibility": [1, 1, 0, 1, 1, 0, 1, 0, 0, 0],
    "start_up": [1, 0, 1, 0, 1, 0, 0, 1, 1, 0],
    # Target
    "target": [
        "happy",
        "happy",
        "happy",
        "unhappy",
        "unhappy",
        "happy",
        "happy",
        "unhappy",
        "unhappy",
        "unhappy",
    ],
}

train_df = pd.DataFrame(happiness_data)
train_df

<br><br>

### 1.1 Decision stump by hand 
rubric={autograde:2}

**Your tasks:**

With this toy dataset, build a decision stump (decision tree with only 1 split) by hand, splitting on the condition `supportive_colleagues <= 0.5`. What training accuracy would you get with this decision stump? Save the accuracy as a decimal in an object named `supportive_colleagues_acc`. 

> You do not have to show any calculations or code. 

<div class="alert alert-warning">

Solution_1.1
    
</div>

In [None]:
supportive_colleagues_acc = None

...

In [None]:
grader.check("q1.1")

<br><br>

### 1.2 Separating features and target
rubric={autograde:2}

Recall that in `scikit-learn`, before building a classifier, we need to separate features and target. 

**Your tasks:**

1. Separate features and target from `train_df` and save them in `X_train_toy` and `y_train_toy`, respectively. 

<div class="alert alert-warning">

Solution_1.2
    
</div>

In [None]:
X_train_toy = None
y_train_toy = None

...

In [None]:
grader.check("q1.2")

<br><br>

### 1.3 Create a decision tree classifier object
rubric={autograde:1}

**Your tasks:**

1. Create a `DecisionTreeClassifier` object with `random_state=16` and store it in a variable called `toy_tree`.

<div class="alert alert-warning">

Solution_1.3
    
</div>

In [None]:
# Import the decision tree classifier
...

toy_tree = None

...

In [None]:
grader.check("q1.3")

<br><br>

### 1.4 `fit` the decision tree classifier 
rubric={autograde:1}

**Your tasks:**

1. Now train a decision tree model by calling `fit` on `toy_tree` with `X_train_toy` and `y_train_toy` created above. 

<div class="alert alert-warning">

Solution_1.4
    
</div>

In [None]:
...

In [None]:
grader.check("q1.4")

<br><br>

### 1.5 Visualize the trained decision tree
rubric={autograde:2}

**Your tasks:**

1. Visualize the trained decision tree model using the same function we used in the lecture notes (we have copied it here for you). Save the visualization tree returned by the function below in a variable called `toy_tree_viz`.  

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree


def display_tree(model, filled=True, impurity=False, ax=None, figsize=(12, 8), **kwargs):
    if ax is None:
        ax = plt.subplots(figsize=figsize)[1]
    return plot_tree(
        model,
        feature_names=model.feature_names_in_,
        class_names=model.classes_.astype(str) if hasattr(model, 'classes_') else None, # To avoid errors when using regression trees
        filled=filled,
        impurity=impurity,
        ax=ax,
        **kwargs,
    )

<div class="alert alert-warning">

Solution_1.5
    
</div>

In [None]:
toy_tree_viz = None

...

toy_tree_viz;

In [None]:
grader.check("q1.5")

<br><br>

### 1.6 Depth of the tree
rubric={autograde:1}

**Your tasks:**

1. What's the depth of the learned decision tree model? Save it as an integer in the variable `toy_depth` below. Hint: You can either input the depth manually by looking at your visualzation above or use the `.get_depth()` method of the decision tree object.

<div class="alert alert-warning">

Solution_1.6
    
</div>

In [None]:
toy_depth = None

...

In [None]:
grader.check("q1.6")

<br><br>

### 1.7 Accuracy calculation
rubric={autograde:1}

**Your tasks:**

1. Evaluate the `toy_tree` on the training data (i.e., call `score()` on `X_train_toy` and `y_train_toy`) and store the score in a variable called `train_acc`.

<div class="alert alert-warning">

Solution_1.7
    
</div>

In [None]:
train_acc = None

...

In [None]:
grader.check("q1.7")

<br><br>

<!-- BEGIN QUESTION -->

### 1.8 Discussion
rubric={reasoning:2}

**Your tasks:**

1. Do you get perfect training accuracy? Why or why not? 

<details><summary>Solutions</summary>

We do not get perfect training accuracy. Notice that the model made an "error" on example with index 8; the original target is "unhappy" and the predicted one is "happy". This is because we have some inconsistency in the training data; we have two examples in the dataset with exactly the same feature vectors but different targets.

</details>

<div class="alert alert-warning">

Solution_1.8
    
</div>

<!-- END QUESTION -->

<br><br>

### 1.9 Predicting on the offer data 
rubric={autograde:3}

Recall that our goal is to predict in which jobs you are likely to be happy. The `offer_df` dataframe below has all the job offers you have received. 

**Your tasks:**

1. Using the trained decision tree above, predict the targets for all examples in `offer_df` and store them in the `predictions` variable below. In which of the job offers is the model predicting that you will be happy?

In [None]:
offer_df

<div class="alert alert-warning">

Solution_1.9
    
</div>

In [None]:
...

predictions

In [None]:
grader.check("q1.9")

<br><br><br><br>

## Exercise 2: Decision trees on Spotify Song Attributes dataset <a name="2"></a>
<hr>

### Introducing the dataset
  
For the rest of this practice session lab you'll be using [a dataset of Spotify Song Attributes](https://raw.githubusercontent.com/UBC-MDS/intro-to-ml-workshop/main/practice/p1/data/spotify.csv). The dataset contains a number of features of songs from 2017 and a binary variable `target` that represents whether the user liked the song (encoded as 1) or not (encoded as 0). See the documentation of all the features [here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/). 

### 2.2 Data splitting 
rubric={autograde:2}

We have provided the code to read in the data CSV directly from the URL and to store it as a pandas dataframe named `spotify_df`.


**Your tasks:**

1. Split the dataframe into `train_df` and `test_df` with `random_state=123` and `test_size=0.2`. 

<div class="alert alert-warning">

Solution_2.2
    
</div>

In [None]:
# To simplify the problem, we are only keeping a subset of the original columns
url = 'https://raw.githubusercontent.com/UBC-MDS/intro-to-ml-workshop/main/practice/p1/data/spotify.csv'
spotify_df = pd.read_csv(url, index_col=0)[['acousticness', 'danceability', 'liveness', 'tempo', 'energy', 'valence', 'loudness', 'target']]
spotify_df

In [None]:
# Import the train test split funciton
...

train_df, test_df = ..., ...

...

In [None]:
grader.check("q2.2")

<br><br>

### Exploratory data analysis (EDA)
rubric={autograde:2}

Right after splitting our data into train and test, we want to do some exploratory data analysis on the training dataframe. This analysis can help us idenitfy any data cleaning we need to do, what features could be informative for the target value, which models might be suirable for our problem, and more. The golden rule applies here to, the information we gather from the EDA will inform our down stream data science decision, so we don't want any information from the test data set to be used here.

Since we didn't cover EDA in this workshop, we will show you a couple of summary statistics and plots that can be useful. For more inspiration of what you should look at during EDA, please [see some example here](https://joelostblom.github.io/altair_ally/examples.html)

In [None]:
# Show summary statistics for each column
train_df.describe()

The starter code below produces distribution plots and pairwise scatter plots for all the features, separated for positive (target=1, i.e., user liked the song) and negative (target=0, i.e., user disliked the song) examples. The histogram shows that extremely quiet songs tend to be disliked (more blue bars than orange on the left) and very loud songs also tend to be disliked (more blue than orange on the far right).

Let's say that, for a particular feature, the distribution plots of that feature are identical for the two target classes. Does that mean the feature is not useful for predicting the target class?. 
No, the feature might still be useful, because it may be predictive in conjunction with other features. For example, the valence feature density plots (below) do indeed look quite overlapping. But it may be the case that very high valence in conjunction with low tempo is very predictive of a liked song. This type of pattern would not emerge in these individual density plots and not in the pairwise scatter plots if the relationship is complex, but a decision tree could potentially still learn it.

In [None]:
import seaborn as sns


sns.pairplot(train_df, hue='target', corner=True)

<br><br><br><br>

## Model building
<hr>

Now that we did some preliminary exploratory data analysis (EDA), let's move on to modeling. 

<br><br>

### 3.1 Creating `X` and `y`
rubric={autograde:2}

**Your tasks:**

1. Separate `X` and `y` from `train_df` and `test_df` from the previous exercise and store them as `X_train`, `y_train`, `X_test`, `y_test`, respectively.

<div class="alert alert-warning">

Solution_3.1
    
</div>

In [None]:
X_train = None
y_train = None
X_test = None
y_test = None

...

In [None]:
grader.check("q3.1")

<br><br>

<br><br>

### 3.2 The baseline model: `DummyClassifier`
rubric={autograde:2}

**Your tasks:**
1. Carry out 10-fold cross-validation on `DummyClassifier` object above using `cross_validate` on `X_train` and `y_train`. Pass `return_train_score=True` to `cross_validate`. Return the train score and store the object as a dataframe in the `dummy_score` variable below

<div class="alert alert-warning">

Solution_3.2
    
</div>

In [None]:
# Import the dummy classifier class and the cross_validate function
...

dummy_score = None
...

In [None]:
grader.check("q3.2")

<br><br>

### 3.3
rubric={autograde:2}

**Your tasks:**

1. Create a `DecisionTreeClassifier` with `random_state=123` and store it in a variable called `spotify_tree`.

<div class="alert alert-warning">

Solution_3.3
    
</div>

In [None]:
# Make the necessary import
...

spotify_tree = None

...

In [None]:
grader.check("q3.3")

<br><br>

### 3.4 Cross-validation with `DecisionTreeClassifier`
rubric={autograde:4}

**Your tasks:** 

1. Carry out 10-fold cross validation with the `spotify_tree` object above using `cross_validate` on `X_train` and `y_train`. Pass `return_train_score=True` to `cross_validate`. Save the results as a pandas dataframe in a variable called `dt_scores_df`. 

<div class="alert alert-warning">

Solution_3.4
    
</div>

In [None]:
dt_scores_df = None

...

In [None]:
grader.check("q3.4")

<br><br>

## Hyperparameters <a name="4"></a>
<hr>

### 4.1 Train and cross-validation plots
rubric={autograde:12}

In this exercise, you'll experiment with the `max_depth` hyperparameter of the decision tree classifier. See the [`DecisionTreeClassifier` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for more details.

**Your tasks:**

1. Explore the `max_depth` hyperparameter. Run 10-fold cross-validation for trees with the following values of `max_depth`: `np.arange(1, 25, 2)`. Set the `random_state` of `DecisionTreeClassifier` to 123 in each case for reproducibility. 
2. For each `max_depth`, get both the mean train accuracy and the mean cross-validation accuracy.
3. Make a plot with `max_depth` on the *x*-axis and the train and cross-validation accuracies on the *y*-axis. That is, your plot should have two curves, one for train and one for cross-validation. Include a legend to specify which is which and make sure each curve and the axes have the reasonable name. Save the plot to `max_depth_plot`.


**There are some automatic checks on this question, but they don't represent the only way of going about solving this question, so you can check in with us that your chart looks reasonable if you think it is correct but the tests are failling.**

In [None]:
depths = np.arange(1, 25, 2)
depths

<div class="alert alert-warning">

Solution_4.1
    
</div>

In [None]:
# max_depth_plot: the figure plotted for this exercise
max_depth_plot = None

...

In [None]:
grader.check("q4.1")

<br><br>

### 4.3 Picking the best value for `max_depth`
rubric={autograde:2}

Before continuing, think about how changing the `max_depth` hyperparameter affects the training and cross-validation accuracy. 

<details><summary>Solution</summary>
    
In case of the training data, a higher value of `max_depth` parameter results in higher accuracy. When the accuracy is 1.0, it means that the model is able to classify all training examples perfectly. This happens because for higher `max_depth` values, the decision tree learns a specific rule for almost all examples in the training data. In case of the cross-validation scores, initially the accuracy increases a bit and then it goes back down since the model is no longer learning the general relationship between the input features and the target, but instead learning about noise in the training data.

</details>


**Your tasks:**
1. From your results, pick the "best" `max_depth`, the one which gives the maximum cross-validation score. Store it in a variable called `best_max_depth` as an integer. 

<div class="alert alert-warning">

Solution_4.3   
    
</div>

In [None]:
best_max_depth = None

...

In [None]:
grader.check("q4.3")

<br><br>

### 4.4 Final assessment on the test split 
rubric={autograde:2}

Now that we have our finalized model, we are ready to evaluate it on the test set!

**Your tasks:**

1. Create a decision tree model `best_model` using the `best_max_depth` you chose in the previous exercise. 
2. Fit the `best_model` on the _entire training set_ (`X_train` and `y_train`). 
2. Compute the test score (on `X_test` and `y_test`) and store it in a variable called `test_score` below. 

<div class="alert alert-warning">

Solution_4.5
    
</div>

In [None]:
best_model = None
test_score = None

...

In [None]:
grader.check("q4.4")

<br><br>

### 4.5 Visualizing Spotify decision tree
rubric={autograde:3}

**Your tasks:**
1. Visualize `best_model` with the `display_tree` function from Exercise 1.5 with `counts=True`. Store the visualization in `spotify_tree_viz` variable below. 
2. Which feature did the model pick as the best feature? In other words, what feature did the model use for the first split? Store the name of the feature as a string in the variable called `best_feat` below. 

<div class="alert alert-warning">

Solution_4.5
    
</div>

In [None]:
spotify_tree_viz = None
best_feat = None

...

spotify_tree_viz;

In [None]:
grader.check("q4.5")

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-warning">

Solution_4.6
    
</div>

### 4.6 Analysis
rubric={reasoning:6}

**Your tasks:**

Reflect on the following questions:

1. How do the test scores compare to the cross-validation scores? Briefly discuss. 
2. Why can't you simply pick the value of `max_depth` that gives the best accuracy on the training data? (Answer in maximum 2 to 3 sentences.)
3. Do you think that the `max_depth` you chose would generalize to other "spotify" datasets (i.e., data on other spotify users)?


<details><summary>Solutions</summary>
    
1. We see the test score is a bit higher compared to the cross-validation score. But I would not trust this result too much. Looking at the plot, we can see the cv score plot is quite "bumpy" and even if `max_depth=5` is a pretty good value, there is probably also some luck involved there.
2. If we are to pick `max_depth` simply based on the training data, it'll pick the lowest value for the parameter as it performs best on the training set. (See the table and plot in 5.1.) That said, that model would be overfit and it won't generalize well on the validation data. That's why we treat it as a hyperparameter and pick the best value based on the cross-validation accuracy. 
3. Whether the chosen `max_depth` generalizes to other users or not would depend upon how similar the new user is to this user. In other words, whether the training data for this user is representative of the new user or not. That said, the chosen `max_depth` of 5 would most like do better than if we had chosen a higher depth.  
    
</details>

<!-- END QUESTION -->

### Congratulations on practice 1! Well done 👏👏! 

![](img/eva-well-done.png)