In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw5.ipynb")

# CPSC 330 - Applied Machine Learning 

## Homework 5: Putting it all together 
### Associated lectures: All material till lecture 13 

**Due date: See the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).**

## Table of contents
0. [Submission instructions](#si)
1. [Understanding the problem](#1)
2. [Data splitting](#2)
3. [EDA](#3)
4. (Optional) [Feature engineering](#4)
5. [Preprocessing and transformations](#5) 
6. [Baseline model](#6)
7. [Linear models](#7)
8. [Different models](#8)
9. (Optional) [Feature selection](#9)
10. [Hyperparameter optimization](#10)
11. [Interpretation and feature importances](#11) 
12. [Results on the test set](#12)
13. [Summary of the results](#13)
14. (Optional) [Your takeaway from the course](#15)

## Submission instructions <a name="si"></a>
<hr>
rubric={points:4}

You will receive marks for correctly submitting this assignment. To submit this assignment, follow the instructions below:

- **You may work on this assignment in a group (group size <= 4) and submit your assignment as a group.** 
- Below are some instructions on working as a group.  
    - The maximum group size is 4. 
    - You can choose your own group members. 
    - Use group work as an opportunity to collaborate and learn new things from each other. 
    - Be respectful to each other and make sure you understand all the concepts in the assignment well. 
    - It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. [Here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members) are some instructions on adding group members in Gradescope.  
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- Upload the .ipynb file to Gradescope.
- **If the .ipynb file is too big or doesn't render on Gradescope for some reason, also upload a pdf or html in addition to the .ipynb.** 
- Make sure that your plots/output are rendered properly in Gradescope.

Model used: Random Forests Classifier, Test score: 0.820 (+/- 0.004)

## Imports

In [None]:
import os
%matplotlib inline
import string
import sys
from collections import deque
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import eli5
import shap


sys.path.append("code/.")
from plotting_functions import *


from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
GridSearchCV,
RandomizedSearchCV,
cross_val_score,
cross_validate,
train_test_split,
)
from sklearn.feature_selection import RFECV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier
from utils import *
from pandas_profiling import ProfileReport


## Introduction <a name="in"></a>

In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips
1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.


#### A final note
Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (15-20 hours???) is a good guideline for this project . Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well. 

<br><br>

<!-- BEGIN QUESTION -->

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
<hr>
rubric={points:3}

In this mini project, you will be working on a classification problem of predicting whether a credit card client will default or not. 
For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on [the dataset page on Kaggle](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe. 

<div class="alert alert-warning">
    
Solution_1
    
</div>

In [None]:
credit_df = pd.read_csv("data/UCI_Credit_Card.csv")
credit_df.head()

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 2. Data splitting <a name="2"></a>
<hr>
rubric={points:2}

**Your tasks:**

1. Split the data into train (70%) and test (30%) portions with `random_state=123`.

> If your computer cannot handle training on 70% training data, make the test split bigger.  

In [None]:
credit_train, credit_test = train_test_split(credit_df, test_size = 0.3, random_state = 123)

<div class="alert alert-warning">
    
Solution_2
    
</div>

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 3. EDA <a name="3"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data. 
4. Pick appropriate metric/metrics for assessment. 

<div class="alert alert-warning">
    
Solution_3
    
</div>

In [None]:
credit_train.head()

In [None]:
X_train = credit_train.drop(columns = ["default.payment.next.month"])
y_train = credit_train["default.payment.next.month"]
X_test = credit_test.drop(columns = ["default.payment.next.month"])
y_test = credit_test["default.payment.next.month"]

Feature scalings are quite different. Since default payment next month is 0 or 1, this is a classification problem.

In [None]:
credit_train.info()

All numeric features. Some categorical features have been converted to numerical already such as Sex, Education, Marriage and our prediction feature.

In [None]:
profile = ProfileReport(credit_train, title = "Pandas Profiling Report")
profile

From the pandas report, there are many things we can observe. First, that 50% more woman are present in the data than men. PAY_0 is highly correlated to our prediction feature but half of the column is filled with zeroes, so it is hard to trust because the meaning of 0 is hard to describe when the dataset description does not include 0. My assumption for 0 is that it means the bill was paid but either there was an issue with the processing or it was paid late but not a month late or the full amount was not paid.

In [None]:
credit_df["default.payment.next.month"].value_counts(normalize=True)

There is class imbalance, since it is more likely to pay your credit card payment instead of letting it default. Since defaulting your next payment is rare, we can ignore the class imbalance.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## (Optional) 4. Feature engineering <a name="4"></a>
<hr>
rubric={points:1}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing. 

<div class="alert alert-warning">
    
Solution_4
    
</div>

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 5. Preprocessing and transformations <a name="5"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 

<div class="alert alert-warning">
    
Solution_5
    
</div>

Since all features are numerical, and those of which that are ordinal are already ordered, no one hot encoding nor ordinal encoding needed. No missing values so no imputation. Scaling is needed for numerical features except our target and features that could be represented.

In [None]:
drop_features = ["ID"]
passthrough_features = ["SEX", "EDUCATION", "MARRIAGE", "PAY_0", "PAY_2", "PAY_3", "PAY_4", "PAY_5", "PAY_6"]
numeric_features = ["LIMIT_BAL", "BILL_AMT1","BILL_AMT2","BILL_AMT3","BILL_AMT4","BILL_AMT5","BILL_AMT6",
                      "PAY_AMT1","PAY_AMT2","PAY_AMT3","PAY_AMT4","PAY_AMT5","PAY_AMT6"]



In [None]:
preprocessor = make_column_transformer(
    ("drop", drop_features),
    (StandardScaler(), numeric_features),
    ("passthrough", passthrough_features)
)

preprocessor.fit(X_train)

In [None]:
new_columns = numeric_features + passthrough_features

X_train_enc = pd.DataFrame(
 preprocessor.transform(X_train), index=X_train.index, columns=new_columns
)
X_train_enc.head()

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 6. Baseline model <a name="6"></a>
<hr>
rubric={points:2}

**Your tasks:**
1. Try `scikit-learn`'s baseline model and report results.

<div class="alert alert-warning">
    
Solution_6
    
</div>

In [None]:
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
 """
 Returns mean and std of cross validation
 """
 scores = cross_validate(model, X_train, y_train, **kwargs)
 mean_scores = pd.DataFrame(scores).mean()
 std_scores = pd.DataFrame(scores).std()
 out_col = []
 for i in range(len(mean_scores)):
     out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))
 return pd.Series(data=out_col, index=mean_scores.index)



Above code was taken from lecture 3

In [None]:
dummy = DummyClassifier()
results = {}
results["Dummy"] = mean_std_cross_val_scores(dummy, X_train, y_train, cv = 10, return_train_score = True)
pd.DataFrame(results).T

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 7. Linear models <a name="7"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the complexity hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.

<div class="alert alert-warning">
    
Solution_7
    
</div>

In [None]:
pipe_lr = make_pipeline(preprocessor, LogisticRegression(max_iter = 2000, random_state = 123))

pipe_lr.fit(X_train, y_train)

results["First Logistic Regression"] = mean_std_cross_val_scores(pipe_lr, X_train, y_train, return_train_score = True)
pd.DataFrame(results).T

In [None]:
scores_dict = {
 "C": 10.0 ** np.arange(-3, 3, 1),
 "mean_train_scores": list(),
 "mean_cv_scores": list(),
}
for C in scores_dict["C"]:
 pipe_lr1 = make_pipeline(
 preprocessor,
 LogisticRegression(max_iter=2000, C=C, random_state = 123),
 )
 scores = cross_validate(pipe_lr1, X_train, y_train, return_train_score=True)
 scores_dict["mean_train_scores"].append(scores["train_score"].mean())
 scores_dict["mean_cv_scores"].append(scores["test_score"].mean())
results_df = pd.DataFrame(scores_dict)
results_df

# Code taken and adapted from lecture 7

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 8. Different models <a name="8"></a>
<hr>
rubric={points:12}

**Your tasks:**
1. Try at least 3 other models aside from a linear model. One of these models should be a tree-based ensemble model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat a linear model? 

<div class="alert alert-warning">
    
Solution_8
    
</div>

Going to use decision trees, knn and random forest.

In [None]:
pipe_svm = make_pipeline(preprocessor, SVC())
pipe_svm.fit(X_train,y_train)
results["SVM"] = mean_std_cross_val_scores(pipe_svm, X_train, y_train, return_train_score = True)

In [None]:
pipe_dt = make_pipeline(preprocessor, DecisionTreeClassifier())
pipe_dt.fit(X_train,y_train)
results["Decision Tree"] = mean_std_cross_val_scores(pipe_dt, X_train, y_train, return_train_score = True)

In [None]:
pipe_rf = make_pipeline(preprocessor, RandomForestClassifier(random_state = 123, n_jobs = -1))
pipe_rf.fit(X_train,y_train)
results["Random forests"] = mean_std_cross_val_scores(pipe_rf, X_train, y_train, return_train_score=True)

In [None]:
pd.DataFrame(results).T

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## (Optional) 9. Feature selection <a name="9"></a>
<hr>
rubric={points:2}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV` or forward selection for this. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises. 

<div class="alert alert-warning">
    
Solution_9
    
</div>

In [None]:
rfe_cv = RFECV(LogisticRegression(max_iter = 2000), cv = 10)
rfe_pipe = make_pipeline(preprocessor, rfe_cv,SVC())
rfe_pipe.fit(X_train,y_train)
results["RFECV SVM"] = mean_std_cross_val_scores(rfe_pipe, X_train, y_train, return_train_score = True)

# Adapted from lecture 13

In [None]:
pd.DataFrame(results).T

Will abandon feature selection.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 10. Hyperparameter optimization <a name="10"></a>
<hr>
rubric={points:10}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods. 
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 

<div class="alert alert-warning">
    
Solution_10
    
</div>

In [None]:
param_grid_svm = {
 "svc__gamma": [0.001, 0.01, 0.1, 1.0, 10, 100],
 "svc__C": np.linspace(2, 3, 6),   
}

param_grid_lr = {
  "logisticregression__C": np.arange(1,16),
}

param_grid_dt = {
    'decisiontreeclassifier__max_depth': np.arange(1,16),
}

param_grid_rf = {
  "randomforestclassifier__n_estimators": np.arange(1,16),
 "randomforestclassifier__max_depth": np.arange(1,16),
 "randomforestclassifier__max_features": np.arange(1,16),
}


 
# Adapted from lecture 8

In [None]:
random_search_svm = RandomizedSearchCV(
 pipe_svm, param_distributions=param_grid_svm, n_jobs=4, n_iter=20, cv=5, random_state=123
)

random_search_svm.fit(X_train, y_train);

In [None]:
random_search_lr = RandomizedSearchCV(
 pipe_lr, param_distributions=param_grid_lr, n_jobs=-1, n_iter=15, cv=5, random_state=123
)

random_search_lr.fit(X_train, y_train);

In [None]:
random_search_dt = RandomizedSearchCV(
 pipe_dt, param_grid_dt, n_jobs=-1, n_iter = 15, cv=5, random_state =123
)


random_search_dt.fit(X_train, y_train);

In [None]:
random_search_rf = RandomizedSearchCV(
 pipe_rf, param_distributions=param_grid_rf, n_jobs=-1, n_iter=15, cv=5, random_state=123
)

random_search_rf.fit(X_train, y_train);

In [None]:
pd.DataFrame(random_search_svm.cv_results_)[
 [
 "mean_test_score",
 "param_svc__gamma",
 "param_svc__C",
 "mean_fit_time",
 "rank_test_score",
 ]
].set_index("rank_test_score").sort_index().T

In [None]:
pd.DataFrame(random_search_lr.cv_results_)[
 [
 "mean_test_score",
 "param_logisticregression__C",
 "mean_fit_time",
 "rank_test_score",
 ]
].set_index("rank_test_score").sort_index().T

In [None]:
pd.DataFrame(random_search_dt.cv_results_)[
 [
 "mean_test_score",
 "param_decisiontreeclassifier__max_depth",
 "mean_fit_time",
 "rank_test_score",
 ]
].set_index("rank_test_score").sort_index().T

In [None]:
pd.DataFrame(random_search_rf.cv_results_)[
 [
 "mean_test_score",
 "param_randomforestclassifier__n_estimators",
 "param_randomforestclassifier__max_features",
 "param_randomforestclassifier__max_depth",
 "mean_fit_time",
 "rank_test_score",
 ]
].set_index("rank_test_score").sort_index().T

In [None]:
pipe_rf_opt = make_pipeline(preprocessor, RandomForestClassifier(n_estimators = 13, max_features = 15, max_depth = 5, random_state = 123, n_jobs = -1))
pipe_rf_opt.fit(X_train,y_train)
results["Random forests Optimized"] = mean_std_cross_val_scores(pipe_rf_opt, X_train, y_train, return_train_score=True)

In [None]:
pipe_svm_opt = make_pipeline(preprocessor, SVC(gamma = 0.01, C = 3))
pipe_svm_opt.fit(X_train,y_train)
results["SVM Optimized"] = mean_std_cross_val_scores(pipe_svm_opt, X_train, y_train, return_train_score = True)

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 11. Interpretation and feature importances <a name="1"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Use the methods we saw in class (e.g., `eli5`, `shap`) (or any other methods of your choice) to examine the most important features of one of the non-linear models. 
2. Summarize your observations. 

<div class="alert alert-warning">
    
Solution_11
    
</div>

In [None]:
eli5_weights = eli5.explain_weights(
    pipe_rf.named_steps["randomforestclassifier"], feature_names = new_columns
)
eli5_weights

Interpretation of the weights:
We knew PAY_0 was highly correlated to the target, but with so many zeroes in PAY_0, it would result in the very high standard deviation as shown. LIMIT_BAL seems to be a much more consistent correlation to the target rather than PAY_0. It is also observed that BILL_AMT1 is also highly weighted for predicting the target. This would make sense since if the first bill is large, it is likely to mean the following bills would be high too, leading to a default.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 12. Results on the test set <a name="12"></a>
<hr>

rubric={points:10}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 
3. Take one or two test predictions and explain these individual predictions (e.g., with SHAP force plots).  

<div class="alert alert-warning">
    
Solution_12
    
</div>

In [None]:
pd.DataFrame(results).T

Using the results above, I will use Random Forests.

In [None]:
pipe_rf_opt = make_pipeline(preprocessor, RandomForestClassifier(n_estimators = 13, max_features = 15, max_depth = 5, random_state = 123, n_jobs = -1))
pipe_rf_opt.fit(X_train,y_train)
pipe_rf_opt.score(X_test,y_test)

I believe my results are trustworthy since the difference between the validation score and the test score is only 0.003 which is within standard deviation. I do not believe I had issues with optimization bias.

In [None]:
y_test_reset = y_test.reset_index(drop = True)
y_test_reset

In [None]:
default_yes_ind = y_test_reset[y_test_reset == 1].index.tolist()

ex_default_yes_index = default_yes_ind[2]

In [None]:
pipe_rf_opt.named_steps["randomforestclassifier"].classes_

In [None]:
X_test_enc = pd.DataFrame(
    data = preprocessor.transform(X_test),
    columns = new_columns,
    index = X_test.index,
)
X_test_enc.shape

In [None]:
pipe_rf_opt.named_steps["randomforestclassifier"].predict_proba(X_test_enc)[ex_default_yes_index]

In [None]:
pipe_rf_opt.named_steps["randomforestclassifier"].predict(X_test_enc) [
    ex_default_yes_index
]

In [None]:
rf_opt_explainer = shap.TreeExplainer(pipe_rf_opt.named_steps["randomforestclassifier"])

In [None]:
rf_opt_explainer.expected_value[1] # On average, this is raw score for defaulting payment

In [None]:
rf_opt_explainer.expected_value[0]

In [None]:
X_test_enc = round(X_test_enc, 3)

In [None]:
test_rf_shap_values = rf_opt_explainer.shap_values(X_test_enc)

In [None]:
shap.force_plot(
 rf_opt_explainer.expected_value[1],
 test_rf_shap_values[1][ex_default_yes_index, :],
 X_test_enc.iloc[ex_default_yes_index, :],
 matplotlib=True,
)

We can observe that the amount of time between a bill's due date and the actual day the bill is paid are the largest contributors for this example.

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

## 13. Summary of results <a name="13"></a>
<hr>
rubric={points:12}

Imagine that you want to present the summary of these results to your boss and co-workers. 

**Your tasks:**

1. Create a table summarizing important results. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 
3. Report your final test score along with the metric you used at the top of this notebook in the [Submission instructions section](#si).

<div class="alert alert-warning">
    
Solution_13
    
</div>

In [None]:
table_data = {
    "Model Used": ["Random Forests Classifier"],
    "Test Score": [pd.DataFrame(results).T.iloc[6,2]],
    "Train Score": [pd.DataFrame(results).T.iloc[6,3]],
    "Weight of first payment": ["0.0936" u"\u00B1" "0.0793"],
    "Weight of balance limit": ["0.0673" u"\u00B1" "0.0110"],
    "Weight of first bill": ["0.0650" u"\u00B1" "0.0107"]
}
pd.DataFrame(table_data).T

Using a Random Forest Classifier model, a test score of 82% was achieved. Train score is negligibly higher at 82.6%. One idea is to revisit support vector machines since they could be a better model to use, but since I do not have access to a workstation or server CPU, the load that training and optimizing a SVM model has is too great. I did not engineer additional features to add but perhaps adding some could increase model performance.

<!-- END QUESTION -->

<br><br>

<br><br>

<!-- BEGIN QUESTION -->

## (Optional) 14. Your takeaway <a name="15"></a>
<hr>
rubric={points:2}

**Your tasks:**

What is your biggest takeaway from the supervised machine learning material we have learned so far? Please write thoughtful answers.  

<div class="alert alert-warning">
    
Solution_14
    
</div>

I personally feel like the content I have learned thus far seems really relevant in today's world of machine learning, and I am grateful for learning it. I really enjoyed doing this assignment alone, as it reassured that I have the skills to perform supervised machine learning from start to finish.

<!-- END QUESTION -->

<br><br>

**PLEASE READ BEFORE YOU SUBMIT:** 

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
4. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope. 

Congratulations on finishing this project. This was a tricky one but you did it!

In [None]:
from IPython.display import Image

Image("img/eva-well-done.png")