# Midterm Notebook

Welcome to your midterm. Logistics:
- Open book/open note/**no internet**
- You are not allowed to discuss the exam with each other
- All questions about the exam will come to me, through email. Do not send any public messages to me, or each other about the exam.
- If there are any clarifications required, I will post them on brightspace and update this document.

A note on the **kinds** of answers I expect: As is our style on HW and in class, many of these questions are open ended and are **not** asking you to repeat what you've read or heard in class. On the contrary, if I read my own words (or a texts) I will mark that down! I expect you to demonstrate your original thoughts. Almost none of these questions require 3-word answers (some do though, those should be clear by the question!). Having said that, I also don't want you to start just typing out vocabulary words that we've used in class. 

**Tip**: If you feel you can't answer a question, skip it and come back. Sometimes reading the entire thing will help clarify the individual parts. If all else fails, I will award partial credit for effort, and a clear explanation of what you're confused about and why. 

**Try and explain your confusion!**

## Changelog

<br/>
<div class='alert alert-info'>

<font size='5'>👾</font> **Note:**  This is **version 1**, updated on 2022-03-18 at 9:00am.

</div>

## Notebook Setup

In [1]:
# imports
import numpy as np
import matplotlib.pyplot as plt
colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]
import seaborn as sns

# 3d figures
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

# creating animations
import matplotlib.animation
from IPython.display import HTML

# styling additions
from IPython.display import HTML
style = '''
    <style>
        div.info{
            padding: 15px; 
            border: 1px solid transparent; 
            border-left: 5px solid #dfb5b4; 
            border-color: transparent; 
            margin-bottom: 10px; 
            border-radius: 4px; 
            background-color: #fcf8e3; 
            border-color: #faebcc;
        }
        hr{
            border: 1px solid;
            border-radius: 5px;
        }
    </style>'''
HTML(style)

# Problem 1

For this problem, we're going to use the `penguins` dataset built into `seaborn`.

In [2]:
penguins_df = sns.load_dataset("penguins").dropna()

Note, this gives you a dataframe as:

In [3]:
penguins_df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


## 1.A

As we did in the "Exploratory Data Analysis" discussion, draw a "scatter plot matrix" of this dataset's 4 numeric features:
- `bill_length_mm`
- `bill_depth_mm`
- `flipper_length_mm`
- `body_mass_g`

and colored by their `species`.

<br/>
<div class='info'>
    
<font size='5'>👇🏽</font> **Note:** This is the figure I want you to generate

</div>

![](penguins_df.png)

In [4]:
# CODE HERE:


<br/>
<div class='info'>
    
<font size='5'>🤔</font> **Question:** Explain what we are looking at in this figure. Explain to me its diagonal, and its off diagonal elements. Why is this kind of graphic helpful? Give a thorough explanation here!

</div>

---

---

<div class='info'>
    
<font size='5'>🤔</font> **Question:** Based on the figure and your explanation, if I wanted to tell apart the **Adelie** from the **Chinstrap** penguis, **which two features would be the best?** Why? Give a good explanation here! 

</div>

---

---

## 1.B

Now that you've picked these two features, perform a **linear regression** and use it to try and classify these two penguins apart. 

**Make sure you report the weight values you've learned!**

<br/>
<div class='info'>

<font size='5'>☝🏽</font> **Hint:**  This is very similar to what we did on HW1!

</div>

### Bonus

Show a figure, plotting the datapoints with your regression line.

## 1.C

Perform a **logistic regression** and use it to try and classify these two penguins apart. **Make sure you report the value's you've learned!**

<br/>
<div class='info'>
    
<font size='5'>☝🏽</font> **Hint:** Remember, these are available under the `.coef_` and `.intercept_` names for logistic regression in `sklearn`

</div>

In [5]:
from sklearn.linear_model import LogisticRegression

### Bonus

Show a 2D figure, plotting the datapoints with your logistic regression line.

## 1.D 

Calculate the error in both the linear regression and logistic regression models. Which performs better?

## 1.E 

<br/>
<div class='info'>

<font size='5'>🤔</font> **Question:**  Wait a minute. How did we use linear regression and logistic regression for **classification**? Explain why these are called regression models, and how we used them for classification!

</div>

---

---

# Problem 2

Lets continue with the same dataset as above. For this problem were going to train:
- KNN with `K = 4`
- SVM with `C` "very high", and "very low".

## 2.A
First, lets answer some questions:
<br/>
<div class='info'>

<font size='5'>🤔</font> **Question:**  What is the effect `K` has on a KNN? What happens when its small? When its large? 

</div>

---

---

<br/>
<div class='info'>

<font size='5'>🤔</font> **Question:**  What about the `C` parameter on SVMs? What effect does it have?

</div>

---

---

## 2.B

Ok lets code it up! Here, I want you to:

- Split your dataset into 80% training, and 20% testing. 
- Normalize your features
- Train **each** model on the training data, and then evaluate them on the testing data. 

## 2.C

<br/>
<div class='info'>

<font size='5'>🤔</font> **Question**: Which model does better? Can you think of reasons why this model would do better in this case? Think about this one! Give a good answer!

</div>

---

---

# Problem 3

Lets discuss a **regularized linear regression** given by:

$$
\min_w \| \boldsymbol{X}w-y \|^2_2 + \alpha \|w\|^2_2 
$$

## 3.A
This equation has two terms. The term on the left is the usual **linear regression error** we discussed in class.

<br/>
<div class='info'>

<font size='5'>🤔</font> **Question:**  If $\boldsymbol{X}$ is of size $100 \times 3$,

- what is the dimension of $w$?
- what is th dimension of $y$?
- how big is my dataset?

</div>

---

---

## 3.B

Now for the second term!

<br/>
<div class='info'>

<font size='5'>🤔</font> **Questions:** Explain in words what the second term is doing? How does having it in here **regularize** the problem?

</div>

<br/>
<div class='info'>

<font size='5'>☝🏽</font> **Hint:**  We gave three different explanations for this in class! 

</div>

---

---

## 3.C

In sklearn, the regularized linear regression is implemented in a class called `Ridge` in the `linear_model` module, and takes an `alpha` parameter as discussed above:

In [6]:
from sklearn.linear_model import Ridge

In [7]:
Ridge?

Lets implement this model on the same dataset above with different `alpha` values:

In [8]:
alphas = [1e-2,1e-1,0.5,1,5]

Type your code below:

## 3.D

<br/>
<div class='info'>

<font size='5'>🤔</font> **Question:**  Comment on what you observe here. Which does better? Why? Does this make sense given what we've discussed in class? Why or why not? I'm not looking for a short answer here!

</div>

---

---

# Problem 4

## 4.A

<br/>
<div class='info'>
    
<font size='5'>🤔</font> **Question:** Explain these concepts in words:
- underfitting
- overfitting
- bias
- variance

Draw a figure (draw it yourself, **not** using any code) to aid in your explanation!
   
</div>

**Note:** Once you've drawn this figure, either on paper, or using your mouse using an online tool, etc. you can **embed it** into your notebook like this:

```
![](foldername/figurename.jpg)
```

---

---

## 4.B

<br/>
<div class='info'>

<font size='5'>🤔</font> **Question:**  Above, you explained waht the `C` parameter does for SVMs. Now, explain it but from the perspective of **regularization**! That is, how does it act as a regularizer for the SVM? Think about it!

</div>

---


---

# Bonus

How does the material we've learned so far relate to any other material you've seen before? Any similarities? Differences? 

If this is your first exposure to the material, then how does this relate to what you thought it was going to be before taking the class? How about what you wanted to use ML for in the future?


- Note: I'm **not** asking about style, difficulty, or coverage in the class!
- This question is **not a request for feedback**! 

It's asking you to reconcile the material we've learned in class with the other material you've learned so far in other classes/your career, or what you thought it was going to be! I'm looking for things like: 

 - "In class A we learned about the Bias-Variance trade-off like this, which is different than how we learned about it here. They are different in this way, they are similar in this way. I'm having a hard time seeing how they are talking about the same thing because ... etc.
 - I haven't had any previous exposure to these topics before. I didn't realize what a big deal it was to select an appropriate hypothesis space and algorithm, etc. 
 - I always heard deep learning just works. But how come they don't overfit all the time? etc.
 - I'm confused about how exactly SVMs can do better than KNNs, since KNNs use distance to .... etc.
 - In my lab project we always use ___ method, but that doesn't seem to fit with what we are learning about overfitting, etc.
 
I want you to think deeply about this question. This exercise causes you to re-evaluate what you know, in terms of what you're learning and vice-versa, often changing your interpretation of both! 

<br/>
<div class='info'>

<font size='5'>☝🏽</font> **Note: This bonus is "open ended", and I will award points based on the depth of your answer.**

</div>


---