<h1 align="center">Module 4 Assessment</h1>

## Overview

This assessment is designed to test your understanding of the Mod 4 material. It covers:

* Bayes Theorem
* Calculus, Cost Function, and Gradient Descent
* Extensions to Linear Models
* Time Series

Read the instructions carefully. You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions please use your own words. The expectation is that you have not copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

In [1]:
## Part 1: Bayesian Statistics [Suggested time: 15 minutes]
### a. Bayes' Theorem

Thomas wants to get a new puppy 🐕 🐶 🐩 


<img src="https://media.giphy.com/media/rD8R00QOKwfxC/giphy.gif" />

He can choose to get his new puppy either from the pet store or the pound. The probability of him going to the pet store is $0.2$. 

He can choose to get either a big, medium or small puppy.

If he goes to the pet store, the probability of him getting a small puppy is $0.6$. The probability of him getting a medium puppy is $0.3$, and the probability of him getting a large puppy is $0.1$.

If he goes to the pound, the probability of him getting a small puppy is $0.1$. The probability of him getting a medium puppy is $0.35$, and the probability of him getting a large puppy is $0.55$.

4.a.1) What is the probability of Thomas getting a small puppy?
4.a.2) Given that he got a large puppy, what is the probability that Thomas went to the pet store?
4.a.3) Given that Thomas got a small puppy, is it more likely that he went to the pet store or to the pound?
4.a.4) For Part 2, what is the prior, posterior and likelihood?

In [22]:
p_s_a = 0.6
p_m_a = 0.3
p_l_a = 0.1

p_s_b = 0.1
p_m_b = 0.35
p_l_b = 0.55

p_a = 0.2
p_b = 0.8

# now we do multiplication as usual with Bayes theorem to get the results...
# but I'm out of time to do it even though it would take like 5 minutes...

ans1 = None
ans2 = None
ans3 = "answer here"
ans4_prior = "answer here"
ans4_posterior = "answer here"
ans4_likelihood = "answer here"

---
## Part II: Calculus, Cost Function, and Gradient Descent [Suggested Time: 15 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1. What is a more generalized name for the RSS curve above? How is it related to machine learning models?

In [2]:
# Your answer here
# ----
# The RSS curve above is a case of a cost function, which is a function that assigns mathematical costs to trained models in order
# to evaluate their accuracy. In the case of a regression, like this one, taking the residual sum of squares is mathematically well
# defined way to assign this cost. In general, a cost function should assign higher costs to worse models, so that when we minimize
# the function using analytical tools, we are approaching the better/best model.

### 2. Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above?   What is the relation between the position on the cost curve, the error, and the slope of the line?

In [3]:
# Your answer here
# ----
# In this case we definitely want to select an m value of 0.05. Because this is the minimum point on the graph of the model's cost
# function above, we know that the overall accuracy of the model to our training data is the highest (i.e. the errors are the smallest
# overall). In particular, the cost function is a function on the input space of model parameters, and so minimizing in this space
# ensures that we are moving towards optimal parameters. Since the minimum is close to 0.05, we should select this value for the optimal
# model.

![](visuals/gd.png)

### 3. Using the gradient descent visual from above, explain why the distance between each step is getting smaller as more steps occur with gradient descent.

In [4]:
# Your answer here
# ----
# Each step of gradient descent becomes smaller in the above photo, as the overall size of each step is ultimately controlled by
# the magnitude of the gradient at that point. While the learning rate is a constant, as we move down the graph towards the minimum,
# the slope (analogous to the gradient in this case) becomes smaller in magnitude, so the multiplication of the slope with the 
# learning rate (which determines where the next step will go) becomes smaller overall.

### 4. What is the purpose of a learning rate in gradient descent? Explain how a very small and a very large learning rate would affect the gradient descent.

In [5]:
# Your answer here
# ----
# The purpose of the learning rate is to have an overall method to control the speed at which the algorithm performs gradient descent.
# In particular, the gradient tells you the direction of greatest increase or decrease, which we use to perform gradient descent, but
# for some models, the magnitude of the gradient may be very large or very small. In the case where the gradient is very large, each
# step will be larger, leading to a faster descent process but a lower level of granularity or accuracy, and an increased probability
# that the algorithm will ``overshoot'' the minimum. In the case where the gradient is small, the model may not converge to the minimum
# in an appreciable amount of time, making analysis difficult or causing the program to stall. In either case, the learning rate allows
# us to control these effects to optimize model generation.

---
## Part 3: Extensions to Linear Regression [Suggested Time: 20 min]
---

In this section, you're going to be creating linear models that are more complicated than a simple linear regression. In the cells below, we are importing relevant modules that you might need later on. We also load and prepare the dataset for you.

In [6]:
import pandas as pd
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
from sklearn.linear_model import Lasso, Ridge
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

In [7]:
data = pd.read_csv('raw_data/advertising.csv').drop('Unnamed: 0',axis=1)
data.describe()

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [8]:
X = data.drop('sales', axis=1)
y = data['sales']

In [9]:
# split the data into training and testing set. Do not change the random state please!
X_train , X_test, y_train, y_test = train_test_split(X, y,random_state=2019)

In [11]:
# ----
X_train.shape

(150, 3)

### 1. We'd like to add a bit of complexity to the model created in the example above, and we will do it by adding some polynomial terms. Write a function to calculate train and test error for different polynomial degrees.

This function should:
* take `degree` as a parameter that will be used to create polynomial features to be used in a linear regression model
* create a PolynomialFeatures object for each degree and fit a linear regression model using the transformed data
* calculate the mean square error for each level of polynomial
* return the `train_error` and `test_error` 


In [16]:
from sklearn.metrics import mean_squared_error

def polynomial_regression(degree):
    """
    Calculate train and test errorfor a linear regression with polynomial features.
    (Hint: use PolynomialFeatures)
    
    input: Polynomial degree
    output: Mean squared error for train and test set
    """
    # // your code here //
    
    poly_transformer = PolynomialFeatures(degree=degree)
    
    # luckily we don't need to do any concatenation
    # poly features generates at least 1, a, and b for input columns a and b
    x_train_poly = poly_transformer.fit_transform(X_train)
    x_test_poly = poly_transformer.transform(X_test)
    
    lin_reg = LinearRegression()
    lin_reg.fit(x_train_poly, y_train)
    x_train_pred = lin_reg.predict(x_train_poly)
    x_test_pred = lin_reg.predict(x_test_poly)
    
    train_error = mean_squared_error(y_train, x_train_pred)
    test_error = mean_squared_error(y_test, x_test_pred)
    return train_error, test_error

#### Try out your new function

In [17]:
polynomial_regression(3)

(0.24235967358392063, 0.15281375976869446)

#### Check your answers

MSE for degree 3:
- Train: 0.2423596735839209
- Test: 0.15281375973923944

MSE for degree 4:
- Train: 0.18179109317368244
- Test: 1.9522597174462015

In [18]:
polynomial_regression(4)

(0.18179109287485934, 1.952259624453726)

### 2. What is the optimal number of degrees for our polynomial features in this model? In general, how does increasing the polynomial degree relate to the Bias/Variance tradeoff?  (Note that this graph shows RMSE and not MSE.)

<img src ="visuals/rsme_poly_2.png" width = "600">

<!---
fig, ax = plt.subplots(figsize=(7, 7))
degree = list(range(1, 10 + 1))
ax.plot(degree, error_train[0:len(degree)], "-", label="Train Error")
ax.plot(degree, error_test[0:len(degree)], "-", label="Test Error")
ax.set_yscale("log")
ax.set_xlabel("Polynomial Feature Degree")
ax.set_ylabel("Root Mean Squared Error")
ax.legend()
ax.set_title("Relationship Between Degree and Error")
fig.tight_layout()
fig.savefig("visuals/rsme_poly.png",
            dpi=150,
            bbox_inches="tight")
--->

In [19]:
# Your answer here
# ----
# From the above graph, it looks as though the optimal degree to select for polynomial feature engineering is 3. In particular,
# we want to minimize overfitting on the training set, as this will produce higher error in the testing set. From the graph, beyond
# degree 3, we see a sharp increase in the error on the test set while the error on the training set is going down. This indicates that
# our model is fitting too well to the training data to be able to accurately predict on the testing data. Because the testing data
# has not been touched by our model before initial predictions, ensuring accuracy on this set of data is critical to having a robust
# model that can make predictions on new data. In general, bias is produced when the model isn't sensitive enough to features in the
# training data to be able to make accurate predictions, and variance is produced when the model is too sensitive to these features,
# so that a slight deviation from data in the model will produce a high error due to issues of overfitting.

### 3. In general what methods would you can to reduce overfitting and underfitting? Provide an example for both and explain how each technique works to reduce the problems of underfitting and overfitting.

In [20]:
# Your answer here
# ----
# To ensure that overfitting and underfitting are not issues, a major technique would be to employ the use of a validation set of data. 
# Essentially, this segmentation of data differs slightly from the testing set of data. The testing set is used to evaluate the accuracy 
# of the model on unknown data, but the results here certainly influence parameter selection, as we use the testing set of data as a 
# marker of model performance as we change parameters to find an optimal model. In contrast, a validation set is left completely untouched
# until the final stage of evaluation. At this point, we have found an ``optimal'' model by training to train data, and by tuning 
# parameters until things also look good in the test data (not under or overfit). However, we cannot be sure until we test our final model
# on validation data, which ensures that an issue of overfitting or underfitting to known values is handled. 

# This is a general strategy to improving model selection. In order to ensure that overfitting is not an issue, we might also look to
# methods of regularization. In general, the idea behind these methods is to ``weight'' the different possible features so that the most
# important features in terms of description of model variance are higher, and the ones where the description of model variance is lower
# are weighted lower or ignored entirely. This has the effect of ensuring that small discrepancies that might be locally modeled by a
# trend but don't actually contribute much to the model itself are not baked into the model, which would cause overfitting.

# In order to ensure that underfitting is not an issue, we would need to perform some additional feature engineering, as an issue of
# underfitting arises when the model fails to detect features that contribute to the variance in the data, or when these features are 
# not specifically coded into the model. In fact, we tried this kind of methodology above, where we produced polynomial features of the
# given data. This adds complexity to the model, and more places for the model to assign changes in variance in the data. Of course,
# there is a trade-off, as too many features will eventually contribute to overfitting, as demonstrated in the graph above.

### 4. What is the difference between the two types of regularization for linear regression?

In [None]:
# Your answer here
# ----
# In linear regression there are two main types of regularization: L1 regularization and L2 regularization. These two methods are very
# similar, and come down to a small mathematical change to the regularization term. In both cases, the regularization term is an additional
# term that is added to the cost function to penalize features.

# In L1 regularization, the regularization term is the sum of the absolute values of the model coefficients, multiplied by a constant 
# lambda. In L2 regularization, the regularization term is the sum of the squares of the model coefficients. These correspond to the L1
# and L2 norms of the vectors composed of the model coefficients (mathemtically defined norms, essentially). In both cases, these
# regularization terms serve to penalize features that don't contribute much to the variance in the data. The big difference comes in
# the degree to which they are able to accomplish this. Mainly, L2 regularization will squash features which are not descriptive of the
# data, but they will not fully remove them, as is accomplished when using L1 regularization. Sometimes this can be advantageous,
# sometimes not.

### 5. Why is scaling input variables a necessary step before regularization?

In [17]:
# Your answer here
# ----
# We want to scale input variables before regularization so that our lambda value has meaning in the context of our overall algorithm.
# a scaling process, the value of the norm will be skewed by the magnitude of the features, and our rate constant lambda won't have
# any appreciable effect on selection of a model or the gradient descent process. Also, scaling will make gradient descent faster overall.

## Time Series: Part 4 [Suggested Time: 10 min]

<!---
# load data
ads_df = pd.read_csv("raw_data/social_network_ads.csv")

# one hot encode categorical feature
def is_female(x):
    """Returns 1 if Female; else 0"""
    if x == "Female":
        return 1
    else:
        return 0
        
ads_df["Female"] = ads_df["Gender"].apply(is_female)
ads_df.drop(["User ID", "Gender"], axis=1, inplace=True)
ads_df.head()

# separate features and target
X = ads_df.drop("Purchased", axis=1)
y = ads_df["Purchased"]

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=19)

# preprocessing
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# save preprocessed train/test split objects
pickle.dump(X_train, open("write_data/social_network_ads/X_train_scaled.pkl", "wb"))
pickle.dump(X_test, open("write_data/social_network_ads/X_test_scaled.pkl", "wb"))
pickle.dump(y_train, open("write_data/social_network_ads/y_train.pkl", "wb"))
pickle.dump(y_test, open("write_data/social_network_ads/y_test.pkl", "wb"))

# build model
model = LogisticRegression(C=1e5, solver="lbfgs")
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
y_train_pred = model.predict(X_train)

from sklearn.metrics import confusion_matrix

# create confusion matrix
# tn, fp, fn, tp
cnf_matrix = confusion_matrix(y_test, y_test_pred)
cnf_matrix

# build confusion matrix plot
plt.imshow(cnf_matrix,  cmap=plt.cm.Blues) #Create the basic matrix.

# Add title and Axis Labels
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')

# Add appropriate Axis Scales
class_names = set(y_test) #Get class labels to add to matrix
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# Add Labels to Each Cell
thresh = cnf_matrix.max() / 2. #Used for text coloring below
#Here we iterate through the confusion matrix and append labels to our visualization.
for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
        plt.text(j, i, cnf_matrix[i, j],
                 horizontalalignment="center",
                 color="white" if cnf_matrix[i, j] > thresh else "black")

# Add a Side Bar Legend Showing Colors
plt.colorbar()

# Add padding
plt.tight_layout()
plt.savefig("visuals/cnf_matrix.png",
            dpi=150,
            bbox_inches="tight")
--->

### 1. Which of the following can’t be a component for a time series plot?
A) Seasonality <br>
B) Trend <br>
C) Cyclical <br>
D) Noise<br>
E)  None of the above

In [None]:
# Your answer here
# ----
# I am assuming the answer is cyclical here, there can be parts of a time series component which are cyclical (such as seasonality, though
# this is not guaranteed). 


### 2) What does autocovariance measure?

A) Linear dependence between multiple points on the different series observed at different times<br>
B) Quadratic dependence between two points on the same series observed at different times<br>
C) Linear dependence between two points on different series observed at same time<br>
D) Linear dependence between two points on the same series observed at different times<br>


In [8]:
# Your answer here
# ----
# D

### 3) Looking at the below ACF plot, would you suggest to apply AR or MA in ARIMA modeling technique?

![](visuals/acf.jpg)

A) AR<br>
B) MA <br>
C) Can’t Say <br>



In [7]:
# Your answer here
# ----
# A) AR (the series looks slightly underdifferenced)

### 4) Stationarity is a desirable property for a time series process.

A) TRUE <br>
B) FALSE <br>

In [10]:
# Your answer here
# ----
# A) True

### 5) Which of the following statement is correct?

1. If autoregressive parameter (p) in an ARIMA model is 1, it means that there is no auto-correlation in the series.

2. If moving average component (q) in an ARIMA model is 1, it means that there is auto-correlation in the series with lag 1.

3. If integrated component (d) in an ARIMA model is 0, it means that the series is not stationary.

A) Only 1 <br>
B) Both 1 and 2 <br>
C) Only 2 <br>
D)  All of the statements <br>



In [11]:
# Your answer here
# ----
# C) Only 2

### 6) BIC penalizes complex models more strongly than the AIC. 

A) TRUE <br>
B) FALSE <br>


In [14]:
# Your answer here
# ----
# A) TRUE

### 7) How many AR and MA terms should be included for the time series by looking at the above ACF and PACF plots?


![](visuals/acf_pacf.png)


A) AR (1) MA(0) <br>
B) AR(0)MA(1) <br>
C) AR(2)MA(1) <br>
D) AR(1)MA(2) <br>
E) Can’t Say <br>

In [21]:
# Your answer here
# ----
# We want an MA term because lag 1 autocorrelation is negative through one lag
# We want 0 AR terms because the PACF is negative through the first two lags
# Hence we select B) AR(0)MA(1)