# Assignment 1

### <span style="color:chocolate"> Submission requirements </span>

Your work will not be graded if your notebook doesn't include output. In other words, <span style="color:red"> make sure to rerun your notebook before submitting to Gradescope </span> (Note: if you are using Google Colab: go to Edit > Notebook Settings  and uncheck Omit code cell output when saving this notebook, otherwise the output is not printed).

Additional points may be deducted if these requirements are not met:

    
* Comment your code;
* Each graph should have a title, labels for each axis, and (if needed) a legend. Each graph should be understandable on its own;
* Try and minimize the use of the global namespace (meaning, keep things inside functions).
---

### Import libraries

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

### Define functions

In [None]:
def create_1d_data(num_examples, w, b, bound):
  """Create X, Y data with a linear relationship with added noise.

  Args:
    num_examples: number of examples to generate
    w: desired slope
    b: desired intercept
    bound: lower and upper boundary of the data interval

  Returns:
    X and Y with shape (num_examples)
  """
  np.random.seed(4)  # consistent random number generation
  X = np.arange(num_examples)
  deltas = np.random.uniform(low=-bound, high=bound, size=X.shape) # added noise
  Y = b + deltas + w * X

  return X, Y

---
### Step 1: Data ingestion

Supervised learning is all about learning to make predictions: given an input $x$ (e.g. home square footage), can we produce an output $\hat{y}$ (e.g. estimated value) as close to the actual observed output $y$ (e.g. sale price) as possible. Note that the "hat" above $y$ is used to denote an estimated or predicted value.

Let's start by generating some artificial data. We'll create a vector of inputs, $X$, and a corresponding vector of target outputs $Y$. In general, we'll refer to invidual examples with a lowercase ($x$), and a vector or matrix containing multiple examples with a capital ($X$).

### <span style="color:chocolate">Exercise 1:</span> Create data (10 points)

Create artificial data using the function <span style="color:chocolate">create_1d_data()</span> defined at the top of this notebook. Set the following argument values:
- number of examples = 70;
- slope (w) = 2;
- intercept (b) = 1;
- bound = 1.

Denote the output by X and Y. Print the shape and the first 10 elements for each object.

In [None]:
# YOUR CODE HERE
artific_data = create_1d_data(num_examples=70,w=2,b=1,bound=1)
artific_data

---
### Step 2: Data preprocessing

Given the simplicity of the data (just one feature in X), our sole task here is to divide the data into training and test sets.

### <span style="color:chocolate">Exercise 2:</span> Data splits (10 points)

Using the <span style="color:chocolate">train_test_split()</span> method available in scikit-learn:
1. Split the (X,Y) data into training and test paritions by setting test_size=0.3 and random_state=1234. All the other arguments of the method are set to default values. Name the resulting arrays X_train, X_test, Y_train, Y_test;
2. Print the shape of each array.

In [None]:
# YOUR CODE HERE

# 1:
X_train, X_test, Y_train, Y_test = train_test_split(artific_data[0], artific_data[1], test_size=0.3, random_state=1234)

# 2:
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"Y_train shape: {Y_train.shape}")
print(f"Y_test shape: {Y_test.shape}")


---
### Step 3: Exploratory data analysis (EDA)

EDA helps us to gain insights into the distribution and characteristics of the dataset we are dealing with.
This understanding is fundamental for making informed decisions regarding:
- data cleaning;
- feature selection;
- model building;
- model evaluation, etc.

### <span style="color:chocolate">Exercise 3:</span> Plots (10 points)

1. Generate a scatter plot displaying the X_train data along the x-axis and the Y_train data along the y-axis, ensuring clear labeling of both axes. Add a title "Exploratory Data Analysis: Training Data" and a legend "observed training data" to the plot;
2. Enhance the plot by incorporating a vertical red line to denote the mean value of X_train. Accompany it with a legend clarifying the meaning of the line and the mean value of X_train.

In [None]:
# YOUR CODE HERE
# 1.
plt.scatter(X_train, Y_train, label="observed training data", color='black')
plt.xlabel('X_train')
plt.ylabel('Y_train')
plt.title('Exploratory Data Analysis: Training Data')

# 2:
mean_x = np.mean(X_train)
plt.axvline(x=mean_x, color='red', linestyle='--', label=f'Mean of X_train = {mean_x:.2f}')
plt.legend()

plt.show()


---
### Step 4: Modeling

In this section, our objective is to propose models to describe the data generation process. Remember a model is a function that takes an input $x$ and produces a prediction $\hat{y}$.

Let's consider two possible models for this data:
1. $M_1(x) = 5+x$
2. $M_2(x) = 1+2x$

### <span style="color:chocolate">Exercise 4:</span> Models for data (10 points)

1. Compute the predictions of models $M_1$ and $M_2$ for the values in X_train. These predictions should be vectors of the same shape as Y_train. Call these predictions M1_hat_train and M2_hat_train. Hint: the "learned" parameters are alredy provided to you;
2. Plot the prediction lines of these two models overlayed on the observed data (X_train, Y_train). Note: you will generate only one plot. Make sure to include axes names, titles and legend.

In [None]:
# YOUR CODE HERE

# 1.
M1_hat_train = 5 + X_train
M2_hat_train = 1 + 2 * X_train

# 2.
plt.scatter(X_train, Y_train, label="Observed Training Data", color='black')
plt.plot(X_train, M1_hat_train, label="Prediction by M1(x) = 5 + x", color='blue')
plt.plot(X_train, M2_hat_train, label="Prediction by M2(x) = 1 + 2x", color='purple')
plt.xlabel('X_train')
plt.ylabel('Y_train')
plt.title('Model Predictions vs Observed Training Data')
mean_x = np.mean(X_train)
plt.axvline(x=mean_x, color='red', linestyle='--', label=f'Mean of X_train = {mean_x:.2f}')
plt.legend()

plt.show()


---
### Step 5: Evaluation and Generalization

How good are our models? Intuitively, the better the model, the more closely it fits the data we have. That is, for each $x$, we'll compare $y$, the true value, with $\hat{y}$, the predicted value. This comparison is often called the *loss* or the *error*. One common such comparison is *squared error*: $(y-\hat{y})^2$. Averaging over all our data points, we get the *mean squared error*:

\begin{equation}
\textit{MSE} = \frac{1}{n} \sum_{y_i \in Y}(y_i - \hat{y}_i)^2
\end{equation}

How well do our models generalize? The test dataset serves as a proxy for unseen data in real-world applications. By evaluating the model on the test data, you can assess its ability to generalize beyond the training data. This ensures that the model can make accurate predictions on new data it hasn't seen during training.

### <span style="color:chocolate">Exercise 5:</span> Computing MSE (20 points)

1. Write a function for computing the MSE metric based on the provided definition;
2. Utilizing this function, calculate the training data MSE for the two models, $M_1$ and $M_2$.
3. Comment on which model fits the training data better.

In [None]:
# YOUR CODE HERE
def MSE(true_values, predicted_values):
  """Return the MSE between true_values and predicted values."""
  return (1/len(true_values)) * sum((true_values - predicted_values)**2)

In [None]:
# YOUR CODE HERE
# 2.
mse_M1 = MSE(Y_train, M1_hat_train)
mse_M2 = MSE(Y_train, M2_hat_train)

print(f'MSE for M1(x) = 5 + x: {mse_M1}')
print(f'MSE for M2(x) = 1 + 2x: {mse_M2}')

# 3.
if mse_M1 < mse_M2:
    print("Model M1 fits the training data better (lower MSE).")
else:
    print("Model M2 fits the training data better (lower MSE).")

### <span style="color:chocolate">Exercise 6:</span> Generalization (15 points)

1. Compute the predictions of models $M_1$ and $M_2$ for the values in X_test. These predictions should be vectors of the same shape as Y_test. Call these predictions M1_hat_test and M2_hat_test.
2. Calculate the test data MSE for the two models, $M_1$ and $M_2$, using the <span style="color:chocolate">MSE()</span> function defined above.
3. Does the model you chose in Exercise 5 generalize well?

In [None]:
# YOUR CODE HERE
# 1.
M1_hat_test = 5 + X_test
M2_hat_test = 1 + 2 * X_test

# 2.
MSE_M1_test = MSE(Y_test, M1_hat_test)
MSE_M2_test = MSE(Y_test, M2_hat_test)

print(f"MSE for Model M1 on test data: {MSE_M1_test}")
print(f'MSE for Model M1 on train data: {mse_M1}')
print(f"MSE for Model M2 on test data: {MSE_M2_test}")
print(f'MSE for Model M2 on train data {mse_M2}')
# 3.
# Based on the printed MSE values, the model with the lower MSE on the test data (M2) also performed well on the training data.
# In fact, the MSE values are very similar (~0.33 on test and ~0.31 on train). Thus, the model can be considered to generalize well.
# While M1 was already stated in exercise 5 to not fit the training data as well as M2, it should be noted that it also generalizes
# well with the test data, as the MSE values are again very similar (~1300 on test and ~1358 on train).

### <span style="color:chocolate">Exercise 7:</span> More features (25 points)

1. Fit an 8-th degree polynomial to (X_train, Y_train). Call the predictions of this model M3_hat_train. Hint: see <span style="color:chocolate">np.polyfit()</span> for details.
2. Plot the prediction lines of the $M_3$ overlayed on the observed data (X_train, Y_train). Note: you will generate only one plot. Make sure to include axes names, titles and legend.
3. Calculate the training data MSE for the $M_3$ model using the <span style="color:chocolate">MSE()</span> function defined above.
4. Does model $M_3$ do better than your chosen model in Exercise 5 at predicting the labels for new unseen data? Hint: your new unseen data is the test dataset.

In [None]:
# YOUR CODE HERE
# 1.
M3 = np.polyfit(X_train, Y_train, deg=8)
eight_model = np.poly1d(M3)
M3_hat_train = eight_model(X_train)

# 2.
plt.scatter(X_train, Y_train, label="Observed Training Data", color='black', alpha=0.5)
plt.plot(X_train, M3_hat_train, label=f'Model M3 (deg 8)', color='blue', lw=2)
plt.xlabel('X_train')
plt.ylabel('Y_train')
plt.title('Polynomial Fit (Deg 8) on Training Data')
plt.legend()
plt.show()

# 3.
MSE_M3_train = MSE(Y_train, M3_hat_train)
print(f"MSE for Model M3 on training data: {MSE_M3_train}")

# 4.
M3_hat_test = eight_model(X_test)
MSE_M3_test = MSE(Y_test, M3_hat_test)
print(f"MSE for Model M3 on test data: {MSE_M3_test}")
print(f"MSE for Model M2 on test data: {MSE_M2_test}")
print(f'MSE for Model M2 on train data {mse_M2}')

if MSE_M3_test < MSE_M2_test:
    print("Model M3 performs better than our chosen model in Exercise 5 at predicting the labels for new unseen data (lower MSE on test data)")
else:
    print("Model M2 performs better than our chosen model in Exercise 5 at predicting the labels for new unseen data (lower MSE on test data)")

# By using a degree 8 polynomial, our model fits too well to the training data and fails to generalize to the test data as well as M2.

----
#### <span style="color:chocolate">Additional practice question</span> (not graded)

Would you perform EDA on the test dataset?
1. Why or why not?
2. Provide a link to a paper/article to support your answer.

In [None]:
# YOUR ANSWER HERE
# 1.

# No, I would not perform EDA on the test data because this data is supposed to be unseen data that is only used at time to evaluate the model.
# By exploring the test data, we are understanding things about the hidden data that can influence our model training process/decisions
# (feature engineering decisions, hyperparameter tuning, etc.). This defeats the purpose of having a test dataset as we can face overfitting
# if we were to try to use this model on more newly added data that didn't go through the EDA. This would invalidate the generalizability statement
# for our model. Thus, EDA should be done only on the training data, and then evaluated on the test set (allowing us to undeerstand if our preprocessing
# decisions were useful as well as the usability of the model).

# 2.
# The following cite states """
# The first thing we will want to do with this data is construct a train/test split. Constructing a train test split before EDA and data cleaning can often
# be helpful. This allows us to see if our data cleaning and any conclusions we draw from visualizations generalize to new data. This can be done by re-running
# the data cleaning and EDA process on the test dataset."""
# Thus, they are exactly stating what we mention above about isolating the test data from any EDA.
# Source: https://ds100.org/sp20/resources/assets/lectures/lec18/TrainTestSplitAndCrossValidation.html