# CS145 Introduction to Data Mining - Assignment 1
## Deadline: 12:00PM (noon), April 12, 2024

## Instructions
Each assignment is structured as a Jupyter notebook, offering interactive tutorials that align with our lectures. You will encounter two types of problems: *write-up problems* and *coding problems*.

1. **Write-up Problems:** These problems are primarily theoretical, requiring you to demonstrate your understanding of lecture concepts and to provide mathematical proofs for key theorems. Your answers should include sufficient steps for the mathematical derivations.
2. **Coding Problems:** Here, you will be engaging with practical coding tasks. These may involve completing code segments provided in the notebooks or developing models from scratch.

To ensure clarity and consistency in your submissions, please adhere to the following guidelines:

* For write-up problems, use Markdown bullet points to format text answers. Also, express all mathematical equations using $\LaTeX$ and avoid plain text such as `x0`, `x^1`, or `R x Q` for equations.
* For coding problems, comment on your code thoroughly for readability and ensure your code is executable. Non-runnable code may lead to a loss of **all** points. Coding problems have automated grading, and altering the grading code will result in a deduction of **all** points.
* Your submission should show the entire process of data loading, preprocessing, model implementation, training, and result analysis. This can be achieved through a mix of explanatory text cells, inline comments, intermediate result displays, and experimental visualizations.

### Collaboration and Integrity

* Collaboration is encouraged, but all final submissions must be your own work. Please acknowledge any collaboration or external sources used, including websites, papers, and GitHub repositories.
* Any suspicious cases of academic misconduct will be reported to The Office of the Dean of Students.

## Before You Start

Useful information about managing environments can be found [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).

You may also quickly review the usage of basic Python and Numpy package, if needed in coding for matrix operations.

In this notebook, you must not delete any code cells in this notebook. If you change any code outside the blocks that you are allowed to edit (between `STRART/END YOUR CODE HERE`), you need to highlight these changes. You may add some additional cells to help explain your results and observations.

### Linear Regression with Closed-Form Solution (10 points)

In [5]:
import numpy as np
import seaborn as sns
import torch
from sklearn.model_selection import train_test_split

In [8]:
data = sns.load_dataset("anscombe")
data = data[data.dataset == "II"]
x = np.array(data['x'])
y = np.array(data['y'])
x = torch.from_numpy(x).float().unsqueeze(1)
y = torch.from_numpy(y).float().unsqueeze(1)
# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Augment the feature with additional all one dimension
X = torch.cat([x_train, torch.ones_like(x_train)], dim=1)

In [None]:
# START YOUR CODE HERE
# TODO: Compute the closed-form solution for the weights (w) using the normal equation
# Hint: Use torch.linalg.inv() for matrix inversion and @ for matrix multiplication
w_closed_form = None
# END YOUR CODE HERE

# Compute training and test error for closed-form solution
y_train_pred = x_train @ w_closed_form[0] + w_closed_form[1]
y_test_pred = x_test @ w_closed_form[0] + w_closed_form[1]
train_error_closed_form = torch.mean((y_train_pred - y_train.reshape(-1))**2).item()
test_error_closed_form = torch.mean((y_test_pred - y_test.reshape(-1))**2).item()

print(f"Closed-form solution:")
print(f"Training error: {train_error_closed_form:.4f}")
print(f"Test error: {test_error_closed_form:.4f}")

In [None]:
h
from sklearn.model_selection import train_test_split

data = sns.load_dataset("anscombe")
data = data[data.dataset == "II"]

x = np.array(data['x'])
y = np.array(data['y'])

x = torch.from_numpy(x).float().unsqueeze(1)
y = torch.from_numpy(y).float().unsqueeze(1)

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Augment the feature with additional all one dimension
X = torch.cat([x_train, torch.ones_like(x_train)], dim=1)

# START YOUR CODE HERE
# TODO: Compute the closed-form solution for the weights (w) using the normal equation
# Hint: Use torch.linalg.inv() for matrix inversion and @ for matrix multiplication
w_closed_form = None
# END YOUR CODE HERE

# Compute training and test error for closed-form solution
y_train_pred = x_train @ w_closed_form[0] + w_closed_form[1]
y_test_pred = x_test @ w_closed_form[0] + w_closed_form[1]
train_error_closed_form = torch.mean((y_train_pred - y_train.reshape(-1))**2).item()
test_error_closed_form = torch.mean((y_test_pred - y_test.reshape(-1))**2).item()

print(f"Closed-form solution:")
print(f"Training error: {train_error_closed_form:.4f}")
print(f"Test error: {test_error_closed_form:.4f}")

In [None]:
from matplotlib import pyplot as plt
x_plot = torch.Tensor(np.linspace(x.min(), x.max(), 1000)).unsqueeze(1)
y_plot = x_plot @ w_closed_form[0] + w_closed_form[1]
plt.scatter(x_plot, y_plot)
plt.scatter(x_train, y_train)
plt.scatter(x_test, y_test)
plt.title("Closed-form solution")
plt.show()

### Linear Regression with Gradient Descent (20 points)

In [None]:
class LinearRegression(torch.nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.w = torch.nn.Parameter(torch.rand(input_dim, 1))
        
    def forward(self, x):
        # START YOUR CODE HERE
        # TODO: Implement the forward pass of the linear regression model
        # Hint: Perform matrix multiplication between the input features (x) and the weights (self.w)
        y_pred = None
        # END YOUR CODE HERE
        return y_pred
    
    def compute_gradient(self, x, y):
        # START YOUR CODE HERE
        # TODO: Compute the gradient of the mean squared error loss with respect to the weights (w)
        # Hint: Use the chain rule to break down the gradient computation
        #       d_loss_d_w = d_loss_d_y * d_y_d_w, where d_loss_d_y is the gradient of the loss w.r.t. the predictions,
        #       and d_y_d_w is the gradient of the predictions w.r.t. the weights
        d_loss_d_w = None
        # END YOUR CODE HERE
        return d_loss_d_w

In [None]:
# Augment the input features
x_train_augmented = torch.cat([x_train, torch.ones(len(x_train), 1)], dim=1)
x_test_augmented = torch.cat([x_test, torch.ones(len(x_test), 1)], dim=1)

input_dim = x_train_augmented.shape[1]
model = LinearRegression(input_dim)

num_epochs = 2000
learning_rate = 0.01
losses = []

for epoch in range(1, num_epochs + 1):
    # START YOUR CODE HERE 
    # TODO: Compute the gradient for the samples and update the model weights
    # Hint: Use model.compute_gradient() to compute the gradients
    # END YOUR CODE HERE
    
    with torch.no_grad():
        y_train_pred = model(x_train_augmented)
        y_test_pred = model(x_test_augmented)
        train_error = torch.mean((y_train_pred - y_train) ** 2).item()
        test_error = torch.mean((y_test_pred - y_test) ** 2).item()
        losses.append((train_error, test_error))
        if epoch % 50 == 0:
            print(f"Epoch {epoch}, training error: {train_error:.4f}, test error: {test_error:.4f}")

with torch.no_grad():
    y_train_pred = model(x_train_augmented)
    y_test_pred = model(x_test_augmented)
    train_error_bgd = torch.mean((y_train_pred - y_train) ** 2).item()
    test_error_bgd = torch.mean((y_test_pred - y_test)**2).item()

print(f"\nBatch Gradient Descent:")
print(f"Training error: {train_error_bgd:.4f}")
print(f"Test error: {test_error_bgd:.4f}")

In [None]:
from matplotlib import pyplot as plt
x_plot = torch.Tensor(np.linspace(x.min(), x.max(), 1000)).unsqueeze(1)
y_plot = x_plot @ model.w[0] + model.w[1]
y_plot = y_plot.detach().numpy()
plt.scatter(x_plot, y_plot)
plt.scatter(x_train, y_train)
plt.scatter(x_test, y_test)
plt.title("Linear Regression with Batch Gradient Descent")
plt.show()

In [None]:
plt.plot(losses)

### Linear Regression with Stochastic Gradient Descent (20 points)

In [None]:
model_sgd = LinearRegression(input_dim)

num_epochs = 1000
batch_size = 2
learning_rate = 0.001
losses = []

for epoch in range(1, num_epochs + 1):
    for i in range(0, len(x_train), batch_size):
        # START YOUR CODE HERE
        # TODO: Extract the current batch of training data (x_batch, y_batch)
        # Hint: Use indexing on x_train_augmented and y_train
        x_batch = None
        y_batch = None
        # END YOUR CODE HERE
        
        # START YOUR CODE HERE 
        # TODO: Compute the gradient for the current batch and update the model weights
        # Hint: Use model_sgd.compute_gradient() to compute the gradients
        # END YOUR CODE HERE

    with torch.no_grad():
        y_train_pred = model_sgd(x_train_augmented)
        y_test_pred = model_sgd(x_test_augmented)
        train_error_sgd = torch.mean((y_train_pred - y_train) ** 2).item()
        test_error_sgd = torch.mean((y_test_pred - y_test) ** 2).item()
        losses.append((train_error_sgd, test_error_sgd))

        if epoch % 50 == 0:
            print(f"Epoch: {epoch}, Train error: {train_error_sgd:.4f}, Test error: {test_error_sgd:.4f}")

print(f"\nStochastic Gradient Descent:")
print(f"Training error: {train_error_sgd:.4f}")
print(f"Test error: {test_error_sgd:.4f}")

In [None]:
x_plot = torch.Tensor(np.linspace(x.min(), x.max(), 1000)).unsqueeze(1)
y_plot = x_plot @ model_sgd.w[0] + model_sgd.w[1]
y_plot = y_plot.detach().numpy()
plt.scatter(x_plot, y_plot)
plt.scatter(x_train, y_train)
plt.scatter(x_test, y_test)
plt.title("Linear Regression with Stochastic Gradient Descent")
plt.show()

In [None]:
plt.plot(losses)

### Questions: 
1. (10 points) Compare the MSE values on the test dataset for each algorithm. Are they the same? Why or why not?
2. (10 points) Apply z-score normalization (which scales the values of a feature to have a mean of 0 and a standard deviation of 1) for the feature and comment whether or not it affect the three algorithms.
3. (10 points) Apply a polynomial function to augment the feature and compare the MSE values for batch gradient descent. What do you observe?
4. (10 points) Change the learning rate of the stochastic gradient descent algorithm and compare the loss curves. What do you observe?
5. (10 points) Ridge regression is adding an $L_2$ regularization term to the original objective function of mean squared error. The objective function become following: 
    $$ J(\beta) = \frac{1}{2n} \sum_i \left(x_i^\mathsf{T}\beta - y_i \right)^2 + \frac{\lambda}{2n} \sum_j \beta_j^2 ,$$ 
    where $\lambda \geq 0$, which is a hyper parameter that controls the trade off. Take the derivative of this provided objective function and derive the closed form solution for $\beta$. 

### Your answer here: