# Jupyter Notebook for DS4400 Homework 2
## Advait Dharampal 

 *[C]* Problem 2:  Linear regression (15 points)

In this problem, you will use an existing package of your choice for training and testing a linear regression model for the house prediction
dataset.

1. Use an existing package to train a multiple linear regression model on the training set using all the features (except the ones excluded
above). Report the coefficients of the linear regression models and the following metrics on the training data: (1) MSE metric; (2)
$R^2$ metric.
2. Evaluate the model on the testing set. Report the MSE and $R^2$ metrics on the testing set.
3. Interpret the results in your own words. Which features contribute mostly to the linear regression model? Is the model fitting the data
well? How large is the model error? How do the training and testing MSE relate?

In [1]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

In [2]:
# Load Datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [3]:
# Clean Features
excluded = ['id', 'date', 'zipcode', 'price', 'Unnamed: 0']
features = [f for f in train_df.columns if f not in excluded]

# Define splits
X_train = train_df[features]
y_train = train_df['price'] / 1000

X_test = test_df[features]
y_test = test_df['price'] / 1000 

# Data Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Training 
model = LinearRegression()
model.fit(X_train_scaled, y_train)

In [4]:
# Evaluation Train: 
y_train_pred = model.predict(X_train_scaled)
mse_train = mean_squared_error(y_train, y_train_pred)
r2_train = r2_score(y_train, y_train_pred)

print(f"Train MSE: {mse_train}")
print(f"Train R^2 {r2_train}")

# Evaulation Test 
y_test_pred = model.predict(X_test_scaled)
mse_test = mean_squared_error(y_test, y_test_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f"\nTest MSE: {mse_test}")
print(f"Test R^2 {r2_test}")

# Extract Coefficients
coef_df = pd.DataFrame({'Feature': features, 'Coefficient': model.coef_})
print()
print(coef_df)

Train MSE: 31486.16777579488
Train R^2 0.7265334318706018

Test MSE: 57628.15470567037
Test R^2 0.6543560876120955

          Feature  Coefficient
0        bedrooms   -12.521962
1       bathrooms    18.527633
2     sqft_living    56.748837
3        sqft_lot    10.881868
4          floors     8.043721
5      waterfront    63.742900
6            view    48.200109
7       condition    12.964269
8           grade    92.231475
9      sqft_above    48.290089
10  sqft_basement    27.137032
11       yr_built   -67.643117
12   yr_renovated    17.271380
13            lat    78.375737
14           long    -1.035203
15  sqft_living15    45.577658
16     sqft_lot15   -12.930091


2.3:
The features that have the biggest impact on the price in this Linear Regression model are grade, lat, yr_built, and waterfront. These have the highest weights when the absoulate value is taken. The model fits the data okay there is a difference of -0.0721 in R^2 and +26,142 in MSE. We see a very high difference between the two MSE's, with the test split yielding a higher MSE indicating there is slight overfitting. This suggests the that predictions are less reliable when applied to houeses outside the datset. The model has an error of ~240 thousand dollars for the training houses, meaning that the relationship between the price and the features is not being fully captured. The relatively high R^2 and small difference shows that a good amount of variance is being accounted for.


### *[C]* Problem 3:  Implementing closed-form solution for linear regression (15 points)

In this problem, you will implement your own linear regression model, using the closed-form solution we derived in class. You will also
compare your model with the one trained with the package in Problem 2 on the same house price prediction dataset.

- Implement the closed-from solution for multiple linear regression using matrix operations and train a model on the training set. Write
a function to predict the response on a new testing point.
- Compare the models given by your implementation with those trained in Problem 2 by the Python packages. Report the MSE and $R^2$
metrics for the models you implemented on both training and testing sets and compare these metrics to the ones given by the package
implementation from Problem 2. Discuss if the results of your implementation are similar to those of the package.

In [5]:
def closed_form_linreg(X, y):
    X_T = X.T
    X_T_X = X_T @ X
    X_inv = np.linalg.pinv(X_T_X)

    X_T_y = X_T @ y
    w = X_inv @ X_T_y

    return w

# Function to predict response with new testing point
def pred_closed_form(X, w):
    return X @ w

In [6]:
# Adding intercept: 
X_train_scaled = np.hstack(
    (np.ones((X_train_scaled.shape[0], 1)), X_train_scaled)
)

X_test_scaled = np.hstack(
    (np.ones((X_test_scaled.shape[0], 1)), X_test_scaled)
)

# R^2 and MSE: 
train_w = closed_form_linreg(X_train_scaled, y_train)

y_cf_pred_train = pred_closed_form(X_train_scaled, train_w)
mse_train = mean_squared_error(y_train, y_cf_pred_train)
r2_train = r2_score(y_train, y_cf_pred_train)

print(f"CF Train MSE: {mse_train}")
print(f"CF Train R^2 {r2_train}")

y_cf_pred_test = pred_closed_form(X_test_scaled, train_w)
mse_test = mean_squared_error(y_test, y_cf_pred_test)
r2_test = r2_score(y_test, y_cf_pred_test)

print(f"\nCF Test MSE: {mse_test}")
print(f"CF Test R^2 {r2_test}")

CF Train MSE: 31486.167775794875
CF Train R^2 0.7265334318706018

CF Test MSE: 57628.154705670146
CF Test R^2 0.6543560876120968


The results are the same as the results from sklearn which indicates that when adding an intercept to the closed form solution matches the mathematical method used by sklearn for thier linear regression.

*[C]* Problem 4: Polynomial Regression (15 points)

- Consider a feature $X$, a response variable $Y$, and $N$ samples of training data. Implement a polynomial regression model that fits a polynomial of degree $p$ to the data using the least-square method. Use your own implementation from Problem 3 and adapt it for polynomial
regression. If $p=2$, the model will use two features ($X$ and $X^2$), if $p=3$ the model will use 3 features ($X,X^2,X^3$), and so on for larger values of $p$.
- Consider the house price prediction problem with feature $X=$ `sqft_living`. Train a polynomial regression model for different values of $p \le 5$ using your implementation. Include a table with the MSE and $R^2$ metrics on both the training and testing data for at least 3 different values of $p$. Discuss your observations on how the MSE and $R^2$ metrics change with the degree of the polynomial.

In [7]:
def closed_poly(X, p):
    X = X.reshape(-1, 1) 
    X_poly = np.hstack([X**i for i in range(1, p+1)])
    X_poly = np.hstack((np.ones((X_poly.shape[0], 1)), X_poly))
    return X_poly

In [8]:
# Training on sqft 
sqft_train = train_df['sqft_living'].values
sqft_test = test_df['sqft_living'].values

results = []
for p in range(1, 6):
    
    X_poly_train = closed_poly(sqft_train, p)
    X_poly_test = closed_poly(sqft_test, p)
    
    w = closed_form_linreg(X_poly_train, y_train)
    
    y_train_pred = X_poly_train @ w
    y_test_pred = X_poly_test @ w
    
    results.append({
        "Degree p": p,
        "Train MSE": mean_squared_error(y_train, y_train_pred),
        "Train R2": r2_score(y_train, y_train_pred),
        "Test MSE": mean_squared_error(y_test, y_test_pred),
        "Test R2": r2_score(y_test, y_test_pred)
    })

results_df = pd.DataFrame(results)
print(results_df)

   Degree p      Train MSE  Train R2      Test MSE    Test R2
0         1   57947.526161  0.496709  8.857598e+04   0.468736
1         2   56662.028385  0.507874  8.112345e+04   0.513435
2         3   65018.564631  0.435295  1.380955e+05   0.171726
3         4   89997.823678  0.218343  4.657158e+05  -1.793285
4         5  129385.494552 -0.123751  1.369025e+07 -81.111807


Using polynomial regression shows that the best model is a quadratic model with ~.51 R^2. Every degree after this degrades and results in an instable model.

*[C]* Problem 5:  Gradient descent (20 points)

In this problem, you will implement your own gradient descent algorithm and apply it to linear regression on the same house prediction dataset.

1. Write code for gradient descent for training linear regression using the algorithm from class.
2. Vary the value of the learning rate (at least 3 different values $\alpha \in \{0.01,0.1,0.5\}$) and report the value of the model parameter $\theta$ after different number of iterations (10, 50, and 100). Include in a table the MSE and $R^2$ metrics on the training and testing set for the different number of iterations and different learning rates. You can choose more values of the learning rates to observe how the  behavior of the algorithm changes.
3. Write some observations about the behavior of the algorithm: How do the metrics change with different learning rates; How many iterations are needed; Does the algorithm converge to the optimal solution, etc.

In [9]:
def gradient_descent(X, y, epochs, learning_rate):
    length, height = X.shape
    theta = np.zeros(height)

    for i in range(epochs):
        gradient = (2/length) * X.T @ (X @ theta - y)
        theta = theta - learning_rate * gradient
    
    return theta

In [12]:
rates = [0.01, 0.1, 0.5]
iterations = [10, 50, 100]
gd_results = []
for rate in rates:
    for iters in iterations:
        
        theta = gradient_descent(X_train_scaled, y_train, iters, rate)
        
        y_train_pred = X_train_scaled @ theta
        y_test_pred = X_test_scaled @ theta
        
        gd_results.append({
            "learning_rate": rate,
            "iterations": iters,
            "Train MSE": mean_squared_error(y_train, y_train_pred),
            "Train R2": r2_score(y_train, y_train_pred),
            "Test MSE": mean_squared_error(y_test, y_test_pred),
            "Test R2": r2_score(y_test, y_test_pred)
        })

gd_results_df = pd.DataFrame(gd_results)

pd.set_option("display.float_format", "{:.2f}".format)

gd_results_df = pd.DataFrame(gd_results)
gd_results_df = gd_results_df.round(2)
print(gd_results_df)

   learning_rate  iterations  \
0           0.01          10   
1           0.01          50   
2           0.01         100   
3           0.10          10   
4           0.10          50   
5           0.10         100   
6           0.50          10   
7           0.50          50   
8           0.50         100   

                                           Train MSE  \
0                                          235727.77   
1                                           69720.50   
2                                           36820.35   
3                                           35105.10   
4                                           31497.26   
5                                           31486.43   
6                              145606446341105152.00   
7 12595424869718327231925228640627754361405936290...   
8 33227915364327251158449253502425703813999040941...   

                                            Train R2  \
0                                              -1.05   
1     

This shows thats after learning rate 0.5 the model diverges with really high MSE and a negative R^2. As the learning rate increases we see the MSE start to get smaller in both the test and train samples, meaning the model is converging (up until 0.5). The more iterations used for smaller learning rates gave the best R^2 and MSE. 0.1 was the best learning rate as it was not converging too slowly like 0.01 and did not diverge like 0.5. 