Name: Dev Patel 

Course: DS4400 Data Mining and Machine Learning 1

Prof: Silvio Amir

University: Northeastern University

Problem 3: Implement your own linear regression using the closed-form solution and compare with Problem 2.

- Implement the closed-form solution β = (X^T X)^{-1} X^T y using matrix operations; write a predict function.
- Compare MSE and R² on train/test with the sklearn model from Problem 2. Discuss if results are similar.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
print("\nTrain columns:", list(train_df.columns))
train_df.head()

Train shape: (1000, 20)
Test shape: (1000, 22)

Train columns: ['Unnamed: 0', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']


Unnamed: 0.1,Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,1,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,2,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,3,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,4,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,5,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [3]:
exclude_cols = ['id', 'date', 'zipcode', 'price', '', 'Unnamed: 0']
feature_cols = [c for c in train_df.columns if c not in exclude_cols]

X_train = train_df[feature_cols].values
y_train = train_df['price'].values
X_test = test_df[feature_cols].values
y_test = test_df['price'].values

print(f"Features: {feature_cols}")
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")

Features: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']
X_train: (1000, 17), y_train: (1000,)


In [4]:
def fit_linear_regression(X, y):
    """
    Closed-form solution: β = (X^T X)^{-1} X^T y
    Uses lstsq for numerical stability (equivalent to normal equations).
    X: (n x p) feature matrix; intercept column is added inside.
    """
    X_design = np.column_stack([np.ones(len(X)), X])
    beta, *_ = np.linalg.lstsq(X_design, y, rcond=None)
    return beta

def predict(X, beta):
    """Predict response for new points. X should NOT include intercept (it is in beta[0])."""
    X_design = np.column_stack([np.ones(len(X)), X])
    return X_design @ beta

# Train: add intercept via design matrix, solve for β
beta = fit_linear_regression(X_train, y_train)
intercept, coefs = beta[0], beta[1:]
print(f"Coefficients (β): intercept={intercept:.2f}")
print(f"Feature coefs: {dict(zip(feature_cols, coefs.round(4)))}")

Coefficients (β): intercept=-23088898.61
Feature coefs: {'bedrooms': -14704.2805, 'bathrooms': 25687.784, 'sqft_living': 83.0836, 'sqft_lot': 0.3759, 'floors': 15555.581, 'waterfront': 715535.1705, 'view': 63027.898, 'condition': 18816.4028, 'grade': 79534.6027, 'sqft_above': 42.0111, 'sqft_basement': 41.0743, 'yr_built': -2400.6693, 'yr_renovated': 43.6829, 'lat': 553505.0323, 'long': -7424.0271, 'sqft_living15': 68.0158, 'sqft_lot15': -0.5155}


In [5]:
y_train_pred_ours = predict(X_train, beta)
y_test_pred_ours = predict(X_test, beta)

mse_train_ours = mean_squared_error(y_train, y_train_pred_ours)
r2_train_ours = r2_score(y_train, y_train_pred_ours)
mse_test_ours = mean_squared_error(y_test, y_test_pred_ours)
r2_test_ours = r2_score(y_test, y_test_pred_ours)

print("--- Our Implementation (Closed-Form) ---")
print(f"Train MSE: {mse_train_ours:,.2f}  R²: {r2_train_ours:.4f}")
print(f"Test  MSE: {mse_test_ours:,.2f}  R²: {r2_test_ours:.4f}")

--- Our Implementation (Closed-Form) ---
Train MSE: 31,486,167,775.79  R²: 0.7265
Test  MSE: 57,628,154,705.67  R²: 0.6544


In [6]:
# sklearn model from Problem 2
model_sk = LinearRegression()
model_sk.fit(X_train, y_train)
y_train_pred_sk = model_sk.predict(X_train)
y_test_pred_sk = model_sk.predict(X_test)
mse_train_sk = mean_squared_error(y_train, y_train_pred_sk)
r2_train_sk = r2_score(y_train, y_train_pred_sk)
mse_test_sk = mean_squared_error(y_test, y_test_pred_sk)
r2_test_sk = r2_score(y_test, y_test_pred_sk)

print("--- sklearn (Problem 2) ---")
print(f"Train MSE: {mse_train_sk:,.2f}  R²: {r2_train_sk:.4f}")
print(f"Test  MSE: {mse_test_sk:,.2f}  R²: {r2_test_sk:.4f}")

--- sklearn (Problem 2) ---
Train MSE: 31,486,167,775.79  R²: 0.7265
Test  MSE: 57,628,154,705.67  R²: 0.6544


In [7]:
# Comparison
print("Comparison (ours vs sklearn):")
print(pd.DataFrame({
    'Ours (Train)': [mse_train_ours, r2_train_ours],
    'sklearn (Train)': [mse_train_sk, r2_train_sk],
    'Ours (Test)': [mse_test_ours, r2_test_ours],
    'sklearn (Test)': [mse_test_sk, r2_test_sk]
}, index=['MSE', 'R²']))
print("\nMSE match (train):", np.isclose(mse_train_ours, mse_train_sk))
print("MSE match (test):", np.isclose(mse_test_ours, mse_test_sk))

Comparison (ours vs sklearn):
     Ours (Train)  sklearn (Train)   Ours (Test)  sklearn (Test)
MSE  3.148617e+10     3.148617e+10  5.762815e+10    5.762815e+10
R²   7.265334e-01     7.265334e-01  6.543561e-01    6.543561e-01

MSE match (train): True
MSE match (test): True


**Discussion:** The closed-form solution β = (X^T X)^{-1} X^T y (implemented via `np.linalg.lstsq` for numerical stability) produces MSE and R² values that match sklearn's `LinearRegression` on both train and test sets. The two implementations are equivalent: both solve the same least-squares problem. Any minor differences in coefficients are due to numerical precision when the design matrix is ill-conditioned (e.g., correlated features), but the predictions and metrics are effectively identical.