Name: Dev Patel 

Course: DS4400 Data Mining and Machine Learning 1

Prof: Silvio Amir

University: Northeastern University

Problem 4: Polynomial Regression

- Implement polynomial regression using the closed-form solution from Problem 3. For degree $p$, the model uses features $(X, X^2, \ldots, X^p)$.
- Train on `sqft_living` for $p = 1, 2, 3, 4, 5$. Report MSE and R² on train/test. Discuss how metrics change with $p$.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

X_train_raw = train_df['sqft_living'].values.astype(float)
y_train = train_df['price'].values.astype(float)
X_test_raw = test_df['sqft_living'].values.astype(float)
y_test = test_df['price'].values.astype(float)

print(f"Train: {len(X_train_raw)} samples, Test: {len(X_test_raw)} samples")
print(f"sqft_living range: [{X_train_raw.min():.0f}, {X_train_raw.max():.0f}]")

Train: 1000 samples, Test: 1000 samples
sqft_living range: [380, 6070]


In [3]:
def poly_features(X, degree):
    """Build polynomial feature matrix [X, X^2, ..., X^p] from 1-D array X."""
    return np.column_stack([X ** d for d in range(1, degree + 1)])

def fit_linear_regression(X, y):
    """Closed-form: β = (X^T X)^{-1} X^T y (via lstsq). Adds intercept column."""
    X_design = np.column_stack([np.ones(len(X)), X])
    beta, *_ = np.linalg.lstsq(X_design, y, rcond=None)
    return beta

def predict(X, beta):
    """Predict using β. X should NOT include intercept."""
    X_design = np.column_stack([np.ones(len(X)), X])
    return X_design @ beta

In [4]:
results = []

for p in range(1, 6):
    X_train_poly = poly_features(X_train_raw, p)
    X_test_poly = poly_features(X_test_raw, p)

    beta = fit_linear_regression(X_train_poly, y_train)
    
    y_train_pred = predict(X_train_poly, beta)
    y_test_pred = predict(X_test_poly, beta)
    
    train_mse = mean_squared_error(y_train, y_train_pred)
    train_r2 = r2_score(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    results.append({'Degree (p)': p,
                    'Train MSE': train_mse,
                    'Train R²': train_r2,
                    'Test MSE': test_mse,
                    'Test R²': test_r2})

results_df = pd.DataFrame(results)
results_df['Train MSE'] = results_df['Train MSE'].map('{:,.0f}'.format)
results_df['Test MSE'] = results_df['Test MSE'].map('{:,.0f}'.format)
results_df['Train R²'] = results_df['Train R²'].map('{:.4f}'.format)
results_df['Test R²'] = results_df['Test R²'].map('{:.4f}'.format)
print("Polynomial Regression Results (feature = sqft_living):\n")
results_df

Polynomial Regression Results (feature = sqft_living):



Unnamed: 0,Degree (p),Train MSE,Train R²,Test MSE,Test R²
0,1,57947526161,0.4967,88575978543,0.4687
1,2,54822665116,0.5238,71791679479,0.5694
2,3,53785194716,0.5329,99833483777,0.4012
3,4,52795850030,0.5415,249716356906,-0.4978
4,5,54114912599,0.53,1113838006160,-5.6806


**Discussion:**

- **Increasing p improves training fit:** As the polynomial degree increases, the model has more flexibility. Train MSE decreases and Train R² increases with higher $p$, since a higher-degree polynomial can capture more of the non-linear relationship between `sqft_living` and `price`.

- **Test performance may plateau or degrade:** While low-degree polynomials (p=1, 2) may underfit, moderate degrees (p=2 or 3) often strike a good balance. At higher degrees (p=4, 5), the model starts overfitting to training noise; test MSE may increase or test R² may drop compared to a lower-degree model.

- **Overfitting risk:** A large gap between train MSE and test MSE at high p signals overfitting. The model memorizes training data patterns that do not generalize. Polynomial features with large exponents also create very large values, which can amplify numerical instability.

- **Best trade-off:** The degree that minimizes test MSE (or maximizes test R²) represents the best bias-variance trade-off for this single-feature regression.