Name: Dev Patel 

Course: DS4400 Data Mining and Machine Learning 1

Prof: Silvio Amir

University: Northeastern University

Problem 2: Use an existing package to train and test a linear regression model for the house prediction dataset.

1. Train a multiple linear regression model on the training set using all features (except id, date, zipcode). Report coefficients, MSE, and R² on training data.
2. Evaluate on the testing set. Report MSE and R².
3. Interpret: Which features contribute most? Is the model fitting well? How large is the error? How do training and testing MSE relate?

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
print("\nTrain columns:", list(train_df.columns))
print("\nTrain head:")
train_df.head()

Train shape: (1000, 20)
Test shape: (1000, 22)

Train columns: ['Unnamed: 0', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

Train head:


Unnamed: 0.1,Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,1,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,2,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,3,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,4,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,5,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [3]:
exclude_cols = ['id', 'date', 'zipcode', 'price', '', 'Unnamed: 0']
feature_cols = [c for c in train_df.columns if c not in exclude_cols]

X_train = train_df[feature_cols]
y_train = train_df['price']
X_test = test_df[feature_cols]
y_test = test_df['price']

print(f"Features: {feature_cols}")
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")

Features: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']
X_train: (1000, 17), y_train: (1000,)
X_test: (1000, 17), y_test: (1000,)


In [4]:
model = LinearRegression()
model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

coef_df = pd.DataFrame({'Feature': X_train.columns, 'Coefficient': model.coef_})
coef_df = coef_df.sort_values('Coefficient', key=abs, ascending=False)
print("Coefficients (sorted by |coef|):")
print(coef_df.to_string(index=False))
print(f"\nIntercept: {model.intercept_:.2f}")
print(f"\n--- Training Metrics ---")
print(f"MSE: {train_mse:,.2f}")
print(f"R²:  {train_r2:.4f}")

Coefficients (sorted by |coef|):
      Feature   Coefficient
   waterfront 715535.170470
          lat 553505.032276
        grade  79534.602722
         view  63027.898001
    bathrooms  25687.783987
    condition  18816.402756
       floors  15555.580988
     bedrooms -14704.280497
         long  -7424.027121
     yr_built  -2400.669330
  sqft_living     83.084210
sqft_living15     68.015792
 yr_renovated     43.682942
   sqft_above     42.010495
sqft_basement     41.073715
   sqft_lot15     -0.515528
     sqft_lot      0.375930

Intercept: -23088898.61

--- Training Metrics ---
MSE: 31,486,167,775.79
R²:  0.7265


In [5]:
y_test_pred = model.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("--- Testing Metrics ---")
print(f"MSE: {test_mse:,.2f}")
print(f"R²:  {test_r2:.4f}")
print(f"RMSE: ${np.sqrt(test_mse):,.0f}")

--- Testing Metrics ---
MSE: 57,628,154,705.67
R²:  0.6544
RMSE: $240,059


## Part 3: Interpretation of Results

**Which features contribute most to the model?**  
The features with the largest absolute coefficients (shown in Part 1) have the strongest impact on predicted price. Typically, `sqft_living`, `grade`, `lat`, and `long` tend to have large coefficients because they capture size, quality, and location—key drivers of house prices.

**Is the model fitting the data well?**  
The R² metric indicates how much variance in price is explained: R² close to 1 means a good fit. An R² around 0.6–0.7 on this dataset is typical for linear regression on housing data, as house prices are influenced by many non-linear factors.

**How large is the model error?**  
MSE measures average squared prediction error. RMSE (√MSE) is often easier to interpret—it represents a typical prediction error in dollars. Comparing RMSE to the mean or median price gives a sense of error magnitude.

**How do training and testing MSE relate?**  
- If test MSE ≈ train MSE: the model generalizes well.  
- If test MSE >> train MSE: possible overfitting; the model may be too complex or memorizing training noise.  
- If both are high: the model underfits; a more flexible model or better features may help.

In [6]:
# Interpretation summary
print(f"Train RMSE: ${np.sqrt(train_mse):,.0f}  |  Test RMSE: ${np.sqrt(test_mse):,.0f}")
print(f"Mean price: ${y_train.mean():,.0f}  |  Test/Train MSE ratio: {test_mse/train_mse:.2f}x")

Train RMSE: $177,443  |  Test RMSE: $240,059
Mean price: $520,415  |  Test/Train MSE ratio: 1.83x
