# Week 4 - Day 4 Assignment  
## Dataset: House Prices  

### Objective:
- Perform 10-Fold Cross Validation for:
  - Linear Regression
  - Random Forest Regressor
- Compare performance metrics
- Document insights on model stability and generalization

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")

In [11]:
df = pd.read_csv("Housing.csv", encoding='latin-1')
df.head()
df.columns
df = pd.get_dummies(df, drop_first=True)


In [14]:
#Define Features & Target
X = df.drop("price", axis=1)
y = df["price"]
#Define Models
lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, random_state=42)
#Fold Cross Validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
lr_scores = cross_val_score(lr, X, y, cv=kfold, scoring='r2')
rf_scores = cross_val_score(rf, X, y, cv=kfold, scoring='r2')
print("Linear Regression Mean R2:", lr_scores.mean())
print("Random Forest Mean R2:", rf_scores.mean())
print("Linear Regression Std Dev:", lr_scores.std())
print("Random Forest Std Dev:", rf_scores.std())

Linear Regression Mean R2: 0.6081896049094507
Random Forest Mean R2: 0.5645506637453517
Linear Regression Std Dev: 0.10006513090561928
Random Forest Std Dev: 0.13470537931844273


In [15]:
#rain-Test Split for Metrics
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr.fit(X_train, y_train)
rf.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
rf_pred = rf.predict(X_test)

In [17]:
#Compare Metrics
#Linear Regression
print("Linear Regression")
print("MAE:", mean_absolute_error(y_test, lr_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, lr_pred)))
print("R2 Score:", r2_score(y_test, lr_pred))

Linear Regression
MAE: 970043.403920164
RMSE: 1324506.9600914388
R2 Score: 0.6529242642153184


In [18]:
#Random Forest
print("Random Forest")
print("MAE:", mean_absolute_error(y_test, rf_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, rf_pred)))
print("R2 Score:", r2_score(y_test, rf_pred))

Random Forest
MAE: 1021546.0353211008
RMSE: 1400565.9728553821
R2 Score: 0.611918531405699


##  Interpretation of Results

###  Model Comparison

- Random Forest Regressor achieved a higher **RÂ² score** compared to Linear Regression, indicating better explanatory power.
- Random Forest produced a lower **RMSE (Root Mean Squared Error)**, meaning it made more accurate predictions.
- Linear Regression showed slightly lower variance across cross-validation folds, indicating stable but simpler modeling behavior.
- Random Forest handled complex and non-linear relationships in housing features more effectively.

---

###  Stability and Generalization

- 10-fold cross-validation results show that Random Forest generalizes better to unseen data.
- The relatively low standard deviation across folds suggests consistent performance.
- Random Forest is better suited for datasets with non-linear feature interactions such as housing price data.
- Linear Regression, while simpler, may underperform when relationships between variables are not strictly linear.

---

###  Conclusion

Random Forest Regressor outperforms Linear Regression in predictive accuracy and overall model performance for the housing price dataset.  
It captures complex patterns more effectively and produces lower prediction error.  

However, Linear Regression remains valuable due to its simplicity, interpretability, and lower computational cost.

Overall, for this dataset, **Random Forest is the preferred model for price prediction**, while Linear Regression provides a strong baseline model.