<a href="https://colab.research.google.com/github/aryansinghsisodia3/BostonHousing-Dataset/blob/main/BostonHousing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The Boston Housing Dataset

Using Boston Housing dataset (sklearn.datasets.load_boston)

*Another BostonHousing Dataset is being ussed from Github since original dataset was discontinued after version 1.2 of sklearn* -
[BostonHousing.csv](https://gist.github.com/nnbphuong/def91b5553736764e8e08f6255390f37)

Load and preprocess the dataset after finding the best attributes

1.   Split into training and test sets.
2.   Normalize/standardize the features if required.

Apply Multiple Linear Regression (MLR) to predict house prices.

1.   Report the Mean Squared Error (MSE) and Adjusted R² score on the test set.

Apply K-Nearest Neighbors Regression (KNN Regression) with different values of k (1–15).

1.   Plot test set accuracy/error vs. k.
2.   Report the best performance (lowest MSE).

Compare the performance of MLR vs. KNN Regression.

1.   Which performs better on this dataset?
2.   Give a short explanation.


In [None]:
import pandas as pd

df = pd.read_csv('/content/BostonHousing.csv')
display(df.head())
display(df.info())
display(df.isnull().sum())

FileNotFoundError: [Errno 2] No such file or directory: '/content/BostonHousing.csv'

Remark: No missing values were found in 506 entries and 14 columns

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('MEDV', axis=1)
y = df['MEDV']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

### MLR:
Multiple Linear Regression model on the Boston Housing data and its performance using MSE and Adjusted R^2.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)

# MSE and R-squared
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

# Adjusted R-squared
X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)

ols_model = sm.OLS(y_train, X_train_sm).fit()
adj_r2_lr = ols_model.rsquared_adj

print(f"Linear Regression - Mean Squared Error (MSE): {mse_lr:.4f}")
print(f"Linear Regression - R-squared: {r2_lr:.4f}")
print(f"Linear Regression - Adjusted R-squared: {adj_r2_lr:.4f}")

### KNN
KNN Regression model for different values of k (1-15) and their performance using MSE.


In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

mse_knn_list = []

for k in range(1, 16):
    knn_model = KNeighborsRegressor(n_neighbors=k)
    knn_model.fit(X_train, y_train)
    y_pred_knn = knn_model.predict(X_test)
    mse_knn = mean_squared_error(y_test, y_pred_knn)
    mse_knn_list.append(mse_knn)

print("MSE for k=1 to 15:", mse_knn_list)

### Plot KNN:


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(range(1, 16), mse_knn_list, marker='o')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Mean Squared Error (MSE)')
plt.title('KNN Regression Performance vs. Number of Neighbors (k)')
plt.grid(True)
plt.show()

### Correlation Heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Boston Housing Dataset')
plt.show()

### Scatter Plot of Predicted vs. Actual Values (Linear Regression)

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, alpha=0.5)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs. Predicted Prices (Linear Regression)")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2) # Diagonal line
plt.grid(True)
plt.show()

### Scatter Plot of Predicted vs. Actual Values (KNN Regression - Best k)

In [None]:
# Need to re-train KNN with the best k found previously (best_k)
knn_model_best = KNeighborsRegressor(n_neighbors=best_k)
knn_model_best.fit(X_train, y_train)
y_pred_knn_best = knn_model_best.predict(X_test)

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_knn_best, alpha=0.5)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title(f"Actual vs. Predicted Prices (KNN Regression with k={best_k})")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2) # Diagonal line
plt.grid(True)
plt.show()

## Compare MLR and KNN

Comparing the performance of MLR and the KNN model based on their MSE scores and to see which model performs better.


In [None]:
min_mse_knn = min(mse_knn_list)
best_k = mse_knn_list.index(min_mse_knn) + 1

print(f"Minimum MSE for KNN: {min_mse_knn:.4f} at k = {best_k}")
print(f"MSE for Linear Regression: {mse_lr:.4f}")

if min_mse_knn < mse_lr:
    print("KNN Regression performs better based on MSE.")
else:
    print("Linear Regression performs better based on MSE.")

## Explanation

* Minimum MSE for KNN: 23.3970 at k = 6
* MSE for Linear Regression: 14.8007
* A lower MSE indicates that the model's predictions are closer to the actual values on average.
* Therefore, in terms of minimizing the prediction error on this test set, the Linear Regression model performed better than the best KNN model.