# House Price Prediction

**About the Dataset:**

The dataset used for this project is the **[Boston House Prices Dataset](https://www.kaggle.com/datasets/vikrishnan/boston-house-prices/data)**, which contains information about various attributes related to Boston suburbs and towns. These records were collected from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970.

**Attributes in the Dataset:**

- **CRIM**: Per capita crime rate by town.

- **ZN**: Proportion of residential land zoned for lots over 25,000 sq.ft.

- **INDUS**: Proportion of non-retail business acres per town.

- **CHAS**: Charles River dummy variable (1 if tract bounds river; 0 otherwise).

- **NOX**: Nitric oxides concentration (parts per 10 million).

- **RM**: Average number of rooms per dwelling.

- **AGE**: Proportion of owner-occupied units built before 1940.

- **DIS**: Weighted distances to five Boston employment centers.

- **RAD**: Index of accessibility to radial highways.

- **TAX**: Full-value property-tax rate per $10,000.

- **PTRATIO**: Pupil-teacher ratio by town.

- **B**: A derived feature, calculated as 1000 * (Bk - 0.63)^2, where Bk is the proportion of Black residents by town.

- **LSTAT**: Percentage of lower status population.

- **MEDV**: Median value of owner-occupied homes in $1000s.



## Import the necessary libraries:

In [None]:
import os
import sys
import random
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Loading DataSet:


In [None]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

In [None]:
path="/kaggle/input/boston-house-prices/housing.csv"
df=pd.read_csv(path,header=None, delimiter=r"\s+", names=column_names)

In [None]:
df.head(10)

In [None]:
df.shape

In [None]:
df.describe(include='all')

In [None]:
df.info()

In [None]:
df.isnull().sum()

No missing values

In [None]:
miss_val=df.isnull().sum().sort_values(ascending=False)
miss_val=pd.DataFrame(data=df.isnull().sum().sort_values(ascending=False), columns=['MissValueCount'])

miss_val['Percent']=miss_val.MissValueCount.apply(lambda x:'{:.2f}'.format(float(x)/df.shape[0]*100))
miss_val=miss_val[miss_val.MissValueCount>=0]
miss_val

In [None]:
sns.distplot(df.MEDV)

## Initial Data Observations and Considerations

In the initial exploration of the dataset, several interesting observations and considerations have been identified that can inform our approach to the house price prediction task. These observations are as follows:

### 1. ZN (Proportion of Residential Land Zoned for Large Lots)
- The 25th and 50th percentiles of the 'ZN' column have values of 0, indicating that a significant portion of the data may have zero values for this feature.
- It's reasonable to assume that this variable might not have a strong influence on house prices if most values are zero.

### 2. CHAS (Charles River Dummy Variable)
- The 25th, 50th, and 75th percentiles of the 'CHAS' column are all 0, suggesting that this binary categorical variable might not have a significant impact on house prices in this dataset.

### 3. MEDV (Median Value of Owner-Occupied Homes)
- The maximum value of 'MEDV' is 50, and it is noted that the data description indicates that this variable is censored at 50.00, corresponding to a median price of $50,000.
- Values above 50 may not provide useful information for predicting house prices, and this censoring should be considered when building the regression model.

These initial observations guide our data preprocessing decisions, including the potential exclusion of the 'ZN' and 'CHAS' columns from our feature set and the handling of 'MEDV' values above 50. Furthermore, we will visualize the data to explore trends and relationships with the target variable ('MEDV') for a more comprehensive analysis.

Data exploration and understanding are fundamental steps in preparing the dataset for modeling and making informed decisions throughout the project.


In [None]:
fig, axs = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for k,v in df.items():
    sns.boxplot(y=k, data=df, ax=axs[index])
    index += 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

Columns like CRIM, ZN, RM, B seems to have outliers. Let's see the outliers percentage in every column.

In [None]:
for k, v in df.items():
        q1 = v.quantile(0.25)
        q3 = v.quantile(0.75)
        irq = q3 - q1
        v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
        perc = np.shape(v_col)[0] * 100.0 / np.shape(df)[0]
        print("Column %s outliers = %.2f%%" % (k, perc))

Let's remove MEDV outliers (MEDV = 50.0) before plotting more distributions

In [None]:
data = df[~(df['MEDV'] >= 50.0)]
print(np.shape(data))

Let's see how these features plus MEDV distributions looks like

In [None]:
fig, axs = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for k,v in data.items():
    sns.distplot(v, ax=axs[index])
    index += 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

The histogram also shows that columns CRIM, ZN, B has highly skewed distributions. Also MEDV looks to have a normal distribution (the predictions) and other colums seem to have norma or bimodel ditribution of data except CHAS (which is a discrete variable).

Now let's plot the pairwise correlation on data.

In [None]:
plt.figure(figsize=(20, 10))
sns.heatmap(df.corr().abs(),  annot=True, cmap='viridis')
plt.title("Correlation HeatMap")
plt.show()

From correlation matrix, we see TAX and RAD are highly correlated features. The columns LSTAT, INDUS, RM, TAX, NOX, PTRAIO has a correlation score above 0.5 with MEDV which is a good indication of using as predictors. Let's plot these columns against MEDV.

In [None]:
y = df['MEDV']
column_sels = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

In [None]:
from sklearn import preprocessing
# Let's scale the columns before plotting them against MEDV
min_max_scaler = preprocessing.MinMaxScaler()
column_sels = ['LSTAT', 'INDUS', 'NOX', 'PTRATIO', 'RM', 'TAX', 'DIS', 'AGE']
x = data.loc[:,column_sels]
y = data['MEDV']
x = pd.DataFrame(data=min_max_scaler.fit_transform(x), columns=column_sels)
fig, axs = plt.subplots(ncols=4, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for i, k in enumerate(column_sels):
    sns.regplot(y=y, x=x[k], ax=axs[i])
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

Visualize the relationships between the scaled features ('LSTAT', 'INDUS', 'NOX', 'PTRATIO', 'RM', 'TAX', 'DIS', 'AGE') and the target variable 'MEDV' using regression plots. Scaling the features to a common range ([0, 1]) ensures that they are on a similar scale and helps in comparing their effects on the target variable. These regression plots can provide insights into how each feature influences the target variable in a standardized way.visualize the relationships between the scaled features ('LSTAT', 'INDUS', 'NOX', 'PTRATIO', 'RM', 'TAX', 'DIS', 'AGE') and the target variable 'MEDV' using regression plots. Scaling the features to a common range ([0, 1]) ensures that they are on a similar scale and helps in comparing their effects on the target variable. These regression plots can provide insights into how each feature influences the target variable in a standardized way.

So with these analsis, we may try predict MEDV with 'LSTAT', 'INDUS', 'NOX', 'PTRATIO', 'RM', 'TAX', 'DIS', 'AGE' features. Let's try to remove the skewness of the data trough log transformation.

In [None]:
df.skew().sort_values(ascending=False)

In [None]:
y =  np.log1p(y)
for col in x.columns:
    if np.abs(x[col].skew()) > 0.3:
        x[col] = np.log1p(x[col])

## Split the Dataset into train and test data

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=3)

In [None]:
from sklearn.linear_model import LinearRegression

# Create a Linear Regression model instance
lr = LinearRegression()

# Fit the model to your training data
lr.fit(x_train, y_train)

In [None]:
y_pred = lr.predict(x_test)
print("Predicted Values:", y_pred)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate the metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Print the metrics
print("Linear Regression Model: ")
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2)

In [None]:
coefficients = lr.coef_
intercept = lr.intercept_

# Print the coefficients (weights) for each feature
print("Coefficients (Weights) for each feature:")
for feature, coef in zip(x_train.columns, coefficients):
    print(f"{feature}: {coef}")

# Print the intercept
print("Intercept (Bias):", intercept)

### Visualization of Regression model:


In [None]:
import matplotlib.pyplot as plt

# Scatter plot
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted Values")

# Add a regression line (optional)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red')

plt.show()

## SVM Model:

In [None]:
from sklearn.svm import SVR
svr_model = SVR(kernel='rbf')
svr_model.fit(x_train, y_train)

In [None]:
y_pred = svr_model.predict(x_test)
print("Predicted Values:", y_pred)

In [None]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

# Print the metrics
print("SVM MODEL Metrics: ")
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2)

## Decision Tree Regressor Model:


In [None]:
from sklearn.tree import DecisionTreeRegressor
dt_regressor = DecisionTreeRegressor(random_state=42)
# Create a Decision Tree Regressor instance
dt_regressor = DecisionTreeRegressor(random_state=42)

# Train the Decision Tree Regressor on your training data
dt_regressor.fit(x_train, y_train)

In [None]:
y_pred = dt_regressor.predict(x_test)
print("Predicted Values:", y_pred)

In [None]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

# Print the metrics
print("Decision Tree Regressor Metrics: ")
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2)

### Visualization Decision Tree Regressor:


In [None]:
from sklearn.tree import plot_tree

# Plot the Decision Tree
plt.figure(figsize=(12, 6))
plot_tree(dt_regressor, filled=True, feature_names=x_train.columns)
plt.show()

## KNeighborsRegressor:

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(x_train, y_train)

In [None]:
y_pred=knn_regressor.predict(x_test)
print('Predicted Values: ', y_pred)

In [None]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2_knr = r2_score(y_test, y_pred)

# Print the metrics
print("KNeighborsRegressor Metrics: ")
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2_knr)

## GradientBoostingRegressor:

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
# Create a Gradient Boosting Regressor instance with specified hyperparameters
gb_regressor = GradientBoostingRegressor(
    n_estimators=100,  # Number of boosting stages to be used
    learning_rate=0.1,  # Learning rate (step size shrinkage)
    max_depth=3,  # Maximum depth of individual estimators
    random_state=42
)

# Train the Gradient Boosting Regressor on your training data
gb_regressor.fit(x_train, y_train)

In [None]:
y_pred=gb_regressor.predict(x_test)
print("Predicted Values: ", y_pred)

In [None]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2_gbr = r2_score(y_test, y_pred)

# Print the metrics
print("GradientBoostingRegressor Metrics: ")
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2) Score:", r2_gbr)

In [None]:
from IPython.display import display, HTML

# Create a DataFrame with the metrics
metrics_data = {
    'Model': ['Linear Regression', 'SVR (SVM)', 'Decision Tree', 'K-Neighbors', 'Gradient Boosting'],
    'MAE': [0.1037, 0.0984, 0.1317, 0.1288, 0.1037],
    'MSE': [0.0249, 0.0236, 0.0361, 0.0314, 0.0249],
    'RMSE': [0.1579, 0.1536, 0.1901, 0.1773, 0.1579],
    'R² Score': [0.8326, 0.8416, 0.7574, 0.7888, 0.8326]
}

metrics_df = pd.DataFrame(metrics_data)

# Display the DataFrame as an HTML table
display(HTML(metrics_df.to_html()))

1. **SVR (SVM Model) appears to be the best-performing model** based on the R² (R-squared) score. It achieved the highest R² score of 0.8416 on the validation set, indicating that it explains the most variance in the target variable. This suggests that SVR provides a strong fit to the data and effectively captures the underlying patterns.

2. **Linear Regression and Gradient Boosting Regressor** also performed well, with R² scores of 0.8326. These models provide good explanations of the variance in the target variable and are competitive choices.

3. **K-Neighbors Regressor** performed slightly below the top models, with an R² score of 0.7888. While it still provides a reasonable fit, it may not capture as much variance as the top models.

4. **Decision Tree Regressor** achieved the lowest R² score of 0.7574, indicating that it explained less variance in the target variable compared to the other models. It may be more prone to overfitting or may not capture the underlying patterns as effectively.

In conclusion, **SVR (SVM Model) is recommended as the best model for this regression task**, as it consistently achieved the highest R² score, signifying its ability to explain the most variance in the target variable. However, the choice of the best model should also consider other factors, including model complexity, interpretability, and domain-specific requirements. It's advisable to further validate the selected model on a separate test dataset to confirm its performance on unseen data.

### Model Comparision Visualization:

In [None]:
linear_regression_scores = [0.8326, 0.8326, 0.8326, 0.8326, 0.8326]
svr_scores = [0.8416, 0.8416, 0.8416, 0.8416, 0.8416]
decision_tree_scores = [0.7574, 0.7574, 0.7574, 0.7574, 0.7574]
k_neighbors_scores = [0.7888, 0.7888, 0.7888, 0.7888, 0.7888]
gradient_boosting_scores = [0.8326, 0.8326, 0.8326, 0.8326, 0.8326]

# Create a DataFrame with the example scores
scores_map = {
    'Linear Regression': linear_regression_scores,
    'SVR (SVM)': svr_scores,
    'Decision Tree': decision_tree_scores,
    'K-Neighbors': k_neighbors_scores,
    'Gradient Boosting': gradient_boosting_scores
}

scores_df = pd.DataFrame(scores_map)

# Create a boxplot to visualize model performance with improved styling
plt.figure(figsize=(12, 8))
sns.set(style="whitegrid")  # Set a white grid background
ax = sns.boxplot(data=scores_df, palette="viridis")  # Use a color palette for boxes
plt.title("Model Performance Comparison", fontsize=16)
plt.ylabel("R-squared (R2) Score", fontsize=14)
plt.xlabel("Regression Models", fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()

The **Support Vector Regression (SVR) model** was chosen for this project due to its ability to handle both linear and non-linear relationships in the data, making it suitable for predicting house prices, which often exhibit complex patterns. SVR is also effective in handling high-dimensional datasets and has been widely used in real estate prediction tasks, making it a robust choice for accurate price forecasting.
* Visit the [Boston House Price ML Predictor](https://boston-house-price-ml-predictor-saimanoj.streamlit.app/) web application.