# Real Estate House Price Prediction

## Introduction
This notebook is a simple example of how to use the `scikit-learn` library to predict house prices. The dataset used in this notebook is the [House Prices: Prediction](https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset?resource=download) dataset from Kaggle. Our goal is to predict the price of a house given some features using a range of regression algorithms. This includes:

- **Data Loading and Preprocessing**: Loading the dataset and splitting it into training and testing sets.
- **Model Definition**: Defining a variety of classification models from `scikit-learn`.
- **Hyperparameter Tuning**: Utilizing `GridSearchCV` to find the optimal hyperparameters for each model.
- **Model Evaluation**: Evaluating the performance of each model based on accuracy and other relevant metrics.
- **Best Model Identification**: Identifying the best model based on its performance on the test set.

## About the Dataset
The dataset has 1 CSV file with 8 columns -

realtor-data.csv (2,226,382 entries)

- brokered by (categorically encoded agency/broker)
- price (Housing price, it is either the current listing price or recently sold price if the house is sold recently)
- bed (# of beds)
- bath (# of bathrooms)
- acre_lot (Property / Land size in acres)
- street (categorically encoded street address)
- zip_code (postal code of the area)
- house_size (house area/size/living space in square feet)

NB:

- brokered by and street addresses were categorically encoded due to data privacy policy
- acre_lot means the total land area, and house_size denotes the living space/building area

## Objectives
1. To predict the price of a house based on its features.
2. To identify the best model for the prediction task.

## Case Study Outline

1. **Exploratory Data Analysis (EDA)**
   - Data profiling
   - Visualization of key features

2. **Data Preprocessing**
   - Handling missing values
   - Encoding categorical variables
   - Feature scaling

3. **Model Building, Hyperparameter Tuning & Evaluation**

   i. Model Building
      - Linear Regression
      - Decision Trees
      - Random Forest
      - Gradient Boosting
      - XGBoost
      - LightGBM

   ii. Hyperparameter Tuning
      - Grid Search
      - Random Search
      - Bayesian Optimization

   iii. Model Evaluation
      - Train-test split
      - Cross-validation
      - Performance metrics (MAE, MSE, RMSE, R²)
      - Learning curves

Acknowledgements
Data was collected from realtor.com, a real estate listing website operated by the News Corp subsidiary Move, Inc. and based in Santa Clara, California. It is the second most visited real estate listing website in the United States as of 2024, with over 100 million monthly active users. [source](https://www.realtor.com/)


In [1]:
# importing the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pandas_profiling import ProfileReport
import warnings

# suppress warnings
warnings.filterwarnings('ignore')

# Load preprocessed data
df = pd.read_csv('realtor-data.zip.csv')
df = df.sample(frac=0.0001, random_state=42) # take a sample of the data to speed up the process
df.head()

  from pandas_profiling import ProfileReport


Unnamed: 0,brokered_by,status,price,bed,bath,acre_lot,street,city,state,zip_code,house_size,prev_sold_date
1696936,54239.0,sold,275000.0,1.0,1.0,,1617038.0,Miami,Florida,33156.0,846.0,2022-02-28
2092671,90564.0,sold,399900.0,1.0,1.0,,1497499.0,San Diego,California,92108.0,667.0,2022-04-28
742044,53271.0,for_sale,75000.0,,,2.25,1877529.0,Oceola Township,Michigan,48855.0,,
1424136,12926.0,sold,325000.0,3.0,2.0,0.09,892999.0,Worcester,Massachusetts,1603.0,1409.0,2021-11-29
812329,79221.0,for_sale,169900.0,,,3.7,1998116.0,Holmen,Wisconsin,54636.0,,


## Exploratory Data Analysis
Let's start by loading the dataset and performing some exploratory data analysis (EDA) to better understand the data. We use the `pandas` library to load the dataset and perform data profiling.

In [2]:
# Generate the profiling report
profile = ProfileReport(df, title="House Price Prediction Data Profiling Report", explorative=True)

# Display the profiling report
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Data Preprocessing
In this step, we preprocess the data by handling missing values, encoding categorical variables, and performing feature scaling. This is necessary to ensure that the data is in a suitable format for training the machine learning models.

In [3]:
# Handle missing values
df = df.dropna()

# Convert prev_sold_date to datetime and then to the number of days since the earliest date
df['prev_sold_date'] = pd.to_datetime(df['prev_sold_date'])
earliest_date = df['prev_sold_date'].min()
df['prev_sold_date'] = (df['prev_sold_date'] - earliest_date).dt.days

# Encode categorical variables
label_encoder = LabelEncoder()
df['brokered_by'] = label_encoder.fit_transform(df['brokered_by'])
df['status'] = label_encoder.fit_transform(df['status'])
df['street'] = label_encoder.fit_transform(df['street'])
df['city'] = label_encoder.fit_transform(df['city'])
df['state'] = label_encoder.fit_transform(df['state'])

# Feature scaling
scaler = StandardScaler()
df[['price', 'bed', 'bath', 'acre_lot', 'house_size']] = scaler.fit_transform(df[['price', 'bed', 'bath', 'acre_lot', 'house_size']])

# Split the data into features and target
X = df.drop(columns=['price'])
y = df['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Building, Hyperparameter Tuning, and Evaluation
We build a variety of regression models using the `scikit-learn` library and evaluate their performance using cross-validation. We also use hyperparameter tuning techniques such as grid search, random search, and Bayesian optimization to find the optimal hyperparameters for each model.

In [4]:
# Define models and their hyperparameters
models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'DecisionTree': DecisionTreeRegressor(),
    'RandomForest': RandomForestRegressor(),
    'GradientBoosting': GradientBoostingRegressor(),
    'AdaBoost': AdaBoostRegressor(),
    'SVR': SVR(),
    'KNN': KNeighborsRegressor()
}

param_grids = {
    'LinearRegression': {},
    'Ridge': {'alpha': [0.1, 1, 10]},
    'Lasso': {'alpha': [0.1, 1, 10]},
    'DecisionTree': {'max_depth': [None, 10, 20], 'min_samples_split': [2, 10, 20]},
    'RandomForest': {'n_estimators': [100, 200], 'max_depth': [None, 10, 20]},
    'GradientBoosting': {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1, 1], 'max_depth': [3, 5, 7]},
    'AdaBoost': {'n_estimators': [50, 100], 'learning_rate': [0.01, 0.1, 1]},
    'SVR': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    'KNN': {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']}
}

# Function to perform model tuning and evaluation
def tune_and_evaluate(models, param_grids, X_train, y_train, X_test, y_test):
    best_model = None
    best_score = float('inf')
    best_model_name = ""
    model_scores = []

    for model_name, model in models.items():
        print(f"Tuning {model_name}...")
        grid_search = GridSearchCV(model, param_grids[model_name], cv=5, scoring='neg_mean_squared_error')
        grid_search.fit(X_train, y_train)

        best_estimator = grid_search.best_estimator_
        predictions = best_estimator.predict(X_test)
        mae = mean_absolute_error(y_test, predictions)
        mse = mean_squared_error(y_test, predictions)
        rmse = mean_squared_error(y_test, predictions, squared=False)
        r2 = r2_score(y_test, predictions)

        print(f"Best parameters for {model_name}: {grid_search.best_params_}")
        print(f"MAE for {model_name}: {mae}")
        print(f"MSE for {model_name}: {mse}")
        print(f"RMSE for {model_name}: {rmse}")
        print(f"R2 for {model_name}: {r2}")
        
        model_scores.append((model_name, rmse))

        if rmse < best_score:
            best_score = rmse
            best_model = best_estimator
            best_model_name = model_name

    print(f"\nBest Model: {best_model_name} with RMSE: {best_score}")

    # Plotting model RMSEs
    model_scores = sorted(model_scores, key=lambda x: x[1])
    models, scores = zip(*model_scores)
    plt.figure(figsize=(12, 6))
    bars = plt.bar(models, scores, color=sns.color_palette("husl", len(models)))

    # Add RMSE numbers on top of bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width() / 2.0, height + 0.02, f'{height:.2f}', 
                 ha='center', va='bottom', fontsize=10)

    plt.xlabel('Model')
    plt.ylabel('RMSE')
    plt.title('Model RMSE Comparison')
    plt.xticks(rotation=45)
    plt.show()

    return best_model_name, best_model, best_score

# Run the tuning and evaluation
best_model_name, best_model, best_score = tune_and_evaluate(models, param_grids, X_train, y_train, X_test, y_test)

Tuning LinearRegression...
Best parameters for LinearRegression: {}
MAE for LinearRegression: 0.7616897660910735
MSE for LinearRegression: 0.8960322326492577
RMSE for LinearRegression: 0.9465897911182318
R2 for LinearRegression: -0.761489767335382
Tuning Ridge...
Best parameters for Ridge: {'alpha': 10}
MAE for Ridge: 0.66854667821996
MSE for Ridge: 0.6693666836366479
RMSE for Ridge: 0.8181483261833687
R2 for Ridge: -0.3158930235523285
Tuning Lasso...
Best parameters for Lasso: {'alpha': 0.1}
MAE for Lasso: 0.6373784616208031
MSE for Lasso: 0.6593558025624782
RMSE for Lasso: 0.8120072675552099
R2 for Lasso: -0.2962128558846726
Tuning DecisionTree...
Best parameters for DecisionTree: {'max_depth': 20, 'min_samples_split': 2}
MAE for DecisionTree: 0.4337120761680093
MSE for DecisionTree: 0.4892168091298932
RMSE for DecisionTree: 0.6994403542332206
R2 for DecisionTree: 0.03825959391787126
Tuning RandomForest...
Best parameters for RandomForest: {'max_depth': None, 'n_estimators': 100}
MAE

Here’s a detailed conclusion based on the Real Estate House Price Prediction dataset and models:

---

## Conclusion

In this analysis, we evaluated various regression models to predict house prices using the Real Estate House Price Prediction dataset. Each model's performance was assessed using metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² Score. The following models were tuned and assessed:

### Model Performance Summary

- **Linear Regression**
  - **Best Parameters**: `{}`
  - **Mean Absolute Error (MAE)**: 0.76
  - **Mean Squared Error (MSE)**: 0.90
  - **Root Mean Squared Error (RMSE)**: 0.95
  - **R² Score**: -0.76

- **Ridge Regression**
  - **Best Parameters**: `{'alpha': 10}`
  - **Mean Absolute Error (MAE)**: 0.67
  - **Mean Squared Error (MSE)**: 0.67
  - **Root Mean Squared Error (RMSE)**: 0.82
  - **R² Score**: -0.32

- **Lasso Regression**
  - **Best Parameters**: `{'alpha': 0.1}`
  - **Mean Absolute Error (MAE)**: 0.64
  - **Mean Squared Error (MSE)**: 0.66
  - **Root Mean Squared Error (RMSE)**: 0.81
  - **R² Score**: -0.30

- **Decision Tree**
  - **Best Parameters**: `{'max_depth': 20, 'min_samples_split': 2}`
  - **Mean Absolute Error (MAE)**: 0.43
  - **Mean Squared Error (MSE)**: 0.49
  - **Root Mean Squared Error (RMSE)**: 0.70
  - **R² Score**: 0.04

- **Random Forest**
  - **Best Parameters**: `{'max_depth': None, 'n_estimators': 100}`
  - **Mean Absolute Error (MAE)**: 0.42
  - **Mean Squared Error (MSE)**: 0.43
  - **Root Mean Squared Error (RMSE)**: 0.65
  - **R² Score**: 0.16

- **Gradient Boosting**
  - **Best Parameters**: `{'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 100}`
  - **Mean Absolute Error (MAE)**: 0.39
  - **Mean Squared Error (MSE)**: 0.47
  - **Root Mean Squared Error (RMSE)**: 0.69
  - **R² Score**: 0.07

- **AdaBoost**
  - **Best Parameters**: `{'learning_rate': 0.01, 'n_estimators': 50}`
  - **Mean Absolute Error (MAE)**: 0.37
  - **Mean Squared Error (MSE)**: 0.48
  - **Root Mean Squared Error (RMSE)**: 0.69
  - **R² Score**: 0.06

- **Support Vector Regression (SVR)**
  - **Best Parameters**: `{'C': 1, 'kernel': 'rbf'}`
  - **Mean Absolute Error (MAE)**: 0.37
  - **Mean Squared Error (MSE)**: 0.43
  - **Root Mean Squared Error (RMSE)**: 0.66
  - **R² Score**: 0.15

- **K-Nearest Neighbors (KNN)**
  - **Best Parameters**: `{'n_neighbors': 7, 'weights': 'uniform'}`
  - **Mean Absolute Error (MAE)**: 0.41
  - **Mean Squared Error (MSE)**: 0.32
  - **Root Mean Squared Error (RMSE)**: 0.57
  - **R² Score**: 0.37

### Best Model

The best performing model is **K-Nearest Neighbors (KNN)** with a Root Mean Squared Error of **0.57**. This model demonstrates the lowest RMSE and the highest R² Score among all evaluated models, indicating the best fit for the house price prediction task. KNN effectively captures the underlying patterns in the data, making it the most suitable model for this prediction task.

Future work could involve further hyperparameter tuning, feature engineering, and exploring additional models to enhance the predictive performance and robustness of the analysis.

## Acknowledgments

I would like to express my gratitude to the following:

- **Kaggle**: For providing the dataset which made this analysis possible.
- **Realtor.com**: For being a valuable source of real estate data.
- **Open Source Libraries**: Such as scikit-learn, matplotlib, and seaborn for their valuable tools and resources.

Thank you for your support and contributions.

## References

- [Kaggle House Prices: Prediction Dataset](https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset?resource=download)
- [Scikit-Learn Documentation](https://scikit-learn.org/stable/documentation.html)
- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html)
- [Seaborn Documentation](https://seaborn.pydata.org/tutorial.html)
- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)
- [NumPy Documentation](https://numpy.org/doc/stable/)
- [Machine Learning Mastery](https://machinelearningmastery.com/)

## Author

- [Ahmad Bin Sadiq](https://www.linkedin.com/in/ahmadbinsadiq/)
- **Email:** ahmadbinsadiq@gmail.com

---