## Delhi House Price Prediction

Given *data about houses in Delhi*, let's try to predict the **price** of a given house. 

We will use a variety of regression models to make our predictions. 

Data source: https://www.kaggle.com/datasets/neelkamal692/delhi-house-price-prediction

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import LinearSVR, SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
data = pd.read_csv('MagicBricks.csv')
data

Unnamed: 0,Area,BHK,Bathroom,Furnishing,Locality,Parking,Price,Status,Transaction,Type,Per_Sqft
0,800.0,3,2.0,Semi-Furnished,Rohini Sector 25,1.0,6500000,Ready_to_move,New_Property,Builder_Floor,
1,750.0,2,2.0,Semi-Furnished,"J R Designers Floors, Rohini Sector 24",1.0,5000000,Ready_to_move,New_Property,Apartment,6667.0
2,950.0,2,2.0,Furnished,"Citizen Apartment, Rohini Sector 13",1.0,15500000,Ready_to_move,Resale,Apartment,6667.0
3,600.0,2,2.0,Semi-Furnished,Rohini Sector 24,1.0,4200000,Ready_to_move,Resale,Builder_Floor,6667.0
4,650.0,2,2.0,Semi-Furnished,Rohini Sector 24 carpet area 650 sqft status R...,1.0,6200000,Ready_to_move,New_Property,Builder_Floor,6667.0
...,...,...,...,...,...,...,...,...,...,...,...
1254,4118.0,4,5.0,Unfurnished,Chittaranjan Park,3.0,55000000,Ready_to_move,New_Property,Builder_Floor,12916.0
1255,1050.0,3,2.0,Semi-Furnished,Chittaranjan Park,3.0,12500000,Ready_to_move,Resale,Builder_Floor,12916.0
1256,875.0,3,3.0,Semi-Furnished,Chittaranjan Park,3.0,17500000,Ready_to_move,New_Property,Builder_Floor,12916.0
1257,990.0,2,2.0,Unfurnished,Chittaranjan Park Block A,1.0,11500000,Ready_to_move,Resale,Builder_Floor,12916.0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Area         1259 non-null   float64
 1   BHK          1259 non-null   int64  
 2   Bathroom     1257 non-null   float64
 3   Furnishing   1254 non-null   object 
 4   Locality     1259 non-null   object 
 5   Parking      1226 non-null   float64
 6   Price        1259 non-null   int64  
 7   Status       1259 non-null   object 
 8   Transaction  1259 non-null   object 
 9   Type         1254 non-null   object 
 10  Per_Sqft     1018 non-null   float64
dtypes: float64(4), int64(2), object(5)
memory usage: 108.3+ KB


### Preprocessing

In [4]:
df = data.copy()

In [5]:
df.isna().sum()

Area             0
BHK              0
Bathroom         2
Furnishing       5
Locality         0
Parking         33
Price            0
Status           0
Transaction      0
Type             5
Per_Sqft       241
dtype: int64

In [6]:
# Drop Per_Sqft column
df = df.drop('Per_Sqft', axis=1)

In [7]:
df.isna().sum()

Area            0
BHK             0
Bathroom        2
Furnishing      5
Locality        0
Parking        33
Price           0
Status          0
Transaction     0
Type            5
dtype: int64

In [8]:
{column: list(df[column].unique()) for column in df.select_dtypes('object').columns.drop('Locality')}

{'Furnishing': ['Semi-Furnished', 'Furnished', 'Unfurnished', nan],
 'Status': ['Ready_to_move', 'Almost_ready'],
 'Transaction': ['New_Property', 'Resale'],
 'Type': ['Builder_Floor', 'Apartment', nan]}

In [9]:
# Fill Missing Values
for column in ['Parking', 'Type', 'Bathroom']:
    df[column] = df[column].fillna(df[column].mode()[0])

In [10]:
df.isna().sum()

Area           0
BHK            0
Bathroom       0
Furnishing     5
Locality       0
Parking        0
Price          0
Status         0
Transaction    0
Type           0
dtype: int64

In [11]:
# Binary Encoding
df['Status'] = df['Status'].replace({'Almost_ready': 0, 'Ready_to_move': 1})
df['Transaction'] = df['Transaction'].replace({'New_Property': 0, 'Resale': 1})
df['Type'] = df['Type'].replace({'Builder_Floor': 0, 'Apartment': 1})

In [12]:
df

Unnamed: 0,Area,BHK,Bathroom,Furnishing,Locality,Parking,Price,Status,Transaction,Type
0,800.0,3,2.0,Semi-Furnished,Rohini Sector 25,1.0,6500000,1,0,0
1,750.0,2,2.0,Semi-Furnished,"J R Designers Floors, Rohini Sector 24",1.0,5000000,1,0,1
2,950.0,2,2.0,Furnished,"Citizen Apartment, Rohini Sector 13",1.0,15500000,1,1,1
3,600.0,2,2.0,Semi-Furnished,Rohini Sector 24,1.0,4200000,1,1,0
4,650.0,2,2.0,Semi-Furnished,Rohini Sector 24 carpet area 650 sqft status R...,1.0,6200000,1,0,0
...,...,...,...,...,...,...,...,...,...,...
1254,4118.0,4,5.0,Unfurnished,Chittaranjan Park,3.0,55000000,1,0,0
1255,1050.0,3,2.0,Semi-Furnished,Chittaranjan Park,3.0,12500000,1,1,0
1256,875.0,3,3.0,Semi-Furnished,Chittaranjan Park,3.0,17500000,1,0,0
1257,990.0,2,2.0,Unfurnished,Chittaranjan Park Block A,1.0,11500000,1,1,0


In [13]:
def onehot_encode(df, column, rename=False):
    df = df.copy()
    if rename == True:
        df[column] = df[column].replace({x: i for i, x in enumerate(df[column].unique())})
    dummies = pd.get_dummies(df[column], prefix=column)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [14]:
# One Hot Encoding
df = onehot_encode(df, column='Furnishing', rename=False)
df = onehot_encode(df, column='Locality', rename=True)

In [15]:
df

Unnamed: 0,Area,BHK,Bathroom,Parking,Price,Status,Transaction,Type,Furnishing_Furnished,Furnishing_Semi-Furnished,...,Locality_355,Locality_356,Locality_357,Locality_358,Locality_359,Locality_360,Locality_361,Locality_362,Locality_363,Locality_364
0,800.0,3,2.0,1.0,6500000,1,0,0,False,True,...,False,False,False,False,False,False,False,False,False,False
1,750.0,2,2.0,1.0,5000000,1,0,1,False,True,...,False,False,False,False,False,False,False,False,False,False
2,950.0,2,2.0,1.0,15500000,1,1,1,True,False,...,False,False,False,False,False,False,False,False,False,False
3,600.0,2,2.0,1.0,4200000,1,1,0,False,True,...,False,False,False,False,False,False,False,False,False,False
4,650.0,2,2.0,1.0,6200000,1,0,0,False,True,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,4118.0,4,5.0,3.0,55000000,1,0,0,False,False,...,False,False,False,False,True,False,False,False,False,False
1255,1050.0,3,2.0,3.0,12500000,1,1,0,False,True,...,False,False,False,False,True,False,False,False,False,False
1256,875.0,3,3.0,3.0,17500000,1,0,0,False,True,...,False,False,False,False,True,False,False,False,False,False
1257,990.0,2,2.0,1.0,11500000,1,1,0,False,False,...,False,False,False,False,False,False,False,False,False,True


In [21]:
df.isna().sum().sum()

0

In [17]:
{column: list(df[column].unique()) for column in df.select_dtypes('object').columns}

{}

In [18]:
# Split df into X and y
y = df['Price']
X = df.drop('Price', axis=1)

In [19]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [22]:
# Scale X
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [24]:
X_train.var()

Area            1.001136
BHK             1.001136
Bathroom        1.001136
Parking         1.001136
Status          1.001136
                  ...   
Locality_360    0.000000
Locality_361    1.001136
Locality_362    1.001136
Locality_363    1.001136
Locality_364    1.001136
Length: 375, dtype: float64

In [25]:
y_train

582     55000000
976     14000000
886      1490000
561     30000000
1083     4000000
          ...   
715     14300000
905     67000000
1096     5500000
235     13000000
1061     8000000
Name: Price, Length: 881, dtype: int64

### Training

In [27]:
models = {
    "                     Linear Regression": LinearRegression(),
    " Linear Regression (L2 Regularization)": Ridge(),
    " Linear Regression (L1 Regularization)": Lasso(),
    "                   K-Nearest Neighbors": KNeighborsRegressor(),
    "                        Neural Network": MLPRegressor(),
    "Support Vector Machine (Linear Kernel)": LinearSVR(),
    "   Support Vector Machine (RBF Kernel)": SVR(),
    "                         Decision Tree": DecisionTreeRegressor(),
    "                         Random Forest": RandomForestRegressor(),
    "                     Gradient Boosting": GradientBoostingRegressor(),
    "                               XGBoost": XGBRegressor(),
    "                              LightGBM": LGBMRegressor(verbose=0),
    "                              CatBoost": CatBoostRegressor(verbose=0)
}

In [28]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                     Linear Regression trained.
 Linear Regression (L2 Regularization) trained.
 Linear Regression (L1 Regularization) trained.
                   K-Nearest Neighbors trained.
                        Neural Network trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                         Decision Tree trained.
                         Random Forest trained.
                     Gradient Boosting trained.
                               XGBoost trained.
                              LightGBM trained.
                              CatBoost trained.


### Results

In [29]:
for name, model in models.items():
    print(name + " R^2 Score: {:.5f}".format(model.score(X_test, y_test)))

                     Linear Regression R^2 Score: 0.67648
 Linear Regression (L2 Regularization) R^2 Score: 0.67672
 Linear Regression (L1 Regularization) R^2 Score: 0.67647
                   K-Nearest Neighbors R^2 Score: 0.59531
                        Neural Network R^2 Score: -0.62241
Support Vector Machine (Linear Kernel) R^2 Score: -0.62248
   Support Vector Machine (RBF Kernel) R^2 Score: -0.07453
                         Decision Tree R^2 Score: 0.70921
                         Random Forest R^2 Score: 0.81970
                     Gradient Boosting R^2 Score: 0.84855
                               XGBoost R^2 Score: 0.87506
                              LightGBM R^2 Score: 0.77550
                              CatBoost R^2 Score: 0.85818
