# **Bengaluru House Price Prediction**

Data Science Regression Project - Predicting Home Prices in Bangalore.

> [Kaggle Dataset](https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data)

In [None]:
# Install Kaggle.
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
# Files Upload.
from google.colab import files

files.upload()

In [None]:
# Create a Kaggle Folder.
!mkdir ~/.kaggle

# Copy the kaggle.json to the folder created.
!cp kaggle.json ~/.kaggle/

# Permission for the json file to act.
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Dataset Download.
!kaggle datasets download -d amitabhajoy/bengaluru-house-price-data

In [None]:
# Unzip Dataset.
!unzip bengaluru-house-price-data.zip

# **References**

> [**Real Estate Price Prediction Project - YouTube Tutorial**
](https://www.youtube.com/watch?v=rdfbcdP75KI&list=PLeo1K3hjS3uu7clOTtwsp94PcHbzqpAdg)

In [None]:
# Import Library.
import pandas as pd
import numpy as np

# Load Dataset.
data = pd.read_csv("Bengaluru_House_Data.csv")
data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [None]:
# Shape of the Dataset.
print("Dataset Shape is", data.shape)

Dataset Shape is (13320, 9)


In [None]:
# Drop Un-important Features.
data = data.drop(["area_type", "availability", "society", "balcony"], axis="columns")

# Remove data with Missing Values.
data = data.dropna()

# Convert the "size" feature into a set of integer values.
data["size"] = data["size"].apply(lambda x: int(x.split(" ")[0]))

# Convert the "bath" feature into a set of integer values.
data["bath"] = data["bath"].astype("int")

# Strip extra spaces from "location" features.
data["location"] = data["location"].apply(lambda x: x.strip())

In [None]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13246 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    13246 non-null  object 
 1   size        13246 non-null  int64  
 2   total_sqft  13246 non-null  object 
 3   bath        13246 non-null  int64  
 4   price       13246 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 620.9+ KB
None


In [None]:
# Explore the "total_sqft" feature.
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True


data[~data["total_sqft"].apply(is_float)].head(10)

Unnamed: 0,location,size,total_sqft,bath,price
30,Yelahanka,4,2100 - 2850,4,186.0
122,Hebbal,4,3067 - 8156,4,477.0
137,8th Phase JP Nagar,2,1042 - 1105,2,54.005
165,Sarjapur,2,1145 - 1340,2,43.49
188,KR Puram,2,1015 - 1540,2,56.8
410,Kengeri,1,34.46Sq. Meter,1,18.5
549,Hennur Road,2,1195 - 1440,2,63.77
648,Arekere,9,4125Perch,9,265.0
661,Yelahanka,2,1120 - 1145,2,48.13
672,Bettahalsoor,4,3090 - 5002,4,445.0


The above data table shows that the "**total_sqft**" feature can also be a range (e.g., 2100 - 2850). In this case, take the average of the "$min$" and "$max$" values from the range. There are other cases like "34.46Sq. Meter" in which one can convert to square ft. using unit conversion. In this example, drop such corner cases for simplicity.

In [None]:
def convert_sqft_to_num(x):
    tokens = x.split("-")
    if len(tokens) == 2:
        return (float(tokens[0]) + float(tokens[1])) / 2
    try:
        return float(x)
    except:
        return None


data["total_sqft"] = data["total_sqft"].apply(convert_sqft_to_num)
data = data[data["total_sqft"].notnull()]

In [None]:
# Add a new feature called "price_per_sqft".
data["price_per_sqft"] = data["price"] * 100000 / data["total_sqft"]

**Examine locations which is a categorical variable. Apply the dimensionality reduction technique here to reduce the number of locations.**

In [None]:
data["location"].value_counts(ascending=False)

Whitefield                   533
Sarjapur  Road               392
Electronic City              304
Kanakpura Road               264
Thanisandra                  235
                            ... 
Rajanna Layout                 1
Subramanyanagar                1
Lakshmipura Vidyaanyapura      1
Malur Hosur Road               1
Abshot Layout                  1
Name: location, Length: 1287, dtype: int64

**Any location value with less than 20 data points should be tagged as "Others" location. This way, the number of categories gets reduced drastically.**

In [None]:
data["location"] = data["location"].apply(
    lambda x: "Others"
    if x in data["location"].value_counts()[data["location"].value_counts() <= 20]
    else x
)

## **Outlier Removal using Business Logic.**

As a Data Scientist, when you have a conversation with your business manager (who has expertise in real estate), he will tell you that normally on average the square ft. per bedroom is 300 (i.e., 2 BHK apartment has a minimum of 600 sqft.). For example, if a 2 BHK apartment has 400 sqft, then that seems suspicious and can be removed as an outlier. We will remove such outliers by keeping our minimum threshold per BHK to be 300 sqft.

In the below table, check the data points. We have a 6 BHK apartment with 1020 sqft and 8 BHK with 600 sqft. These data points are outliers and must be removed.

In [None]:
print(data.shape)

(13200, 6)


In [None]:
data[data["total_sqft"] / data["size"] < 300].head()

Unnamed: 0,location,size,total_sqft,bath,price,price_per_sqft
9,Others,6,1020.0,6,370.0,36274.509804
45,HSR Layout,8,600.0,9,200.0,33333.333333
58,Others,6,1407.0,4,150.0,10660.98081
68,Others,8,1350.0,7,85.0,6296.296296
70,Others,3,500.0,3,100.0,20000.0


In [None]:
# Remove Outliers.
data = data[~(data["total_sqft"] / data["size"] < 300)]
print(data.shape)

(12456, 6)


#### **Outlier Removal Using Standard Deviation and Mean.**

In the below description, we find that the minimum price per sqft. is Rs 267, whereas the maximum is Rs 176470. This shows a wide variation in property prices. We should remove outliers per location using the mean and standard deviation.

In [None]:
data["price_per_sqft"].describe()

count     12456.000000
mean       6308.502826
std        4168.127339
min         267.829813
25%        4210.526316
50%        5294.117647
75%        6916.666667
max      176470.588235
Name: price_per_sqft, dtype: float64

In [None]:
def remove_pps_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby("location"):
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)
        reduced_df = subdf[
            (subdf.price_per_sqft > (m - st)) & (subdf.price_per_sqft <= (m + st))
        ]
        df_out = pd.concat([df_out, reduced_df], ignore_index=True)
    return df_out


data = remove_pps_outliers(data)
print(data.shape)

(10431, 6)


We also remove the $N$ BHK apartments whose "**price_per_sqft**" is less than the mean "**price_per_sqft**" of $(N-1)$ BHK apartments. Here $N$ represents the total BHK which is greater than 1. For example, we can remove those 2 BHK apartments whose "**price_per_sqft**" is less than the mean "**price_per_sqft**" of 1 BHK apartment.

In [None]:
def remove_bhk_outliers(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby("location"):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby("size"):
            bhk_stats[bhk] = {
                "mean": np.mean(bhk_df.price_per_sqft),
                "std": np.std(bhk_df.price_per_sqft),
                "count": bhk_df.shape[0],
            }
        for bhk, bhk_df in location_df.groupby("size"):
            stats = bhk_stats.get(bhk - 1)
            if stats and stats["count"] > 5:
                exclude_indices = np.append(
                    exclude_indices,
                    bhk_df[bhk_df.price_per_sqft < (stats["mean"])].index.values,
                )
    return df.drop(exclude_indices, axis="index")


data = remove_bhk_outliers(data)
print(data.shape)

(6972, 6)


#### **Outlier Removal using Bathrooms Features.**

In [None]:
data["bath"].unique()

array([ 3,  5,  4,  2,  8,  1,  6,  7,  9, 12, 16, 13])

In [None]:
# No. of Bedrooms >= No. of Bathrooms.
data = data[data["bath"] < data["size"] + 1]

# Remove data points where BHK > 5.
data = data[data["size"] < 5]

print(data.shape)

(6405, 6)


**Any location value with less than 20 data points should be tagged as "Others" location. This way, the number of categories gets reduced drastically.**

In [None]:
data["location"] = data["location"].apply(
    lambda x: "Others"
    if x in data["location"].value_counts()[data["location"].value_counts() <= 20]
    else x
)

In [None]:
# Drop "price_per_sqft" Feature.
data = data.drop(["price_per_sqft", "bath"], axis="columns")

# Shuffle the dataset rows and Reset the column index.
data = data.sample(frac=1).reset_index(drop=True)

print(data)

                location  size  total_sqft  price
0                 Others     3      1740.0  120.0
1     7th Phase JP Nagar     2      1175.0   82.5
2                 Others     3      1893.0  115.0
3              Bellandur     3      1846.0  135.0
4     5th Phase JP Nagar     2      1030.0   57.0
...                  ...   ...         ...    ...
6400               Hoodi     3      2144.0  140.0
6401         Hennur Road     2      1020.0   47.0
6402         Hegde Nagar     3      1569.0  101.0
6403              Hebbal     2      1440.0  115.0
6404              Others     2      1200.0   80.0

[6405 rows x 4 columns]


In [None]:
# Save New Dataset.
data.to_csv("housingprice.csv")

# **ML Model Building.**

In [None]:
# Handling Categorical Features: Label Encoding.
from sklearn import preprocessing

lb = preprocessing.LabelEncoder()
data["location"] = lb.fit_transform(data["location"])

# Save the Encoder Object.
import pickle

pickle.dump(lb, open("catencoder.pkl", "wb"))

# Split Dataset into Feature and Target Set.
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Split Dataset into Train and Test Set.
from sklearn.model_selection import train_test_split, RandomizedSearchCV

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## **Train the Random Forest Regressor Model.**

> [**sklearn.ensemble.RandomForestRegressor**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)

In [None]:
# Using Random Forest, with manual Hyperparameter Optimization.
from sklearn.ensemble import RandomForestRegressor

parameters = {
    "n_estimators": [100, 200, 300, 400],
    "criterion": ["squared_error", "absolute_error", "poisson"],
    "max_depth": [None, 2, 3, 4, 5],
    "min_samples_split": [2, 3, 4, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "min_samples_leaf": [1, 2, 3, 4],
}

clf = RandomForestRegressor()
random_search = RandomizedSearchCV(
    estimator=clf,
    param_distributions=parameters,
    n_iter=100,
    cv=10,
    verbose=2,
    random_state=42,
    n_jobs=-1,
)
random_search = random_search.fit(X_train, y_train)

In [None]:
best_random_grid = random_search.best_estimator_
print(random_search.best_estimator_)

In [None]:
# Predict the Test Set results.
y_pred = best_random_grid.predict(X_test)

In [None]:
# Save the Model in the Pickle File.
pickle.dump(best_random_grid, open("rf_model.pkl", "wb"))

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("MSE Score {}".format(mean_squared_error(y_test, y_pred)))
print("MAE Score {}".format(mean_absolute_error(y_test, y_pred)))
print("R2 Score {}".format(r2_score(y_test, y_pred)))

## **Train the Decision Tree Regressor Model.**

[**sklearn.tree.DecisionTreeRegressor**](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor)

In [None]:
# Using Decision Tree, with manual Hyperparameter Optimization.
from sklearn.tree import DecisionTreeRegressor

parameters = {
    "criterion": ["squared_error", "absolute_error", "poisson"],
    "max_depth": [None, 2, 3, 4, 5],
    "min_samples_split": [2, 3, 4, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "min_samples_leaf": [1, 2, 3, 4],
}

clf = DecisionTreeRegressor()
random_search = RandomizedSearchCV(
    estimator=clf,
    param_distributions=parameters,
    n_iter=100,
    cv=10,
    verbose=2,
    random_state=42,
    n_jobs=-1,
)
random_search = random_search.fit(X_train, y_train)

In [None]:
best_random_grid = random_search.best_estimator_
print(random_search.best_estimator_)

In [None]:
# Predict the Test Set results.
y_pred = best_random_grid.predict(X_test)

In [None]:
# Save the Model in the Pickle File.
pickle.dump(best_random_grid, open("dt_model.pkl", "wb"))

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("MSE Score {}".format(mean_squared_error(y_test, y_pred)))
print("MAE Score {}".format(mean_absolute_error(y_test, y_pred)))
print("R2 Score {}".format(r2_score(y_test, y_pred)))

## **Train the Support Vector Regressor Model.**

> [**sklearn.svm.SVR**](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)

In [None]:
# Using SVM, with manual Hyperparameter Optimization.
from sklearn.svm import SVR

parameters = {
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "degree": [1, 2, 3, 4, 5],
    "gamma": ["scale", "auto"],
    "C": [0.001, 0.01, 0.1, 1, 10],
}

clf = SVR()
random_search = RandomizedSearchCV(
    estimator=clf,
    param_distributions=parameters,
    n_iter=100,
    cv=10,
    verbose=2,
    random_state=42,
    n_jobs=-1,
)
random_search = random_search.fit(X_train, y_train)

In [None]:
best_random_grid = random_search.best_estimator_
print(random_search.best_estimator_)

In [None]:
# Predict the Test Set results.
y_pred = best_random_grid.predict(X_test)

In [None]:
# Save the Model in the Pickle File.
pickle.dump(best_random_grid, open("svr_model.pkl", "wb"))

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("MSE Score {}".format(mean_squared_error(y_test, y_pred)))
print("MAE Score {}".format(mean_absolute_error(y_test, y_pred)))
print("R2 Score {}".format(r2_score(y_test, y_pred)))