# Bangalore House Price Estimator

This a project based on a Linear Regression model using the principles of Machine Learning(ML) and Data Science to estimate the house prices in the city of Bangalore,India based on the given input features after Data Preprocessing.

# Importing Libraries

In [185]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from jupyterthemes import jtplot
jtplot.style()

# Loading Data

In [186]:
price = pd.read_csv('C:/Users/LENOVO/# Jupyter Notebook Files/Machine Learning/Resources/py/DataScience/BangloreHomePrices/model/bengaluru_house_prices.csv')
price.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Data_Set/bengaluru_house_prices.csv'

#### Finding the size of the Data set
Detecting the number of rows and columns present in the Data set

In [None]:
price.shape

# Data Cleaning

* #### Removing redundant features

Since this would be a simple model, features like 'area_type', 'availability', 'society', 'balcony' would not be of much use.
<br> Hence these features could be dropped.

In [None]:
price.drop(['area_type', 'availability', 'society', 'balcony'],axis='columns',inplace=True)
price.head()

* #### Detecting Null Values

In [None]:
price.isnull().sum()

Since the null values detected is small in number and data set is fairly big, Dataset can afford to drop all the detected null values.

In [None]:
price.dropna(inplace=True)
price.isnull().sum()

* #### Converting data to uniform data types in features

The feature 'size' contains data which convey the same menaing but are represented in different formats.
<br> The Model needs only numeric data, only the numeric part of the data present in 'size' need to be taken.
<br> This can be done by creating a new column/feature 'bhk' which stores only the numeric part of the data in 'size' by applying a function to 'size'.

In [None]:
price['bhk'] = price['size'].apply(lambda x: int(x.split(' ')[0]))
price.bhk.unique()

After glimpsing through thye data in 'total_sqft', some disctrepencies were observed.

In [None]:
def is_float(x):
    """ This a Function to check whether a data is of float data type or not"""
    try:
        float(x)
    except:
        return False
    return True

In [None]:
price[~price['total_sqft'].apply(is_float)].head(10) 
# Negation operation is used to view the discrpencies in the data, in the current scenario

The data represnted in 'total_sqft' were found be be not of uniform data types.
<br> It contained a range of values too.

In [None]:
def convert_range_to_num(x):
    """ This a Function to convert range values into a single float number by taking the average of the upper
        and lower limts of the range    """
    tokens = x.split('-')
    if len(tokens) == 2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x) # To not effect the number which were already of float data type
    except:
        return None  

In [None]:
price.total_sqft = price.total_sqft.apply(convert_range_to_num)
# Removing the others values than float, since they would be represented in None according the applied function
price = price[price.total_sqft.notnull()] 
price.head(2)

In [None]:
price.loc[30] # Checking if the function was applied correctly

* #### Outlier Detection and Removal/Correction

A feature 'price_per_sqft' is an important feature in house pricing scenarios and  it can be used to detect the outliers present in 'total_sqft'

In [None]:
price['price_per_sqft'] = price['price']*100000/price['total_sqft']
# The feature 'price' was multiplied by a factor of 100000 as 'price' was represented in Lakhs (INR)
price.head()

In [None]:
# Viewing the statistics of 'price_per_sqft' to understand anomalies and outliers better
price_stats = price['price_per_sqft'].describe()
price_stats

After glimpsing through the feature 'location', a certain observation was made

In [None]:
len(price.location.unique())

The feature 'location' had way too many unique values 
<br> And many of these values appeared only once.
<br> This could lead to a high dimensionality probelm and 'loaction' is a categorical value.

In [None]:
price.location = price.location.apply(lambda x: x.strip())
location_stats = price['location'].value_counts(ascending=False)
location_stats

In [None]:
# Finding the number of 'location' with greater than 10 appearances
len(location_stats[location_stats>10])

In [None]:
# Finding the number of 'location' with less than or equal 10 appearances
len(location_stats[location_stats<=10])

In [None]:
# Viewing the number of 'location' with less than or equal 10 appearances and storing those values
location_stats_less_than_10 = location_stats[location_stats<=10]
location_stats_less_than_10

This problem of high dimensionality that could later rise could be solved by renaming all location which appeared less than or equal to 10 times, as 'other' by applying a function on 'location'

In [None]:
price.location =price.location.apply(lambda x: 'other' if x in location_stats_less_than_10 else x)
len(price.location.unique())

After glimpsing through the featuer 'bhk', a certain observation was made.

The number of 'bhk' for the respective 'total_sqft' area did not match, and could be counted as an anomaly.

In [None]:
price[price.total_sqft/price.bhk<300].head()

Using domain knowledge in the field of real estate or house pricing, it was found,
<br>Usually a single bhk would take about 300 sqft of area
<br>Hence those data with the raio of 'total_sqft' to 'bhk' less than 300 could be removed.

In [None]:
price = price[~(price.total_sqft/price.bhk<300)]
price.shape

In [None]:
# Viewing the statistics of 'price_per_sqft' to check anomalies or ouliers
price.price_per_sqft.describe()

It was observed that the maximum and minimum value obtained from 'price_per_sqft' did not match with the data obtained from the knoweledge recieved from the domain of real estate or house pricing

In [None]:
def remove_price_per_sqft_outliers(df):
    """ This is a Function to be outliers present in 'price_per_sqft' by removing the data which would be
    less than the sum of mean and standard deviation and more than the result of subtraction of mean and standard deviation,
    This is done as majority (99.7 %) of values in a normal distribution would fall under this range or within
    range of 3rd standard deviation"""
    df_out = pd.DataFrame()
    for key, subdf in df.groupby('location'):
        mean = np.mean(subdf.price_per_sqft)
        std = np.std(subdf.price_per_sqft)
        df_reduced = subdf[(subdf.price_per_sqft>(mean-std)) & (subdf.price_per_sqft<=(mean+std))]
        df_out = pd.concat([df_out,df_reduced],ignore_index=True)
    return df_out

In [None]:
price = remove_price_per_sqft_outliers(price)
price.shape

After glimpsing through the featuer 'bhk', a certain observation was made.

The price of houses falling in number of 'bhk' for 2 was more than houses falling in number of 'bhk' for 3, this could be counted as an anomaly.
<br> But also this could depend on the feature 'location' as well. Prime location cost more than others

In [None]:
def plot_scatter_chart(df,location):
    """This is Function is plot the scatter plot between bhk = 2 and bhk =3 with respect to 'price' and 
    'total_sqft' to detect outliers by visualizing data."""
    bhk2 = df[(df.location==location) & (df.bhk==2)]
    bhk3 = df[(df.location==location) & (df.bhk==3)]
    plt.scatter(bhk2.total_sqft,bhk2.price,color='cyan',label='2 BHK', s=50,)
    plt.scatter(bhk3.total_sqft,bhk3.price,marker='x', color='lightgreen',label='3 BHK', s=50)
    plt.xlabel("Total Square Feet Area")
    plt.ylabel("Price (Lakh Indian Rupees)")
    plt.title(location)
    plt.legend()


In [None]:
# Viewing an example of outliers detected after glimpsing through the feature
plot_scatter_chart(price,"Rajaji Nagar")

In [None]:
# Viewing another example of outliers detected after glimpsing through the feature
plot_scatter_chart(price,"Hebbal")

In [None]:
def remove_bhk_outliers(df):
    """This is a Function to remove the outliers detected in the feature 'bhk' with respect to price and 
    number of 'bhk' by computing mean and standard deviation and removing data where higher number of bhk 
    would have lower mean or standard deviation than a lower number of bhk"""
    exclude_indices = np.array([])
    for location, location_df in df.groupby('location'):
        bhk_stats = {}
        for bhk, bhk_df in location_df.groupby('bhk'):
            bhk_stats[bhk] = {
                'mean': np.mean(bhk_df.price_per_sqft),
                'std': np.std(bhk_df.price_per_sqft),
                'count': bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby('bhk'):
            stats = bhk_stats.get(bhk-1)
            if stats and stats['count']>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
    return df.drop(exclude_indices,axis='index')


In [None]:
price = remove_bhk_outliers(price)
price.shape

In [None]:
plot_scatter_chart(price,"Rajaji Nagar")

In [None]:
plot_scatter_chart(price,"Hebbal")

Viewing if the data is in a normal distribution curve.

In [None]:
plt.hist(price.price_per_sqft,rwidth=0.8,color='cyan')
plt.xlabel("Price Per Square Feet")
plt.ylabel("Count")

After glimpsing through the feature 'bath', some discrepencies were observed.

In [None]:
price.bath.unique()

An unsual high number of baths were noticied which does not match with the number of bhk

In [None]:
plt.hist(price.bath,rwidth=0.8,color='cyan')
plt.xlabel("Number of bathrooms")
plt.ylabel("Count")

In [None]:
# Viewing the data with number of baths greater than 10,as 10 is an unusual high number for baths
price[price.bath>10]

These observed values could be anomalies or outliers
<br> From data obtained from the domain knowledge of real estate or house pricing, it was found,
<br> Usually the numbers of baths would be equal to the number of 'bhk' 
<br> removal of data with 2 (as buffer) more baths than the number of bhk could be used to remove these anomalies. 

In [None]:
# Viewing the data with the above conditions
price[price.bath>price.bhk+2]

In [None]:
price = price[price.bath<price.bhk+2]
price.shape

In [None]:
price.bath.unique()

In [None]:
# Dropping the feature 'size' as it is no longer of use
price = price.drop(['size','price_per_sqft'],axis='columns')
price.head(3)

* #### Converting categorical data into numerical data
    * One Hot Encoding

In [None]:
location_dummies = pd.get_dummies(price.location)
location_dummies.head()

In [None]:
# Dropping a column to avoid the dummy trap condition
price = pd.concat([price,location_dummies.drop('other',axis='columns')],axis='columns')
price.head()

In [None]:
# Dropping the feature 'location' as it is no longer of use
price = price.drop('location',axis='columns')
price.head(2)

In [None]:
price.shape

# Separating dependant variable from independant variables

In [None]:
x = price.drop(['price'],axis='columns')
y = price.price

In [None]:
x.shape

In [None]:

y.shape

# Train Test Spilt

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=10)

# Finding the best algorithm

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor

def best_model_using_gridsearchcv(x,y):
    algorithms = {
        'Linear_Regression' : {
            'model': LinearRegression(),
            'params': {
                'normalize': [True, False]
            }
        },
        'Lasso_Regression': {
            'model': Lasso(),
            'params': {
                'alpha': [1,2],
                'selection': ['random', 'cyclic']
            }
        },
        'Decison_Tree_Regressior': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['mse','friedman_mse'],
                'splitter': ['best','random']
            }
        }
    }
    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algorithm_name, config in algorithms.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(x,y)
        scores.append({
            'Model': algorithm_name,
            'Best_Score': str(round(gs.best_score_*100,2)) + ' %',
            'Best_Parameters': gs.best_params_
        })

    return pd.DataFrame(scores,columns=['Model','Best_Score','Best_Parameters'])

In [None]:
best_model_stats = best_model_using_gridsearchcv(x,y)
best_model_stats

From the above observations, <br>
Linear Regression with Parameter 'normalize' set to False proved to be the model giving the best accuracy
<br>Hence Linear regression will be used to create the model.

# Model creation

In [None]:
house_price_estimator = LinearRegression()

# Training the Model

In [None]:
house_price_estimator.fit(x_train,y_train)

# Testing the Model

In [None]:
house_price_estimated = house_price_estimator.predict(x_test)

# Accuracy of the Model

In [None]:
score =  house_price_estimator.score(x_test,y_test)
print("The Model estimated the prices of houses in the city of Bangalore with an Accuracy of {} % "
      .format(str(round(score*100,2))))

# Exporting the trained Model 

In [None]:
import pickle
with open('banglore_house_price_estimator.pickle','wb') as file:
    pickle.dump(house_price_estimator,file)

# Export Location and Column Information to json File

In [None]:
import json
columns = {
    'data_columns' : [col.lower() for col in x.columns]
}
with open("columns_information.json","w") as file:
    file.write(json.dumps(columns))