## Real Estate Price Prediction (Data Science Project)

The first step of every project is to import all the necessary libraries. Let's do it

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline

Great! Now get our data and load it in a variable

In [2]:
data_location = 'data/Bengaluru_House_Data.csv'
data1 = pd.read_csv(data_location)

Awesome :) It's time to explore our data

In [3]:
data1.head(10)

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0
5,Super built-up Area,Ready To Move,Whitefield,2 BHK,DuenaTa,1170,2.0,1.0,38.0
6,Super built-up Area,18-May,Old Airport Road,4 BHK,Jaades,2732,4.0,,204.0
7,Super built-up Area,Ready To Move,Rajaji Nagar,4 BHK,Brway G,3300,4.0,,600.0
8,Super built-up Area,Ready To Move,Marathahalli,3 BHK,,1310,3.0,1.0,63.25
9,Plot Area,Ready To Move,Gandhi Bazar,6 Bedroom,,1020,6.0,,370.0


In [4]:
data1.tail(10)

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
13310,Super built-up Area,Ready To Move,Rachenahalli,2 BHK,,1050,2.0,2.0,52.71
13311,Plot Area,Ready To Move,Ramamurthy Nagar,7 Bedroom,,1500,9.0,2.0,250.0
13312,Super built-up Area,Ready To Move,Bellandur,2 BHK,,1262,2.0,2.0,47.0
13313,Super built-up Area,Ready To Move,Uttarahalli,3 BHK,Aklia R,1345,2.0,1.0,57.0
13314,Super built-up Area,Ready To Move,Green Glen Layout,3 BHK,SoosePr,1715,3.0,3.0,112.0
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.0,231.0
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,,400.0
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.0,60.0
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.0,488.0
13319,Super built-up Area,Ready To Move,Doddathoguru,1 BHK,,550,1.0,1.0,17.0


In [5]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


In [6]:
data1.describe()

Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


What we have infered from above cells?
1. Shape of our data- `(13320, 9)`
2. We do have missing values. And in `society` column, we have approximately `46%` of missing data.
3. We do have some unrealistic records. Like `40 bathrooms`. Are you serious? 40 Bathrooms? Is that an hotel?
4. We have untidy data such as, `size` and `availability` column. Also `total_sqft` column must be integer, but it is object in dataset.

After these observations, we're pretty sure that we need to do a lot of work to clean & manipulate data. This is what we call `Exploratory Data Analysis` or simply `EDA`. 

In [7]:
data1['area_type'].value_counts()

Super built-up  Area    8790
Built-up  Area          2418
Plot  Area              2025
Carpet  Area              87
Name: area_type, dtype: int64

In [8]:
data2 = data1.drop(['area_type', 'availability', 'society', 'bath'], axis=1)

In [9]:
data2

Unnamed: 0,location,size,total_sqft,balcony,price
0,Electronic City Phase II,2 BHK,1056,1.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,3.0,120.00
2,Uttarahalli,3 BHK,1440,3.0,62.00
3,Lingadheeranahalli,3 BHK,1521,1.0,95.00
4,Kothanur,2 BHK,1200,1.0,51.00
...,...,...,...,...,...
13315,Whitefield,5 Bedroom,3453,0.0,231.00
13316,Richards Town,4 BHK,3600,,400.00
13317,Raja Rajeshwari Nagar,2 BHK,1141,1.0,60.00
13318,Padmanabhanagar,4 BHK,4689,1.0,488.00


In [10]:
data2.isna().sum()

location        1
size           16
total_sqft      0
balcony       609
price           0
dtype: int64

As we can't estimate the size of the home & balconies (coz these are independent of other variables), also, there are not much missing values, so it's better to remove rather than estimating.

In [11]:
data2.shape

(13320, 5)

In [12]:
data2.dropna(inplace=True)

In [13]:
data2.isna().sum()

location      0
size          0
total_sqft    0
balcony       0
price         0
dtype: int64

In [14]:
data2.shape

(12710, 5)

Now we have 12710 records.

In [15]:
data2

Unnamed: 0,location,size,total_sqft,balcony,price
0,Electronic City Phase II,2 BHK,1056,1.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,3.0,120.00
2,Uttarahalli,3 BHK,1440,3.0,62.00
3,Lingadheeranahalli,3 BHK,1521,1.0,95.00
4,Kothanur,2 BHK,1200,1.0,51.00
...,...,...,...,...,...
13314,Green Glen Layout,3 BHK,1715,3.0,112.00
13315,Whitefield,5 Bedroom,3453,0.0,231.00
13317,Raja Rajeshwari Nagar,2 BHK,1141,1.0,60.00
13318,Padmanabhanagar,4 BHK,4689,1.0,488.00


Let's work on `size` column. As we can see, size column is lil bit untidy. So we'll replace BHK & Bedrooms with just numbers.

In [16]:
data2['size'] = data2['size'].apply(lambda x: int(x.split()[0]))

In [17]:
data2

Unnamed: 0,location,size,total_sqft,balcony,price
0,Electronic City Phase II,2,1056,1.0,39.07
1,Chikka Tirupathi,4,2600,3.0,120.00
2,Uttarahalli,3,1440,3.0,62.00
3,Lingadheeranahalli,3,1521,1.0,95.00
4,Kothanur,2,1200,1.0,51.00
...,...,...,...,...,...
13314,Green Glen Layout,3,1715,3.0,112.00
13315,Whitefield,5,3453,0.0,231.00
13317,Raja Rajeshwari Nagar,2,1141,1.0,60.00
13318,Padmanabhanagar,4,4689,1.0,488.00


In [18]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12710 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    12710 non-null  object 
 1   size        12710 non-null  int64  
 2   total_sqft  12710 non-null  object 
 3   balcony     12710 non-null  float64
 4   price       12710 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 595.8+ KB


That's great. We filtered only number from object and converted it into integer.
* Now we need to convert our `total_sqft` column to integer datatype.
* For this, we'll create a function `object_to_float` that we convert each value to float and return `True`. 
* If some objects can't be converted to float, it will return `False`. We will save these boolean values to new column `bool`

In [19]:
# function to convert object 'total_sqft' to integer vlaue

def object_to_float(value):
    try:
        float(value)
    except:
        return False
    return True

In [20]:
data3 = data2.copy()
data3['bool'] = data3['total_sqft'].apply(object_to_float)

These are the records that can't be converted to float. As you can see, these values are in range, so we'll take mid values of them

In [21]:
data3[~ data3['bool']]

Unnamed: 0,location,size,total_sqft,balcony,price,bool
30,Yelahanka,4,2100 - 2850,0.0,186.000,False
122,Hebbal,4,3067 - 8156,0.0,477.000,False
137,8th Phase JP Nagar,2,1042 - 1105,0.0,54.005,False
165,Sarjapur,2,1145 - 1340,0.0,43.490,False
188,KR Puram,2,1015 - 1540,0.0,56.800,False
...,...,...,...,...,...,...
12975,Whitefield,2,850 - 1060,0.0,38.190,False
12990,Talaghattapura,3,1804 - 2273,0.0,122.000,False
13059,Harlur,2,1200 - 1470,0.0,72.760,False
13265,Hoodi,2,1133 - 1384,0.0,59.135,False


Function to get mid-values

In [22]:
def floatConversion(x):
    values = x.split('-')
    if len(values) == 2:
        newNumber = (float(values[0]) + float(values[1]))/2
        return newNumber
    try:
        return float(x)
    except:
        return None

In [23]:
data3['area'] = data3['total_sqft'].apply(floatConversion)

In [24]:
data3.isna().sum()

location       0
size           0
total_sqft     0
balcony        0
price          0
bool           0
area          42
dtype: int64

We still have 42 values that can't be converted to float and don't have mid values. So drop them

In [25]:
data3.dropna(inplace=True)

In [26]:
data3.isna().sum()

location      0
size          0
total_sqft    0
balcony       0
price         0
bool          0
area          0
dtype: int64

In [27]:
data3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12668 entries, 0 to 13319
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    12668 non-null  object 
 1   size        12668 non-null  int64  
 2   total_sqft  12668 non-null  object 
 3   balcony     12668 non-null  float64
 4   price       12668 non-null  float64
 5   bool        12668 non-null  bool   
 6   area        12668 non-null  float64
dtypes: bool(1), float64(3), int64(1), object(2)
memory usage: 705.2+ KB


In [28]:
data4 = data3.drop(['total_sqft', 'bool'], axis=1)

In [29]:
data4

Unnamed: 0,location,size,balcony,price,area
0,Electronic City Phase II,2,1.0,39.07,1056.0
1,Chikka Tirupathi,4,3.0,120.00,2600.0
2,Uttarahalli,3,3.0,62.00,1440.0
3,Lingadheeranahalli,3,1.0,95.00,1521.0
4,Kothanur,2,1.0,51.00,1200.0
...,...,...,...,...,...
13314,Green Glen Layout,3,3.0,112.00,1715.0
13315,Whitefield,5,0.0,231.00,3453.0
13317,Raja Rajeshwari Nagar,2,1.0,60.00,1141.0
13318,Padmanabhanagar,4,1.0,488.00,4689.0


In [30]:
len(data4.location.unique())

1259

We have 1259 unique location. This will be very cumbersome for our model to encode this much of unique data. So we'll rename all locations that have less than or equal to 10 datapoints as `Others`

In [31]:
data5 = data4.copy()
data5.location = data5.location.apply(lambda x: x.strip())

In [32]:
locations = data5['location'].value_counts()
unnecessary_locations = locations[locations <= 10]
unnecessary_locations

1st Block Koramangala    10
Gunjur Palya             10
Kalkere                  10
Nagappa Reddy Layout     10
Dairy Circle             10
                         ..
Subbannaiah Palya         1
whitefiled                1
Medi Agrahara             1
Sadduguntepalya           1
Abshot Layout             1
Name: location, Length: 1013, dtype: int64

In [33]:
data5['location'] = data5.location.apply(lambda x: 'Others' if x in unnecessary_locations else x)

In [34]:
data5[data5.location == 'Others']

Unnamed: 0,location,size,balcony,price,area
18,Others,3,2.0,290.00,2770.0
19,Others,2,2.0,48.00,1100.0
25,Others,3,2.0,56.00,1250.0
42,Others,1,0.0,38.00,600.0
49,Others,2,1.0,36.00,869.0
...,...,...,...,...,...
13278,Others,2,1.0,65.00,1256.0
13285,Others,2,2.0,110.00,1353.0
13291,Others,1,0.0,26.00,812.0
13292,Others,3,2.0,63.93,1440.0


In [35]:
len(data5.location.unique())

236

That's great! We now have only `236` unique values instead of 1259. That would be a lot easier for our model to encode

In [36]:
data5['size'].unique()

array([ 2,  4,  3,  1,  6,  8,  7,  5, 11,  9, 27, 43, 14, 12, 10, 13])

We have datapoints with 27, 43 rooms. Don't you think that's like an hotel. So we will remove datapoints with 11 or greater BHK.

In [37]:
data6 = data5.copy()
data6 = data6[data6['size'] < 11]

In [38]:
data6['size'].unique()

array([ 2,  4,  3,  1,  6,  8,  7,  5,  9, 10])

In [39]:
data6.shape

(12660, 5)

In [40]:
data6

Unnamed: 0,location,size,balcony,price,area
0,Electronic City Phase II,2,1.0,39.07,1056.0
1,Chikka Tirupathi,4,3.0,120.00,2600.0
2,Uttarahalli,3,3.0,62.00,1440.0
3,Lingadheeranahalli,3,1.0,95.00,1521.0
4,Kothanur,2,1.0,51.00,1200.0
...,...,...,...,...,...
13314,Green Glen Layout,3,3.0,112.00,1715.0
13315,Whitefield,5,0.0,231.00,3453.0
13317,Raja Rajeshwari Nagar,2,1.0,60.00,1141.0
13318,Padmanabhanagar,4,1.0,488.00,4689.0


Let's see what an average price of home per sqft. For this, we'll create a new column `price_per_sqft` that will equals to `price/area`

In [41]:
data6['price_per_sqft'] = (data6['price']*100000)/data6['area']

In [42]:
data6

Unnamed: 0,location,size,balcony,price,area,price_per_sqft
0,Electronic City Phase II,2,1.0,39.07,1056.0,3699.810606
1,Chikka Tirupathi,4,3.0,120.00,2600.0,4615.384615
2,Uttarahalli,3,3.0,62.00,1440.0,4305.555556
3,Lingadheeranahalli,3,1.0,95.00,1521.0,6245.890861
4,Kothanur,2,1.0,51.00,1200.0,4250.000000
...,...,...,...,...,...,...
13314,Green Glen Layout,3,3.0,112.00,1715.0,6530.612245
13315,Whitefield,5,0.0,231.00,3453.0,6689.834926
13317,Raja Rajeshwari Nagar,2,1.0,60.00,1141.0,5258.545136
13318,Padmanabhanagar,4,1.0,488.00,4689.0,10407.336319


In [43]:
data6['price_per_sqft'].describe()

count    1.266000e+04
mean     6.873296e+03
std      2.263968e+04
min      2.678298e+02
25%      4.242424e+03
50%      5.375608e+03
75%      7.142857e+03
max      2.300000e+06
Name: price_per_sqft, dtype: float64

We'll calculate Outliers using this formula

In [44]:
Q1 = data6['price_per_sqft'].quantile(0.25)
Q3 = data6['price_per_sqft'].quantile(0.75)
IQR = Q3-Q1
Out1 = Q1-(1.5*IQR)
Out2 = Q3+(1.5*IQR)

In [45]:
data6 = data6[~(data6['price_per_sqft'] > Out2)]

In [46]:
data6.shape

(11475, 6)

That's great. From `12660` to `11475`. These type of outliers wrongly manipulate our data.

Now, usually we have 1 or 2 balconies extra than bedrooms. Let's remove those datapoints who have more than 2 balconies than BHK

In [47]:
data6 = data6[(data6['balcony'] <= data6['size'] + 2)]

In [48]:
data6.shape

(11475, 6)

Ohkay! We don't have such datapoints.

One more thing we should do. Average area for 1 BHK home is around 300sqft in India. If you have 4 BHK home, the area should not be less than ~1300 sqft. So let's see whether we have such type of data or not

In [49]:
data7 = data6.copy()
data7 = data7[~(data7['area']/data7['size'] < 300)]

In [50]:
data7.shape

(11199, 6)

Wohoo! We had that type of data. But don't worry folks, we removed them :)

In [51]:
data7.columns

Index(['location', 'size', 'balcony', 'price', 'area', 'price_per_sqft'], dtype='object')

In [52]:
newColumnPosition = ['location', 'size', 'balcony', 'area', 'price_per_sqft', 'price']
data7 = data7[newColumnPosition]
data7.drop('price_per_sqft', axis=1, inplace=True)

In [53]:
data7

Unnamed: 0,location,size,balcony,area,price
0,Electronic City Phase II,2,1.0,1056.0,39.07
1,Chikka Tirupathi,4,3.0,2600.0,120.00
2,Uttarahalli,3,3.0,1440.0,62.00
3,Lingadheeranahalli,3,1.0,1521.0,95.00
4,Kothanur,2,1.0,1200.0,51.00
...,...,...,...,...,...
13314,Green Glen Layout,3,3.0,1715.0,112.00
13315,Whitefield,5,0.0,3453.0,231.00
13317,Raja Rajeshwari Nagar,2,1.0,1141.0,60.00
13318,Padmanabhanagar,4,1.0,4689.0,488.00


In [54]:
data7.isna().sum()

location    0
size        0
balcony     0
area        0
price       0
dtype: int64

In [55]:
data7.shape

(11199, 5)

In [56]:
data7.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11199 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   location  11199 non-null  object 
 1   size      11199 non-null  int64  
 2   balcony   11199 non-null  float64
 3   area      11199 non-null  float64
 4   price     11199 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 525.0+ KB


Its time to do some Machine Learning Stuff.
* First we will spilt our data into train and test data.
* Then we'll encode categorical features to numerical data
* Finally train our model

In [57]:
X = data7.drop('price', axis=1)
y = data7['price']

In [86]:
# Turn categorical features (location) into numerical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_feature = ['location']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, categorical_feature)], remainder='passthrough')

transformed_X = transformer.fit_transform(X)

In [85]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2, random_state=10)

model1 = RandomForestRegressor()

model1.fit(X_train, y_train)
model1.score(X_test, y_test)

0.7682037117048978