### Data Science Regression Project: 
<b><font color='green'>Predicting Home Prices in Banglore

This data science project series walks through step by step process of how to build a real estate price prediction website. We will first build a model using sklearn and linear regression using banglore home prices dataset from kaggle.com. Second step would be to write a python flask server that uses the saved model to serve http requests. Third component is the website built in html, css and javascript that allows user to enter home square ft area, bedrooms etc and it will call python flask server to retrieve the predicted price. During model building we will cover almost all data science concepts such as data load and cleaning, outlier detection and removal, feature engineering, dimensionality reduction, gridsearchcv for hyperparameter tunning, k fold cross validation etc. Technology and tools wise this project covers,

+ Python
+ Numpy and Pandas for data cleaning
+ Matplotlib for data visualization
+ Sklearn for model building
+ Jupyter notebook, visual studio code and pycharm as IDE
+ Python flask for http server
+ HTML/CSS/Javascript for UI

<a hrref="https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data">Bengaluru_House_Data.csv</a>

What are the things that a potential home buyer considers before purchasing a house? The location, the size of the property, vicinity to offices, schools, parks, restaurants, hospitals or the stereotypical white picket fence? What about the most important factor — the price?

Now with the lingering impact of demonetization, the enforcement of the Real Estate (Regulation and Development) Act (RERA), and the lack of trust in property developers in the city, housing units sold

<b> Import Libraries:

In [1]:
# for data manipulation 
import pandas as pd

# numrical 
import numpy as np

# highly powerfull libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# chart figsize
plt.rcParams["figure.figsize"] = (20,10)

<b>Load Data:

In [2]:
# reading csv file:
data = pd.read_csv('Bengaluru_House_Data.csv')

# printing first five rows of dataset:
data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [3]:
# shape of dataset:
data.shape

(13320, 9)

<b>Droping Columns:</b>
we can drop some of the columns from our dataset to make a simple project as possible for clear data science project idea.
selescted columns for droping from dataset: area_tye , availability , society , balcony

In [4]:
data2 = data.drop(['area_type' , 'availability' , 'society' , 'balcony'],axis=1)
data2.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


<b>Null Values:</b>
now in this part, we can find null values from each attribute then we can fill them with the suitable operation. in some conditions when we have large datasets and fewer missing values present in the dataset then we can just remove or drop them from a dataset. and cases we have small kinds of datasets but the dataset contains highly missing values then at that time we can not drop them by rows or filtering although we can perform an operation like mod median mean extesa and fill them.

In [5]:
# checking null values in each column:
data2.isnull().sum()

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

in this scenario, we have a large dataset with 13320
rows and fewer missing values present in our dataset so we can easily drop them.

In [6]:
# drop missing values rows:
data3 = data2.dropna()

# checking null values in each column:
data3.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

we have successfully dropped all missing values from our dataset. now check the shape of "data" before drop missing values and "data3" after dropping missing values. 

In [7]:
print("Before droping missing values: ",data.shape)
print("After droping missing values: ",data3.shape) 
print("Difference: ", data.shape[0] - data3.shape[0])

Before droping missing values:  (13320, 9)
After droping missing values:  (13246, 5)
Difference:  74


<b>Explore Size Column:

In [8]:
# every unique value in size cilumn:
data['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', nan, '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

in our size column, we have "2 BHK" and "4 Bedroom" its same meaning that in "2 BHK" contains 2 bedrooms and "4 Bedroom" contains 4 bedrooms so first, we can create a new column "BHK" which contains the number of bedroom values.

In [9]:
# creating a new column name "BHK" which contains bedroom sizes extract form size column.
data3['BHK'] = data3['size'].apply(lambda x: int(x.split(' ')[0]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data3['BHK'] = data3['size'].apply(lambda x: int(x.split(' ')[0]))


In [10]:
data3.head()
# we added new column Name "BHK"

Unnamed: 0,location,size,total_sqft,bath,price,BHK
0,Electronic City Phase II,2 BHK,1056,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0,4
2,Uttarahalli,3 BHK,1440,2.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0,3
4,Kothanur,2 BHK,1200,2.0,51.0,2


In [11]:
# now checkout unique values in BHK column:
data3.BHK.unique()

array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18], dtype=int64)

In [12]:
# now we can also check that any house in this dataset contains 20+ rooms.
data3[data3['BHK']>20]

Unnamed: 0,location,size,total_sqft,bath,price,BHK
1718,2Electronic City Phase II,27 BHK,8000,27.0,230.0,27
4684,Munnekollal,43 Bedroom,2400,40.0,660.0,43


here's we find two homes with 27 and 43 bedrooms. but if you can see in total_sqft column then 27 bedrooms house contains 8000 sqft and 43 bedrooms house contains 2400 sqft so this is also an error in sqft column.

<b>Explore total_sqft column:

In [13]:
data3.total_sqft.unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

we have to change the range into a single value integer like in average. Example  '1133 - 1384' know we can change this into a single value for machine learning flexibility.

first we can that our values are in float or not we can define a function which returns thru and false. 

In [14]:
def in_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [15]:
data3[~data3['total_sqft'].apply(in_float)].head(10)

Unnamed: 0,location,size,total_sqft,bath,price,BHK
30,Yelahanka,4 BHK,2100 - 2850,4.0,186.0,4
122,Hebbal,4 BHK,3067 - 8156,4.0,477.0,4
137,8th Phase JP Nagar,2 BHK,1042 - 1105,2.0,54.005,2
165,Sarjapur,2 BHK,1145 - 1340,2.0,43.49,2
188,KR Puram,2 BHK,1015 - 1540,2.0,56.8,2
410,Kengeri,1 BHK,34.46Sq. Meter,1.0,18.5,1
549,Hennur Road,2 BHK,1195 - 1440,2.0,63.77,2
648,Arekere,9 Bedroom,4125Perch,9.0,265.0,9
661,Yelahanka,2 BHK,1120 - 1145,2.0,48.13,2
672,Bettahalsoor,4 Bedroom,3090 - 5002,4.0,445.0,4


now we can create a function which take one argument and change into float if it is single value. but if it is a range like 1300 - 1600 then this function convert into average single value and other egnore like this input 4125Perch	 

In [16]:
def conver_sqt_into_no(x):
    tokens = x.split('-')
    if len(tokens)==2:
        return (float(tokens[0]))+(float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None      

In [17]:
data4 = data3.copy()
data4['total_sqft'] = data4['total_sqft'].apply(conver_sqt_into_no)
data4.head(10)

Unnamed: 0,location,size,total_sqft,bath,price,BHK
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3
4,Kothanur,2 BHK,1200.0,2.0,51.0,2
5,Whitefield,2 BHK,1170.0,2.0,38.0,2
6,Old Airport Road,4 BHK,2732.0,4.0,204.0,4
7,Rajaji Nagar,4 BHK,3300.0,4.0,600.0,4
8,Marathahalli,3 BHK,1310.0,3.0,63.25,3
9,Gandhi Bazar,6 Bedroom,1020.0,6.0,370.0,6


# Feature Engineering:

coping data4 into data5

In [18]:
data5 = data4.copy()

creating new features from existing features:

In [19]:
# our price in lacks so 1 = 100000 we can multiply 100000 by price column on each value:
data5['Price_per_sqft'] = data5['price']*100000 / data5['total_sqft']
data5.head(5)

Unnamed: 0,location,size,total_sqft,bath,price,BHK,Price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0


now we can check our locations. first, we can check our unique location then after we can see how many rows in a single location.

In [20]:
len(data5.location.unique())

1304

so we have 1304 unique value in our locations. this is a big problem in categorical data because we have to apply encoding on categorical data for machine learning model know we can not keep all column in our dataset after encoding so we have to go some depth and find the best solution for this problem this is highi dimentionality problem. let's do that!

in this scenario first, we can find that locations contain 1 or 2 fields. simply mean we can use other location techniques to handle this first we can take location who contained "<10" field after finding that locations we can assign all these locations to other location to this why we can reduce location.

In [21]:
# removing extra spaces from location labels:
data5.location = data5.location.apply(lambda x: x.strip())

In [22]:
# how many rows in each location
data_stats = data5.groupby('location')['location'].agg('count').sort_values(ascending=False)
data_stats

location
Whitefield               535
Sarjapur  Road           392
Electronic City          304
Kanakpura Road           266
Thanisandra              236
                        ... 
1 Giri Nagar               1
Kanakapura Road,           1
Kanakapura main  Road      1
Karnataka Shabarimala      1
whitefiled                 1
Name: location, Length: 1293, dtype: int64

In [23]:
# select locations which contains maximu  10 fields:
len(data_stats[data_stats<=10])

1052

In [24]:
# printing locations which contains maximu  10 fields:
location_less_10 = data_stats[data_stats<=10]
location_less_10

location
Basapura                 10
1st Block Koramangala    10
Gunjur Palya             10
Kalkere                  10
Sector 1 HSR Layout      10
                         ..
1 Giri Nagar              1
Kanakapura Road,          1
Kanakapura main  Road     1
Karnataka Shabarimala     1
whitefiled                1
Name: location, Length: 1052, dtype: int64

In [25]:
#
data5.location = data5['location'].apply(lambda x: "other" if x in location_less_10 else x)

In [26]:
print("Total Unique Locations: ",len(data5.location.unique()))

Total Unique Locations:  242


# OutLiers:

In [29]:
data5[data5.total_sqft/data5.BHK<300]

Unnamed: 0,location,size,total_sqft,bath,price,BHK,Price_per_sqft
9,other,6 Bedroom,1020.0,6.0,370.0,6,36274.509804
45,HSR Layout,8 Bedroom,600.0,9.0,200.0,8,33333.333333
58,Murugeshpalya,6 Bedroom,1407.0,4.0,150.0,6,10660.980810
68,Devarachikkanahalli,8 Bedroom,1350.0,7.0,85.0,8,6296.296296
70,other,3 Bedroom,500.0,3.0,100.0,3,20000.000000
...,...,...,...,...,...,...,...
13277,other,7 Bedroom,1400.0,7.0,218.0,7,15571.428571
13279,other,6 Bedroom,1200.0,5.0,130.0,6,10833.333333
13281,Margondanahalli,5 Bedroom,1375.0,5.0,125.0,5,9090.909091
13303,Vidyaranyapura,5 Bedroom,774.0,5.0,70.0,5,9043.927649


In [31]:
data6 = data5[~(data5.total_sqft/data5.BHK<300)]
data6.head()

Unnamed: 0,location,size,total_sqft,bath,price,BHK,Price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699.810606
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615.384615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305.555556
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245.890861
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250.0
