<h1 style='color:white' align='center'>Data Science Regression Project: Predicting Home Prices in Banglore</h1>

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib 
matplotlib.rcParams["figure.figsize"] = (20,10)
from sklearn.metrics import mean_squared_error, r2_score

<h2 style='color:white'>Data Load: Load banglore home prices into a dataframe</h2>

In [None]:
df1 = pd.read_csv("bengaluru_house_prices.csv")
df1.head()

In [None]:
df1.shape

In [None]:
df1.columns

In [None]:
df1['area_type'].unique()

In [None]:
df1['area_type'].value_counts()

**Drop features that are not required to build our model**

In [None]:
df2 = df1.drop(['area_type','society','balcony','availability'],axis='columns')
df2.shape

<h2 style='color:white'>Data Cleaning: Handle NA values</h2>

In [None]:
df2.isnull().sum()

In [None]:
df2.shape

In [None]:
df3 = df2.dropna()
df3.isnull().sum()

In [None]:
df3.shape

<h2 style='color:white'>Feature Engineering</h2>

**Add new feature(integer) for bhk (Bedrooms Hall Kitchen)**

In [None]:
df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
df3.bhk.unique()

**Explore total_sqft feature**

In [None]:
def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

In [None]:
df3[~df3['total_sqft'].apply(is_float)].head(10)

**Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take average of min and max value in the range. There are other cases such as 34.46Sq. Meter which one can convert to square ft using unit conversion. I am going to just drop such corner cases to keep things simple**

In [None]:
def convert_sqft_to_num(x):
    tokens = x.split('-')
    if len(tokens) == 2:
        return (float(tokens[0])+float(tokens[1]))/2
    try:
        return float(x)
    except:
        return None   

In [None]:
df4 = df3.copy()
df4.total_sqft = df4.total_sqft.apply(convert_sqft_to_num)
df4 = df4[df4.total_sqft.notnull()]
df4.head(2)

**For below row, it shows total_sqft as 2475 which is an average of the range 2100-2850**

In [None]:
df4.loc[30]

**Add new feature called price per square feet**

In [None]:
df5 = df4.copy()
df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']
df5.head()

In [None]:
df5_stats = df5['price_per_sqft'].describe()
df5_stats

In [None]:
df5.to_csv("bhp.csv",index=False)

In [None]:
df5.location = df5.location.apply(lambda x: x.strip())
location_stats = df5['location'].value_counts(ascending=False)
location_stats

In [None]:
location_stats.values.sum()

In [None]:
len(location_stats[location_stats>10])

In [None]:
len(location_stats)

In [None]:
len(location_stats[location_stats<=10])