# <font color='Blue'> Planejamento da Solução </font>

## Input - Entrada
### Problema de Negócio
1. Quais são as coisas que um potencial comprador de imóvel residencial considera antes de comprar uma casa?
(A localização, o tamanho da propriedade, a proximidade de escritórios, escolas, parques, restaurantes, hospitais ou preço ) 
2. Conjunto de dados que descrevem imóveis em Bengaluru.

## Output - Saída

1. Grafico(s) demostrando as features mais importantes para comprar um imóveis
2. Modelo para prever o preço das casas em Bengaluru.

## Task - Processos

1. Quais são as coisas que um potencial comprador de imóvel residencial considera antes de comprar uma casa?
* Limpar os dados
* Criar features
* Criar hipoteses de negocio em relação as features
* Fazer EDA validando / refutando as hipoteses de negocio.






# 0.0. Imports

In [162]:
import re
import pandas as pd
import numpy as np
import xgboost as xgb
import seaborn as sns
import haversine as hs
import matplotlib.pyplot as plt

from boruta import BorutaPy
from geopy.geocoders import Nominatim
from pandas_profiling import ProfileReport

from sklearn import model_selection as ms
from sklearn import preprocessing as pp
from sklearn import dummy
from sklearn import metrics
from sklearn import linear_model as lm
from sklearn import ensemble as en
from sklearn.model_selection import cross_val_score

## 0.1. Helper Function

In [163]:
def metrics_cv(model, X, y, model_name='not defined', kfold=5):
    mae = -cross_val_score( model, X, y, scoring='neg_mean_absolute_error' , cv=kfold )
    mape = -cross_val_score( model, X, y, scoring='neg_mean_absolute_percentage_error' , cv=kfold )
    mse = -cross_val_score( model, X, y, scoring='neg_root_mean_squared_error' , cv=kfold )
    
    dictionary = {
        'Model': model_name,
        'MAE': f'{round(np.mean(mae), 3)}  +/-  { round(np.std(mae), 3)}',
        'MAPE': f'{round(np.mean(mape), 3)} +/- { round(np.std(mape), 3)}',
        'MSE': f'{round(np.mean(mse), 3)} +/- { round(np.std(mse), 3)}'
    }
    
    return pd.DataFrame(dictionary, index=[0])


def descriptive_statistics(num_attr):
    # Central Tendency: mean, median
    c1 = pd.DataFrame(num_attr.apply(np.mean))
    c2 = pd.DataFrame(num_attr.apply(np.median))

    # Dispension: min, max, range, std, skew, kurtosis
    d1 = pd.DataFrame(num_attr.apply(min))
    d2 = pd.DataFrame(num_attr.apply(max))
    d3 = pd.DataFrame(num_attr.apply(lambda x: x.max() - x.min()))
    d4 = pd.DataFrame(num_attr.apply(lambda x: x.std()))
    
    # Measures of Shape
    s1 = pd.DataFrame(num_attr.apply(lambda x: x.skew()))
    s2 = pd.DataFrame(num_attr.apply(lambda x: x.kurtosis()))

    # concat
    m = pd.concat([d1,d2,d3,c1,c2,d4,s1,s2], axis=1).reset_index()
    m.columns = ['attributes', 'min', 'max', 'range', 'mean', 'median', 'std', 'skew', 'kurtosis']
    return m


## 0.2. Load Data

In [164]:
df_raw = pd.read_csv('../data/Bengaluru_House_Data.csv')

# 1.0. Data Description

In [165]:
df1 = df_raw.copy()

*  **Area_type** - describes the area 
*  **Availability** - when it can be possessed or when it is ready(categorical and time-series) 
*  **Location** - where it is located in Bengaluru 
*  **Price** - Value of the property in lakhs(INR) M
*  **Size** - in BHK or Bedroom (1-10 or more) 
*  **Society** - to which society it belongs 
*  **Total_sqft** - size of the property in sq.ft 
*  **Bath** - No. of bathrooms 
*  **Balcony** - No. of the balcony 

## 1.1. Data dimensions

In [166]:
print('Number of Rows: {}'.format(df1.shape[0]))
print('Number of Columns: {}'.format(df1.shape[1]))

Number of Rows: 13320
Number of Columns: 9


## 1.2. Data Types

In [167]:
df1.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

## 1.3. Check NA

In [168]:
df1.isna().mean() *100

area_type        0.000000
availability     0.000000
location         0.007508
size             0.120120
society         41.306306
total_sqft       0.000000
bath             0.548048
balcony          4.572072
price            0.000000
dtype: float64

## 1.4. Replace NA

In [169]:
# drop features
# remove: balcony, 4% ?
df1 = df1.dropna(subset=['size', 'location', 'bath', 'balcony'])

## 1.5. Change Dtypes

In [170]:
df1.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

## 1.6. Descriptive Statistics

In [171]:
num_att = df1.select_dtypes(include=['int64', 'float64'])
cat_att = df1.select_dtypes(exclude=['int64', 'float64'])

### 1.6.1. Numerical Attributes

In [172]:
descriptive_statistics(num_att)

Unnamed: 0,attributes,min,max,range,mean,median,std,skew,kurtosis
0,bath,1.0,40.0,39.0,2.617309,2.0,1.226,4.590497,85.455663
1,balcony,0.0,3.0,3.0,1.584343,2.0,0.817287,0.005966,-0.544247
2,price,8.0,2912.0,2904.0,106.060778,70.0,131.766089,7.875011,107.376164


**Note:**
1. bath: 40 banheiros? 

### 1.6.2. Categorical Attributes

In [173]:
cat_att.describe(include=['object'])

Unnamed: 0,area_type,availability,location,size,society,total_sqft
count,12710,12710,12710,12710,7496,12710
unique,4,78,1265,27,2592,1976
top,Super built-up Area,Ready To Move,Whitefield,2 BHK,GrrvaGr,1200
freq,8481,10077,514,5152,80,788


In [174]:
cat_att['area_type'].value_counts(normalize=True)

Super built-up  Area    0.667270
Built-up  Area          0.181747
Plot  Area              0.144532
Carpet  Area            0.006452
Name: area_type, dtype: float64

In [175]:
cat_att['availability'].value_counts(normalize=True).head(10)

Ready To Move    0.792840
18-Dec           0.022895
18-May           0.022187
18-Apr           0.020535
18-Aug           0.015736
19-Dec           0.014319
18-Jul           0.011015
18-Mar           0.009284
20-Dec           0.007710
18-Jun           0.007553
Name: availability, dtype: float64

In [176]:
cat_att['location'].value_counts(normalize=True).head(10)

Whitefield               0.040441
Sarjapur  Road           0.029268
Electronic City          0.023603
Kanakpura Road           0.020535
Thanisandra              0.018175
Yelahanka                0.016208
Uttarahalli              0.014634
Hebbal                   0.013611
Raja Rajeshwari Nagar    0.013218
Marathahalli             0.012903
Name: location, dtype: float64

In [177]:
cat_att['size'].value_counts(normalize=True).head(10)

2 BHK        0.405350
3 BHK        0.324784
4 Bedroom    0.058930
1 BHK        0.041699
3 Bedroom    0.041463
4 BHK        0.038474
2 Bedroom    0.025806
5 Bedroom    0.020692
6 Bedroom    0.013297
1 Bedroom    0.008261
Name: size, dtype: float64

In [178]:
dirt = df1.loc[~df1['total_sqft'].apply(lambda x: bool(re.search(   '^([0-9]+)$', x    ))), 'total_sqft']

len(dirt)

272

In [179]:
dirt.tolist()[:10]

['2100 - 2850',
 '1330.74',
 '3067 - 8156',
 '1042 - 1105',
 '1563.05',
 '1145 - 1340',
 '1015 - 1540',
 '2023.71',
 '1113.27',
 '34.46Sq. Meter']

**Note**:

1. availability: coluna categoricas 78 variavéis categoricas (Ready To Move 79 %)
2. location - capturar lat e long e deletar variavel.
3. size: Limpar variavel e transformar em inteira.
4. society: remover, pouca informação.
5. total_sqft: 272 itens que não são somente números.

# 2.0. Feature Engineering

In [180]:
df2['location'].value_counts(normalize=True)

Whitefield                                        0.040441
Sarjapur  Road                                    0.029268
Electronic City                                   0.023603
Kanakpura Road                                    0.020535
Thanisandra                                       0.018175
                                                    ...   
Jaladarsini Layout                                0.000079
6th block banashankari 3rd stage, 100 feet ORR    0.000079
Byappanahalli                                     0.000079
Basvasamithi Layout Vidyaranyapura                0.000079
EPIP AREA, WHITEFIELD                             0.000079
Name: location, Length: 1265, dtype: float64

In [181]:
# df2 = df1.copy()
df2 = pd.read_csv('../data/bengaluru_house_data_lat_lon.csv')

## 2.1. Feature Creation

### 2.1.1.  Lat and Lon

In [182]:
# %%time 

# location -> lat and lon

# geolocator = Nominatim(user_agent="geoapiExercises")

# def location_lat(x):
#     if geolocator.geocode(x, timeout=None):
#         return geolocator.geocode(x, timeout=None).raw['lat']
#     else: 
#         return x

# df2['lat'] = df2['location'].apply(location_lat)

# def location_lon(x):
#     if geolocator.geocode(x, timeout=None):
#         return geolocator.geocode(x, timeout=None).raw['lon']
#     else: 
#         return x

# df2['lon'] = df2['location'].apply(location_lon)

# df2.to_csv('bengaluru_house_data_lat_lon.csv', index=False)

### 2.1.2. Number of Bedroom

In [183]:
df2['qt_bedroom'] = df2['size'].apply(lambda x: str(x).split()[0]).astype(int)

In [184]:
df2['size_bedroom_or_bhk'] = df2['size'].apply(lambda x: str(x).split()[1])

## 2.2. New Features Creation location



### 2.2.0. Lat and Lon Manual

In [185]:
Kasavanhalli = ['13.953548', '76.700462']
Bisuvanahalli = ['13.2292777', '77.5461785']
Bhoganhalli = ['12.925617', '77.700203']
Talaghattapura = ['12.86849', '77.536557']
Lakshminarayana_Pura = ['12.869087', '77.534158']
Kumaraswami_Layout = ['12.9037594', '77.56184389999999']
Margondanahalli = ['12.957805', '77.713036']
Kothannur = ['13.0551956', '77.64222059999997']
Babusapalaya = ['13.022823', '77.652092']
Somasundara_Palya = ['12.89859', '77.651465']

df2.loc[df2['location'] == 'Kasavanhalli', 'lat'] = Kasavanhalli[0]
df2.loc[df2['location'] == 'Bisuvanahalli', 'lat'] = Bisuvanahalli[0]
df2.loc[df2['location'] == 'Bhoganhalli', 'lat'] = Bhoganhalli[0]
df2.loc[df2['location'] == 'Talaghattapura', 'lat'] = Talaghattapura[0]
df2.loc[df2['location'] == 'Lakshminarayana Pura', 'lat'] = Lakshminarayana_Pura[0]
df2.loc[df2['location'] == 'Kumaraswami Layout', 'lat'] = Kumaraswami_Layout[0]
df2.loc[df2['location'] == 'Margondanahalli', 'lat'] = Margondanahalli[0]
df2.loc[df2['location'] == 'Kothannur', 'lat'] = Kothannur[0]
df2.loc[df2['location'] == 'Babusapalaya', 'lat'] = Babusapalaya[0]
df2.loc[df2['location'] == 'Somasundara Palya', 'lat'] = Somasundara_Palya[0]

df2.loc[df2['location'] == 'Kasavanhalli', 'lon'] = Kasavanhalli[1]
df2.loc[df2['location'] == 'Bisuvanahalli', 'lon'] = Bisuvanahalli[1]
df2.loc[df2['location'] == 'Bhoganhalli', 'lon'] = Bhoganhalli[1]
df2.loc[df2['location'] == 'Talaghattapura', 'lon'] = Talaghattapura[1]
df2.loc[df2['location'] == 'Lakshminarayana Pura', 'lon'] = Lakshminarayana_Pura[1]
df2.loc[df2['location'] == 'Kumaraswami Layout', 'lon'] = Kumaraswami_Layout[1]
df2.loc[df2['location'] == 'Margondanahalli', 'lon'] = Margondanahalli[1]
df2.loc[df2['location'] == 'Kothannur', 'lon'] = Kothannur[1]
df2.loc[df2['location'] == 'Babusapalaya', 'lon'] = Babusapalaya[1]
df2.loc[df2['location'] == 'Somasundara Palya', 'lon'] = Somasundara_Palya[1]

### 2.2.1. Technology Center

* Em 2014, Bangalore contribuiu com US $ 45 bilhões, ou 38 por cento do total das exportações de TI da Índia.

* Grande número de funcionários que trabalham nas áreas dos corredores de TI de polo de ti:

1. Whitefield
3. Outer Ring Road 
2. Electronics City

In [186]:
def regex_technology_center(x):
    if bool(re.search('^Electro', x)) | bool(re.search('^[Oo]uter+', x)) | bool(re.search('^[Ww]hitefield+', x)):
        return 1
    else:
        return 0

df2['is_technology_center'] = df2['location'].apply(regex_technology_center)

###  2.2.2. Near the airport

Bangalore é servida pelo Aeroporto Internacional de Kempegowda ( IATA : BLR , ICAO : VOBL ), localizado em Devanahalli , a cerca de 40 km (25 milhas) do centro da cidade. 

In [187]:
df2['near_the_airport'] = df2['location'].apply(lambda x: 1 if bool((re.search('^[dD]evanahall', x))) else 0  )

###  2.2.3. Road

In [188]:
df2['is_road'] = df2['location'].apply(lambda x:1 if bool((re.search('[Rr]oad', x))) else 0 )

### 2.2.4. Distance of Bangalore University

In [189]:
## Distance Between 2 Geolocations in Python
# https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b

In [190]:
def distance_bangalore_university(x):
    dist_bang_un = (12.9365945869, 77.5019063257)
    
    if re.search('^[0-9-][0-9][\.]+', x['lat']):
        
        loc = float(x['lat']), float(x['lon'])
        return hs.haversine(dist_bang_un, loc)
    else:
        return np.NAN
    
    
df2['distance_bangalore_university'] = df2.apply(distance_bangalore_university, axis=1)

### 2.2.5. Distance of Airport 

In [191]:
def distance_airport(x):
    distance = (13.199379, 77.710136)
    
    if re.search('^[0-9-][0-9][\.]+', x['lat']):
        
        loc = float(x['lat']), float(x['lon'])
        return hs.haversine(distance, loc)
    else:
        return np.NAN
    
    
df2['distance_airport'] = df2.apply(distance_airport, axis=1)

### 2.2.6. Distance Hesaraghatta Lake

In [192]:
def distance_hesaraghatta_lake(x):
    distance = (13.15, 77.49)
    
    if re.search('^[0-9-][0-9][\.]+', x['lat']):
        
        loc = float(x['lat']), float(x['lon'])
        return hs.haversine(distance, loc)
    else:
        return np.NAN

df2['distance_hesaraghatta_lake'] = df2.apply(distance_hesaraghatta_lake, axis=1)

### 2.2.7. Distance Center

In [193]:
def distance_center(x):
    distance = (12.9701977859, 77.5902776389)
    
    if re.search('^[0-9-][0-9][\.]+', x['lat']):
        
        loc = float(x['lat']), float(x['lon'])
        return hs.haversine(distance, loc)
    else:
        return np.NAN
    
df2['distance_center'] = df2.apply(distance_center, axis=1)

# 3.0. Data Filtering

In [194]:
df3 = df2.copy()

## 3.1. Filter Rows

In [195]:
# lat and lon
df3_lat_dirt = df3.loc[~df3['lat'].apply(lambda x: bool(re.search('^[0-9-][0-9][\.]+', x))), :]
df3_lon_dirt = df3.loc[~df3['lon'].apply(lambda x: bool(re.search('^[0-9-][0-9][\.]+', x))), :]

df3['lat'] = df3.loc[df3['lat'].apply(lambda x: bool(re.search('^[0-9-][0-9][\.]+', x))), 'lat'].astype(float)
df3['lon'] = df3.loc[df3['lon'].apply(lambda x: bool(re.search('^[0-9-][0-9][\.]+', x))), 'lon'].astype(float)

In [196]:
# total_sqft_dirt = 
df3_total_sqft_dirt = df3.loc[~df3['total_sqft'].apply(lambda x: bool(re.search(r'^([\s\d]+)$', x))), :]

In [197]:
# total_sqft
df3_total_sqft_dirt = df3.loc[~df3['total_sqft'].apply(lambda x: bool(re.search(r'^([\s\d]+)$', x)) ), :]

df3['total_sqft'] = df3.loc[df3['total_sqft'].apply(lambda x: bool(re.search(r'^([\s\d]+)$', x))), 'total_sqft'].astype(int)

## 3.2. Filter Columns

In [198]:
df3.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price,lat,lon,qt_bedroom,size_bedroom_or_bhk,is_technology_center,near_the_airport,is_road,distance_bangalore_university,distance_airport,distance_hesaraghatta_lake,distance_center
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056.0,2.0,1.0,39.07,12.846854,77.676927,2,BHK,1,0,0,21.435172,39.363745,39.324556,16.62243
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600.0,5.0,3.0,120.0,12.895768,77.867101,4,Bedroom,0,0,0,39.839926,37.800097,49.680452,31.121232
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440.0,2.0,3.0,62.0,12.905568,77.545544,3,BHK,0,0,0,5.85403,37.218565,27.837715,8.668786
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521.0,3.0,1.0,95.0,,,3,BHK,0,0,0,,,,
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200.0,2.0,1.0,51.0,12.580537,77.333067,2,BHK,0,0,0,43.620978,80.034913,65.566917,51.529952


In [199]:
drop_cols = ['society', 'size']
df3 = df3.drop(drop_cols, axis=1)
df3.isna().mean()

area_type                        0.000000
availability                     0.000000
location                         0.000000
total_sqft                       0.021400
bath                             0.000000
balcony                          0.000000
price                            0.000000
lat                              0.082297
lon                              0.159166
qt_bedroom                       0.000000
size_bedroom_or_bhk              0.000000
is_technology_center             0.000000
near_the_airport                 0.000000
is_road                          0.000000
distance_bangalore_university    0.082297
distance_airport                 0.082297
distance_hesaraghatta_lake       0.082297
distance_center                  0.082297
dtype: float64

In [200]:
df3.isna().mean()

area_type                        0.000000
availability                     0.000000
location                         0.000000
total_sqft                       0.021400
bath                             0.000000
balcony                          0.000000
price                            0.000000
lat                              0.082297
lon                              0.159166
qt_bedroom                       0.000000
size_bedroom_or_bhk              0.000000
is_technology_center             0.000000
near_the_airport                 0.000000
is_road                          0.000000
distance_bangalore_university    0.082297
distance_airport                 0.082297
distance_hesaraghatta_lake       0.082297
distance_center                  0.082297
dtype: float64

In [201]:
df3 = df3.dropna()
df3.isna().mean()

area_type                        0.0
availability                     0.0
location                         0.0
total_sqft                       0.0
bath                             0.0
balcony                          0.0
price                            0.0
lat                              0.0
lon                              0.0
qt_bedroom                       0.0
size_bedroom_or_bhk              0.0
is_technology_center             0.0
near_the_airport                 0.0
is_road                          0.0
distance_bangalore_university    0.0
distance_airport                 0.0
distance_hesaraghatta_lake       0.0
distance_center                  0.0
dtype: float64

In [202]:
round((1 - (df3.shape[0] / df_raw.shape[0])) * 100, 2)

21.76

# 4.0. EDA

In [203]:
df4 = df3.copy()

## 4.1. Univariate Analysis

In [204]:
# profile = ProfileReport(df4, title='Analysis - Bengaluru House')
# profile.to_file('../reports/figures/output_v1.html')

### total_sqft

In [205]:
df4.sort_values('total_sqft').head(10)

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,lat,lon,qt_bedroom,size_bedroom_or_bhk,is_technology_center,near_the_airport,is_road,distance_bangalore_university,distance_airport,distance_hesaraghatta_lake,distance_center
4723,Built-up Area,Ready To Move,Srirampuram,5.0,7.0,3.0,115.0,17.43458,82.715203,7,BHK,0,0,0,750.28572,713.951666,735.448133,740.706499
333,Plot Area,18-Dec,Suragajakkanahalli,11.0,3.0,2.0,74.0,12.739777,77.668954,3,Bedroom,0,0,0,28.406925,51.299965,49.566145,27.004037
968,Carpet Area,Ready To Move,Weavers Colony,15.0,1.0,0.0,30.0,12.845495,77.583129,1,BHK,0,0,0,13.421027,41.68638,35.330959,13.887985
5671,Plot Area,Ready To Move,Mysore Road,45.0,1.0,0.0,23.0,12.387214,76.666963,1,Bedroom,0,0,1,109.257192,144.743482,123.127287,119.311009
12615,Super built-up Area,Ready To Move,Tilak Nagar,250.0,2.0,2.0,40.0,28.636548,77.096496,1,BHK,0,0,0,1746.261684,1717.709843,1722.508847,1742.770295
111,Plot Area,Ready To Move,Hennur Road,276.0,3.0,3.0,23.0,13.025809,77.630507,2,Bedroom,0,0,1,17.104793,21.139094,20.549522,7.565419
10029,Super built-up Area,Ready To Move,Yelahanka New Town,284.0,1.0,1.0,8.0,13.097804,77.581189,1,BHK,0,0,0,19.877271,17.958764,11.454251,14.223291
4607,Carpet Area,Ready To Move,Nagarbhavi,300.0,1.0,1.0,20.0,12.954674,77.512172,1,BHK,0,0,0,2.297656,34.642887,21.851672,8.637836
10955,Super built-up Area,Ready To Move,Malleshwaram,302.0,2.0,1.0,25.0,13.002735,77.570325,2,BHK,0,0,0,10.442843,26.596657,18.542847,4.214693
942,Plot Area,Ready To Move,Rajaji Nagar,315.0,4.0,2.0,90.0,18.014228,79.552624,4,Bedroom,0,0,0,605.836888,570.56746,584.22612,598.96317


In [206]:
df4.sort_values('total_sqft', ascending=False).head(10)

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,lat,lon,qt_bedroom,size_bedroom_or_bhk,is_technology_center,near_the_airport,is_road,distance_bangalore_university,distance_airport,distance_hesaraghatta_lake,distance_center
1794,Plot Area,Ready To Move,Nelamangala,52272.0,2.0,1.0,140.0,13.095302,77.396359,3,Bedroom,0,0,0,21.028336,35.892835,11.824671,25.195785
5121,Super built-up Area,Ready To Move,Doddabommasandra,42000.0,8.0,3.0,175.0,13.064967,77.562966,9,BHK,0,0,0,15.732873,21.848508,12.322476,10.945435
5194,Super built-up Area,Ready To Move,Ulsoor,36000.0,4.0,2.0,450.0,12.977879,77.62467,4,BHK,0,0,0,14.072908,26.311678,24.064122,3.823236
641,Built-up Area,Ready To Move,Yelahanka,35000.0,3.0,3.0,130.0,13.100698,77.596345,3,BHK,0,0,0,20.920089,16.498904,12.754434,14.525884
12396,Plot Area,Ready To Move,Dodsworth Layout,30400.0,4.0,2.0,1824.0,12.97077,77.744557,6,Bedroom,0,0,0,26.568094,25.69216,34.021643,16.717528
6884,Plot Area,Ready To Move,Yelahanka,26136.0,1.0,0.0,150.0,13.100698,77.596345,1,Bedroom,0,0,0,20.920089,16.498904,12.754434,14.525884
1173,Plot Area,Ready To Move,Siddapura,14000.0,3.0,2.0,800.0,14.340956,74.892425,4,Bedroom,0,0,0,322.323243,329.717028,310.242118,328.943158
577,Super built-up Area,19-Jan,Malleshwaram,12000.0,7.0,3.0,2200.0,13.002735,77.570325,7,BHK,0,0,0,10.442843,26.596657,18.542847,4.214693
390,Super built-up Area,19-Jan,Rajaji Nagar,12000.0,6.0,3.0,2200.0,18.014228,79.552624,7,BHK,0,0,0,605.836888,570.56746,584.22612,598.96317
2489,Super built-up Area,Ready To Move,Sathya Sai Layout,11338.0,9.0,1.0,1000.0,12.973461,77.751115,6,BHK,0,0,0,27.314817,25.510031,34.428233,17.431717


**Notes:**
1. **total_sqft**: apartamentos com 5, 11, 15, 45 total_sqft ?

### bath

In [207]:
df4.sort_values('bath', ascending=False).head(10)

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,lat,lon,qt_bedroom,size_bedroom_or_bhk,is_technology_center,near_the_airport,is_road,distance_bangalore_university,distance_airport,distance_hesaraghatta_lake,distance_center
1877,Plot Area,Ready To Move,Hongasandra,990.0,12.0,0.0,120.0,12.901368,77.632057,8,Bedroom,0,0,0,14.639552,34.199629,31.641348,8.892508
1681,Plot Area,Ready To Move,1 Ramamurthy Nagar,1200.0,11.0,0.0,170.0,13.012022,77.677782,11,Bedroom,0,0,0,20.821178,21.125807,25.476453,10.560175
5555,Super built-up Area,Ready To Move,Vidyaranyapura,4700.0,10.0,3.0,130.0,13.406193,75.249867,9,BHK,0,0,0,249.354943,267.219384,244.099641,257.973221
2072,Plot Area,Ready To Move,NS Palya,1500.0,10.0,3.0,165.0,12.908793,77.604554,8,Bedroom,0,0,0,11.546386,34.275995,29.552849,7.000955
6373,Plot Area,Ready To Move,Kothanur,1020.0,10.0,0.0,155.0,12.580537,77.333067,8,Bedroom,0,0,0,43.620978,80.034913,65.566917,51.529952
6211,Plot Area,Ready To Move,Hoskote,1800.0,10.0,3.0,185.0,13.073014,77.792138,9,Bedroom,0,0,0,34.912258,16.621765,33.821706,24.67682
11500,Plot Area,Ready To Move,Rajaji Nagar,1200.0,10.0,2.0,180.0,18.014228,79.552624,8,Bedroom,0,0,0,605.836888,570.56746,584.22612,598.96317
1836,Plot Area,Ready To Move,Chikkasandra,1200.0,9.0,3.0,120.0,13.710558,76.84497,9,Bedroom,0,0,0,111.61999,109.474057,93.551989,115.237687
733,Plot Area,Ready To Move,Sector 3 HSR Layout,600.0,9.0,3.0,190.0,12.912736,77.638005,9,Bedroom,0,0,0,14.986828,32.817008,30.87265,8.22061
2682,Built-up Area,Ready To Move,Kadugodi,6200.0,9.0,0.0,200.0,12.998577,77.760972,9,Bedroom,0,0,0,28.905803,22.996982,33.836422,18.762297


### balcony

In [208]:
df4['balcony'].value_counts()

2.0    4273
1.0    4030
3.0    1379
0.0     739
Name: balcony, dtype: int64

### price

In [209]:
df4.sort_values('price', ascending=False).head(10)

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,lat,lon,qt_bedroom,size_bedroom_or_bhk,is_technology_center,near_the_airport,is_road,distance_bangalore_university,distance_airport,distance_hesaraghatta_lake,distance_center
10558,Super built-up Area,18-Jan,Ashok Nagar,8321.0,5.0,2.0,2912.0,13.040073,80.215925,4,BHK,0,0,0,294.288157,271.934855,295.478689,284.575065
12600,Plot Area,Ready To Move,Defence Colony,8000.0,6.0,3.0,2800.0,34.01125,71.536452,6,Bedroom,0,0,0,2419.727021,2396.699545,2396.350943,2418.336727
11214,Plot Area,Ready To Move,Sadashiva Nagar,9600.0,7.0,2.0,2736.0,15.331903,75.12647,5,Bedroom,0,0,0,369.508415,365.707151,351.772371,373.525481
9816,Plot Area,Ready To Move,5th Block Jayanagar,10624.0,4.0,2.0,2340.0,12.929507,77.580165,4,Bedroom,0,0,0,8.517801,33.146607,26.391704,4.65544
6103,Plot Area,18-Sep,Bommenahalli,2940.0,3.0,2.0,2250.0,13.26871,76.641051,4,Bedroom,0,0,0,100.278299,115.976416,92.84439,108.020498
390,Super built-up Area,19-Jan,Rajaji Nagar,12000.0,6.0,3.0,2200.0,18.014228,79.552624,7,BHK,0,0,0,605.836888,570.56746,584.22612,598.96317
577,Super built-up Area,19-Jan,Malleshwaram,12000.0,7.0,3.0,2200.0,13.002735,77.570325,7,BHK,0,0,0,10.442843,26.596657,18.542847,4.214693
8119,Plot Area,Ready To Move,Dollars Colony,7800.0,3.0,2.0,2000.0,15.346072,75.116714,3,Bedroom,0,0,0,371.367519,367.522493,353.614524,375.375167
6954,Plot Area,18-Apr,Yemlur,11000.0,5.0,3.0,2000.0,12.946651,77.676065,4,Bedroom,0,0,0,18.906827,28.343324,30.290452,9.65792
12396,Plot Area,Ready To Move,Dodsworth Layout,30400.0,4.0,2.0,1824.0,12.97077,77.744557,6,Bedroom,0,0,0,26.568094,25.69216,34.021643,16.717528


In [210]:
df4.sort_values('price', ascending=True).head(10)

Unnamed: 0,area_type,availability,location,total_sqft,bath,balcony,price,lat,lon,qt_bedroom,size_bedroom_or_bhk,is_technology_center,near_the_airport,is_road,distance_bangalore_university,distance_airport,distance_hesaraghatta_lake,distance_center
10029,Super built-up Area,Ready To Move,Yelahanka New Town,284.0,1.0,1.0,8.0,13.097804,77.581189,1,BHK,0,0,0,19.877271,17.958764,11.454251,14.223291
8163,Built-up Area,Ready To Move,Chandapura,450.0,1.0,1.0,9.0,17.443639,77.433391,1,BHK,0,0,0,501.215027,472.872567,477.470083,497.70922
7111,Super built-up Area,Ready To Move,Alur,470.0,2.0,1.0,10.0,15.428596,77.261334,1,BHK,0,0,0,278.309091,252.549702,254.563727,275.65157
10569,Built-up Area,Ready To Move,Attibele,410.0,1.0,1.0,10.0,12.778259,77.771283,1,BHK,0,0,0,34.099194,47.292867,51.358348,28.991247
5138,Super built-up Area,Ready To Move,Attibele,400.0,1.0,1.0,10.0,12.778259,77.771283,1,BHK,0,0,0,34.099194,47.292867,51.358348,28.991247
12007,Super built-up Area,Ready To Move,Chandapura,410.0,1.0,1.0,10.0,17.443639,77.433391,1,BHK,0,0,0,501.215027,472.872567,477.470083,497.70922
1396,Built-up Area,18-Mar,Kengeri,340.0,1.0,1.0,10.0,12.917657,77.483757,1,BHK,0,0,0,2.881536,39.78211,25.844238,12.93782
11395,Super built-up Area,Ready To Move,Attibele,400.0,1.0,1.0,10.25,12.778259,77.771283,1,BHK,0,0,0,34.099194,47.292867,51.358348,28.991247
2313,Built-up Area,Ready To Move,Attibele,395.0,1.0,1.0,10.25,12.778259,77.771283,1,BHK,0,0,0,34.099194,47.292867,51.358348,28.991247
8217,Plot Area,Ready To Move,Doddaballapur,640.0,1.0,0.0,10.5,13.292958,77.543146,2,Bedroom,0,0,0,39.876773,20.855693,16.905254,36.250458


# 5.0. Data Preparation

In [256]:
df5 = df4.copy()

## 5.1. Standardization

## 5.2. Rescaling

In [257]:
mms_total_sqft = pp.MinMaxScaler()
mms_bath = pp.MinMaxScaler()
mms_balcony = pp.MinMaxScaler()
mms_lat = pp.MinMaxScaler()
mms_lon = pp.MinMaxScaler()
mms_qt_bedroom = pp.MinMaxScaler()

mms_distance_bangalore_university = pp.MinMaxScaler()
mms_distance_airport = pp.MinMaxScaler()
mms_distance_hesaraghatta_lake = pp.MinMaxScaler()
mms_distance_center = pp.MinMaxScaler()

In [258]:
df5['total_sqft'] = mms_total_sqft.fit_transform(df5[['total_sqft']])
df5['bath'] = mms_bath.fit_transform(df5[['bath']])
df5['balcony'] = mms_balcony.fit_transform(df5[['balcony']])
df5['lat'] = mms_lat.fit_transform(df5[['lat']])
df5['lon'] = mms_lon.fit_transform(df5[['lon']])
df5['qt_bedroom'] = mms_qt_bedroom.fit_transform(df5[['qt_bedroom']])

df5['distance_bangalore_university'] = mms_total_sqft.fit_transform(df5[['distance_bangalore_university']])
df5['distance_airport'] = mms_distance_airport.fit_transform(df5[['distance_airport']])
df5['distance_hesaraghatta_lake'] = mms_distance_hesaraghatta_lake.fit_transform(df5[['distance_hesaraghatta_lake']])
df5['distance_center'] = mms_distance_center.fit_transform(df5[['distance_center']])

## 5.3. Encoding



In [259]:
# area_type
le_area_type = pp.LabelEncoder()
df5['area_type'] = le_area_type.fit_transform(df5[['area_type']].values.ravel())

In [260]:
# size_bedroom_or_bhk
le_size_bedroom_or_bhk = pp.LabelEncoder()
df5['size_bedroom_or_bhk'] = le_area_type.fit_transform(df5[['size_bedroom_or_bhk']].values.ravel())

In [261]:
# area_type
df5 = pd.get_dummies(df5, prefix='area_type', columns=['area_type'] )

In [262]:
# availability
me_availability = dict(df5['availability'].value_counts(normalize=True))
df5['availability'] = df5['availability'].map(me_availability)

In [263]:
# location
me_location = dict(df5['location'].value_counts(normalize=True))
df5['location'] = df5['location'].map(me_location)

In [264]:
df5.head().T

Unnamed: 0,0,1,2,4,6
availability,0.012955,0.80712,0.80712,0.80712,0.80712
location,0.012283,0.001343,0.017561,0.005662,0.015737
total_sqft,0.020108,0.049649,0.027455,0.022863,0.024968
bath,0.090909,0.363636,0.090909,0.090909,0.181818
balcony,0.333333,1.0,1.0,0.333333,0.333333
price,39.07,120.0,62.0,51.0,63.25
lat,0.254096,0.254755,0.254887,0.25051,0.255556
lon,0.790671,0.792458,0.789437,0.78744,0.790873
qt_bedroom,0.090909,0.272727,0.181818,0.090909,0.181818
size_bedroom_or_bhk,0.0,1.0,0.0,0.0,0.0


# 6.0. Feature Selection

In [268]:
df6 = df5.copy()

In [269]:
X = df5.drop(['price',], axis=1)
y = df5['price'].copy()

x_training, x_validation, y_train, y_validation = ms.train_test_split(X, y, test_size=0.2, random_state=42)

## 6.1. Boruta as Feature Selection

In [270]:
# x_boruta = X.copy().values
# y_boruta = y.ravel()

# rf = en.RandomForestRegressor(n_estimators=300)

# feat_selector_boruta = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=42).fit(x_boruta, y_boruta)

In [271]:
# feat_ranking = feat_selector_boruta.ranking_
# feat_selector = feat_selector_boruta.support_
# columns_name = X.columns

In [272]:
feat_ranking = [ 3,  6,  1,  1,  9,  2,  1,  7,  8, 13, 14, 12,  4,  2,  2,  1, 10, 14,  5, 11]
feat_selector = [False, False,  True,  True, False, False,  True, False, False,
                 False, False, False, False, False, False,  True, False, False,False, False]
columns_name = ['availability', 'location', 'total_sqft', 'bath', 'balcony', 'lat',
                'lon', 'qt_bedroom', 'size_bedroom_or_bhk', 'is_technology_center',
                'near_the_airport', 'is_road', 'distance_bangalore_university',
                'distance_airport', 'distance_hesaraghatta_lake', 'distance_center',
                'area_type_0', 'area_type_1', 'area_type_2', 'area_type_3']


df_boruta_ranking = pd.DataFrame({'ranking': feat_ranking, 
                                   'selected_boruta':feat_selector}, 
                                    index=columns_name).sort_values('ranking')

df_boruta_ranking

Unnamed: 0,ranking,selected_boruta
total_sqft,1,True
bath,1,True
lon,1,True
distance_center,1,True
lat,2,False
distance_airport,2,False
distance_hesaraghatta_lake,2,False
availability,3,False
distance_bangalore_university,4,False
area_type_2,5,False


## 6.2. Feature Importance

In [273]:
# x_fi = X.copy().values
# y_fi = y.ravel()

# forest = en.RandomForestRegressor().fit(x_fi, y_fi)

In [274]:
# feature_importances = forest.feature_importances_
# columns_names = X.columns

In [275]:
feature_importances = [2.83948758e-02, 2.92242380e-02, 5.93242661e-01, 6.05061977e-02,
                       1.16161054e-02, 2.78409536e-02, 3.12862980e-02, 1.92521651e-02,
                       6.98896514e-03, 1.21222355e-04, 1.68048597e-04, 1.18157922e-03,
                       2.75819747e-02, 2.73359672e-02, 3.93738080e-02, 7.33364147e-02,
                       3.55987262e-03, 4.89581851e-05, 1.71008717e-02, 1.83882303e-03]
columns_names = ['availability', 'location', 'total_sqft', 'bath', 'balcony', 'lat',
                'lon', 'qt_bedroom', 'size_bedroom_or_bhk', 'is_technology_center',
                'near_the_airport', 'is_road', 'distance_bangalore_university',
                'distance_airport', 'distance_hesaraghatta_lake', 'distance_center',
                'area_type_0', 'area_type_1', 'area_type_2', 'area_type_3']

df_fi_ranking = pd.DataFrame({'Feature Importance': feature_importances}, 
                             index=columns_names).sort_values('Feature Importance', ascending=False)
df_fi_ranking

Unnamed: 0,Feature Importance
total_sqft,0.593243
distance_center,0.073336
bath,0.060506
distance_hesaraghatta_lake,0.039374
lon,0.031286
location,0.029224
availability,0.028395
lat,0.027841
distance_bangalore_university,0.027582
distance_airport,0.027336


## 6.3. Manual Selection

In [276]:
cols_selected = ['availability',
                 'total_sqft',
                 'bath',
                 'balcony',
                 'lat',
                 'lon',
                 'location',
                 'qt_bedroom',
                 'size_bedroom_or_bhk',
                 'is_technology_center',
                 'near_the_airport',
                 'is_road',
                 'distance_bangalore_university',
                 'distance_airport',
                 'distance_hesaraghatta_lake',
                 'distance_center',
                 'area_type_0',
                 'area_type_1',
                 'area_type_2',
                 'area_type_3']

# 7.0. Model Training

In [284]:
x_val = x_validation[cols_selected].copy()
y_val = y_validation.copy()
x_train = x_training[cols_selected].copy()

X = X[cols_selected].copy()

## 7.1. Average Model

In [285]:
# model definition and fit
model_baseline = dummy.DummyRegressor(strategy='mean').fit(x_train, y_train)

# model predict
yhat_baseline = model_baseline.predict(x_val)

# model perfomance
mae = metrics.mean_absolute_error(y_val, yhat_baseline)
mape = metrics.mean_absolute_percentage_error(y_val, yhat_baseline)
rmse = np.sqrt(metrics.mean_squared_error(y_val, yhat_baseline))

print('MAE: {} | MAPE: {} | RMSE: {}'.format(mae, mape, rmse))

MAE: 71.72002096537742 | MAPE: 0.8369691528243824 | RMSE: 162.87336198295392


In [286]:
#1 MAE: 69.18655009838794 | MAPE: 0.9189057519989557 | RMSE: 130.53405853555498
#2 MAE: 66.74150511881209 | MAPE: 0.8464271363888621 | RMSE: 134.87690120642173

In [287]:
result_baseline = metrics_cv(model_baseline, X, y, 'Average Model')
result_baseline

Unnamed: 0,Model,MAE,MAPE,MSE
0,Average Model,65.513 +/- 2.704,0.856 +/- 0.034,129.141 +/- 15.399


## 7.2. Linear Regression Model

In [288]:
# model definition and fit
model_lr = lm.LinearRegression().fit(x_train, y_train)

# model predict
yhat_lr = model_lr.predict(x_val)

# model perfomance
mae = metrics.mean_absolute_error(y_val, yhat_lr)
mape = metrics.mean_absolute_percentage_error(y_val, yhat_lr)
rmse = np.sqrt(metrics.mean_squared_error(y_val, yhat_lr))

print('MAE: {} | MAPE: {} | RMSE: {}'.format(mae, mape, rmse))

MAE: 45.587953237410076 | MAPE: 0.4024040439866077 | RMSE: 130.01524406646908


In [289]:
#1 MAE: 51.39237039425772 | MAPE: 0.5167872441161455 | RMSE: 113.84866787174654
#2 MAE: 43.14026377555565 | MAPE: 0.4230519670345765 | RMSE: 121.40008663415217

In [290]:
result_lr = metrics_cv(model_lr, X, y, 'LinearRegression')
result_lr

Unnamed: 0,Model,MAE,MAPE,MSE
0,LinearRegression,40.399 +/- 1.879,0.403 +/- 0.01,105.04 +/- 15.353


## 7.3. Random Forest Model

In [291]:
# model definition and fit
model_rf = en.RandomForestRegressor().fit(x_train, y_train)

# model predict
yhat_rf = model_rf.predict(x_val)

# model perfomance
mae = metrics.mean_absolute_error(y_val, yhat_rf)
mape = metrics.mean_absolute_percentage_error(y_val, yhat_rf) 
rmse = np.sqrt(metrics.mean_squared_error(y_val, yhat_rf))

print('MAE: {} | MAPE: {} | RMSE: {}'.format(mae, mape, rmse))

MAE: 29.08292287498614 | MAPE: 0.22459632557403508 | RMSE: 91.76455881919502


In [292]:
#1 MAE: 39.84512783025325 | MAPE: 0.357494658863791 | RMSE: 103.6417705071636
#2 MAE: 29.032647859198246 | MAPE: 0.2633721578886517 | RMSE: 80.29353829948478

In [293]:
result_rf = metrics_cv(model_rf, X, y, 'RandomForestRegressor')
result_rf

Unnamed: 0,Model,MAE,MAPE,MSE
0,RandomForestRegressor,25.611 +/- 0.621,0.227 +/- 0.007,71.749 +/- 7.886


## 7.4. XGB Regression Model

In [294]:
# model definition and fit
model_xgb = xgb.XGBRegressor(objective='reg:squarederror').fit(x_train, y_train)

# model predict
yhat_xgb = model_xgb.predict(x_val)

# model perfomance
mae = metrics.mean_absolute_error(y_val, yhat_xgb)
mape = metrics.mean_absolute_percentage_error(y_val, yhat_xgb) 
rmse = np.sqrt(metrics.mean_squared_error(y_val, yhat_xgb))

print('MAE: {} | MAPE: {} | RMSE: {}'.format(mae, mape, rmse))

MAE: 32.98867952536164 | MAPE: 0.2882799738677832 | RMSE: 93.09754396532945


In [295]:
result_xgb = metrics_cv(model_xgb, X, y, 'XGB Regressor')
result_xgb

Unnamed: 0,Model,MAE,MAPE,MSE
0,XGB Regressor,29.227 +/- 0.605,0.282 +/- 0.009,71.862 +/- 9.315


## 7.5. Results

In [296]:
result = pd.concat([result_baseline, result_lr, result_rf, result_xgb])
result

Unnamed: 0,Model,MAE,MAPE,MSE
0,Average Model,65.513 +/- 2.704,0.856 +/- 0.034,129.141 +/- 15.399
0,LinearRegression,40.399 +/- 1.879,0.403 +/- 0.01,105.04 +/- 15.353
0,RandomForestRegressor,25.611 +/- 0.621,0.227 +/- 0.007,71.749 +/- 7.886
0,XGB Regressor,29.227 +/- 0.605,0.282 +/- 0.009,71.862 +/- 9.315


# 8.0. Hyperparameter Fine Tuning

# 9.0. Model Perfomance

# 10.0. Deploy to Product