# <font color='Blue'> Planejamento da Solução </font>

## Input - Entrada
### Problema de Negócio
1. Quais são as coisas que um potencial comprador de imóvel residencial considera antes de comprar uma casa?
(A localização, o tamanho da propriedade, a proximidade de escritórios, escolas, parques, restaurantes, hospitais ou preço ) 
2. Conjunto de dados que descrevem imóveis em Bengaluru.

## Output - Saída

1. Grafico(s) demostrando as features mais importantes para comprar um imóveis
2. Modelo para prever o preço das casas em Bengaluru.

## Task - Processos

1. Quais são as coisas que um potencial comprador de imóvel residencial considera antes de comprar uma casa?
* Limpar os dados
* Criar features
* Criar hipoteses de negocio em relação as features
* Fazer EDA validando / refutando as hipoteses de negocio.






# 0.0. Imports

In [1]:
import re
import pandas as pd
import numpy as np
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt

from boruta import BorutaPy
from geopy.geocoders import Nominatim
from pandas_profiling import ProfileReport

from sklearn import model_selection as ms
from sklearn import preprocessing as pp
from sklearn import dummy
from sklearn import metrics
from sklearn import linear_model as lm
from sklearn import ensemble as en
from sklearn.model_selection import cross_val_score

## 0.1. Helper Function

In [2]:
def metrics_cv(model, X, y, model_name='not defined', kfold=5):
    mae = -cross_val_score( model, X, y, scoring='neg_mean_absolute_error' , cv=kfold )
    mape = -cross_val_score( model, X, y, scoring='neg_mean_absolute_percentage_error' , cv=kfold )
    mse = -cross_val_score( model, X, y, scoring='neg_root_mean_squared_error' , cv=kfold )
    
    dictionary = {
        'Model': model_name,
        'MAE': f'{round(np.mean(mae), 3)}  +/-  { round(np.std(mae), 3)}',
        'MAPE': f'{round(np.mean(mape), 3)} +/- { round(np.std(mape), 3)}',
        'MSE': f'{round(np.mean(mse), 3)} +/- { round(np.std(mse), 3)}'
    }
    
    return pd.DataFrame(dictionary, index=[0])


def descriptive_statistics(num_attr):
    # Central Tendency: mean, median
    c1 = pd.DataFrame(num_attr.apply(np.mean))
    c2 = pd.DataFrame(num_attr.apply(np.median))

    # Dispension: min, max, range, std, skew, kurtosis
    d1 = pd.DataFrame(num_attr.apply(min))
    d2 = pd.DataFrame(num_attr.apply(max))
    d3 = pd.DataFrame(num_attr.apply(lambda x: x.max() - x.min()))
    d4 = pd.DataFrame(num_attr.apply(lambda x: x.std()))
    
    # Measures of Shape
    s1 = pd.DataFrame(num_attr.apply(lambda x: x.skew()))
    s2 = pd.DataFrame(num_attr.apply(lambda x: x.kurtosis()))

    # concat
    m = pd.concat([d1,d2,d3,c1,c2,d4,s1,s2], axis=1).reset_index()
    m.columns = ['attributes', 'min', 'max', 'range', 'mean', 'median', 'std', 'skew', 'kurtosis']
    return m


## 0.2. Load Data

In [3]:
df_raw = pd.read_csv('../data/Bengaluru_House_Data.csv')

# 1.0. Data Description

In [4]:
df1 = df_raw.copy()

*  **Area_type** - describes the area 
*  **Availability** - when it can be possessed or when it is ready(categorical and time-series) 
*  **Location** - where it is located in Bengaluru 
*  **Price** - Value of the property in lakhs(INR) M
*  **Size** - in BHK or Bedroom (1-10 or more) 
*  **Society** - to which society it belongs 
*  **Total_sqft** - size of the property in sq.ft 
*  **Bath** - No. of bathrooms 
*  **Balcony** - No. of the balcony 

## 1.1. Data dimensions

In [5]:
print('Number of Rows: {}'.format(df1.shape[0]))
print('Number of Columns: {}'.format(df1.shape[1]))

Number of Rows: 13320
Number of Columns: 9


## 1.2. Data Types

In [6]:
df1.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

## 1.3. Check NA

In [7]:
df1.isna().mean() *100

area_type        0.000000
availability     0.000000
location         0.007508
size             0.120120
society         41.306306
total_sqft       0.000000
bath             0.548048
balcony          4.572072
price            0.000000
dtype: float64

## 1.4. Replace NA

In [8]:
# drop features
# remove: balcony, 4% ?
df1 = df1.dropna(subset=['size', 'location', 'bath', 'balcony'])

## 1.5. Change Dtypes

In [9]:
df1.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

## 1.6. Descriptive Statistics

In [10]:
num_att = df1.select_dtypes(include=['int64', 'float64'])
cat_att = df1.select_dtypes(exclude=['int64', 'float64'])

### 1.6.1. Numerical Attributes

In [11]:
descriptive_statistics(num_att)

Unnamed: 0,attributes,min,max,range,mean,median,std,skew,kurtosis
0,bath,1.0,40.0,39.0,2.617309,2.0,1.226,4.590497,85.455663
1,balcony,0.0,3.0,3.0,1.584343,2.0,0.817287,0.005966,-0.544247
2,price,8.0,2912.0,2904.0,106.060778,70.0,131.766089,7.875011,107.376164


**Note:**
1. bath: 40 banheiros? 

### 1.6.2. Categorical Attributes

In [12]:
cat_att.describe(include=['object'])

Unnamed: 0,area_type,availability,location,size,society,total_sqft
count,12710,12710,12710,12710,7496,12710
unique,4,78,1265,27,2592,1976
top,Super built-up Area,Ready To Move,Whitefield,2 BHK,GrrvaGr,1200
freq,8481,10077,514,5152,80,788


In [13]:
cat_att['area_type'].value_counts(normalize=True)

Super built-up  Area    0.667270
Built-up  Area          0.181747
Plot  Area              0.144532
Carpet  Area            0.006452
Name: area_type, dtype: float64

In [14]:
cat_att['availability'].value_counts(normalize=True).head(10)

Ready To Move    0.792840
18-Dec           0.022895
18-May           0.022187
18-Apr           0.020535
18-Aug           0.015736
19-Dec           0.014319
18-Jul           0.011015
18-Mar           0.009284
20-Dec           0.007710
18-Jun           0.007553
Name: availability, dtype: float64

In [15]:
cat_att['location'].value_counts(normalize=True).head(10)

Whitefield               0.040441
Sarjapur  Road           0.029268
Electronic City          0.023603
Kanakpura Road           0.020535
Thanisandra              0.018175
Yelahanka                0.016208
Uttarahalli              0.014634
Hebbal                   0.013611
Raja Rajeshwari Nagar    0.013218
Marathahalli             0.012903
Name: location, dtype: float64

In [16]:
cat_att['size'].value_counts(normalize=True).head(10)

2 BHK        0.405350
3 BHK        0.324784
4 Bedroom    0.058930
1 BHK        0.041699
3 Bedroom    0.041463
4 BHK        0.038474
2 Bedroom    0.025806
5 Bedroom    0.020692
6 Bedroom    0.013297
1 Bedroom    0.008261
Name: size, dtype: float64

In [17]:
dirt = df1.loc[~df1['total_sqft'].apply(lambda x: bool(re.search(   '^([0-9]+)$', x    ))), 'total_sqft']

len(dirt)

272

In [18]:
dirt.tolist()[:10]

['2100 - 2850',
 '1330.74',
 '3067 - 8156',
 '1042 - 1105',
 '1563.05',
 '1145 - 1340',
 '1015 - 1540',
 '2023.71',
 '1113.27',
 '34.46Sq. Meter']

**Note**:

1. availability: coluna categoricas 78 variavéis categoricas (Ready To Move 79 %)
2. location - capturar lat e long e deletar variavel.
3. size: Limpar variavel e transformar em inteira.
4. society: remover, pouca informação.
5. total_sqft: 272 itens que não são somente números.

# 2.0. Feature Engineering

In [19]:
# df2 = df1.copy()

## 2.1. Feature Creation

In [20]:
# %%time 

# location -> lat and lon

# geolocator = Nominatim(user_agent="geoapiExercises")

# def location_lat(x):
#     if geolocator.geocode(x, timeout=None):
#         return geolocator.geocode(x, timeout=None).raw['lat']
#     else: 
#         return x

# df2['lat'] = df2['location'].apply(location_lat)

# def location_lon(x):
#     if geolocator.geocode(x, timeout=None):
#         return geolocator.geocode(x, timeout=None).raw['lon']
#     else: 
#         return x

# df2['lon'] = df2['location'].apply(location_lon)

# df2.to_csv('bengaluru_house_data_lat_lon.csv', index=False)

df2 = pd.read_csv('../data/bengaluru_house_data_lat_lon.csv')

In [21]:
# size -> qt_bedroom

df2['qt_bedroom'] = df2['size'].apply(lambda x: str(x).split()[0]).astype(int)

# 3.0. Data Filtering

In [22]:
df3 = df2.copy()

## 3.1. Filter Rows

In [23]:
# lat and lon
df3_lat_dirt = df3.loc[~df3['lat'].apply(lambda x: bool(re.search('^[0-9-][0-9][\.]+', x))), :]
df3_lon_dirt = df3.loc[~df3['lon'].apply(lambda x: bool(re.search('^[0-9-][0-9][\.]+', x))), :]

df3['lat'] = df3.loc[df3['lat'].apply(lambda x: bool(re.search('^[0-9-][0-9][\.]+', x))), 'lat'].astype(float)
df3['lon'] = df3.loc[df3['lon'].apply(lambda x: bool(re.search('^[0-9-][0-9][\.]+', x))), 'lon'].astype(float)

In [24]:
# total_sqft_dirt = 
df3_total_sqft_dirt = df3.loc[~df3['total_sqft'].apply(lambda x: bool(re.search(r'^([\s\d]+)$', x))), :]

In [25]:
# total_sqft
df3_total_sqft_dirt = df3.loc[~df3['total_sqft'].apply(lambda x: bool(re.search(r'^([\s\d]+)$', x)) ), :]

df3['total_sqft'] = df3.loc[df3['total_sqft'].apply(lambda x: bool(re.search(r'^([\s\d]+)$', x))), 'total_sqft'].astype(int)

## 3.2. Filter Columns

In [26]:
df3.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price,lat,lon,qt_bedroom
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056.0,2.0,1.0,39.07,12.846854,77.676927,2
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600.0,5.0,3.0,120.0,12.895768,77.867101,4
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440.0,2.0,3.0,62.0,12.905568,77.545544,3
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521.0,3.0,1.0,95.0,,,3
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200.0,2.0,1.0,51.0,12.580537,77.333067,2


In [27]:
drop_cols = ['location', 'society', 'size']
df3 = df3.drop(drop_cols, axis=1)
df3.isna().mean()

area_type       0.000000
availability    0.000000
total_sqft      0.021400
bath            0.000000
balcony         0.000000
price           0.000000
lat             0.112274
lon             0.189142
qt_bedroom      0.000000
dtype: float64

In [28]:
df3.isna().mean()

area_type       0.000000
availability    0.000000
total_sqft      0.021400
bath            0.000000
balcony         0.000000
price           0.000000
lat             0.112274
lon             0.189142
qt_bedroom      0.000000
dtype: float64

In [29]:
df3 = df3.dropna()
df3.isna().mean()

area_type       0.0
availability    0.0
total_sqft      0.0
bath            0.0
balcony         0.0
price           0.0
lat             0.0
lon             0.0
qt_bedroom      0.0
dtype: float64

In [30]:
round((1 - (df3.shape[0] / df_raw.shape[0])) * 100, 2)

24.53

# 4.0. EDA

In [31]:
df4 = df3.copy()

## 4.1. Univariate Analysis

In [32]:
# profile = ProfileReport(df4, title='Analysis - Bengaluru House')
# profile.to_file('../reports/figures/output_v1.html')

### total_sqft

In [33]:
df4.sort_values('total_sqft').head(10)

Unnamed: 0,area_type,availability,total_sqft,bath,balcony,price,lat,lon,qt_bedroom
4723,Built-up Area,Ready To Move,5.0,7.0,3.0,115.0,17.43458,82.715203,7
333,Plot Area,18-Dec,11.0,3.0,2.0,74.0,12.739777,77.668954,3
968,Carpet Area,Ready To Move,15.0,1.0,0.0,30.0,12.845495,77.583129,1
5671,Plot Area,Ready To Move,45.0,1.0,0.0,23.0,12.387214,76.666963,1
12615,Super built-up Area,Ready To Move,250.0,2.0,2.0,40.0,28.636548,77.096496,1
111,Plot Area,Ready To Move,276.0,3.0,3.0,23.0,13.025809,77.630507,2
10029,Super built-up Area,Ready To Move,284.0,1.0,1.0,8.0,13.097804,77.581189,1
4607,Carpet Area,Ready To Move,300.0,1.0,1.0,20.0,12.954674,77.512172,1
10955,Super built-up Area,Ready To Move,302.0,2.0,1.0,25.0,13.002735,77.570325,2
942,Plot Area,Ready To Move,315.0,4.0,2.0,90.0,18.014228,79.552624,4


In [34]:
df4.sort_values('total_sqft', ascending=False).head(10)

Unnamed: 0,area_type,availability,total_sqft,bath,balcony,price,lat,lon,qt_bedroom
1794,Plot Area,Ready To Move,52272.0,2.0,1.0,140.0,13.095302,77.396359,3
5121,Super built-up Area,Ready To Move,42000.0,8.0,3.0,175.0,13.064967,77.562966,9
5194,Super built-up Area,Ready To Move,36000.0,4.0,2.0,450.0,12.977879,77.62467,4
641,Built-up Area,Ready To Move,35000.0,3.0,3.0,130.0,13.100698,77.596345,3
12396,Plot Area,Ready To Move,30400.0,4.0,2.0,1824.0,12.97077,77.744557,6
6884,Plot Area,Ready To Move,26136.0,1.0,0.0,150.0,13.100698,77.596345,1
1173,Plot Area,Ready To Move,14000.0,3.0,2.0,800.0,14.340956,74.892425,4
577,Super built-up Area,19-Jan,12000.0,7.0,3.0,2200.0,13.002735,77.570325,7
390,Super built-up Area,19-Jan,12000.0,6.0,3.0,2200.0,18.014228,79.552624,7
2489,Super built-up Area,Ready To Move,11338.0,9.0,1.0,1000.0,12.973461,77.751115,6


**Notes:**
1. **total_sqft**: apartamentos com 5, 11, 15, 45 total_sqft ?

### bath

In [35]:
df4.sort_values('bath', ascending=False).head(10)

Unnamed: 0,area_type,availability,total_sqft,bath,balcony,price,lat,lon,qt_bedroom
1877,Plot Area,Ready To Move,990.0,12.0,0.0,120.0,12.901368,77.632057,8
1681,Plot Area,Ready To Move,1200.0,11.0,0.0,170.0,13.012022,77.677782,11
2072,Plot Area,Ready To Move,1500.0,10.0,3.0,165.0,12.908793,77.604554,8
6211,Plot Area,Ready To Move,1800.0,10.0,3.0,185.0,13.073014,77.792138,9
11500,Plot Area,Ready To Move,1200.0,10.0,2.0,180.0,18.014228,79.552624,8
5555,Super built-up Area,Ready To Move,4700.0,10.0,3.0,130.0,13.406193,75.249867,9
6373,Plot Area,Ready To Move,1020.0,10.0,0.0,155.0,12.580537,77.333067,8
9982,Plot Area,Ready To Move,1200.0,9.0,3.0,230.0,17.24653,80.143697,9
4329,Plot Area,Ready To Move,1800.0,9.0,1.0,180.0,12.982362,77.522638,9
10795,Plot Area,Ready To Move,3280.0,9.0,3.0,450.0,13.104198,77.617085,10


### balcony

In [36]:
df4['balcony'].value_counts()

2.0    4142
1.0    3874
3.0    1326
0.0     710
Name: balcony, dtype: int64

### price

In [37]:
df4.sort_values('price', ascending=False).head(10)

Unnamed: 0,area_type,availability,total_sqft,bath,balcony,price,lat,lon,qt_bedroom
10558,Super built-up Area,18-Jan,8321.0,5.0,2.0,2912.0,13.040073,80.215925,4
12600,Plot Area,Ready To Move,8000.0,6.0,3.0,2800.0,34.01125,71.536452,6
11214,Plot Area,Ready To Move,9600.0,7.0,2.0,2736.0,15.331903,75.12647,5
9816,Plot Area,Ready To Move,10624.0,4.0,2.0,2340.0,12.929507,77.580165,4
6103,Plot Area,18-Sep,2940.0,3.0,2.0,2250.0,13.26871,76.641051,4
577,Super built-up Area,19-Jan,12000.0,7.0,3.0,2200.0,13.002735,77.570325,7
390,Super built-up Area,19-Jan,12000.0,6.0,3.0,2200.0,18.014228,79.552624,7
8119,Plot Area,Ready To Move,7800.0,3.0,2.0,2000.0,15.346072,75.116714,3
6954,Plot Area,18-Apr,11000.0,5.0,3.0,2000.0,12.946651,77.676065,4
12396,Plot Area,Ready To Move,30400.0,4.0,2.0,1824.0,12.97077,77.744557,6


In [38]:
df4.sort_values('price', ascending=True).head(10)

Unnamed: 0,area_type,availability,total_sqft,bath,balcony,price,lat,lon,qt_bedroom
10029,Super built-up Area,Ready To Move,284.0,1.0,1.0,8.0,13.097804,77.581189,1
8163,Built-up Area,Ready To Move,450.0,1.0,1.0,9.0,17.443639,77.433391,1
5138,Super built-up Area,Ready To Move,400.0,1.0,1.0,10.0,12.778259,77.771283,1
10569,Built-up Area,Ready To Move,410.0,1.0,1.0,10.0,12.778259,77.771283,1
1396,Built-up Area,18-Mar,340.0,1.0,1.0,10.0,12.917657,77.483757,1
12007,Super built-up Area,Ready To Move,410.0,1.0,1.0,10.0,17.443639,77.433391,1
7111,Super built-up Area,Ready To Move,470.0,2.0,1.0,10.0,15.428596,77.261334,1
11395,Super built-up Area,Ready To Move,400.0,1.0,1.0,10.25,12.778259,77.771283,1
2313,Built-up Area,Ready To Move,395.0,1.0,1.0,10.25,12.778259,77.771283,1
8217,Plot Area,Ready To Move,640.0,1.0,0.0,10.5,13.292958,77.543146,2


# 5.0. Data Preparation

In [39]:
df5 = df4.copy()

## 5.1. Standardization

## 5.2. Rescaling

In [40]:
mms = pp.MinMaxScaler()

In [41]:
df5['total_sqft'] = mms.fit_transform(df5[['total_sqft']])
df5['bath'] = mms.fit_transform(df5[['bath']])
df5['balcony'] = mms.fit_transform(df5[['balcony']])
df5['lat'] = mms.fit_transform(df5[['lat']])
df5['lon'] = mms.fit_transform(df5[['lon']])
df5['qt_bedroom'] = mms.fit_transform(df5[['qt_bedroom']])

## 5.3. Encoding



In [42]:
# # Mean Encoder
me_area_type = dict(df5['area_type'].value_counts(normalize=True))
df5['area_type'] = df5['area_type'].map(me_area_type)

me_availability = dict(df5['availability'].value_counts(normalize=True))
df5['availability'] = df5['availability'].map(me_availability)

In [43]:
df5.head()

Unnamed: 0,area_type,availability,total_sqft,bath,balcony,price,lat,lon,qt_bedroom
0,0.674294,0.013331,0.020108,0.090909,0.333333,39.07,0.254096,0.790671,0.090909
1,0.14226,0.8076,0.049649,0.363636,1.0,120.0,0.254755,0.792458,0.272727
2,0.178273,0.8076,0.027455,0.090909,1.0,62.0,0.254887,0.789437,0.181818
4,0.674294,0.8076,0.022863,0.090909,0.333333,51.0,0.25051,0.78744,0.090909
6,0.674294,0.8076,0.024968,0.181818,0.333333,63.25,0.255556,0.790873,0.181818


# 6.0. Feature Selection

In [44]:
df6 = df5.copy()

In [45]:
X = df5.drop(['price',], axis=1)
y = df5['price'].copy()

x_train, x_val, y_train, y_val = ms.train_test_split(X, y, test_size=0.2, random_state=42)

## 6.1. Boruta as Feature Selection

In [46]:
x_boruta = X.copy().values
y_boruta = y.ravel()

rf = en.RandomForestRegressor(n_estimators=300)

feat_selector_boruta = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=42).fit(x_boruta, y_boruta)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	8
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	8
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	8
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	8
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	8
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	8
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	8
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	4
Tentative: 	0
Rejected: 	4


BorutaPy finished running.

Iteration: 	9 / 100
Confirmed: 	4
Tentative: 	0
Rejected: 	4


In [47]:
feat_ranking = feat_selector_boruta.ranking_
feat_selector = feat_selector_boruta.support_
columns_name = X.columns

In [48]:
df_boruta_ranking = pd.DataFrame({'ranking': feat_ranking, 
                                   'selected_boruta':feat_selector}, 
                                    index=columns_name).sort_values('ranking')

df_boruta_ranking

Unnamed: 0,ranking,selected_boruta
total_sqft,1,True
bath,1,True
lat,1,True
lon,1,True
availability,2,False
area_type,3,False
qt_bedroom,3,False
balcony,5,False


## 6.2. Feature Importance

In [49]:
x_fi = X.copy().values
y_fi = y.ravel()

forest = en.RandomForestRegressor().fit(x_fi, y_fi)

In [50]:
feature_importances = forest.feature_importances_
columns_names = X.columns

In [51]:
df_fi_ranking = pd.DataFrame({'Feature Importance': feature_importances}, 
                             index=columns_names).sort_values('Feature Importance', ascending=False)
df_fi_ranking

Unnamed: 0,Feature Importance
total_sqft,0.604893
lat,0.112347
lon,0.089319
bath,0.068094
availability,0.048662
qt_bedroom,0.030494
area_type,0.026467
balcony,0.019723


## 6.3. Manual Selection

In [52]:
cols_selected = ['total_sqft', 'lat', 'lon', 'bath', 'availability', 'qt_bedroom']

# 7.0. Model Training

In [53]:
x_val = x_val[cols_selected].copy()
x_train = x_train[cols_selected].copy()

X = X[cols_selected].copy()

## 7.1. Average Model

In [54]:
# model definition and fit
model_baseline = dummy.DummyRegressor(strategy='median').fit(x_train, y_train)

# model predict
yhat_baseline = model_baseline.predict(x_val)

# model perfomance
mae = metrics.mean_absolute_error(y_val, yhat_baseline)
mape = metrics.mean_absolute_percentage_error(y_val, yhat_baseline)
rmse = np.sqrt(metrics.mean_squared_error(y_val, yhat_baseline))

print('MAE: {} | MAPE: {} | RMSE: {}'.format(mae, mape, rmse))

MAE: 58.43193933366484 | MAPE: 0.49402897534467827 | RMSE: 140.02660227277417


In [55]:
#1 MAE: 69.18655009838794 | MAPE: 0.9189057519989557 | RMSE: 130.53405853555498
#2 MAE: 66.74150511881209 | MAPE: 0.8464271363888621 | RMSE: 134.87690120642173

In [56]:
result_baseline = metrics_cv(model_baseline, X, y, 'Average Model')
result_baseline

Unnamed: 0,Model,MAE,MAPE,MSE
0,Average Model,57.276 +/- 3.067,0.496 +/- 0.018,135.868 +/- 15.607


## 7.2. Linear Regression Model

In [57]:
# model definition and fit
model_lr = lm.LinearRegression().fit(x_train, y_train)

# model predict
yhat_lr = model_lr.predict(x_val)

# model perfomance
mae = metrics.mean_absolute_error(y_val, yhat_lr)
mape = metrics.mean_absolute_percentage_error(y_val, yhat_lr)
rmse = np.sqrt(metrics.mean_squared_error(y_val, yhat_lr))

print('MAE: {} | MAPE: {} | RMSE: {}'.format(mae, mape, rmse))

MAE: 43.60025400131541 | MAPE: 0.4321635307115498 | RMSE: 121.23072615127244


In [58]:
#1 MAE: 51.39237039425772 | MAPE: 0.5167872441161455 | RMSE: 113.84866787174654
#2 MAE: 43.14026377555565 | MAPE: 0.4230519670345765 | RMSE: 121.40008663415217

In [59]:
result_lr = metrics_cv(model_lr, X, y, 'LinearRegression')
result_lr

Unnamed: 0,Model,MAE,MAPE,MSE
0,LinearRegression,43.298 +/- 1.967,0.439 +/- 0.01,108.275 +/- 14.817


## 7.3. Random Forest Model

In [60]:
# model definition and fit
model_rf = en.RandomForestRegressor().fit(x_train, y_train)

# model predict
yhat_rf = model_rf.predict(x_val)

# model perfomance
mae = metrics.mean_absolute_error(y_val, yhat_rf)
mape = metrics.mean_absolute_percentage_error(y_val, yhat_rf) 
rmse = np.sqrt(metrics.mean_squared_error(y_val, yhat_rf))

print('MAE: {} | MAPE: {} | RMSE: {}'.format(mae, mape, rmse))

MAE: 30.18761857885216 | MAPE: 0.27098841516021255 | RMSE: 81.70818692549372


In [61]:
#1 MAE: 39.84512783025325 | MAPE: 0.357494658863791 | RMSE: 103.6417705071636
#2 MAE: 29.032647859198246 | MAPE: 0.2633721578886517 | RMSE: 80.29353829948478

In [62]:
result_rf = metrics_cv(model_rf, X, y, 'RandomForestRegressor')
result_rf

Unnamed: 0,Model,MAE,MAPE,MSE
0,RandomForestRegressor,28.455 +/- 0.877,0.253 +/- 0.008,75.008 +/- 8.023


## 7.4. XGB Regression Model

In [63]:
# model definition and fit
model_xgb = xgb.XGBRegressor(objective='reg:squarederror').fit(x_train, y_train)

# model predict
yhat_xgb = model_xgb.predict(x_val)

# model perfomance
mae = metrics.mean_absolute_error(y_val, yhat_xgb)
mape = metrics.mean_absolute_percentage_error(y_val, yhat_xgb) 
rmse = np.sqrt(metrics.mean_squared_error(y_val, yhat_xgb))

print('MAE: {} | MAPE: {} | RMSE: {}'.format(mae, mape, rmse))

MAE: 33.29181167697386 | MAPE: 0.319053723643867 | RMSE: 83.08787778156639


In [64]:
result_xgb = metrics_cv(model_xgb, X, y, 'XGB Regressor')
result_xgb

Unnamed: 0,Model,MAE,MAPE,MSE
0,XGB Regressor,32.204 +/- 0.904,0.306 +/- 0.01,76.782 +/- 7.5


## 7.5. Results

In [65]:
result = pd.concat([result_baseline, result_lr, result_rf, result_xgb])
result

Unnamed: 0,Model,MAE,MAPE,MSE
0,Average Model,57.276 +/- 3.067,0.496 +/- 0.018,135.868 +/- 15.607
0,LinearRegression,43.298 +/- 1.967,0.439 +/- 0.01,108.275 +/- 14.817
0,RandomForestRegressor,28.455 +/- 0.877,0.253 +/- 0.008,75.008 +/- 8.023
0,XGB Regressor,32.204 +/- 0.904,0.306 +/- 0.01,76.782 +/- 7.5


# 8.0. Hyperparameter Fine Tuning

# 9.0. Model Perfomance

# 10.0. Deploy to Product