# Desafío 2 - Properati

## Modelo base: Precios hedónicos

Con el objeto de interpretar las relaciones entre la variable objetivo y las descriptivas, partimos de un modelo relativamente sencillo, conocido como Modelo de Precios Hedónicos y lo ejecutamos con statsmodels para verificar la significancia de las variables. Dicho modelo tiene la forma:
$$ \ln{(\text{price_usd_per_m2})} = \beta_0+\beta_1\times\text{rooms}+\beta_2\times\text{surface_total_in_m2}+\beta_{3i}\times\text{property_type}_i+\beta_{4i}\times\text{localidad}_i$$

donde $\text{property_type}$ y $\text{localidad}$ son variables categóricas.

Para ello, importamos el dataset limpio y las librerías necesarias

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

In [5]:
df = pd.read_csv('df2.csv')
df

Unnamed: 0.1,Unnamed: 0,property_type,place_name,geonames_id,price,currency,price_aprox_local_currency,price_aprox_usd,surface_total_in_m2,surface_covered_in_m2,...,Estrenar,Gimnasio,Lavadero,Parrilla,Pileta,SUM,Seguridad,log_price,log_price_aprox_usd,log_price_usd_per_m2
0,0,PH,Mataderos,3430787.0,62000.0,USD,1093959.00,62000.0,55.0,40.0,...,0,0,0,0,0,0,0,11.034890,11.034890,7.027556
1,2,apartment,Mataderos,3430787.0,72000.0,USD,1270404.00,72000.0,55.0,55.0,...,0,0,0,0,0,0,0,11.184421,11.184421,7.177088
2,4,apartment,Centro,3435548.0,64000.0,USD,1129248.00,64000.0,35.0,35.0,...,0,0,0,0,0,0,0,11.066638,11.066638,7.511290
3,5,house,Gualeguaychú,3433657.0,,USD,,,53.0,53.0,...,0,0,1,0,0,0,0,,,
4,6,PH,Munro,3430511.0,130000.0,USD,2293785.00,130000.0,106.0,78.0,...,0,0,0,0,0,0,0,11.775290,11.775290,7.111851
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98916,121215,apartment,Belgrano,3436077.0,870000.0,USD,15350715.00,870000.0,113.0,93.0,...,0,0,0,1,1,0,1,13.676248,13.676248,8.948861
98917,121216,house,Beccar,3436080.0,498000.0,USD,8786961.00,498000.0,360.0,360.0,...,0,0,0,1,1,0,0,13.118355,13.118355,7.232251
98918,121217,apartment,Villa Urquiza,3433775.0,131500.0,USD,2320251.75,131500.0,46.0,39.0,...,0,0,1,1,0,0,0,11.786762,11.786762,7.958121
98919,121218,apartment,Plaza Colón,,95900.0,USD,1692107.55,95900.0,48.0,48.0,...,0,0,1,0,0,0,0,11.471061,11.471061,7.599860


In [6]:
for i in df.columns:
    print(i, df[i].dtype)

Unnamed: 0 int64
property_type object
place_name object
geonames_id float64
price float64
currency object
price_aprox_local_currency float64
price_aprox_usd float64
surface_total_in_m2 float64
surface_covered_in_m2 float64
price_usd_per_m2 float64
price_per_m2 float64
rooms float64
description object
title object
lat float64
lon float64
precio_regex float64
moneda object
provincia object
localidad object
barrio object
Amenities int64
Cochera int64
Estrenar int64
Gimnasio int64
Lavadero int64
Parrilla int64
Pileta int64
SUM int64
Seguridad int64
log_price float64
log_price_aprox_usd float64
log_price_usd_per_m2 float64


##### Nos aseguramos que la variable `price_usd_per_m2` esté completa y obtenemos las variables logarítmicas actualizadas
<p style="color:red;"><b> NO EJECUTAR </p>

In [4]:
#df['price_usd_per_m2'] = df.price_usd_per_m2.fillna(df.price/df.surface_total_in_m2)
#df['log_price'] = df.price.apply(np.log)
#df['log_price_aprox_usd'] = df.price_aprox_usd.apply(np.log)
#df['log_price_usd_per_m2'] = df.price_usd_per_m2.apply(np.log)

##### Generamos el modelo descrito arriba con statsmodels

In [7]:
model_base = smf.ols('log_price_usd_per_m2 ~ rooms + surface_total_in_m2 +surface_covered_in_m2 + C(property_type) + C(localidad)', data=df)
model_base.fit().summary2()

  return self.params / self.bse
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


0,1,2,3
Model:,OLS,Adj. R-squared:,0.614
Dependent Variable:,log_price_usd_per_m2,AIC:,37371.5474
Date:,2019-10-08 19:49,BIC:,39676.9606
No. Observations:,27823,Log-Likelihood:,-18406.0
Df Model:,279,F-statistic:,159.9
Df Residuals:,27543,Prob (F-statistic):,0.0
R-squared:,0.618,Scale:,0.22208

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,7.2585,0.0704,103.0774,0.0000,7.1205,7.3966
C(property_type)[T.apartment],0.4409,0.0114,38.7563,0.0000,0.4186,0.4632
C(property_type)[T.house],-0.3786,0.0130,-29.1444,0.0000,-0.4040,-0.3531
C(property_type)[T.store],0.2874,0.0579,4.9648,0.0000,0.1740,0.4009
C(localidad)[T.Achiras],-0.0000,0.0000,-1.2652,0.2058,-0.0000,0.0000
C(localidad)[T.Adolfo Alsina],0.0000,0.0000,8.0925,0.0000,0.0000,0.0000
C(localidad)[T.Agronomía],0.0521,0.1241,0.4199,0.6746,-0.1912,0.2954
C(localidad)[T.Agua Blanca],0.0000,0.0000,11.2982,0.0000,0.0000,0.0000
C(localidad)[T.Agua de Oro],0.0000,0.0000,2.4596,0.0139,0.0000,0.0000

0,1,2,3
Omnibus:,5590.618,Durbin-Watson:,1.751
Prob(Omnibus):,0.0,Jarque-Bera (JB):,122727.048
Skew:,-0.398,Prob(JB):,0.000
Kurtosis:,13.258,Condition No.:,387746840649198272512


Como podemos observar en el output anterior, el modelo se está fiteando sobre $27823$ observaciones y estamos describiendo un $0.609 (R^2=0.612)$ de la variación en el precio de los inmuebles.

A continuación probamos ingresar las dummies de Amenities, Seguridad, Cochera, A Estrenar, Gimnasio, Lavadero, Parrilla, Pileta, Sum y una transformación sobre rooms para aumentar el poder descriptivo del modelo. También optamos por no calcular un intercepto, ya que no tiene mucho sentido.

In [10]:
model_base = smf.ols('log_price ~ rooms + surface_total_in_m2 +surface_covered_in_m2 + C(property_type) + C(localidad)', data=df)
model_base.fit().summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.401
Dependent Variable:,log_price,AIC:,77464.0077
Date:,2019-10-08 19:59,BIC:,80579.4017
No. Observations:,41353,Log-Likelihood:,-38371.0
Df Model:,360,F-statistic:,77.89
Df Residuals:,40992,Prob (F-statistic):,0.0
R-squared:,0.406,Scale:,0.37782

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,11.0396,0.0830,133.0063,0.0000,10.8769,11.2023
C(property_type)[T.apartment],0.1257,0.0122,10.3441,0.0000,0.1019,0.1496
C(property_type)[T.house],0.2969,0.0135,22.0125,0.0000,0.2704,0.3233
C(property_type)[T.store],0.4605,0.0670,6.8698,0.0000,0.3291,0.5918
C(localidad)[T.Achiras],0.3384,0.4425,0.7647,0.4444,-0.5289,1.2056
C(localidad)[T.Adolfo Alsina],0.0000,0.0000,0.1913,0.8483,-0.0000,0.0000
C(localidad)[T.Agronomía],0.1773,0.1349,1.3146,0.1887,-0.0871,0.4417
C(localidad)[T.Agua Blanca],-0.1994,0.6202,-0.3215,0.7478,-1.4150,1.0162
C(localidad)[T.Agua de Oro],0.0000,0.0000,1.3769,0.1685,-0.0000,0.0000

0,1,2,3
Omnibus:,11452.441,Durbin-Watson:,1.509
Prob(Omnibus):,0.0,Jarque-Bera (JB):,942926.570
Skew:,0.342,Prob(JB):,0.000
Kurtosis:,26.383,Condition No.:,693007518233025576960


In [9]:
df['rooms2'] = df.rooms**2
df['rooms3'] = df.rooms**3
df['rooms4'] = df.rooms**4
model_base_mod =  smf.ols('log_price_usd_per_m2 ~ rooms + rooms2 + rooms3 + rooms4 + surface_total_in_m2 + \
                            surface_covered_in_m2 + C(property_type) + C(localidad) + Seguridad + \
                            Amenities + Cochera + Estrenar + Gimnasio + Lavadero + Parrilla + Pileta + SUM -1', data=df)
model_base_mod.fit().summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.627
Dependent Variable:,log_price_usd_per_m2,AIC:,36428.1117
Date:,2019-10-08 19:52,BIC:,38832.3282
No. Observations:,27823,Log-Likelihood:,-17922.0
Df Model:,291,F-statistic:,162.1
Df Residuals:,27531,Prob (F-statistic):,0.0
R-squared:,0.631,Scale:,0.21459

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
C(property_type)[PH],7.3620,0.0719,102.3437,0.0000,7.2210,7.5030
C(property_type)[apartment],7.7575,0.0713,108.7501,0.0000,7.6177,7.8973
C(property_type)[house],6.9568,0.0719,96.8225,0.0000,6.8159,7.0976
C(property_type)[store],7.6121,0.0900,84.5434,0.0000,7.4356,7.7885
C(localidad)[T.Achiras],0.0000,0.0000,1.1678,0.2429,-0.0000,0.0000
C(localidad)[T.Adolfo Alsina],-0.0000,0.0000,-7.0432,0.0000,-0.0000,-0.0000
C(localidad)[T.Agronomía],-0.0324,0.1221,-0.2656,0.7905,-0.2718,0.2069
C(localidad)[T.Agua Blanca],0.0000,0.0000,11.3869,0.0000,0.0000,0.0000
C(localidad)[T.Agua de Oro],-0.0000,0.0000,-7.3928,0.0000,-0.0000,-0.0000

0,1,2,3
Omnibus:,5588.913,Durbin-Watson:,1.754
Prob(Omnibus):,0.0,Jarque-Bera (JB):,130434.338
Skew:,-0.37,Prob(JB):,0.000
Kurtosis:,13.581,Condition No.:,473968958627864379392


Revisamos por qué hay tantas localidades y barrios que no son significativos. Para ello agrupamos las observaciones por localidad, obtenemos los estadísitcos básicos, filtramos por las que no son NaN y las ordenamos por desvío estándar.

In [19]:
varianzas = df.groupby('localidad').log_price_usd_per_m2.describe()
display(varianzas.shape)
varianzas.loc[varianzas['std'].isnull()==False].sort_values('std', ascending=False)

(519, 8)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
localidad,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Baradero,6.0,6.898410,3.123209,3.717183,4.260327,6.304836,9.761309,10.571317
Zárate,12.0,5.274652,2.851703,0.728528,3.029108,4.928395,7.787727,9.305651
La Cumbre,2.0,5.478362,2.792535,3.503741,4.491052,5.478362,6.465672,7.452982
San Antonio de Arredondo,14.0,5.955985,2.485753,2.642299,4.925027,5.407662,6.732212,12.847927
Nono,4.0,5.517404,2.309755,2.525729,4.361798,5.876872,7.032478,7.790144
...,...,...,...,...,...,...,...,...
Villa Icho Cruz,2.0,5.401889,0.002530,5.400100,5.400995,5.401889,5.402783,5.403678
Dorrego,3.0,5.636283,0.000000,5.636283,5.636283,5.636283,5.636283,5.636283
Manfredi,2.0,5.184297,0.000000,5.184297,5.184297,5.184297,5.184297,5.184297
Timbúes,3.0,4.448693,0.000000,4.448693,4.448693,4.448693,4.448693,4.448693


De las 519 localidades que tenemos, sólo tenemos datos del logaritmo de los precios en 270, lo cual explica el resultado. También podemos observar cuáles son las localidades con mayor desviación estándar.

In [10]:
model_caba_mod =  smf.ols('log_price_usd_per_m2 ~ C(rooms) + surface_total_in_m2 + \
                            surface_covered_in_m2 + C(property_type) + C(localidad) + Seguridad + \
                            Amenities + Cochera + Estrenar + Gimnasio + Lavadero + Parrilla + Pileta + SUM -1', data=df.loc[df.provincia=='Capital Federal'])
model_caba_mod.fit().summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.542
Dependent Variable:,log_price_usd_per_m2,AIC:,4824.7299
Date:,2019-10-07 21:19,BIC:,5451.6631
No. Observations:,11798,Log-Likelihood:,-2327.4
Df Model:,84,F-statistic:,167.4
Df Residuals:,11713,Prob (F-statistic):,0.0
R-squared:,0.546,Scale:,0.0875

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
C(rooms)[1.0],7.4157,0.0451,164.4886,0.0000,7.3273,7.5041
C(rooms)[2.0],7.3615,0.0450,163.6095,0.0000,7.2733,7.4497
C(rooms)[3.0],7.3462,0.0450,163.3752,0.0000,7.2581,7.4344
C(rooms)[4.0],7.3068,0.0454,160.9153,0.0000,7.2178,7.3958
C(rooms)[5.0],7.2793,0.0469,155.2463,0.0000,7.1874,7.3713
C(rooms)[6.0],7.2593,0.0503,144.3765,0.0000,7.1608,7.3579
C(rooms)[7.0],7.1618,0.0546,131.2458,0.0000,7.0548,7.2687
C(rooms)[8.0],7.1646,0.0689,103.9297,0.0000,7.0295,7.2997
C(rooms)[9.0],7.1611,0.1158,61.8440,0.0000,6.9341,7.3881

0,1,2,3
Omnibus:,3147.882,Durbin-Watson:,1.699
Prob(Omnibus):,0.0,Jarque-Bera (JB):,281247.951
Skew:,0.172,Prob(JB):,0.0
Kurtosis:,26.917,Condition No.:,1.0109638259643102e+16


In [20]:
df.loc[df.provincia=='Capital Federal'].groupby('localidad').log_price_usd_per_m2.describe().sort_values('std', ascending=False)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
localidad,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Villa Santa Rita,48.0,8.17611,1.540745,7.057396,7.487582,7.665518,7.768772,12.821258
Versalles,54.0,7.881811,1.207798,6.579251,7.282578,7.472836,7.82017,13.142166
Catalinas,3.0,8.364431,1.06482,7.513891,7.767324,8.020756,8.789702,9.558647
Monte Castro,72.0,7.666624,1.056871,2.713281,7.307741,7.598779,7.72982,12.923912
Pompeya,55.0,7.319088,0.932873,6.127597,6.856832,7.098811,7.52706,9.825526
Villa Riachuelo,5.0,7.445846,0.911166,6.645391,7.11437,7.195437,7.256062,9.017968
Villa Lugano,178.0,7.188193,0.831523,5.907123,6.843883,7.055267,7.336085,12.323856
Paternal,157.0,7.696419,0.809844,6.40698,7.418581,7.585635,7.749708,12.765688
Velez Sarsfield,35.0,7.48908,0.772381,6.655531,7.041887,7.286463,7.659794,9.987369
Boedo,262.0,7.781244,0.7351,6.781004,7.393263,7.616173,7.824046,10.449714


##### Generamos dummies por localidad (barrio), tipo de inmueble y cantidad de ambientes para incorporar en scikit learn
Construimos dos dataframes nuevos: 
- prices (que posee todas las variables relacionadas a precio)
- X (que posee todas las varialbes descriptivas a partir de las cuales obtener las dummies)

In [12]:
#df.drop(columns='Unnamed: 0', inplace=True)
prices = df.filter(['price', 'price_aprox_local_currency', 'price_aprox_usd','price_usd_per_m2','price_per_m2', \
               'log_price', 'log_price_aprox_usd', 'log_price_usd_per_m2', 'log_price_usd_per_m2', 'log_price_usd_per_m2 '])
#pd.get_dummies(df, columns=['property_type', 'localidad'], drop_first=True)
X = df.filter(['property_type', 'surface_total_in_m2', 'surface_covered_in_m2', 'rooms','localidad',\
               'barrio', 'Amenities', 'Cochera', 'Estrenar', 'Gimnasio','Lavadero', 'Parrilla', 'Pileta', 'SUM', \
               'Seguridad', 'rooms2', 'rooms3', 'rooms4' ])
X = pd.get_dummies(X.property_type, drop_first=1).join(X).drop('property_type', axis=1)
X = pd.get_dummies(X.localidad, drop_first=1).join(X).drop('localidad', axis=1)

In [13]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


def findAlphaLasso(X,y,randomState=53,tries=5,alphaFrom=0.00001,alphaTo=100000,steps=1000):
    kf = KFold(n_splits=5, shuffle=True, random_state=randomState)
    step_value=(alphaTo-alphaFrom)/steps
    print("Lasso:")
    prevAlphaTest=0
    for i in range(0,tries):
        al_lasso = np.linspace(alphaFrom, alphaTo, steps)
        lm_lasso_cv= LassoCV(alphas=al_lasso, cv=kf, normalize=True)
        lm_lasso_cv.fit(X, y)
        if( round(prevAlphaTest*10000) == round(lm_lasso_cv.alpha_*10000) ):
            return lm_lasso_cv 
        prevAlphaTest=lm_lasso_cv.alpha_
        alphaFrom = prevAlphaTest - step_value*2
        alphaTo = prevAlphaTest   + step_value*2
        step_value=(alphaTo-alphaFrom)/steps
        print("intento {} value {} from {} to {} ".format(i+1,lm_lasso_cv.alpha_,alphaFrom,alphaTo))
    return lm_lasso_cv

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, prices.log_price_usd_per_m2, shuffle=True)
lasso = findAlphaLasso(X_train, y_train)
print(" Score Train Lasso: %.2f\n" % lasso.score(X_train, y_train))
print(" Score Test Lasso: %.2f\n" % lasso.score(X_test, y_test))

Lasso:


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').