# Desafío 2 - Properati

## Modelo base: Precios hedónicos

Con el objeto de interpretar las relaciones entre la variable objetivo y las descriptivas, partimos de un modelo relativamente sencillo, conocido como Modelo de Precios Hedónicos y lo ejecutamos con statsmodels para verificar la significancia de las variables. Dicho modelo tiene la forma:
$$ \ln{(\text{price_usd_per_m2})} = \beta_0+\beta_1\times\text{rooms}+\beta_2\times\text{surface_total_in_m2}+\beta_{3i}\times\text{property_type}_i+\beta_{4i}\times\text{localidad}_i$$

donde $\text{property_type}$ y $\text{localidad}$ son variables categóricas.

Para ello, importamos el dataset limpio y las librerías necesarias

In [15]:
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

In [2]:
df = pd.read_csv('df2.csv')
df.sample(20)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0.1,Unnamed: 0,property_type,place_name,geonames_id,price,currency,price_aprox_local_currency,price_aprox_usd,surface_total_in_m2,surface_covered_in_m2,...,Estrenar,Gimnasio,Lavadero,Parrilla,Pileta,SUM,Seguridad,log_price,log_price_aprox_usd,log_price_usd_per_m2
46381,54851,house,San Vicente,3428056.0,650000.0,USD,11468925.0,650000.0,3400.0,400.0,...,0,0,0,0,0,0,0,13.384728,13.384728,5.253197
83830,101254,house,San Isidro,3428983.0,365000.0,USD,6440242.5,365000.0,217.0,217.0,...,0,0,0,1,1,0,1,12.807653,12.807653,
64303,76254,apartment,Lomas de Zamora,,140000.0,USD,2470230.0,140000.0,53.0,53.0,...,0,0,0,0,0,0,0,11.849398,11.849398,7.879106
15553,18081,apartment,Pinamar,3429971.0,255000.0,USD,4499347.5,255000.0,130.0,130.0,...,0,0,0,1,1,0,1,12.449019,12.449019,7.581484
22161,26400,apartment,Boedo,3436003.0,425687.0,USD,7511034.27,425687.0,30.0,30.0,...,0,0,0,0,0,0,0,12.96146,12.96146,9.560262
57622,67816,house,Castelar,3435607.0,115000.0,USD,2029117.5,115000.0,80.0,80.0,...,0,0,1,1,0,0,0,11.652687,11.652687,7.270661
70513,84112,apartment,Córdoba,3860255.0,60009.53,USD,1058838.15,60009.53,58.0,47.0,...,0,0,0,0,0,0,0,11.002259,11.002259,6.941816
21294,24755,apartment,Villa Carlos Paz,3832791.0,220000.0,USD,3881790.0,220000.0,140.0,140.0,...,0,0,0,0,1,0,1,12.301383,12.301383,7.35974
51419,60566,apartment,Recoleta,3429595.0,180000.0,USD,3176010.0,180000.0,40.0,40.0,...,0,0,1,0,0,0,1,12.100712,12.100712,8.411833
58847,69332,apartment,Monserrat,3430570.0,178000.0,USD,3140721.0,178000.0,139.0,139.0,...,0,0,0,0,0,0,0,12.089539,12.089539,


In [8]:
for i in df.columns:
    print(i, df[i].dtype)

Unnamed: 0 int64
property_type object
place_name object
geonames_id float64
price float64
currency object
price_aprox_local_currency float64
price_aprox_usd float64
surface_total_in_m2 float64
surface_covered_in_m2 float64
price_usd_per_m2 float64
price_per_m2 float64
rooms float64
description object
title object
lat float64
lon float64
precio_regex float64
moneda object
provincia object
localidad object
barrio object
Amenities int64
Cochera int64
Estrenar int64
Gimnasio int64
Lavadero int64
Parrilla int64
Pileta int64
SUM int64
Seguridad int64
log_price float64
log_price_aprox_usd float64
log_price_usd_per_m2 float64


##### Nos aseguramos que la variable `price_usd_per_m2` esté completa y obtenemos las variables logarítmicas actualizadas

In [16]:
df['price_usd_per_m2'] = df.price/df.surface_total_in_m2
df['log_price'] = df.price.apply(np.log)
df['log_price_aprox_usd'] = df.price_aprox_usd.apply(np.log)
df['log_price_usd_per_m2'] = df.price_usd_per_m2.apply(np.log)

##### Generamos el modelo descrito arriba con statsmodels

In [17]:
model_base = smf.ols('log_price_usd_per_m2 ~ rooms + surface_total_in_m2 + C(property_type) + C(localidad)', data=df)
model_base.fit().summary2()

  return self.params / self.bse
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


0,1,2,3
Model:,OLS,Adj. R-squared:,0.608
Dependent Variable:,log_price_usd_per_m2,AIC:,37834.9565
Date:,2019-10-07 14:47,BIC:,40132.136
No. Observations:,27823,Log-Likelihood:,-18638.0
Df Model:,278,F-statistic:,156.2
Df Residuals:,27544,Prob (F-statistic):,0.0
R-squared:,0.612,Scale:,0.22582

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,7.2676,0.0710,102.3506,0.0000,7.1284,7.4068
C(property_type)[T.apartment],0.4421,0.0115,38.5368,0.0000,0.4196,0.4646
C(property_type)[T.house],-0.4133,0.0130,-31.8025,0.0000,-0.4388,-0.3878
C(property_type)[T.store],0.2470,0.0583,4.2337,0.0000,0.1327,0.3614
C(localidad)[T.Achiras],-0.0000,0.0000,-3.3149,0.0009,-0.0000,-0.0000
C(localidad)[T.Adolfo Alsina],0.0000,0.0000,3.9535,0.0001,0.0000,0.0000
C(localidad)[T.Agronomía],0.0514,0.1252,0.4106,0.6814,-0.1939,0.2967
C(localidad)[T.Agua Blanca],0.0000,0.0000,5.6179,0.0000,0.0000,0.0000
C(localidad)[T.Agua de Oro],-0.0000,0.0000,-1.8984,0.0576,-0.0000,0.0000

0,1,2,3
Omnibus:,6127.791,Durbin-Watson:,1.757
Prob(Omnibus):,0.0,Jarque-Bera (JB):,152057.393
Skew:,-0.469,Prob(JB):,0.000
Kurtosis:,14.414,Condition No.:,467892021479041335296


Como podemos observar en el output anterior, el modelo se está fiteando sobre $27823$ observaciones y estamos describiendo un $0.609 (R^2=0.612)$ de la variación en el precio de los inmuebles.

A continuación probamos ingresar las dummies de Amenities, Seguridad, Cochera, A Estrenar, Gimnasio, Lavadero, Parrilla, Pileta, Sum y una transformación sobre rooms para aumentar el poder descriptivo del modelo. También optamos por no calcular un intercepto, ya que no tiene mucho sentido.

In [24]:
df['rooms2'] = df.rooms**2
df['rooms3'] = df.rooms**3
df['rooms4'] = df.rooms**4
model_base_mod =  smf.ols('log_price_usd_per_m2 ~ rooms + rooms2 + rooms3 + rooms4 + surface_total_in_m2 + \
                            surface_covered_in_m2 + C(property_type) + C(localidad) + Seguridad + \
                            Amenities + Cochera + Estrenar + Gimnasio + Lavadero + Parrilla + Pileta + SUM -1', data=df)
model_base_mod.fit().summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.627
Dependent Variable:,log_price_usd_per_m2,AIC:,36428.1117
Date:,2019-10-07 15:31,BIC:,38832.3282
No. Observations:,27823,Log-Likelihood:,-17922.0
Df Model:,291,F-statistic:,162.1
Df Residuals:,27531,Prob (F-statistic):,0.0
R-squared:,0.631,Scale:,0.21459

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
C(property_type)[PH],7.3620,0.0719,102.3437,0.0000,7.2210,7.5030
C(property_type)[apartment],7.7575,0.0713,108.7501,0.0000,7.6177,7.8973
C(property_type)[house],6.9568,0.0719,96.8225,0.0000,6.8159,7.0976
C(property_type)[store],7.6121,0.0900,84.5434,0.0000,7.4356,7.7885
C(localidad)[T.Achiras],0.0000,0.0000,1.1678,0.2429,-0.0000,0.0000
C(localidad)[T.Adolfo Alsina],-0.0000,0.0000,-7.0432,0.0000,-0.0000,-0.0000
C(localidad)[T.Agronomía],-0.0324,0.1221,-0.2656,0.7905,-0.2718,0.2069
C(localidad)[T.Agua Blanca],0.0000,0.0000,11.3869,0.0000,0.0000,0.0000
C(localidad)[T.Agua de Oro],-0.0000,0.0000,-7.3928,0.0000,-0.0000,-0.0000

0,1,2,3
Omnibus:,5588.913,Durbin-Watson:,1.754
Prob(Omnibus):,0.0,Jarque-Bera (JB):,130434.338
Skew:,-0.37,Prob(JB):,0.000
Kurtosis:,13.581,Condition No.:,473968958627864379392


Revisamos por qué hay tantas localidades y barrios que no son significativos. Para ello agrupamos las observaciones por localidad, obtenemos los estadísitcos básicos, filtramos por las que no son NaN y las ordenamos por desvío estándar.

In [39]:
varianzas = df.groupby('localidad').log_price_usd_per_m2.describe()
display(varianzas.shape)
varianzas.loc[varianzas['std'].isnull()==False].sort_values('std', ascending=False)

(519, 8)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
localidad,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
La Cumbre,2.0,5.478362,2.792535e+00,3.503741,4.491052,5.478362,6.465672,7.452982
Nono,3.0,5.096565,2.634353e+00,2.525729,3.749775,4.973821,6.381983,7.790144
Brandsen,8.0,6.165967,2.110951e+00,1.487623,5.556291,7.272223,7.482739,7.536364
Plottier,7.0,6.382504,2.015403e+00,2.607617,5.844774,6.860058,7.180629,9.159047
Los Hornillos,2.0,5.425119,1.907923e+00,4.076014,4.750566,5.425119,6.099671,6.774224
...,...,...,...,...,...,...,...,...
Villa Icho Cruz,2.0,5.401889,2.529902e-03,5.400100,5.400995,5.401889,5.402783,5.403678
Dorrego,3.0,5.636283,1.087792e-15,5.636283,5.636283,5.636283,5.636283,5.636283
Timbúes,3.0,4.448693,0.000000e+00,4.448693,4.448693,4.448693,4.448693,4.448693
Pueblo Andino,2.0,3.824162,0.000000e+00,3.824162,3.824162,3.824162,3.824162,3.824162


De las 519 localidades que tenemos, sólo tenemos datos del logaritmo de los precios en 270, lo cual explica el resultado. También podemos observar cuáles son las localidades con mayor desviación estándar.

In [46]:
model_caba_mod =  smf.ols('log_price_usd_per_m2 ~ C(rooms) + surface_total_in_m2 + \
                            surface_covered_in_m2 + C(property_type) + C(localidad) + Seguridad + \
                            Amenities + Cochera + Estrenar + Gimnasio + Lavadero + Parrilla + Pileta + SUM -1', data=df.loc[df.provincia=='Capital Federal'])
model_caba_mod.fit().summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.542
Dependent Variable:,log_price_usd_per_m2,AIC:,4824.7299
Date:,2019-10-07 15:58,BIC:,5451.6631
No. Observations:,11798,Log-Likelihood:,-2327.4
Df Model:,84,F-statistic:,167.4
Df Residuals:,11713,Prob (F-statistic):,0.0
R-squared:,0.546,Scale:,0.0875

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
C(rooms)[1.0],7.4157,0.0451,164.4886,0.0000,7.3273,7.5041
C(rooms)[2.0],7.3615,0.0450,163.6095,0.0000,7.2733,7.4497
C(rooms)[3.0],7.3462,0.0450,163.3752,0.0000,7.2581,7.4344
C(rooms)[4.0],7.3068,0.0454,160.9153,0.0000,7.2178,7.3958
C(rooms)[5.0],7.2793,0.0469,155.2463,0.0000,7.1874,7.3713
C(rooms)[6.0],7.2593,0.0503,144.3765,0.0000,7.1608,7.3579
C(rooms)[7.0],7.1618,0.0546,131.2458,0.0000,7.0548,7.2687
C(rooms)[8.0],7.1646,0.0689,103.9297,0.0000,7.0295,7.2997
C(rooms)[9.0],7.1611,0.1158,61.8440,0.0000,6.9341,7.3881

0,1,2,3
Omnibus:,3147.882,Durbin-Watson:,1.699
Prob(Omnibus):,0.0,Jarque-Bera (JB):,281247.951
Skew:,0.172,Prob(JB):,0.0
Kurtosis:,26.917,Condition No.:,1.0109638259643102e+16


In [44]:
df.loc[df.provincia=='Capital Federal'].groupby('localidad').log_price_usd_per_m2.describe().sort_values('std', ascending=False)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
localidad,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Boedo,215.0,7.755845,0.747558,6.781004,7.313497,7.600902,7.824046,10.449714
Centro / Microcentro,186.0,7.665014,0.630684,3.687629,7.507378,7.702685,7.870453,10.349775
Parque Chacabuco,88.0,7.321837,0.536032,4.51175,7.164348,7.432278,7.666287,8.1425
Parque Avellaneda,30.0,7.119464,0.484047,5.352385,6.967139,7.26443,7.402318,7.786306
Mataderos,259.0,7.178445,0.476265,6.214608,6.804128,7.186927,7.513891,9.721166
Villa Soldati,6.0,6.669731,0.461537,5.999833,6.602692,6.659138,6.664525,7.45008
Once,117.0,7.539732,0.447027,6.186328,7.293418,7.513891,7.706263,9.998798
Boca,167.0,7.274257,0.445558,5.36874,7.138448,7.430557,7.526247,7.924129
Pompeya,31.0,6.900431,0.427604,6.127597,6.696305,6.882858,7.156078,7.708533
Nuñez,483.0,8.005744,0.418777,5.521461,7.779458,7.967901,8.160518,9.663643


##### Generamos dummies por localidad (barrio), tipo de inmueble y cantidad de ambientes para incorporar en scikit learn
Construimos dos dataframes nuevos: 
- prices (que posee todas las variables relacionadas a precio)
- X (que posee todas las varialbes descriptivas a partir de las cuales obtener las dummies)

In [68]:
#df.drop(columns='Unnamed: 0', inplace=True)
prices = df.filter(['price', 'price_aprox_local_currency', 'price_aprox_usd','price_usd_per_m2','price_per_m2', \
               'log_price', 'log_price_aprox_usd', 'log_price_usd_per_m2', 'log_price_usd_per_m2', 'log_price_usd_per_m2 '])
#pd.get_dummies(df, columns=['property_type', 'localidad'], drop_first=True)
X = df.filter(['property_type', 'surface_total_in_m2', 'surface_covered_in_m2', 'rooms','localidad',\
               'barrio', 'Amenities', 'Cochera', 'Estrenar', 'Gimnasio','Lavadero', 'Parrilla', 'Pileta', 'SUM', \
               'Seguridad', 'rooms2', 'rooms3', 'rooms4' ])
X = pd.get_dummies(X.property_type, drop_first=1).join(X).drop('property_type', axis=1)
X = pd.get_dummies(X.localidad, drop_first=1).join(X).drop('localidad', axis=1)

In [69]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


def findAlphaLasso(X,y,randomState=53,tries=5,alphaFrom=0.00001,alphaTo=100000,steps=1000):
    kf = KFold(n_splits=5, shuffle=True, random_state=randomState)
    step_value=(alphaTo-alphaFrom)/steps
    print("Lasso:")
    prevAlphaTest=0
    for i in range(0,tries):
        al_lasso = np.linspace(alphaFrom, alphaTo, steps)
        lm_lasso_cv= LassoCV(alphas=al_lasso, cv=kf, normalize=True)
        lm_lasso_cv.fit(X, y)
        if( round(prevAlphaTest*10000) == round(lm_lasso_cv.alpha_*10000) ):
            return lm_lasso_cv 
        prevAlphaTest=lm_lasso_cv.alpha_
        alphaFrom = prevAlphaTest - step_value*2
        alphaTo = prevAlphaTest   + step_value*2
        step_value=(alphaTo-alphaFrom)/steps
        print("intento {} value {} from {} to {} ".format(i+1,lm_lasso_cv.alpha_,alphaFrom,alphaTo))
    return lm_lasso_cv

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X, prices.log_price_usd_per_m2, shuffle=True)
lasso = findAlphaLasso(X_train, y_train)
print(" Score Train Lasso: %.2f\n" % lasso.score(X_train, y_train))
print(" Score Test Lasso: %.2f\n" % lasso.score(X_test, y_test))

Lasso:


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').