# 2.1 - Selección de características

En este notebook voy a seleccionar las columnas importantes del dataset `listings`. Utilizaré tres metodos distintos. Por un lado la correlación, con los métodos de Pearson, Spearman y la Tau de Kendall, para intentar ver correlaciones entre las variables y el objetivo, y si existe colinealidad. 

Además usaré un OLS (Ordinary Least Squares - Mínimos Cuadrados Ordinarios), básicamente una regresión lineal, para determinar los p-values según el F-test de cada variable. 

También usaré un random forest o xgboost, no con el objetivo de predecir, sino para que me diga cuál es la importancia de las características.

In [1]:
# librerias

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


In [2]:
listings=pd.read_csv('../data/transform_data/listings.csv')

for c in listings.select_dtypes(include='int'):
    listings[c]=pd.to_numeric(listings[c], downcast='integer')

for c in listings.select_dtypes(include='float'):
    listings[c]=pd.to_numeric(listings[c], downcast='float')
    
listings.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21312 entries, 0 to 21311
Columns: 244 entries, id to carbon_monoxide_detector
dtypes: float32(5), int16(10), int32(3), int8(224), object(2)
memory usage: 8.4 MB


### 1) Correlación

In [3]:
def correlacion(metodo: str, umbral: float) -> None:
    
    """
    Esta función calcula la correlación del dataframe 
    y muestra la columnas correlacionadas con el precio.
    
    param metodo: string, metodo de correlación (pearson, spearman, kendall)
    
    return: None (solo printea)
    """
    
    corr=listings._get_numeric_data().corr(method=metodo)

    print('\033[1m' + 'Correlación positiva con el precio.')
    display(corr.price[corr.price > umbral].sort_values(ascending=False)[1:])

    print()
    print('\033[1m' + 'Correlación negativa con el precio.')
    display(corr.price[corr.price < -umbral].sort_values(ascending=True))

In [4]:
correlacion('pearson', 0.1)

[1mCorrelación positiva con el precio.


accommodates                    0.104994
property_type_boutique_hotel    0.101548
Name: price, dtype: float64


[1mCorrelación negativa con el precio.


dishes_and_silverware   -0.103443
Name: price, dtype: float64

Desde el punto de vista lineal de la $rho$ de Pearson, prácticamente no existe correlación con el precio. Veamos que es lo que ocurre con el punto de vista de Spearman, donde se busca una relación monótona. En una relación monótona, las variables tienden a cambiar al mismo tiempo, pero no necesariamente a un ritmo constante.

In [5]:
correlacion('spearman', 0.2)

[1mCorrelación positiva con el precio.


accommodates                                   0.563101
calculated_host_listings_count_entire_homes    0.450540
beds                                           0.436126
bedrooms                                       0.394437
air_conditioning                               0.336642
cleaning_fee                                   0.317212
guests_included                                0.303152
security_deposit                               0.276899
tv                                             0.265021
bathrooms                                      0.230433
availability_30                                0.215029
availability_60                                0.211080
availability_90                                0.202786
Name: price, dtype: float64


[1mCorrelación negativa con el precio.


room_type_private_room                         -0.525274
calculated_host_listings_count_private_rooms   -0.457793
free_street_parking                            -0.206946
Name: price, dtype: float64

In [6]:
correlacion('kendall', 0.2)

[1mCorrelación positiva con el precio.


accommodates                                   0.442502
beds                                           0.342913
calculated_host_listings_count_entire_homes    0.339121
bedrooms                                       0.310659
air_conditioning                               0.277069
cleaning_fee                                   0.259348
guests_included                                0.247116
security_deposit                               0.221687
tv                                             0.218122
Name: price, dtype: float64


[1mCorrelación negativa con el precio.


room_type_private_room                         -0.432321
calculated_host_listings_count_private_rooms   -0.349861
Name: price, dtype: float64

### 2) OLS

In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [8]:
X=listings._get_numeric_data().drop('price', axis=1)

y=listings.price

In [9]:
modelo=sm.OLS(y, np.asarray(X)).fit()

pred=modelo.predict(X)

modelo.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.108
Model:,OLS,Adj. R-squared:,0.098
Method:,Least Squares,F-statistic:,10.77
Date:,"Tue, 01 Nov 2022",Prob (F-statistic):,0.0
Time:,22:11:21,Log-Likelihood:,-154590.0
No. Observations:,21312,AIC:,309700.0
Df Residuals:,21074,BIC:,311600.0
Df Model:,237,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,7.653e-07,3.49e-07,2.196,0.028,8.21e-08,1.45e-06
x2,3.167e-08,3.16e-08,1.001,0.317,-3.03e-08,9.37e-08
x3,4.2323,6.617,0.640,0.522,-8.737,17.202
x4,-607.4961,643.320,-0.944,0.345,-1868.452,653.460
x5,-7069.1269,6796.891,-1.040,0.298,-2.04e+04,6253.300
x6,12.5991,2.578,4.887,0.000,7.546,17.652
x7,-7.5453,4.287,-1.760,0.078,-15.948,0.857
x8,17.0439,4.285,3.978,0.000,8.646,25.442
x9,5.9981,2.646,2.267,0.023,0.811,11.185

0,1,2,3
Omnibus:,42521.609,Durbin-Watson:,1.455
Prob(Omnibus):,0.0,Jarque-Bera (JB):,122498624.753
Skew:,16.342,Prob(JB):,0.0
Kurtosis:,372.974,Cond. No.,1.54e+24


In [10]:
p_values=modelo.summary().tables[1].as_html()

p_values=pd.read_html(p_values, header=0, index_col=0)

p_values=pd.DataFrame(p_values[0])

p_values['col']=X.columns.tolist()

In [11]:
p_values

Unnamed: 0,coef,std err,t,P>|t|,[0.025,0.975],col
x1,7.653e-07,3.49e-07,2.196,0.028,8.21e-08,1.45e-06,id
x2,3.167e-08,3.16e-08,1.001,0.317,-3.03e-08,9.37e-08,host_id
x3,4.2323,6.617,0.64,0.522,-8.737,17.202,host_is_superhost
x4,-607.4961,643.32,-0.944,0.345,-1868.452,653.46,latitude
x5,-7069.127,6796.891,-1.04,0.298,-20400.0,6253.3,longitude
x6,12.5991,2.578,4.887,0.0,7.546,17.652,accommodates
x7,-7.5453,4.287,-1.76,0.078,-15.948,0.857,bathrooms
x8,17.0439,4.285,3.978,0.0,8.646,25.442,bedrooms
x9,5.9981,2.646,2.267,0.023,0.811,11.185,beds
x10,-0.0067,0.012,-0.536,0.592,-0.031,0.018,security_deposit


In [12]:
p_values[p_values['P>|t|'] < 0.05]['col']

x1                                               id
x6                                     accommodates
x8                                         bedrooms
x9                                             beds
x11                                    cleaning_fee
x12                                 guests_included
x16                                 availability_30
x20                               number_of_reviews
x21                           number_of_reviews_ltm
x25     calculated_host_listings_count_shared_rooms
x32                    property_type_boutique_hotel
x37                            property_type_chalet
x45                             property_type_hotel
x59                            room_type_hotel_room
x60                          room_type_private_room
x61                           room_type_shared_room
x67                                      bed_linens
x68                                             gym
x73                       well_lit_path_to_entrance
x77         

### 3) Feature importances

In [13]:
from sklearn.ensemble import RandomForestRegressor as RFR

In [14]:
rfr=RFR().fit(X, y)

In [15]:
dict(zip(X.columns, rfr.feature_importances_))   

{'id': 0.07980740454503804,
 'host_id': 0.07279704486623591,
 'host_is_superhost': 0.0052914669412842865,
 'latitude': 0.016515467165872145,
 'longitude': 0.019666156039164356,
 'accommodates': 0.025637032833951266,
 'bathrooms': 0.01648935935034492,
 'bedrooms': 0.018427614263255062,
 'beds': 0.022505341833229907,
 'security_deposit': 0.011043809668947482,
 'cleaning_fee': 0.02833609808091595,
 'guests_included': 0.026653912780520407,
 'extra_people': 0.014338340157657128,
 'minimum_nights': 0.014470464688489648,
 'maximum_nights': 0.01823893701948428,
 'availability_30': 0.01594034746498221,
 'availability_60': 0.010590345999194041,
 'availability_90': 0.025366381747847493,
 'availability_365': 0.022248363295699347,
 'number_of_reviews': 0.0433148315368604,
 'number_of_reviews_ltm': 0.02803419398595213,
 'calculated_host_listings_count': 0.023965481629241565,
 'calculated_host_listings_count_entire_homes': 0.02163879045729368,
 'calculated_host_listings_count_private_rooms': 0.041272

In [16]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21312 entries, 0 to 21311
Columns: 241 entries, id to carbon_monoxide_detector
dtypes: float32(5), int16(9), int32(3), int8(224)
memory usage: 5.6 MB


In [17]:
from xgboost import XGBRegressor as XGBR

from catboost import CatBoostRegressor as CTR

from lightgbm import LGBMRegressor as LGBMR

In [18]:
xgbr=XGBR().fit(X, y)
ctr=CTR(verbose=0).fit(X, y)
lgbmr=LGBMR().fit(X, y)

In [19]:
rfr.feature_importances_.sum()

1.0

In [20]:
dict(zip(X.columns, xgbr.feature_importances_))   # de media

{'id': 0.012306694,
 'host_id': 0.013455192,
 'host_is_superhost': 0.009549061,
 'latitude': 0.0030277763,
 'longitude': 0.0045347256,
 'accommodates': 0.00978895,
 'bathrooms': 0.008163057,
 'bedrooms': 0.011647178,
 'beds': 0.010540812,
 'security_deposit': 0.0065532043,
 'cleaning_fee': 0.010344079,
 'guests_included': 0.008803765,
 'extra_people': 0.00613107,
 'minimum_nights': 0.0084869005,
 'maximum_nights': 0.0034942206,
 'availability_30': 0.009556463,
 'availability_60': 0.002051389,
 'availability_90': 0.004279091,
 'availability_365': 0.006673673,
 'number_of_reviews': 0.011495705,
 'number_of_reviews_ltm': 0.023276342,
 'calculated_host_listings_count': 0.008977962,
 'calculated_host_listings_count_entire_homes': 0.009859823,
 'calculated_host_listings_count_private_rooms': 0.04585993,
 'calculated_host_listings_count_shared_rooms': 0.03459963,
 'x': 0.0062351744,
 'y': 0.008899557,
 'z': 0.0,
 'property_type_apartment': 0.00057977735,
 'property_type_barn': 0.0,
 'property

In [21]:
dict(zip(X.columns, ctr.feature_importances_))   

{'id': 10.097670931172573,
 'host_id': 7.745904317206861,
 'host_is_superhost': 1.0921397941831592,
 'latitude': 0.8352870510955389,
 'longitude': 2.5043622566064925,
 'accommodates': 2.63143470720288,
 'bathrooms': 0.49500159810203453,
 'bedrooms': 0.7168537383880924,
 'beds': 1.827466766918601,
 'security_deposit': 1.7480121675862808,
 'cleaning_fee': 1.0312822688222505,
 'guests_included': 1.2885345265810895,
 'extra_people': 2.0552447879462923,
 'minimum_nights': 2.621303037816319,
 'maximum_nights': 4.035750790857667,
 'availability_30': 0.7310179572819214,
 'availability_60': 0.973645398224882,
 'availability_90': 0.4076366536148636,
 'availability_365': 1.9774674384843312,
 'number_of_reviews': 4.416277622010214,
 'number_of_reviews_ltm': 2.231121765468031,
 'calculated_host_listings_count': 6.841092491754845,
 'calculated_host_listings_count_entire_homes': 10.747557826094726,
 'calculated_host_listings_count_private_rooms': 1.3956747977826724,
 'calculated_host_listings_count_s

In [22]:
dict(zip(X.columns, lgbmr.feature_importances_))  

{'id': 198,
 'host_id': 163,
 'host_is_superhost': 11,
 'latitude': 103,
 'longitude': 88,
 'accommodates': 100,
 'bathrooms': 36,
 'bedrooms': 47,
 'beds': 60,
 'security_deposit': 97,
 'cleaning_fee': 177,
 'guests_included': 38,
 'extra_people': 165,
 'minimum_nights': 44,
 'maximum_nights': 94,
 'availability_30': 35,
 'availability_60': 23,
 'availability_90': 32,
 'availability_365': 79,
 'number_of_reviews': 107,
 'number_of_reviews_ltm': 54,
 'calculated_host_listings_count': 120,
 'calculated_host_listings_count_entire_homes': 105,
 'calculated_host_listings_count_private_rooms': 66,
 'calculated_host_listings_count_shared_rooms': 75,
 'x': 80,
 'y': 60,
 'z': 34,
 'property_type_apartment': 6,
 'property_type_barn': 0,
 'property_type_bed_and_breakfast': 0,
 'property_type_boutique_hotel': 34,
 'property_type_bungalow': 0,
 'property_type_camper_rv': 0,
 'property_type_casa_particular_cuba': 0,
 'property_type_cave': 0,
 'property_type_chalet': 12,
 'property_type_condominium