# 2.2 - Selección de características

En este notebook voy a seleccionar las columnas importantes del dataset `listings`. Utilizaré tres metodos distintos. Por un lado la correlación, con los métodos de Pearson, Spearman y la Tau de Kendall, para intentar ver correlaciones entre las variables y el objetivo, y si existe colinealidad. 

Además usaré un OLS (Ordinary Least Squares - Mínimos Cuadrados Ordinarios), básicamente una regresión lineal, para determinar los p-values según el F-test de cada variable. 

También usaré un random forest o xgboost, no con el objetivo de predecir, sino para que me diga cuál es la importancia de las características.

In [1]:
# librerias

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


In [2]:
listings=pd.read_csv('../data/transform_data/listings.csv')

listings=listings.drop(columns=['id', 'host_id', 'latitude', 'longitude'])

listings=listings[listings.price<150]

for c in listings.select_dtypes(include='int'):
    listings[c]=pd.to_numeric(listings[c], downcast='integer')

for c in listings.select_dtypes(include='float'):
    listings[c]=pd.to_numeric(listings[c], downcast='float')
    
listings.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17968 entries, 0 to 21311
Columns: 240 entries, host_is_superhost to suitable_for_events
dtypes: float32(3), int16(10), int32(1), int8(224), object(2)
memory usage: 6.9 MB


### 1) Correlación

In [3]:
def correlacion(metodo: str, umbral: float) -> None:
    
    """
    Esta función calcula la correlación del dataframe 
    y muestra la columnas correlacionadas con el precio.
    
    param metodo: string, metodo de correlación (pearson, spearman, kendall)
    
    return: None (solo printea)
    """
    
    corr=listings._get_numeric_data().corr(method=metodo)

    display(corr.price[corr.price > umbral].sort_values(ascending=False)[1:])

    print('\n\033[1m' + 'Correlación negativa con el precio.' + '\033[0m')
    print(corr.price[corr.price < -umbral].sort_values(ascending=True))

In [4]:
correlacion('pearson', 0.1)

accommodates                                   0.575794
cleaning_fee                                   0.428574
air_conditioning                               0.378497
bedrooms                                       0.358680
beds                                           0.356993
guests_included                                0.355316
tv                                             0.323897
dishwasher                                     0.249624
crib                                           0.240532
hair_dryer                                     0.236680
family_kid_friendly                            0.234675
iron                                           0.228396
security_deposit                               0.223139
coffee_maker                                   0.214136
washer                                         0.201105
oven                                           0.182781
kitchen                                        0.176213
high_chair                                     0


[1mCorrelación negativa con el precio.[0m
room_type_private_room                         -0.607842
calculated_host_listings_count_private_rooms   -0.236309
lock_on_bedroom_door                           -0.209718
free_street_parking                            -0.178777
room_type_shared_room                          -0.139754
smoking_allowed                                -0.129921
calculated_host_listings_count_shared_rooms    -0.127413
property_type_house                            -0.126582
pets_live_on_this_property                     -0.109517
y                                              -0.105206
Name: price, dtype: float64


Desde el punto de vista lineal de la $rho$ de Pearson, prácticamente no existe correlación con el precio. Veamos que es lo que ocurre con el punto de vista de Spearman, donde se busca una relación monótona. En una relación monótona, las variables tienden a cambiar al mismo tiempo, pero no necesariamente a un ritmo constante.

In [5]:
correlacion('spearman', 0.2)

accommodates                                   0.637747
calculated_host_listings_count_entire_homes    0.575809
beds                                           0.461002
cleaning_fee                                   0.459942
guests_included                                0.433070
air_conditioning                               0.403420
bedrooms                                       0.369407
security_deposit                               0.366275
tv                                             0.353078
hair_dryer                                     0.252541
family_kid_friendly                            0.245876
iron                                           0.240857
dishwasher                                     0.240512
crib                                           0.239420
coffee_maker                                   0.229122
washer                                         0.216259
Name: price, dtype: float64


[1mCorrelación negativa con el precio.[0m
room_type_private_room                         -0.660642
calculated_host_listings_count_private_rooms   -0.597380
lock_on_bedroom_door                           -0.222941
Name: price, dtype: float64


In [6]:
correlacion('kendall', 0.2)

accommodates                                   0.502423
calculated_host_listings_count_entire_homes    0.431288
beds                                           0.362336
cleaning_fee                                   0.348891
guests_included                                0.346732
air_conditioning                               0.332910
tv                                             0.291366
bedrooms                                       0.289737
security_deposit                               0.285226
hair_dryer                                     0.208401
family_kid_friendly                            0.202902
Name: price, dtype: float64


[1mCorrelación negativa con el precio.[0m
room_type_private_room                         -0.545174
calculated_host_listings_count_private_rooms   -0.456566
Name: price, dtype: float64


### 2) OLS

In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [8]:
X=listings._get_numeric_data().drop('price', axis=1)

y=listings.price

In [9]:
modelo=sm.OLS(y, np.asarray(X)).fit()

pred=modelo.predict(X)

#modelo.summary()

In [10]:
p_values=modelo.summary().tables[1].as_html()

p_values=pd.read_html(p_values, header=0, index_col=0)

p_values=pd.DataFrame(p_values[0])

p_values['col']=X.columns.tolist()

In [13]:
p_values[['P>|t|', 'col']].sort_values(by='P>|t|')

Unnamed: 0,P>|t|,col
x1,0.0,host_is_superhost
x203,0.0,luggage_dropoff_allowed
x201,0.0,washer
x193,0.0,coffee_maker
x190,0.0,dishwasher
x185,0.0,patio_or_balcony
x182,0.0,elevator
x171,0.0,fireplace_guards
x160,0.0,bed_linens
x149,0.0,internet


In [None]:
p_values[p_values['P>|t|'] < 0.05]['col']

### 3) Feature importances

In [None]:
from sklearn.ensemble import RandomForestRegressor as RFR

In [None]:
rfr=RFR().fit(X, y)

In [None]:
dict(zip(X.columns, rfr.feature_importances_))   

In [None]:
X.info()

In [None]:
from xgboost import XGBRegressor as XGBR

from catboost import CatBoostRegressor as CTR

from lightgbm import LGBMRegressor as LGBMR

In [None]:
xgbr=XGBR().fit(X, y)
ctr=CTR(verbose=0).fit(X, y)
lgbmr=LGBMR().fit(X, y)

In [None]:
rfr.feature_importances_.sum()

In [None]:
dict(zip(X.columns, xgbr.feature_importances_))   

In [None]:
dict(zip(X.columns, ctr.feature_importances_))   

In [None]:
dict(zip(X.columns, lgbmr.feature_importances_))  