# 2.2 - Selección de características

En este notebook voy a seleccionar las columnas importantes del dataset `listings`. Utilizaré tres metodos distintos. Por un lado la correlación, con los métodos de Pearson, Spearman y la Tau de Kendall, para intentar ver correlaciones entre las variables y el objetivo, y si existe colinealidad. 

Además usaré un OLS (Ordinary Least Squares - Mínimos Cuadrados Ordinarios), básicamente una regresión lineal, para determinar los p-values según el F-test de cada variable. 

También usaré un random forest o xgboost, no con el objetivo de predecir, sino para que me diga cuál es la importancia de las características.

In [1]:
# librerias

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
listings=pd.read_csv('../data/transform_data/listings.csv')

listings=listings.drop(columns=['id', 'host_id']) # eliminacion de los id para importancia

listings=listings[(listings.price>=10) & (listings.price<=196)]  # eliminacion de outliers

# cambio en el tamaño del tipo de dato
for c in listings.select_dtypes(include='int'):
    listings[c]=pd.to_numeric(listings[c], downcast='integer')

for c in listings.select_dtypes(include='float'):
    listings[c]=pd.to_numeric(listings[c], downcast='float')
    
listings.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18936 entries, 0 to 21311
Columns: 242 entries, host_is_superhost to suitable_for_events
dtypes: float32(5), int16(10), int32(1), int8(224), object(2)
memory usage: 7.5 MB


### 1) Correlación

In [3]:
def correlacion_precio(metodo: str, umbral: float) -> None:
    
    """
    Esta función calcula la correlación del dataframe 
    y muestra la columnas correlacionadas con el precio.
    
    param metodo: string, metodo de correlación (pearson, spearman, kendall)
    
    return: None (solo printea)
    """
    
    corr=listings._get_numeric_data().corr(method=metodo)
    
    print('\n\033[1m' + 'Correlación positiva con el precio.' + '\033[0m')
    print(corr.price[corr.price > umbral].sort_values(ascending=False)[1:])

    print('\n\033[1m' + 'Correlación negativa con el precio.' + '\033[0m')
    print(corr.price[corr.price < -umbral].sort_values(ascending=True))

In [4]:
correlacion_precio('pearson', 0.2)


[1mCorrelación positiva con el precio.[0m
accommodates           0.555701
cleaning_fee           0.415570
bedrooms               0.400268
beds                   0.376470
air_conditioning       0.348422
guests_included        0.333567
tv                     0.296742
dishwasher             0.245567
security_deposit       0.240386
crib                   0.219767
family_kid_friendly    0.214021
hair_dryer             0.208491
iron                   0.205943
Name: price, dtype: float64

[1mCorrelación negativa con el precio.[0m
room_type_private_room                         -0.548227
calculated_host_listings_count_private_rooms   -0.217194
Name: price, dtype: float64


Desde el punto de vista lineal de la $rho$ de Pearson, prácticamente no existe correlación con el precio. Veamos que es lo que ocurre con el punto de vista de Spearman, donde se busca una relación monótona. En una relación monótona, las variables tienden a cambiar al mismo tiempo, pero no necesariamente a un ritmo constante.

In [5]:
correlacion_precio('spearman', 0.2)


[1mCorrelación positiva con el precio.[0m
accommodates                                   0.634239
calculated_host_listings_count_entire_homes    0.559469
beds                                           0.474839
cleaning_fee                                   0.447736
guests_included                                0.410668
bedrooms                                       0.407235
air_conditioning                               0.397884
security_deposit                               0.367975
tv                                             0.347866
dishwasher                                     0.247182
hair_dryer                                     0.242765
family_kid_friendly                            0.239844
iron                                           0.234592
crib                                           0.233816
washer                                         0.214714
coffee_maker                                   0.207969
Name: price, dtype: float64

[1mCorrelación negativa con e

In [6]:
correlacion_precio('kendall', 0.2)


[1mCorrelación positiva con el precio.[0m
accommodates                                   0.499628
calculated_host_listings_count_entire_homes    0.418713
beds                                           0.372945
cleaning_fee                                   0.343179
guests_included                                0.329627
air_conditioning                               0.328060
bedrooms                                       0.319284
security_deposit                               0.287525
tv                                             0.286819
dishwasher                                     0.203804
hair_dryer                                     0.200162
Name: price, dtype: float64

[1mCorrelación negativa con el precio.[0m
room_type_private_room                         -0.531283
calculated_host_listings_count_private_rooms   -0.445018
Name: price, dtype: float64


### 2) OLS

In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [8]:
X=listings._get_numeric_data().drop('price', axis=1)

y=listings.price

In [9]:
modelo=sm.OLS(y, np.asarray(X)).fit()

pred=modelo.predict(X)

#modelo.summary()

In [10]:
p_values=modelo.summary().tables[1].as_html()

p_values=pd.read_html(p_values, header=0, index_col=0)

p_values=pd.DataFrame(p_values[0])

p_values['col']=X.columns.tolist()

In [11]:
#p_values[['P>|t|', 'col']].sort_values(by='P>|t|')

In [23]:
p_values[p_values['P>|t|'] < 0.05].shape

(78, 7)

In [25]:
p_values[p_values['P>|t|'] < 0.05].head(20)

Unnamed: 0,coef,std err,t,P>|t|,[0.025,0.975],col
x1,1.9239,0.525,3.667,0.0,0.896,2.952,host_is_superhost
x2,-216.9555,53.961,-4.021,0.0,-322.723,-111.188,latitude
x3,-1643.1315,568.995,-2.888,0.004,-2758.413,-527.85,longitude
x4,3.8637,0.221,17.522,0.0,3.432,4.296,accommodates
x5,3.9902,0.356,11.199,0.0,3.292,4.689,bathrooms
x6,6.5933,0.356,18.53,0.0,5.896,7.291,bedrooms
x8,0.0109,0.001,9.922,0.0,0.009,0.013,security_deposit
x9,0.1185,0.01,12.396,0.0,0.1,0.137,cleaning_fee
x10,-0.7212,0.205,-3.52,0.0,-1.123,-0.32,guests_included
x11,-0.0386,0.016,-2.471,0.013,-0.069,-0.008,extra_people


### 3) Feature importances

In [13]:
from sklearn.ensemble import RandomForestRegressor as RFR

In [14]:
rfr=RFR().fit(X, y)

In [26]:
importancias_rfr=pd.DataFrame(dict(zip(X.columns, rfr.feature_importances_)), index=[0]).T.sort_values(by=0, ascending=False)


importancias_rfr.head(20)

Unnamed: 0,0
room_type_private_room,0.300728
bedrooms,0.083579
bathrooms,0.045043
z,0.029111
security_deposit,0.02821
latitude,0.026777
cleaning_fee,0.026009
number_of_reviews,0.023182
y,0.021744
x,0.021538


In [17]:
from xgboost import XGBRegressor as XGBR

from catboost import CatBoostRegressor as CTR

from lightgbm import LGBMRegressor as LGBMR

In [18]:
xgbr=XGBR().fit(X, y)
ctr=CTR(verbose=0).fit(X, y)
lgbmr=LGBMR().fit(X, y)

In [20]:
importancias_xgbr=pd.DataFrame(dict(zip(X.columns, xgbr.feature_importances_)), index=[0]).T.sort_values(by=0, ascending=False)


importancias_xgbr.head()

Unnamed: 0,0
room_type_private_room,0.507157
room_type_shared_room,0.04508
bedrooms,0.036214
calculated_host_listings_count_shared_rooms,0.034336
bathrooms,0.032189


In [21]:
importancias_ctr=pd.DataFrame(dict(zip(X.columns, ctr.feature_importances_)), index=[0]).T.sort_values(by=0, ascending=False)


importancias_ctr.head(40)

Unnamed: 0,0
room_type_private_room,15.366176
accommodates,5.823518
bedrooms,5.373563
cleaning_fee,5.000705
bathrooms,4.760987
security_deposit,4.244257
extra_people,3.734797
calculated_host_listings_count_entire_homes,3.383733
latitude,3.161369
number_of_reviews,2.902542


In [22]:
importancias_lgbmr=pd.DataFrame(dict(zip(X.columns, lgbmr.feature_importances_)), index=[0]).T.sort_values(by=0, ascending=False)

importancias_lgbmr.head()

Unnamed: 0,0
cleaning_fee,199
calculated_host_listings_count,147
latitude,138
extra_people,132
minimum_nights,123
