# 2.2 - Selección de características

En este notebook voy a seleccionar las columnas importantes del dataset `listings`. Utilizaré tres metodos distintos. Por un lado la correlación, con los métodos de Pearson, Spearman y la Tau de Kendall, para intentar ver correlaciones entre las variables y el objetivo, y si existe colinealidad. 

Además usaré un OLS (Ordinary Least Squares - Mínimos Cuadrados Ordinarios), básicamente una regresión lineal, para determinar los p-values según el F-test de cada variable. 

También usaré un random forest o xgboost, no con el objetivo de predecir, sino para que me diga cuál es la importancia de las características.

In [1]:
# librerias

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
# carga de datos
listings=pd.read_csv('../data/transform_data/listings_normal.csv')

listings=listings.drop(columns=['id', 'host_id']) # eliminacion de los id para importancia

listings=listings[(listings.price>=10) & (listings.price<=196)]  # eliminacion de outliers

# cambio en el tamaño del tipo de dato
for c in listings.select_dtypes(include='int'):
    listings[c]=pd.to_numeric(listings[c], downcast='integer')

for c in listings.select_dtypes(include='float'):
    listings[c]=pd.to_numeric(listings[c], downcast='float')
    
listings.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18936 entries, 0 to 21311
Columns: 242 entries, host_is_superhost to bathtub_with_bath_chair
dtypes: float32(25), int16(1), int8(214), object(2)
memory usage: 8.3 MB


### 1) Correlación

In [3]:
def correlacion_precio(metodo: str, umbral: float) -> None:
    
    """
    Esta función calcula la correlación del dataframe 
    y muestra la columnas correlacionadas con el precio.
    
    param metodo: string, metodo de correlación (pearson, spearman, kendall)
    
    return: None (solo printea)
    """
    
    corr=listings._get_numeric_data().corr(method=metodo)
    
    print('\n\033[1m' + f'{metodo.capitalize()} -- Correlación positiva con el precio.' + '\033[0m')
    print(corr.price[corr.price > umbral].sort_values(ascending=False)[1:])

    print('\n\033[1m' + 'Correlación negativa con el precio.' + '\033[0m')
    print(corr.price[corr.price < -umbral].sort_values(ascending=True))

In [4]:
correlacion_precio('pearson', 0.2)


[1mPearson -- Correlación positiva con el precio.[0m
accommodates           0.555701
cleaning_fee           0.415570
bedrooms               0.400268
beds                   0.376470
air_conditioning       0.348422
guests_included        0.333567
tv                     0.296742
dishwasher             0.245567
security_deposit       0.240386
crib                   0.219767
family_kid_friendly    0.214021
hair_dryer             0.208491
iron                   0.205943
Name: price, dtype: float64

[1mCorrelación negativa con el precio.[0m
room_type_private_room                         -0.548227
calculated_host_listings_count_private_rooms   -0.217194
Name: price, dtype: float64


In [5]:
correlacion_precio('spearman', 0.3)


[1mSpearman -- Correlación positiva con el precio.[0m
accommodates                                   0.634239
calculated_host_listings_count_entire_homes    0.559469
beds                                           0.474839
cleaning_fee                                   0.447736
guests_included                                0.410668
bedrooms                                       0.407235
air_conditioning                               0.397884
security_deposit                               0.367975
tv                                             0.347866
Name: price, dtype: float64

[1mCorrelación negativa con el precio.[0m
room_type_private_room                         -0.644362
calculated_host_listings_count_private_rooms   -0.581432
Name: price, dtype: float64


In [6]:
correlacion_precio('kendall', 0.2)


[1mKendall -- Correlación positiva con el precio.[0m
accommodates                                   0.499628
calculated_host_listings_count_entire_homes    0.418713
beds                                           0.372945
cleaning_fee                                   0.343179
guests_included                                0.329627
air_conditioning                               0.328060
bedrooms                                       0.319284
security_deposit                               0.287525
tv                                             0.286819
dishwasher                                     0.203804
hair_dryer                                     0.200162
Name: price, dtype: float64

[1mCorrelación negativa con el precio.[0m
room_type_private_room                         -0.531283
calculated_host_listings_count_private_rooms   -0.445018
Name: price, dtype: float64


Desde el punto de la correlación, ya sea desde el punto de vista lineal o desde el punto de vista monótono o por rangos, las variables más importantes son algunas como `habitación privada, nº de plazas o camas, depósito de suguridad, limpieza, etc...` Veamos que nos dicen los mínimos cuadrados ordinarios. 

### 2) OLS

In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [8]:
X=listings._get_numeric_data().drop('price', axis=1)

y=listings.price

In [9]:
modelo=sm.OLS(y, np.asarray(X)).fit()

pred=modelo.predict(X)

In [10]:
p_values=modelo.summary().tables[1].as_html()

p_values=pd.read_html(p_values, header=0, index_col=0)

p_values=pd.DataFrame(p_values[0])

p_values['col']=X.columns.tolist()

In [11]:
p_values[p_values['P>|t|'] < 0.05].shape

(79, 7)

In [12]:
p_values[p_values['P>|t|'] < 0.05].head(10)

Unnamed: 0,coef,std err,t,P>|t|,[0.025,0.975],col
x1,1.9239,0.525,3.668,0.0,0.896,2.952,host_is_superhost
x2,-216.2162,53.958,-4.007,0.0,-321.979,-110.453,latitude
x3,-1643.0962,568.964,-2.888,0.004,-2758.317,-527.875,longitude
x4,7.8169,0.446,17.523,0.0,6.943,8.691,accommodates
x5,2.8501,0.254,11.2,0.0,2.351,3.349,bathrooms
x6,5.6689,0.306,18.531,0.0,5.069,6.269,bedrooms
x8,2.4659,0.249,9.922,0.0,1.979,2.953,security_deposit
x9,4.0422,0.326,12.397,0.0,3.403,4.681,cleaning_fee
x10,-1.013,0.288,-3.52,0.0,-1.577,-0.449,guests_included
x11,-0.6287,0.254,-2.471,0.013,-1.128,-0.13,extra_people


Los mínimos cuadrados, con el p-valor que nos devuelve del t-test, nos da 79 variables importantes. Recordamos que esto es desde el punto de vista lineal. Usamos ahora 4 modelos para ver la importancia de variables según dichos modelos para finalmente tomar una decisión en su selección.

### 3) Feature importances

In [13]:
from sklearn.ensemble import RandomForestRegressor as RFR

from xgboost import XGBRegressor as XGBR

from lightgbm import LGBMRegressor as LGBMR

from catboost import CatBoostRegressor as CTR

In [14]:
def extraer_importancias(modelo: object, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
    
    """
    Esta función muestra la importancia de características según el modelo que se le pase.
    
    param modelo: modelo para ser entrenado y extraer importancias
    X: datos variables
    y: datos objetivo
    
    return: dataframe con las importancias según la característica
    """
    
    m=modelo.fit(X, y)
    
    importancias=modelo.feature_importances_
    
    impor_df=pd.DataFrame(dict(zip(X.columns, importancias)), 
                          index=['importancias']).T.sort_values(by='importancias', ascending=False)
    
    return impor_df

In [15]:
print('\n\033[1m' + 'Random Forest Regressor')
extraer_importancias(RFR(), X, y).head(10)


[1mRandom Forest Regressor


Unnamed: 0,importancias
room_type_private_room,0.299637
bedrooms,0.082683
bathrooms,0.045598
security_deposit,0.029177
z,0.02857
latitude,0.02709
cleaning_fee,0.02666
number_of_reviews,0.023267
y,0.021763
x,0.021682


In [16]:
print('\n\033[1m' + 'XG Boosting Regressor')
extraer_importancias(XGBR(), X, y).head(10)


[1mXG Boosting Regressor


Unnamed: 0,importancias
room_type_private_room,0.506987
room_type_shared_room,0.045065
bedrooms,0.036202
calculated_host_listings_count_shared_rooms,0.034325
bathrooms,0.032179
accommodates,0.012216
dishwasher,0.010413
dryer,0.006938
free_street_parking,0.006715
security_deposit,0.006712


In [17]:
print('\n\033[1m' + 'LightGBM Regressor')
extraer_importancias(LGBMR(), X, y).head(10)


[1mLightGBM Regressor


Unnamed: 0,importancias
cleaning_fee,205
latitude,155
calculated_host_listings_count,149
minimum_nights,127
extra_people,123
number_of_reviews,119
y,115
accommodates,112
calculated_host_listings_count_entire_homes,110
x,107


In [18]:
print('\n\033[1m' + 'Catboost Regressor')
extraer_importancias(CTR(verbose=0), X, y).head(10)


[1mCatboost Regressor


Unnamed: 0,importancias
room_type_private_room,14.399498
accommodates,5.656601
bedrooms,5.52163
cleaning_fee,5.20972
security_deposit,4.557357
bathrooms,4.458555
extra_people,3.885898
calculated_host_listings_count_entire_homes,3.613878
number_of_reviews,3.052007
z,2.715579


Los 4 modelos concuerdan bastante en la importancia. Solo LGBMR difiere un poco de los demás. Me apoyaré en las importancias extraídas con Catboost para seleccionar las variables que alimentarán el modelo final. Usaré un umbral de importancia de aproximadamente un 1%.

### 4) Comprobación Catboost

In [19]:
from sklearn.model_selection import train_test_split as tts 

from sklearn.metrics import mean_squared_error as mse 
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import r2_score as r2

In [20]:
importancias_ctr=extraer_importancias(CTR(verbose=0), X, y)

In [21]:
umbral=0.7  # 0.7 %

X_new=X[importancias_ctr[importancias_ctr['importancias']>umbral].index]

In [22]:
X_train, X_test, y_train, y_test = tts(X_new, y, train_size=0.8, test_size=0.2, random_state=42)

modelo=CTR(verbose=0)
modelo.fit(X_train, y_train)

<catboost.core.CatBoostRegressor at 0x295a2ba30>

In [23]:
y_pred=modelo.predict(X_train)

print(f'Train RMSE: {mse(y_train, y_pred, squared=False)}')
print(f'Train MAE: {mae(y_train, y_pred)}')
print(f'Train R2: {r2(y_train, y_pred)}')

Train RMSE: 16.926567727651268
Train MAE: 11.792501510348302
Train R2: 0.8034508821366838


In [24]:
y_pred=modelo.predict(X_test)

print(f'Test RMSE: {mse(y_test, y_pred, squared=False)}')
print(f'Test MAE: {mae(y_test, y_pred)}')
print(f'Test R2: {r2(y_test, y_pred)}')

Test RMSE: 21.71621647923623
Test MAE: 14.842099447371211
Test R2: 0.7113012109146164


In [25]:
X_new.shape

(18936, 29)

In [26]:
sorted(X_new.columns)

['accommodates',
 'air_conditioning',
 'availability_30',
 'availability_365',
 'availability_60',
 'availability_90',
 'bathrooms',
 'bedrooms',
 'beds',
 'calculated_host_listings_count',
 'calculated_host_listings_count_entire_homes',
 'calculated_host_listings_count_private_rooms',
 'calculated_host_listings_count_shared_rooms',
 'cleaning_fee',
 'dishwasher',
 'extra_people',
 'guests_included',
 'latitude',
 'longitude',
 'maximum_nights',
 'minimum_nights',
 'number_of_reviews',
 'number_of_reviews_ltm',
 'room_type_private_room',
 'room_type_shared_room',
 'security_deposit',
 'x',
 'y',
 'z']

Esta será la selección de variables. Aún falta determinar como quedarán las variables de loacalización. Se verá en el siguiente notebook que transformación es mejor.