# TP3 - SVM
### Estudiante: Francisco Javier Piqueras Martínez
### Ejercicio de Clasificación de los datos AirBnb

Pasos:
- 1. Estudio estadístico y limpieza de datos
- 2. Clasificación sobre el campo room_type usando SVC y LinearSVC
- 3. Afinación con de hiperparámetros
- 4. Resultado

In [19]:
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score, GridSearchCV
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import confusion_matrix

In [2]:
airbnb_data = pd.read_csv(os.path.join("data","airbnb.csv"))

In [3]:
airbnb_data.head()

Unnamed: 0,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Centro,Justicia,40.424715,-3.698638,Entire home/apt,49,28,35,0.42,1,99
1,Centro,Embajadores,40.413418,-3.706838,Entire home/apt,80,5,18,0.3,1,188
2,Moncloa - Aravaca,Argüelles,40.42492,-3.713446,Entire home/apt,40,2,21,0.25,9,195
3,Moncloa - Aravaca,Casa de Campo,40.431027,-3.724586,Entire home/apt,55,2,3,0.13,9,334
4,Latina,Cármenes,40.40341,-3.740842,Private room,16,2,23,0.76,2,250


### 1. Estudio estadístico y limpieza de datos

En primer lugar observamos los distintos valores de la variable `room_type`

In [4]:
print(airbnb_data['room_type'].unique())

['Entire home/apt' 'Private room' 'Shared room']


Y vemos el balanceo de las mismas:

In [5]:
airbnb_data.groupby("room_type").size()

room_type
Entire home/apt    7926
Private room       5203
Shared room         192
dtype: int64

A la hora de entrenar el modelo, es conveniente saber si las clases están balanceadas o no. En este caso, comprobamos que no lo están.

En primer lugar, vamos a eliminar variables que hagan ruido, como es el caso de neighbourhood. Esta, nos da la localización de la vivienda, al igual que neighbourhood group. Por experiencia en la realización del trabajo prático 1, el modelo funciona mejor con neighbourhood_group que con neighbourhood, por lo que vamos a dejar la primera y a eliminar la segunda.

In [6]:
airbnb = airbnb_data.drop("neighbourhood", 1, inplace=False)

Ahora, vamos a echarle un ojo al resto de variables (tipos, cantidad, etc):

In [7]:
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13321 entries, 0 to 13320
Data columns (total 10 columns):
neighbourhood_group               13321 non-null object
latitude                          13321 non-null float64
longitude                         13321 non-null float64
room_type                         13321 non-null object
price                             13321 non-null int64
minimum_nights                    13321 non-null int64
number_of_reviews                 13321 non-null int64
reviews_per_month                 13321 non-null float64
calculated_host_listings_count    13321 non-null int64
availability_365                  13321 non-null int64
dtypes: float64(3), int64(5), object(2)
memory usage: 1.0+ MB


Como se puede observar, todas las variables tienen 13321, por lo que no hay ningún valor faltante.

A continuación, separamos el dataset creando el training set (80%) y el text set (20%):

In [8]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=38)

for id_train, id_test in split.split(airbnb, airbnb["room_type"]):
    train_set, test_set = airbnb.loc[id_train], airbnb.loc[id_test]

y_train = train_set["room_type"]
X_train = train_set.drop("room_type", 1, inplace=False) 
y_test = test_set["room_type"]
X_test = test_set.drop("room_type", 1, inplace=False)

A continuación, escalamos los valores numéricos y usamos OneHotEncoder para "dummificar" las variables categóricas.

In [9]:
airbnb_categorical = ["neighbourhood_group"]
airbnb_numerical = ["latitude","longitude","price","minimum_nights","number_of_reviews","reviews_per_month","calculated_host_listings_count","availability_365"]
airbnb_label = ["room_type"]

airbnb_col_transformer = ColumnTransformer([
    ("num_parser", StandardScaler(), airbnb_numerical),
    ("cat_parser", OneHotEncoder(), airbnb_categorical),
])

In [10]:
airbnb_train_pipeline_svc = Pipeline([
    ("col_transformer", airbnb_col_transformer),
    ("train", SVC())
])

airbnb_train_pipeline_linearsvc = Pipeline([
    ("col_transformer", airbnb_col_transformer),
    ("train", LinearSVC())
])

### 2. Clasificación sobre el campo room_type usando SVC y LinearSVC

Seguidamente, clasificamos nuestros datos haciendo uso de SVC

In [11]:
cross_val_score(airbnb_train_pipeline_svc, X_train, y_train, cv=5, scoring="accuracy", n_jobs=1)



array([0.86772983, 0.88883677, 0.8620366 , 0.89300798, 0.8713615 ])

Y bueno, nada mal. Ahora vamos a clasificarlos con LinearSVC:

In [12]:
cross_val_score(airbnb_train_pipeline_linearsvc, X_train, y_train, cv=5, scoring="accuracy", n_jobs=1)



array([0.72654784, 0.72795497, 0.71328015, 0.76912248, 0.72957746])

A primera vista, funciona mejor el SVC que el linearSVC.

### 3.  Afinación de hiperparámetros

Vamos a buscar los mejores valores gamma y c para nuestro modelo con kernel=RBF.

Para ello, vamos a realizar las dos búsquedas, una más abierta, y luego una reducida en la zona de mejores resulatdos para afinar los hiperparámetros.

In [13]:
airbnb_train_pipeline_svc.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'col_transformer', 'train', 'col_transformer__n_jobs', 'col_transformer__remainder', 'col_transformer__sparse_threshold', 'col_transformer__transformer_weights', 'col_transformer__transformers', 'col_transformer__verbose', 'col_transformer__num_parser', 'col_transformer__cat_parser', 'col_transformer__num_parser__copy', 'col_transformer__num_parser__with_mean', 'col_transformer__num_parser__with_std', 'col_transformer__cat_parser__categorical_features', 'col_transformer__cat_parser__categories', 'col_transformer__cat_parser__drop', 'col_transformer__cat_parser__dtype', 'col_transformer__cat_parser__handle_unknown', 'col_transformer__cat_parser__n_values', 'col_transformer__cat_parser__sparse', 'train__C', 'train__cache_size', 'train__class_weight', 'train__coef0', 'train__decision_function_shape', 'train__degree', 'train__gamma', 'train__kernel', 'train__max_iter', 'train__probability', 'train__random_state', 'train__shrinking', 'train__tol', 't

In [15]:
grid_first_params = {
    'train__C' : np.logspace(-5, 15, base=2, num=5),
    'train__gamma' : np.logspace(-15, 5, base=2, num=5),
    'train__kernel' : ['rbf'],
    'train__class_weight' : ['balanced']
}

grid_first_search = GridSearchCV(airbnb_train_pipeline_svc, grid_first_params, cv=5, n_jobs=-1)

grid_first_result = grid_first_search.fit(X_train, y_train)

In [16]:
grid_first_result.best_params_

{'train__C': 32768.0,
 'train__class_weight': 'balanced',
 'train__gamma': 0.005524271728019903,
 'train__kernel': 'rbf'}

In [17]:
grid_second_params = {
    'train__C' : np.logspace(12, 17, base=2, num=5),
    'train__gamma' : np.logspace(-10, 5, base=2, num=5),
    'train__kernel' : ['rbf'],
    'train__class_weight' : ['balanced']
}

grid_second_search = GridSearchCV(airbnb_train_pipeline_svc, grid_second_params, cv=5, n_jobs=-1)

grid_second_result = grid_second_search.fit(X_train, y_train)

In [18]:
grid_second_result.best_params_

{'train__C': 55108.98747006743,
 'train__class_weight': 'balanced',
 'train__gamma': 0.013139006488339289,
 'train__kernel': 'rbf'}

In [20]:
airbnb_best_c = grid_second_result.best_params_['train__C']

In [21]:
airbnb_best_gamma = grid_second_result.best_params_['train__gamma']

### 4. Resultado

Finalmente, creamos nuestro modelo con los hiperparámetros que hemos obtenido y vemos los resultados obtenidos:

In [22]:
airbnb_train_pipeline_svc = Pipeline([
    ("col_transformer", airbnb_col_transformer),
    ("test", SVC(C=airbnb_best_c, gamma=airbnb_best_gamma))
])

airbnb_train_pipeline_svc.fit(X_train, y_train)
airbnb_train_pipeline_svc_prediction = airbnb_train_pipeline_svc.predict(X_test)

confusion_matrix(y_test, airbnb_train_pipeline_svc_prediction)

array([[1476,  109,    1],
       [ 131,  906,    4],
       [   1,   30,    7]])