Ejecutamos un primer modelo usando Random Forest por ser un modelo que suele tener buenos resulados incluso sin optimizar hiperparametros y que a su tiende a controlar la varianza y evita modelos con overfitting. También se ha probado una regresion logistica por probar un modelo de similares caraterísticas pero que no sea basado en arboles.

Los resultados tienen un alto Accuracy pero este dato no debe llevar al engaño ya que el modelo aprende que maximiza el accuracy predicienco que todos los casos corresponde a ella. Esto lo comprobamos con la tabla de contingencia o cuando calculamos otras métricas diferentes al Accuracy.

### Open File

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
from matplotlib import pyplot as plt

In [2]:
df = pd.read_csv('./data/df_features.gz')

### Create Predicted Category for final models (2 categories)

In [3]:
# Creo estasdisticos por hotel para introducir en el modelo, ya que cada hotel se comporta de manera distina
diff_hotels = df[['Hotel_Address','Diff']].groupby('Hotel_Address').describe()
diff_hotels = diff_hotels.Diff.reset_index()

In [4]:
df = pd.merge(df, diff_hotels, on='Hotel_Address')

In [5]:
category = np.array(['Bad' if i < 7 else 'Good' for i in df.Reviewer_Score])
df.loc[:, 'Category'] = category
df.Category.value_counts() / len(df)

Good    0.831599
Bad     0.168401
Name: Category, dtype: float64

### Prepare Data for Modeling

Subset a small fraction to run the first models

In [6]:
df_model = df.sample(n=10000, random_state=1)

In [7]:
x_categorical = ['Review_Month','City','Pet','Purpose','Whom','Room_Recode','Nationality_Recode','Length_Recode','Stars']
x_numerical = ['Average_Score', 'Total_Number_of_Reviews_Reviewer_Has_Given', 'Close_Landmarks', 'Dist_Center', 
               'Dist_Train', 'Dist_Airport','food_Neg_Hotel','staff_Neg_Hotel', 'location_Neg_Hotel', 'value_Neg_Hotel',
               'comfort_Neg_Hotel', 'room_Neg_Hotel', 'facilities_Neg_Hotel','cleanliness_Neg_Hotel', 
               'food_Pos_Hotel', 'staff_Pos_Hotel','location_Pos_Hotel', 'value_Pos_Hotel', 'comfort_Pos_Hotel',
               'room_Pos_Hotel', 'facilities_Pos_Hotel', 'cleanliness_Pos_Hotel','Price','Reservation_ADR',
               'count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
x_col = x_categorical + x_numerical
y_col = 'Category'

In [8]:
X_numerical = df_model[x_numerical]
X_numerical_std = X_numerical.apply(lambda x: ((x-np.mean(x)) / np.std(x)))

Create unique binary variables for from categorical variables

In [9]:
df_model['Review_Month'] = df_model['Review_Month'].astype(str)
X_categorical = pd.get_dummies(df_model[x_categorical], prefix_sep='_', drop_first=True)
X_categorical = X_categorical.fillna('Not Available')

Merge numerical Variables and categorical Variables

In [10]:
X = pd.concat([X_numerical_std, X_categorical], axis=1, sort=False)
y = df_model[y_col]

Split into Train and Test

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)
X_test.shape, y_test.shape, X_train.shape, y_train.shape

((2000, 78), (2000,), (8000, 78), (8000,))

### Random Forest

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, precision_score, recall_score

In [13]:
clf = RandomForestClassifier()
clf.fit(X_train, y_train, )
pred = clf.predict(X_test)

High Acuracy but low score in the other important Metrics with Random Forest

In [14]:
print('Accuracy: ', accuracy_score(pred, y_test))
print('Kappa:    ', cohen_kappa_score(pred, y_test))
print('F1-Score: ', f1_score(pred, y_test, pos_label='Bad'))
print('Precision:', precision_score(pred, y_test, pos_label='Bad'))

Accuracy:  0.808
Kappa:     0.1031913775587654
F1-Score:  0.1864406779661017
Precision: 0.12753623188405797


In [15]:
pd.crosstab(pred, y_test)

Category,Bad,Good
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Bad,44,83
Good,301,1572


### Logisitc Regresion

In [16]:
from sklearn.linear_model import LogisticRegression

In [17]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train, )
pred = clf.predict(X_test)

High Acuracy but low score in the other important Metrics also with Logistic Regression

In [18]:
print('Accuracy: ', accuracy_score(pred, y_test))
print('Kappa:    ', cohen_kappa_score(pred, y_test))
print('F1-Score: ', f1_score(pred, y_test, pos_label='Bad'))
print('Precision:', precision_score(pred, y_test, pos_label='Bad'))

Accuracy:  0.825
Kappa:     0.05544535751393209
F1-Score:  0.08854166666666669
Precision: 0.04927536231884058


In [19]:
pd.crosstab(pred, y_test)

Category,Bad,Good
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Bad,17,22
Good,328,1633


In [21]:
(327+1633)/(328+1633+17+22)

0.98

We should balance classes in our goal variable. The model tends to classify as good the majority of cases (98%) since this is the most common label