# Actividad N° 08: Hugging Face y Gradio

## Integrantes

**Grupo N° 03**

- Adriana Villalobos
- Gustavo Ledesma
- Alejo Cuello

## Descripción de la actividad

Trabajamos sobre el conjunto de datos *test.csv* que trata sobre satisfacción de los pasajeros de una aerolínea. El objetivo de la actividad es crear un modelo para poder utilizarlo desde un Hugging Face Space, donde utilizaremos Gradio para crear la interfaz de usuario.

# Código

## Importación de librerías y datos

In [44]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pickle

from funpymodeling.exploratory import status
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

## 1) Preparación de datos

- Cargar el dataset:

In [29]:
all_data = pd.read_csv("../test.csv", sep=',', index_col=0)
all_data.head(3)

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,19556,Female,Loyal Customer,52,Business travel,Eco,160,5,4,3,...,5,5,5,5,2,5,5,50,44.0,satisfied
1,90035,Female,Loyal Customer,36,Business travel,Business,2863,1,1,3,...,4,4,4,4,3,4,5,0,0.0,satisfied
2,12360,Male,disloyal Customer,20,Business travel,Eco,192,2,0,2,...,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied


In [30]:
status(all_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,id,0,0.0,0,0.0,25976,int64
1,Gender,0,0.0,0,0.0,2,object
2,Customer Type,0,0.0,0,0.0,2,object
3,Age,0,0.0,0,0.0,75,int64
4,Type of Travel,0,0.0,0,0.0,2,object
5,Class,0,0.0,0,0.0,3,object
6,Flight Distance,0,0.0,0,0.0,3281,int64
7,Inflight wifi service,0,0.0,813,0.031298,6,int64
8,Departure/Arrival time convenient,0,0.0,1381,0.053164,6,int64
9,Ease of Online booking,0,0.0,1195,0.046004,6,int64


- Este dataset tiene muchas columnas, así que solo nos vamos a quedar con algunas:

In [31]:
data = all_data[['Age', 'Class', 'Inflight wifi service', 'Ease of Online booking', 'Seat comfort', 'Checkin service', 'satisfaction']].copy()
data.columns

Index(['Age', 'Class', 'Inflight wifi service', 'Ease of Online booking',
       'Seat comfort', 'Checkin service', 'satisfaction'],
      dtype='object')

- Cambiar el nombre de las columnas para evitar espacios en blanco y que sean más concisas.

In [32]:
data.rename(
    columns = {
        'Inflight wifi service':'Wifi',
        'Ease of Online booking':'Booking',
        'Seat comfort':'Seat',
        'Checkin service':'Checkin',
        }, 
    inplace = True)
data.columns

Index(['Age', 'Class', 'Wifi', 'Booking', 'Seat', 'Checkin', 'satisfaction'], dtype='object')

- Cambiar los valores de la columna `satisfaction`:

In [33]:
class_map = {'neutral or dissatisfied':0, 'satisfied':1}
data['satisfaction'] = data['satisfaction'].map(class_map)
data['satisfaction']

0        1
1        1
2        0
3        1
4        1
        ..
25971    0
25972    1
25973    0
25974    1
25975    0
Name: satisfaction, Length: 25976, dtype: int64

- Hacer un get dummies.

In [34]:
data = pd.get_dummies(data, drop_first=True, dtype="int64")
data

Unnamed: 0,Age,Wifi,Booking,Seat,Checkin,satisfaction,Class_Eco,Class_Eco Plus
0,52,5,3,3,2,1,1,0
1,36,1,3,5,3,1,0,0
2,20,2,2,2,2,0,1,0
3,44,0,0,4,3,1,0,0
4,49,2,4,2,4,1,1,0
...,...,...,...,...,...,...,...,...
25971,34,3,3,4,4,0,0,0
25972,23,4,4,4,5,1,0,0
25973,17,2,1,2,5,0,1,0
25974,14,3,3,4,4,1,0,0


## 2) Clasificación

- Su variable target o de interés a clasificar es `satisfaction`.
- Recuerden comentar y NO utilizar la siguiente celda:
    ```
    data_x = data_x.values
    data_y = data_y.values
    ``` 

In [38]:
x_data = data.drop(columns=["satisfaction"])
y_data = data[["satisfaction"]]

- Utilicen el 30% del dataset para test.

In [39]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(20780, 7)
(5196, 7)
(20780, 1)
(5196, 1)


- Para el Random Forest consideren los parámetros `n_estimators = 5000` y `random_state = 19`

In [None]:
# rf = RandomForestClassifier(n_estimators=5000,random_state=19)
# rf.fit(x_train,y_train)

  return fit_method(estimator, *args, **kwargs)


0,1,2
,n_estimators,5000
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [None]:
with open("rf.pkl", "rb") as handle:
    rf= pickle.load(handle)

In [55]:
y_train_pred = rf.predict(x_train)
y_train_pred

array([1, 1, 1, ..., 1, 0, 0], dtype=int64)

- **IMPORTANTE** El punto g) Análisis de los distintos puntos de corte, DEMORA MUCHO TIEMPO, aproximadamente > 2.000 min. **Así que si desean, lo pueden descartar.**

- Guarden el modelo con el nombre `rf.pkl`.
  **NOTA:** Este archivo es muy pero muy pesado, así que tengan cuidado en caso de que quieran subir el modelo a un repositorio a Github. **Por eso en este ejercicio NO les pedimos cargar el modelo a un repositorio.**

In [None]:
# Para el punto mencionado acá arriba, ya agregué en el .gitignore para que no tome el/los modelos que guardemos en esta carpeta
# with open("rf.pkl", "wb") as handle:
#     pickle.dump(rf, handle, protocol=pickle.HIGHEST_PROTOCOL)

- Guarden el nombre de las columnas

In [None]:
with open("categories_ohe.pkl", "wb") as handle:
    pickle.dump(x_data.columns, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [57]:
x_data.dtypes

Age               int64
Wifi              int64
Booking           int64
Seat              int64
Checkin           int64
Class_Eco         int64
Class_Eco Plus    int64
dtype: object