# Bootcamp Data Science y MLOps

<img src="https://i.ibb.co/5RM26Cw/LOGO-COLOR2.png" width="500px">

Creado en [escueladedatosvivos.ai](https://escueladedatosvivos.ai) 🚀.

¿Consultas? En la página tenés soporte por IA guiada, comunidad y el acceso a certificación.

<br>

---  

# 0) Dataset ✈️🌎

Nos basamos en el notebook que vimos en la Semana 5 cuando vimos Regresión.
<br>Correspondiente al dataset de Kaggle: [Travel Insurance](https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data).

# 1) Cargamos los datos 📕

In [51]:
import pandas as pd
from funpymodeling.exploratory import status

In [52]:
# Para este caso nos interesa visualizar todas las columnas
pd.set_option('display.max_columns', None)

Como utilizamos Scikit-Learn, haremos uso de `import mlflow.sklearn`

In [53]:
#!pip3 install mlflow

In [54]:
import mlflow.sklearn

Le asignamos un nombre al experimento

In [55]:
mlflow.set_experiment(experiment_name="proyecto_bootcamp_martes")

<Experiment: artifact_location='file:///c:/Users/Carlos/Documents/EDVAI/MLOps_EDVAI/s9_MLflow/martes/mlruns/908263344126470351', creation_time=1684725974019, experiment_id='908263344126470351', last_update_time=1684725974019, lifecycle_stage='active', name='proyecto_bootcamp_martes', tags={}>

In [56]:
#!mlflow ui

Como este dataset lo guardaron con su index, lo vamos a mentener.
<br>Para eso usamos index_col=0, donde le decimos que use la columna 0 como index.

In [None]:
data = pd.read_csv("../05/TravelInsurancePrediction.csv", sep=',', index_col=0) 

In [58]:
data.head(5)

Unnamed: 0,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,31,Government Sector,Yes,400000,6,1,No,No,0
1,31,Private Sector/Self Employed,Yes,1250000,7,0,No,No,0
2,34,Private Sector/Self Employed,Yes,500000,4,1,No,No,1
3,28,Private Sector/Self Employed,Yes,700000,3,1,No,No,0
4,28,Private Sector/Self Employed,Yes,700000,8,1,Yes,No,0


*Nota:* si bien el valor defecto de sep en `read_csv` es la coma `,`. 
<br>Siempre lo hago explícito porque a veces los archivos vienen separados por punto y coma, u otro separador como tab. Es una buena práctica, y también aplica cuando graban archivos.

# 2) Preparación de la data 👀

In [59]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Age,0,0.0,0,0.0,11,int64
1,Employment Type,0,0.0,0,0.0,2,object
2,GraduateOrNot,0,0.0,0,0.0,2,object
3,AnnualIncome,0,0.0,0,0.0,30,int64
4,FamilyMembers,0,0.0,0,0.0,8,int64
5,ChronicDiseases,0,0.0,1435,0.722194,2,int64
6,FrequentFlyer,0,0.0,0,0.0,2,object
7,EverTravelledAbroad,0,0.0,0,0.0,2,object
8,TravelInsurance,0,0.0,1277,0.642677,2,int64


Como tenemos algunas columnas con valores binarios `yes/no` podemos pasarla a `1/0`.
<br>Así evitamos tener tantas columnas si hacemos un get_dummies.

In [60]:
class_map = {'No':0, 'Yes':1}
columns_booleans = ['GraduateOrNot', 'FrequentFlyer', 'EverTravelledAbroad']

for name_column in columns_booleans:
    data[name_column] = data[name_column].map(class_map)

Por otro lado, la columna `Employment Type` recibe valores como `Government Sector` y `Government Sector`.

In [61]:
class_map = {'Government Sector':0, 'Private Sector/Self Employed':1}
data['Employment Type'] = data['Employment Type'].map(class_map)

Renombrar columna `Employment Type` por `EmploymentType`.
<br>**Evitar espacios entre los nombres de las columnas.**

In [62]:
data.rename(columns = {'Employment Type':'EmploymentType'}, inplace = True)

Visualizar data acondicionada y preparada:

In [63]:
data.head(5)

Unnamed: 0,Age,EmploymentType,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,31,0,1,400000,6,1,0,0,0
1,31,1,1,1250000,7,0,0,0,0
2,34,1,1,500000,4,1,0,0,1
3,28,1,1,700000,3,1,0,0,0
4,28,1,1,700000,8,1,1,0,0


In [64]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Age,0,0.0,0,0.0,11,int64
1,EmploymentType,0,0.0,570,0.286865,2,int64
2,GraduateOrNot,0,0.0,295,0.148465,2,int64
3,AnnualIncome,0,0.0,0,0.0,30,int64
4,FamilyMembers,0,0.0,0,0.0,8,int64
5,ChronicDiseases,0,0.0,1435,0.722194,2,int64
6,FrequentFlyer,0,0.0,1570,0.790136,2,int64
7,EverTravelledAbroad,0,0.0,1607,0.808757,2,int64
8,TravelInsurance,0,0.0,1277,0.642677,2,int64


Guardar el tamaño del dataset

In [65]:
print(data.shape)

(1987, 9)


In [66]:
mlflow.log_param("Tamaño dataset", data.shape)

In [67]:
#mlflow.log_param("Autorización del TL", True)

True

# 3) Clasificación 🎯

## 3.1) Separación de X de Y, y luego TR de TS (rutina):

In [68]:
data_x = data.drop('TravelInsurance', axis=1)
data_y = data['TravelInsurance']

Nos quedamos solo con los valores del dataframe.

In [69]:
data_x = data_x.values
data_y = data_y.values

In [70]:
# Porcentaje del dataset para el test
TEST_SIZE = 0.3

Guardar el porcentaje que se utiliza del dataset para el test

In [71]:
mlflow.log_param("Porcentaje de test", TEST_SIZE)

0.3

In [72]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=TEST_SIZE)

## 3.2) Creación del modelo predictivo

In [73]:
# Creamos 100 decision trees
NUM_ESTIMATORS = 100
RANDOM_STATE = 99

In [74]:
from sklearn.ensemble import RandomForestClassifier 

# Creamos 100 decision trees
rf = RandomForestClassifier(n_estimators = NUM_ESTIMATORS, random_state = RANDOM_STATE)

Guardar parámetros del modelo

In [75]:
mlflow.log_param("Número de estimadores", NUM_ESTIMATORS)
mlflow.log_param("Valor semilla", RANDOM_STATE)

99

In [76]:
rf.fit(x_train, y_train)

# 4) Predicción de la clase y score ✅

In [77]:
# En training
pred_tr=rf.predict(x_train)

# En testing
pred_ts=rf.predict(x_test)

## 4.1 Validamos Training
Generamos un dataframe con la realidad vrs. lo predicho

In [78]:
df_val_tr=pd.DataFrame({'y_train':y_train, 'pred_tr':pred_tr})

# ¿cuántos aciertos?
sum(df_val_tr.y_train==df_val_tr.pred_tr)/len(df_val_tr)

0.9316546762589928

In [79]:
from sklearn.metrics import accuracy_score

In [80]:
accuracy_train = accuracy_score(df_val_tr.y_train, df_val_tr.pred_tr, normalize=True)
print(accuracy_train)

0.9316546762589928


Guardar el accuracy en train

In [81]:
mlflow.log_metric("Accuracy en Train", accuracy_train)

## 4.2 Validamos Testing
Generamos un dataframe con la realidad vrs. lo predicho

In [82]:
df_val_ts=pd.DataFrame({'y_test':y_test, 'pred_ts':pred_ts})

In [83]:
# ¿cuántos aciertos?
accuracy_test = accuracy_score(df_val_ts.y_test, df_val_ts.pred_ts, normalize=True)
print(accuracy_test)

0.7973199329983249


Guardar el accuracy en test

In [84]:
mlflow.log_metric("Accuracy en Test", accuracy_test)

# 5) Guardar un modelo 💾

Vamos a guardar el modelo de clasificación de manera convencional

In [85]:
import pickle

In [86]:
# Guardar en el disco
filename = 'rf.pkl'
pickle.dump(rf, open(filename, 'wb')) # rf = nuestro modelo

In [87]:
# Lo cargamos para usarlo en otro momento. 
rf_loaded = pickle.load(open(filename, 'rb'))

Guardar el modelo con mlflow

In [88]:
mlflow.sklearn.log_model(rf, "mi_modelo")

<mlflow.models.model.ModelInfo at 0x1fc82923490>

## 9) Acceder a MLflow UI 📍

Se abre en el `localhost:5000`, y la ventana estará abierta hasta que detengan la celda.

In [91]:
!mlflow ui

^C


In [90]:
#!mlflow models serve --model-uri models:/satisfaccion_de_vuelo/Staging -p 1234 --no-conda

Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3568.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "c:\Users\Carlos\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\Scripts\mlflow.exe\__main__.py", line 7, in <module>
  File "C:\Users\Carlos\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\Carlos\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\click\core.py", line 1053, in main
    rv = s