## Proyecto Final – Machine Learning as a Service (MLaaS)

*__Descripción:__*  
Utilizando el dataset que seleccionó anteriormente, deberá realizar los siguientes pasos para construir un API que realice predicciones.

Dentro de un Notebook deberá construir el código para completar lo siguiente:

1. Crear en Anaconda un environment.
2. Cargar dataset.
3. EDA y Profiling de los datos.
4. Proponer solución de los warnings generados en el EDA.
5. Realice un Split de los datos en 80% training y 20% testing.
6. Tres confguraciones de Setup PyCaret y seleccionar el mejor modelo.
7. Almacenamiento de los 3 pipelines en el disco duro.
8. API sirviendo los 3 modelos.
9. Log por cada predicción generada.
10. Streamlit Web Application consumiendo API



### 1. Crear en Anaconda un environment  
que tenga los componentes y librerías necesarios para ejecutar satisfactoriamente el código, algunas de las librerías que debe considerar son las siguientes:
* __a.__  PyCaret.
* __b.__  Flask.
* __c.__  Streamlit.
* __d.__  Pandas Profilling.

-- Crear un nuevo env con python 3.9:  
`$ conda create --name mlops python=3.9`

-- Activar el nuevo env:  
`$ conda activate mlops`

-- Instalar PyCaret:  
`$ pip install pycaret`

-- Instalar Flask:  
`$ pip install Flask`

-- Instalar Streamlit:  
`$ pip install streamlit`

-- Instalar Pandas Profilling:  
`$ pip install ydata_profiling`

In [2]:
from pycaret.regression import *
from pycaret.datasets import get_data
from flask import Flask, request, jsonify
import pandas as pd
from ydata_profiling import ProfileReport

from datetime import datetime

### 2. Carga de dataset

Cargar los datos dentro del Notebook.

In [3]:
# data cruda se encuentra en el directorio /data/raw/
data = pd.read_csv("../data/raw/Customer_Churn_Dataset.csv")

### 3. EDA y Profiling de los datos.

Realizar un data profiling usando pandas profiling y comentar los resultados en el reporte generado,
especialmente los warnings que puedan aparecer al momento de cargar los datos.

El dataset consta de mas de 440K observaciones de registros de clientes junto con las caracteristicas y etiqueta de abandono que son 4 categoricas y 8 numericas. Contiene una perdida de datos menor a 0.1% y tiene 0 filas duplicadas. 

En las correlaciones, se logra notar una alta correlacion positiva entre las variables Abandono y Total de llamadas a soporte.

La variable abandono consta de 57% con valor 1 (si) y 43% 0 (no)

*__WARNINGS__*  
Se tienen warnings sobre las variables Support Calls y Payment Delay, y warning se refiere a que estas variables contienen gran cantidad de valores Zero  
* Support Calls - 15.9%  
* Payment Delay - 3.8%  

Sin embargo, dada la naturaleza de las variables y los valores que almacena es normal que contengan valores cero, por lo que los Warnings no son tratados.

In [4]:
profile = ProfileReport(data, title ="Customer Churn")
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### 4. Proponer solución de los warnings generados en el EDA.

_**NOTA:**_ Los Warnings no son tratables, unicamente indican las variables que tienen alta correlacion. Y tambien indican que las variables Payment Delay y Support Calls tienen alta presencia de valores 0 Zero en sus registros, pero dada la naturaleza de los datos que almacenan es normal la presencia de estos valores Zero. por lo tanto: No se proponen tratamiento para los Warning generados_




In [5]:
data.dropna(inplace=True)

### 5. Realice un Split de los datos en 80% training y 20% testing.

Dividir los datos en dos partes, una para train y otra para test.

In [6]:
num_row_train = int(len(data)*0.8)

data_train = data.sample(n=num_row_train, random_state=2023)
data_test = data.drop(data_train.index)

In [7]:
print(f"Data en train: {data_train.shape[0]}")
print(f"Data en test: {data_test.shape[0]}")

Data en train: 51499
Data en test: 12875


### 6. Tres confguraciones de Setup PyCaret y seleccionar el mejor modelo.  
Utilizar PyCaret y las herramientas de AutoML para construir un modelo que permita realizar las predicciones satisfactoriamente. Deberá incluir al menos tres configuraciones de setup para aplicar ingeniería de características y seleccionar de las tres configuraciones el mejor modelo.  
Recuerde que debe considerar los siguientes aspectos en la ingeniería de características:  
*  Imputación de variables numéricas.__  
* Imputación de variables categóricas.__  
* Codificación de variables categóricas.__  
* Transformación de variables.__  
* Tratamiento de Outliers.__  
* Normalización de características.__  
* Eliminación de características no utilizadas en el modelo.__  

Imputación de variables numéricas:  
  _**Media**_

Imputación de variables categóricas:  
_**moda**_

Codificación de variables categóricas:  
_**one-hot encoding**_

Tratamiento de Outliers:  
_**drop**_

Normalización de características:  
_**Normalizacion**_

Eliminación de características no utilizadas en el modelo:  
_**NaN > 15% ; baja o nula coorrelacion ; Irrelevantes para el modelo**_  
_**CustomerID**_  


In [8]:
dataset = setup(data = data_train,
                target='Churn',
                ignore_features = ['CustomerID'],
                normalize = True,
                normalize_method='minmax',
                session_id=2023,
                transformation= True, 
                transformation_method = 'yeo-johnson',
                transform_target = True,
                remove_outliers= False,
                remove_multicollinearity = False,
                low_variance_threshold = 0.1,
                imputation_type ='simple',
                numeric_imputation ='mean',
                categorical_imputation = 'mode')


Unnamed: 0,Description,Value
0,Session id,2023
1,Target,Churn
2,Target type,Regression
3,Original data shape,"(51499, 12)"
4,Transformed data shape,"(51499, 15)"
5,Transformed train set shape,"(36049, 15)"
6,Transformed test set shape,"(15450, 15)"
7,Ignore features,1
8,Ordinal features,1
9,Numeric features,7


In [9]:
dataset.X_train_transformed


Unnamed: 0,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type_Standard,Subscription Type_Premium,Subscription Type_Basic,Contract Length_Annual,Contract Length_Monthly,Contract Length_Quarterly,Total Spend,Last Interaction
15532,0.048565,0.0,0.556910,0.435898,0.000000,0.000000,1.0,0.0,0.0,1.0,0.0,0.0,0.403346,0.488353
9458,0.432813,0.0,0.284797,0.325663,0.218517,0.235867,0.0,1.0,0.0,1.0,0.0,0.0,0.365686,0.618994
10596,0.189303,0.0,0.791736,0.202900,0.423027,0.602441,1.0,0.0,0.0,1.0,0.0,0.0,0.901015,0.419425
9093,0.980679,0.0,0.152682,0.947319,0.111854,0.535985,0.0,0.0,1.0,0.0,1.0,0.0,0.017665,0.347411
34644,0.119783,0.0,0.867134,0.633154,0.218517,0.336143,0.0,1.0,0.0,1.0,0.0,0.0,0.126715,0.618994
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38993,0.703621,0.0,0.911806,0.363541,1.000000,1.000000,0.0,1.0,0.0,0.0,1.0,0.0,0.077449,0.000000
13759,0.234819,1.0,0.985402,0.202900,1.000000,0.635644,0.0,1.0,0.0,0.0,0.0,1.0,0.982753,0.310009
8292,0.048565,0.0,0.373210,0.245610,0.906402,0.402848,1.0,0.0,0.0,1.0,0.0,0.0,0.631044,0.650412
42107,0.496709,0.0,0.699518,0.723722,0.811951,0.269326,0.0,1.0,0.0,0.0,0.0,1.0,0.294549,0.618994


#### Entrenamiento y selección de modelos automática

Deberá incluir al menos tres configuraciones de setup para aplicar ingeniería de características y seleccionar de las tres configuraciones el mejor modelo.

In [10]:
best = compare_models(exclude = ['lightgbm'], sort='R2')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,0.0046,0.0021,0.045,0.9917,0.0301,0.0059,5.15
dt,Decision Tree Regressor,0.0032,0.0032,0.0552,0.9872,0.0382,0.003,0.98
et,Extra Trees Regressor,0.0142,0.0037,0.0609,0.9851,0.0397,0.0198,6.774
gbr,Gradient Boosting Regressor,0.1381,0.0368,0.1917,0.8526,0.1181,0.2095,6.228
ada,AdaBoost Regressor,0.15,0.0658,0.2564,0.7363,0.1722,0.1939,3.157
knn,K Neighbors Regressor,0.148,0.0724,0.269,0.7098,0.1848,0.1721,1.911
huber,Huber Regressor,0.2953,0.1363,0.3692,0.4532,0.2513,0.3605,1.017
lar,Least Angle Regression,0.3035,0.1387,0.3724,0.4438,0.252,0.3883,0.801
ridge,Ridge Regression,0.3036,0.1387,0.3724,0.4438,0.252,0.3883,0.787
lr,Linear Regression,0.3035,0.1387,0.3724,0.4437,0.252,0.3884,1.704


Los primeros tres modelos ganadores y sus metricas:


|       Model	                  |  MAE      |	MSE	            |RMSE       |	R2  | RMSLE | MAPE  | TT (Sec)|
|---------------------------------|-----------|-----------------|-----------|-------|-------|-------|---------|
|rf	Random Forest Regressor	|0.0046	|0.0021	|0.0450	|0.9917	|0.0301	|0.0059	|5.1500|
|dt	Decision Tree Regressor	|0.0032	|0.0032	|0.0552	|0.9872	|0.0382	|0.0030	|0.9800|
|et	Extra Trees Regressor	|0.0142	|0.0037	|0.0609	|0.9851	|0.0397	|0.0198	|6.7740|

**El modelo ganador es rf por tener un R2 mas alto de 0.9917.**

In [11]:
rf_model = create_model('rf')
dt_model = create_model('dt')
et_model = create_model('et')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0036,0.0014,0.0374,0.9944,0.0246,0.005
1,0.0039,0.0015,0.0382,0.9941,0.025,0.0054
2,0.0036,0.0016,0.0404,0.9935,0.0279,0.004
3,0.0053,0.0024,0.0494,0.9902,0.0331,0.0069
4,0.0044,0.0022,0.0466,0.9913,0.0323,0.0048
5,0.0051,0.0023,0.0478,0.9908,0.0309,0.0075
6,0.0062,0.0032,0.0565,0.9872,0.0388,0.0076
7,0.0044,0.0018,0.0422,0.9929,0.0278,0.006
8,0.0043,0.0018,0.0421,0.9929,0.0279,0.0058
9,0.0046,0.0025,0.0498,0.99,0.0327,0.0066


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0008,0.0008,0.0288,0.9967,0.02,0.0012
1,0.0017,0.0017,0.0408,0.9933,0.0283,0.0029
2,0.0028,0.0028,0.0527,0.9889,0.0365,0.0029
3,0.0036,0.0036,0.0601,0.9855,0.0416,0.0024
4,0.0039,0.0039,0.0623,0.9844,0.0432,0.0029
5,0.0036,0.0036,0.0601,0.9856,0.0416,0.004
6,0.0055,0.0055,0.0745,0.9777,0.0516,0.0047
7,0.0028,0.0028,0.0527,0.9889,0.0365,0.0017
8,0.0031,0.0031,0.0552,0.9878,0.0383,0.0029
9,0.0042,0.0042,0.0645,0.9833,0.0447,0.0041


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0134,0.0032,0.0569,0.987,0.037,0.0184
1,0.0151,0.0042,0.0646,0.9833,0.0402,0.0236
2,0.0135,0.0033,0.0574,0.9868,0.039,0.0169
3,0.0137,0.0037,0.0609,0.985,0.0403,0.0188
4,0.0125,0.0032,0.0566,0.9871,0.0382,0.0163
5,0.0159,0.0045,0.0673,0.9819,0.0428,0.0232
6,0.0145,0.0036,0.06,0.9855,0.0389,0.0208
7,0.013,0.0032,0.0564,0.9872,0.037,0.0172
8,0.0154,0.0041,0.0639,0.9836,0.0422,0.021
9,0.0148,0.0042,0.0651,0.983,0.0416,0.0221


### 7. Almacenamiento de pipelines en el disco duro.

Almacenar los tres pipelines ganadores en función de los resultados del paso anterior.

Se guardará en la ubicacion segun la jerarquia de la plantilla utilizada de cookiecutter template.  
La ruta es:  
* _'/models/model_v1'_
* _'/models/model_v2'_
* _'/models/model_v3'_




In [13]:
model_v1 = finalize_model(estimator=rf_model)
model_v2 = finalize_model(estimator=dt_model)
model_v3 = finalize_model(estimator=et_model)

In [14]:
save_model(model=model_v1, model_name='../models/model_v1')
save_model(model=model_v2, model_name='../models/model_v2')
save_model(model=model_v3, model_name='../models/model_v3')

Transformation Pipeline and Model Successfully Saved
Transformation Pipeline and Model Successfully Saved
Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('target_transformation',
                  TransformerWrapperWithInverse(transformer=TargetTransformer(estimator=PowerTransformer(standardize=False)))),
                 ('numerical_imputer',
                  TransformerWrapper(include=['Age', 'Tenure', 'Usage Frequency',
                                              'Support Calls', 'Payment Delay',
                                              'Total Spend',
                                              'Last Interaction'],
                                     transformer=SimpleImputer())...
                  TransformerWrapper(exclude=[],
                                     transformer=VarianceThreshold(threshold=0.1))),
                 ('transformation',
                  TransformerWrapper(transformer=PowerTransformer(standardize=False))),
                 ('normalize', TransformerWrapper(transformer=MinMaxScaler())),
                 ('clean_column_names',
             

In [16]:
load_rf_model = load_model('../models/model_v1')
load_dt_model = load_model('../models/model_v2')
load_et_model = load_model('../models/model_v3')

Transformation Pipeline and Model Successfully Loaded
Transformation Pipeline and Model Successfully Loaded
Transformation Pipeline and Model Successfully Loaded


In [17]:
load_rf_model

### 9. Construir un API

Con los datos predichos deberá calcular el R2,
RMSE, MSE y MAPE. Además de proporcionar sus comentarios finales sobre los resultados.

|Model	|MAE	|MSE	|RMSE	|R2	|RMSLE	|MAPE|
|-------|-----|-----|-----|---|-------|----|
|0	Light Gradient Boosting Machine	|1506.4913	|3077860.9119	|1754.3833	|0.2217	|0.9312	|2.8487|
<br>  

  
---
<BR>  

**Comentarios finales:**  
Los dos modelos finales tenian valores de R2 un poco diferentes, pero al aplicarle un tuning de pycaret, estos variaron un poco y se mejoró el modelo para predicciones.  
Esa es la razón por la que al final el R2 del modelo en produccion difiera un poco del R2 del mismo modelo en training. Esto gracias a que se utilizó el parametro optimize = 'R2',
de la funcion tune_model. Ejemplo de uso: ml_gbr = tune_model(estimator=ml_gbr, optimize = 'R2'), y asi podemos optimizar según el indicador que deseamos optimizar (MAE, RMSE, etc)


