# **Deploy of MeLi Model with Docker and FastAPI**

### ***Made by Elizabeth Granda***

### **[Link of the presentation](https://www.canva.com/design/DAGW-89Taz4/MGY-vM0yx48RKxoU-NP3kw/edit?utm_content=DAGW-89Taz4&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton)**

For the model's productization, **FastAPI and Docker** will be used. This process involves converting a trained model into a service that can receive external requests, process them, and return predictions.

As mentioned in the notebook called `Modelo_Final_Elizabeth`, the classification algorithm that will be used for the model's productization will be **XGBoost** since its results showed a better confusion matrix for the context of the problem than the other two.

For the realization of this exercise, it was necessary to create a small module called `preprocessing.py` where, as its name indicates, all the preprocessing that the data must go through to be able to carry out the operation of the model correctly is found. To perform `Pipelines` correctly in scikit-learn, the processes must be programmed in the form of classes and objects, so that the classes are instantiated and the methods that they are programmed for can be applied to the desired variables and/or features. The description of what each class does is documented in the same `preprocessing.py` file, so if you want to know how it works, its explanation will be there.

---


In [1]:
import pickle
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from preprocessing import * #importamos todas las funciones que estan en el modulo que creamos
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import mutual_info_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv("TechnicalInterviewFraudPrevention.csv")

In [3]:
data.head()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,...,L,M,N,O,P,Q,R,S,Monto,Fraude
0,0,10,50257.0,0,0,0.0,0.0,0,0,UY,...,0,3,1,0,5,0.0,0.0,7.25,37.51,1
1,0,10,29014.0,0,0,0.0,0.0,0,0,UY,...,0,1,1,0,3,0.0,0.0,11.66,8.18,1
2,0,7,92.0,0,1,0.0,0.0,0,1,UY,...,0,3,1,0,2,0.0,0.0,86.97,13.96,1
3,9,16,50269.0,0,0,0.0,0.0,0,0,UY,...,0,3,1,0,5,0.0,0.0,2.51,93.67,1
4,0,8,8180.0,0,0,0.0,0.0,0,0,UY,...,0,1,1,0,1,0.0,0.0,25.96,135.4,1


In the `preprocessing.py` file, the functioning of one of the classes called `AssignCountryGroupsAndOneHotEncode()` is mentioned. This class is responsible for grouping and then one-hot encoding the column `J` which contains the information of the country where the transaction was carried out. It is mentioned there that the grouping method for this variable may sound like it is creating a lot of bias, however, those ranges were chosen because the frequency of the countries was appropriate for it. The following table shows this:


In [4]:
value_counts = data['J'].value_counts()


countries_1_to_20 = value_counts[(value_counts >= 1) & (value_counts <= 20)]
countries_20_to_350 = value_counts[(value_counts > 20) & (value_counts <= 350)]
countries_350_to_9400 = value_counts[(value_counts >= 350) & (value_counts <= 9400)]

# Count countries in each range
count_1_to_20 = len(countries_1_to_20)
count_20_to_350 = len(countries_20_to_350)
count_350_to_9400 = len(countries_350_to_9400)

count_1_to_20, count_20_to_350, count_350_to_9400

(13, 3, 3)

In [5]:
data.J.value_counts()

J
AR    9329
BR    4428
MX    2366
ES     314
US     230
UY     180
CA      12
GB       8
GT       2
FR       2
UA       1
IT       1
CO       1
CL       1
PT       1
TR       1
CH       1
KR       1
AU       1
Name: count, dtype: int64

In [6]:
features = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N',
       'O', 'P', 'Q', 'R', 'S', 'Monto']

data = data[features + ["Fraude"]]

In [7]:
data["C"] = data["C"].fillna(data["C"].mean()) #llenamos la data nula de C con la media como antes
data["K"] = data["K"].fillna(0) # llenamos los datos nulos de K con cero (en el preprocessing tenemos esto en cuenta)

In [8]:
data.isnull().sum()

A         0
B         0
C         0
D         0
E         0
F         0
G         0
H         0
I         0
J         0
K         0
L         0
M         0
N         0
O         0
P         0
Q         0
R         0
S         0
Monto     0
Fraude    0
dtype: int64

In [9]:
X = data.drop("Fraude", axis=1)
y = data["Fraude"]

In [10]:
X_full_train, X_test = train_test_split(data, test_size=0.2, random_state=1)
X_train, X_val = train_test_split(X_full_train, test_size=0.25, random_state=1)

In [11]:
X_train = X_train.reset_index(drop=True)
X_val = X_val.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

In [12]:
y_train = X_train.Fraude.values
y_val = X_val.Fraude.values
y_test = X_test.Fraude.values

del X_train['Fraude']
del X_val['Fraude']
del X_test['Fraude']

In [13]:
data.dtypes

A           int64
B           int64
C         float64
D           int64
E           int64
F         float64
G         float64
H           int64
I           int64
J          object
K         float64
L           int64
M           int64
N           int64
O           int64
P           int64
Q          object
R          object
S         float64
Monto      object
Fraude      int64
dtype: object

We performed UnderSampling as we did in the other Notebook:

In [14]:
y_train_series = pd.Series(y_train)

undersample = RandomUnderSampler(sampling_strategy=0.75, random_state=1)
X_train_undersample, y_train_undersample = undersample.fit_resample(X_train, y_train_series)

We bring our Pipeline that calls the `CustomPipeline()` class that we implemented in our `preprocessing.py` which is responsible for performing the pipeline in an ordered way so that the columns created in each of the preprocessing steps are concatenated. We pass the Random Forest classifier with the chosen hyperparameters (same as the other notebook). It looks like this:

In [15]:
'''
pipeline = Pipeline(steps=[
    ("CustomPreprocessor", CustomPipeline()),
    ("Classifier", RandomForestClassifier(
        n_estimators=250,
        random_state=42,
        class_weight="balanced",
        max_depth=3))
])
'''
pipeline = Pipeline(steps=[
    ("CustomPreprocessor", CustomPipeline()),
    ("Classifier", xgb.XGBClassifier(
        n_estimators=250, 
        random_state=42, 
        scale_pos_weight=2.33, 
        learning_rate = 0.1,
        max_depth = 4
    ))])

In [16]:
pipeline.fit(X_train_undersample, y_train_undersample)


[CleanAndNormalizeData] Columnas antes de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto']
[CleanAndNormalizeData] Columnas categóricas detectadas: ['J', 'R']
[CleanAndNormalizeData] Columnas numéricas detectadas: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'S', 'Monto']
[CleanAndNormalizeData] Columnas después de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto']

[MapToBinary] Columnas antes de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto']
[MapToBinary] Columna creada: 'k_bin'
[MapToBinary] Columnas después de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto', 'k_bin']

[AssignCountryGroupsAndOneHotEncode] Columnas antes de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',

**Testing performance with Test Data:**

In [17]:
y_pred = pipeline.predict(X_test)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1])

# Imprimimos los resultados
print("********************")
print("Recall:", recall)
print("Precision:", precision)
print("F1-Score:", f1)
print("AUC-ROC:", roc_auc)


[CleanAndNormalizeData] Columnas antes de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto']
[CleanAndNormalizeData] Columnas categóricas detectadas: ['J', 'R']
[CleanAndNormalizeData] Columnas numéricas detectadas: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'S', 'Monto']
[CleanAndNormalizeData] Columnas después de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto']

[MapToBinary] Columnas antes de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto']
[MapToBinary] Columna creada: 'k_bin'
[MapToBinary] Columnas después de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto', 'k_bin']

[AssignCountryGroupsAndOneHotEncode] Columnas antes de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',

**Testing performance with Validation Data**

In [18]:
y_pred_val = pipeline.predict(X_val)
recall_val = recall_score(y_val, y_pred_val)
precision_val = precision_score(y_val, y_pred_val)
f1_val = f1_score(y_val, y_pred_val)
roc_auc_val = roc_auc_score(y_val, pipeline.predict_proba(X_val)[:, 1])

# Imprimimos los resultados
print("********************")
print("Recall:", recall_val)
print("Precision:", precision_val)
print("F1-Score:", f1_val)
print("AUC-ROC:", roc_auc_val)


[CleanAndNormalizeData] Columnas antes de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto']
[CleanAndNormalizeData] Columnas categóricas detectadas: ['J', 'R']
[CleanAndNormalizeData] Columnas numéricas detectadas: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'S', 'Monto']
[CleanAndNormalizeData] Columnas después de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto']

[MapToBinary] Columnas antes de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto']
[MapToBinary] Columna creada: 'k_bin'
[MapToBinary] Columnas después de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'Monto', 'k_bin']

[AssignCountryGroupsAndOneHotEncode] Columnas antes de transformar: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',

Let´s print the confussion matrix for our test data:

In [19]:
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[1577  885]
 [ 238  676]]


Now let´s write the model in a `pickle` file with extension `.pkl`

In [20]:
with open('./app/model/meli_pipeline-0.0.1.pkl', 'wb') as f:
    pickle.dump(pipeline, f)