## Model Registry

After running the notebook `rumos_bank_lending_prediction.ipynb` and confirming that the dataset was properly loaded, I verified that the preprocessing was working correctly. The model was trained without errors and, after analyzing the performance metrics, I selected the **Random Forest** model as the best.

### MLflow Experiment

I started by creating an experiment in **MLflow**, since each experiment can group multiple runs. It's good practice to include all runs we want to compare within the same experiment, even if they use different models, as long as they are applied to the same data. This makes result comparison easier. I also chose to reuse the same experiment when working with new training data for the same problem, to ensure consistent evaluation.

### Trained Models

In addition to **Random Forest**, I trained and evaluated the following models:

- **Logistic Regression**  
- **K-Nearest Neighbors (KNN)**  
- **Support Vector Machine (SVM)**  
- **Decision Tree**  
- **Multi-layer Perceptron (MLP)**  

All models were:

- Trained using **GridSearchCV** for hyperparameter tuning  
- Evaluated using metrics such as **accuracy** and **total cost** (`total_cost`)  
- Adjusted with a **custom threshold** to convert predicted probabilities into binary classes, adapted to the problem's context

### MLflow Model Registry

All models were registered in the **MLflow Model Registry**, including:

- Best hyperparameters (`mlflow.log_params`)  
- Metrics such as `accuracy`, `total_cost`, and `min_cost_threshold` (`mlflow.log_metric`)  
- The trained model (`mlflow.sklearn.log_model(...)` with `registered_model_name`)  

### Model in Production

Finally, since the **Random Forest** model achieved the best performance in terms of total cost, I promoted it to **@champion** in the **Model Registry**. This model is currently in production, served through an **API developed with FastAPI**.


In [151]:
import os
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
import joblib
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from mlflow.models import infer_signature
from sklearn.model_selection import GridSearchCV
import joblib

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

In [None]:
# Get the project's root directory
root_dir = os.path.abspath(os.path.join(os.getcwd(), "../.."))

# Check if the script is running inside a Docker container
if os.getenv("DOCKER_ENV"):
    dataset_path = "/app/data/lending_data.csv"  # Path inside Docker
else:
    dataset_path = os.path.join(root_dir, "data/lending_data.csv")  # Local path

print("Usando dataset em:", dataset_path)

# Check if the file actually exists before loading
if not os.path.exists(dataset_path):
    raise FileNotFoundError(f" O ficheiro não foi encontrado: {dataset_path}")

# Load the dataset
df = pd.read_csv(dataset_path)
seed = 42

Usando dataset em: /Users/dinisguerreiro/Documents/Documentos/Cursos/Data Analysis/Operacionalização de Machine Learning/OML-trabalho-master/rumos_bank/data/lending_data.csv


## Define the directory where the experiments are stored

In [153]:
from pathlib import Path

uri = "http://0.0.0.0:5001"

mlflow.set_tracking_uri(uri)
#mlflow.set_tracking_uri("file:///Users/dinisguerreiro/Documents/Documentos/Cursos/Data Analysis/Operacionalização de Machine Learning/OML-trabalho-master/mlruns")

In [154]:
import mlflow
print("Tracking URI:", mlflow.get_tracking_uri())

Tracking URI: http://0.0.0.0:5001


## Set the experiment "Rumos Bank - Bad Payer Prediction"

In [155]:
mlflow.set_experiment("Rumos Bank - Bad Payer Prediction")

2025/04/01 21:20:35 INFO mlflow.tracking.fluent: Experiment with name 'Rumos Bank - Bad Payer Prediction' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/333381692465427545', creation_time=1743538835439, experiment_id='333381692465427545', last_update_time=1743538835439, lifecycle_stage='active', name='Rumos Bank - Bad Payer Prediction', tags={}>

## Load the dataset

In [156]:
df = pd.read_csv(dataset_path)

## Create the datasets

All features are numerical. I will remove the client ID:

In [157]:
df = df.drop('ID', axis = 1)

Let's now split the dataset into training and test sets:

In [158]:
train_set, test_set = train_test_split(df, test_size = 0.2, random_state = seed)

In [159]:
X_train = train_set.drop(['default.payment.next.month'], axis = 'columns')
y_train = train_set['default.payment.next.month']

X_test = test_set.drop(['default.payment.next.month'], axis = 1)
y_test = test_set['default.payment.next.month']

Normalization:

In [160]:
scaler = MinMaxScaler()

features_names = X_train.columns

X_train = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train, columns = features_names)

X_test = scaler.transform(X_test)
X_test = pd.DataFrame(X_test, columns = features_names)

In [161]:
X_train

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,0.070707,1.0,0.333333,0.666667,0.051724,0.2,0.2,0.2,0.2,0.2,...,0.129346,0.229591,0.119957,0.212270,0.004010,0.002969,0.002335,0.001961,0.003388,0.001666
1,0.020202,0.0,0.333333,0.666667,0.120690,0.2,0.2,0.2,0.2,0.2,...,0.102352,0.183928,0.102464,0.178567,0.005731,0.000739,0.000950,0.001538,0.000000,0.000000
2,0.171717,1.0,0.833333,0.333333,0.396552,0.2,0.2,0.1,0.1,0.1,...,0.086811,0.160138,0.087471,0.187399,0.000000,0.000505,0.000000,0.011081,0.024242,0.000345
3,0.050505,0.0,0.166667,0.666667,0.068966,0.2,0.2,0.2,0.2,0.2,...,0.107501,0.197477,0.119933,0.212000,0.002310,0.001128,0.002232,0.002415,0.004455,0.003794
4,0.121212,1.0,0.333333,0.666667,0.068966,0.2,0.2,0.2,0.2,0.2,...,0.149338,0.271125,0.200483,0.284403,0.004693,0.002494,0.005580,0.008052,0.011723,0.020298
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23995,0.040404,0.0,0.333333,0.666667,0.189655,0.2,0.2,0.2,0.2,0.2,...,0.116948,0.212849,0.109640,0.183794,0.002290,0.001781,0.001776,0.000116,0.002659,0.139281
23996,0.191919,0.0,0.166667,0.666667,0.275862,0.4,0.4,0.4,0.4,0.4,...,0.178796,0.314795,0.248252,0.325557,0.015454,0.003562,0.000000,0.012077,0.014067,0.007588
23997,0.040404,0.0,0.166667,0.666667,0.086207,0.0,0.0,0.0,0.0,0.0,...,0.086345,0.160138,0.080648,0.178567,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
23998,0.060606,1.0,0.333333,0.666667,0.068966,0.2,0.2,0.2,0.2,0.4,...,0.114429,0.193222,0.109040,0.202517,0.003434,0.001187,0.005022,0.001932,0.000000,0.002276


In [162]:
joblib.dump(scaler, "scaler.pkl")
print("O scaler.pkl guardado com sucesso!")

O scaler.pkl guardado com sucesso!


## Save datasets, models, artifacts, metrics, and run parameters

In [163]:
def total_cost(y_test, y_preds, threshold = 0.5):
    
    tn, fp, fn, tp = confusion_matrix(y_test == 1, y_preds > threshold).ravel()
    
    cost_fn = fn*3000
    cost_fp = fp*1000
    
    return cost_fn + cost_fp

In [164]:
def min_cost_threshold(y_true, y_proba):
    thresholds = np.arange(0, 1.1, 0.1)
    costs = {round(thresh, 1): total_cost(y_true, y_proba, threshold=thresh) for thresh in thresholds}
    return min(costs, key=costs.get)


In [165]:
model_configs = [
    {
        "name": "logistic_regression",
        "estimator": LogisticRegression(max_iter=500, solver='lbfgs', class_weight='balanced', random_state=seed),
        "param_grid": {"C": [0.001, 0.01, 0.1, 1, 10, 100]},
        "threshold": 0.5
    },
    {
        "name": "knn_classifier",
        "estimator": KNeighborsClassifier(),
        "param_grid": {"n_neighbors": range(1, 10)},
        "threshold": 0.3
    },
    {
        "name": "svc_classifier",
        "estimator": SVC(probability=True, class_weight='balanced', gamma='scale', random_state=seed),
        "param_grid": {"C": [0.1, 1, 10], "kernel": ["rbf", "linear"]},
        "threshold": 0.3
    },
    {
        "name": "decision_tree",
        "estimator": DecisionTreeClassifier(class_weight='balanced', random_state=seed),
        "param_grid": {"max_depth": [3, 6], "min_samples_split": [2, 4, 10]},
        "threshold": 0.5
    },
    {
        "name": "random_forest",
        "estimator": RandomForestClassifier(class_weight='balanced', random_state=seed),
        "param_grid": {'n_estimators': [10, 100, 300, 1000]},
        "threshold": 0.3
    },
    {
        "name": "mlp_classifier",
        "estimator": MLPClassifier(solver='lbfgs', max_iter=1000, random_state=seed),
        "param_grid": {
            "hidden_layer_sizes": [(20,), (20, 10), (20, 10, 2)],
            "learning_rate_init": [0.0001, 0.001, 0.01, 0.1]
        },
        "threshold": 0.2
    }
]

In [166]:
for config in model_configs:
    print(f"Treinando e registando modelo: {config['name']}")
    
    with mlflow.start_run(run_name=config["name"]):
        mlflow.log_param("seed", seed)

        clf = GridSearchCV(config["estimator"], config["param_grid"], cv=5).fit(X_train, y_train)
        best_model = clf.best_estimator_

        y_preds_proba = best_model.predict_proba(X_test)[:, 1]
        threshold = config["threshold"]
        y_preds = (y_preds_proba >= threshold).astype(int)

        acc = accuracy_score(y_test, y_preds)
        cost = total_cost(y_test, y_preds_proba, threshold)
        min_thresh = min_cost_threshold(y_test, y_preds_proba)

        mlflow.log_params(clf.best_params_)
        mlflow.log_metric("accuracy", acc)
        mlflow.log_metric("total_cost", cost)
        mlflow.log_metric("min_cost_threshold", min_thresh)

        signature = infer_signature(X_test, best_model.predict(X_test))

        mlflow.sklearn.log_model(
            best_model,
            artifact_path=config["name"],
            registered_model_name=config["name"],
            signature=signature,
            input_example=X_test[:5]
        )

        print(f"Modelo '{config['name']}' registado com sucesso!")


Treinando e registando modelo: logistic_regression


python(67757) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Successfully registered model 'logistic_regression'.
2025/04/01 21:20:41 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: logistic_regression, version 1
Created version '1' of model 'logistic_regression'.


Modelo 'logistic_regression' registado com sucesso!
🏃 View run logistic_regression at: http://0.0.0.0:5001/#/experiments/333381692465427545/runs/c82e450dbeb5406395395dd04e5ff0b2
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/333381692465427545
Treinando e registando modelo: knn_classifier


python(67859) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Successfully registered model 'knn_classifier'.
2025/04/01 21:20:53 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: knn_classifier, version 1
Created version '1' of model 'knn_classifier'.


Modelo 'knn_classifier' registado com sucesso!
🏃 View run knn_classifier at: http://0.0.0.0:5001/#/experiments/333381692465427545/runs/cb84e1e83cdc4cba93aa2b62c690765b
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/333381692465427545
Treinando e registando modelo: svc_classifier


python(78639) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Successfully registered model 'svc_classifier'.
2025/04/01 21:49:18 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: svc_classifier, version 1
Created version '1' of model 'svc_classifier'.


Modelo 'svc_classifier' registado com sucesso!
🏃 View run svc_classifier at: http://0.0.0.0:5001/#/experiments/333381692465427545/runs/d62cf75593d2405183abc7f66fce078b
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/333381692465427545
Treinando e registando modelo: decision_tree


python(78691) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Successfully registered model 'decision_tree'.
2025/04/01 21:49:25 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: decision_tree, version 1
Created version '1' of model 'decision_tree'.


Modelo 'decision_tree' registado com sucesso!
🏃 View run decision_tree at: http://0.0.0.0:5001/#/experiments/333381692465427545/runs/a4ed0c2448e94984ac751cf0e86b5b24
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/333381692465427545
Treinando e registando modelo: random_forest


python(80383) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Successfully registered model 'random_forest'.
2025/04/01 21:53:58 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: random_forest, version 1
Created version '1' of model 'random_forest'.


Modelo 'random_forest' registado com sucesso!
🏃 View run random_forest at: http://0.0.0.0:5001/#/experiments/333381692465427545/runs/185af54e9845451f958d588ef42e512f
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/333381692465427545
Treinando e registando modelo: mlp_classifier


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

Modelo 'mlp_classifier' registado com sucesso!
🏃 View run mlp_classifier at: http://0.0.0.0:5001/#/experiments/333381692465427545/runs/2ea4f384051a494dbcd014dfb7049a64
🧪 View experiment at: http://0.0.0.0:5001/#/experiments/333381692465427545


Created version '1' of model 'mlp_classifier'.


In [167]:
mlflow.end_run()

In [168]:
print(mlflow.get_tracking_uri())

http://0.0.0.0:5001
