<div style="color:#3c4d5a; border-top: 7px solid #42A5F5; border-bottom: 7px solid #42A5F5; padding: 5px; text-align: center; text-transform: uppercase"><h1>Incremental Retraining of XGBOOST Model
</h1> </div>

This notebook implements the incremental retraining process of the Alzheimer's risk prediction model based on XGBoost. The objective is to update the previously trained model by incorporating new patient data, without the need to retrain it from scratch.

To do this, XGBoost's ability to continue training from an existing model is used through the xgb_model parameter. In this approach, the base model acts as a starting point, and new decision trees are added using the recent preprocessed data. This improves the model's predictive power while retaining previously learned knowledge.

- [Transform new samples](#tp)
- [Re-training](#re)
- [Save new version](#se)
- [Results](#results)
- [Conclusion](#conclusion)
- [References](#references)

<div style="color:#37475a"><h2>Imported modules</h2> </div>

---

In [1]:
import mlflow.xgboost
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
import mlflow
import mlflow.xgboost
from mlflow.models import infer_signature

<div style="color:#37475a"><h2>Transformer load and model</h2> </div>

---

In [26]:
experiment_name = "Alzheimer_Preprocesamiento"
run_id = "a9302cdf7df7439d8a59ea7c3fb148ff"

# --- Transformer load and model---
prep_path = mlflow.artifacts.download_artifacts(
    run_id=run_id,
    artifact_path="preprocessor/preprocessor.pkl"
)
artifact_path = "dataset/dataset_transformado.pkl"

local_path = mlflow.artifacts.download_artifacts(
    run_id=run_id,
    artifact_path=artifact_path
)

print("File downloaded to:", local_path)

# Load 
with open(local_path, "rb") as f:
    data = pickle.load(f)

X_train = data["X_train_prep"]
X_test = data["X_test_prep"]
y_train = data["y_train"]
y_test = data["y_test"]

with open(prep_path, "rb") as f:
    transformador = pickle.load(f)

modelo = mlflow.xgboost.load_model(
    model_uri="models:/Alzheimer_XGBoost/latest"
)
print("Dataset successfully loaded from MLflow")

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

File downloaded to: C:\Users\user\AppData\Local\Temp\tmp8tsv7aqp\dataset_transformado.pkl
Dataset successfully loaded from MLflow


In [27]:
from sklearn.model_selection import train_test_split

def tomar_porcentaje(X, y, porcentaje):
    X_pct, _, y_pct, _ = train_test_split(
        X,
        y,
        train_size=porcentaje,
        random_state=42,
        stratify=y
    )
    return X_pct, y_pct


<div id="tp" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Transform new samples</h2> </div>

In [28]:
# X_new_raw, y_new_raw → nuevos pacientes
X_40, y_40 = tomar_porcentaje(X_train, y_train, 0.40)
X_75, y_75 = tomar_porcentaje(X_train, y_train, 0.75)
X_100, y_100 = X_train, y_train


In [29]:
# --- new patients ---
nuevos_samples = pd.DataFrame([
    {
        "PatientID": 6901,
        "Age": 72,
        "Gender": 0,
        "Ethnicity": 1,
        "EducationLevel": 3,
        "BMI": 23.5,
        "Smoking": 0,
        "AlcoholConsumption": 10,
        "PhysicalActivity": 5,
        "DietQuality": 2,
        "SleepQuality": 6,
        "FamilyHistoryAlzheimers": 0,
        "CardiovascularDisease": 0,
        "Diabetes": 0,
        "Depression": 0,
        "HeadInjury": 0,
        "Hypertension": 1,
        "SystolicBP": 130,
        "DiastolicBP": 80,
        "CholesterolTotal": 200,
        "CholesterolLDL": 130,
        "CholesterolHDL": 50,
        "CholesterolTriglycerides": 150,
        "MMSE": 28,
        "FunctionalAssessment": 5,
        "MemoryComplaints": 0,
        "BehavioralProblems": 0,
        "ADL": 1,
        "Confusion": 0,
        "Disorientation": 0,
        "PersonalityChanges": 0,
        "DifficultyCompletingTasks": 1,
        "Forgetfulness": 0,
        "DoctorInCharge": "DrA"
    },
    {
        "PatientID": 6902,
        "Age": 78,
        "Gender": 0,
        "Ethnicity": 2,
        "EducationLevel": 1,
        "BMI": 28,
        "Smoking": 0,
        "AlcoholConsumption": 0,
        "PhysicalActivity": 2,
        "DietQuality": 1,
        "SleepQuality": 4,
        "FamilyHistoryAlzheimers": 1,
        "CardiovascularDisease": 1,
        "Diabetes": 0,
        "Depression": 1,
        "HeadInjury": 0,
        "Hypertension": 1,
        "SystolicBP": 150,
        "DiastolicBP": 90,
        "CholesterolTotal": 220,
        "CholesterolLDL": 160,
        "CholesterolHDL": 40,
        "CholesterolTriglycerides": 180,
        "MMSE": 22,
        "FunctionalAssessment": 8,
        "MemoryComplaints": 1,
        "BehavioralProblems": 1,
        "ADL": 3,
        "Confusion": 1,
        "Disorientation": 1,
        "PersonalityChanges": 1,
        "DifficultyCompletingTasks": 1,
        "Forgetfulness": 1,
        "DoctorInCharge": "DrC"
    },
    {
        "PatientID": 6903,
        "Age": 70,
        "Gender": 1,
        "Ethnicity": 1,
        "EducationLevel": 2,
        "BMI": 24.5,
        "Smoking": 0,
        "AlcoholConsumption": 3,
        "PhysicalActivity": 6,
        "DietQuality": 3,
        "SleepQuality": 7,
        "FamilyHistoryAlzheimers": 0,
        "CardiovascularDisease": 0,
        "Diabetes": 0,
        "Depression": 0,
        "HeadInjury": 0,
        "Hypertension": 0,
        "SystolicBP": 125,
        "DiastolicBP": 78,
        "CholesterolTotal": 190,
        "CholesterolLDL": 120,
        "CholesterolHDL": 55,
        "CholesterolTriglycerides": 140,
        "MMSE": 27,
        "FunctionalAssessment": 4,
        "MemoryComplaints": 0,
        "BehavioralProblems": 0,
        "ADL": 1,
        "Confusion": 0,
        "Disorientation": 0,
        "PersonalityChanges": 0,
        "DifficultyCompletingTasks": 0,
        "Forgetfulness": 0,
        "DoctorInCharge": "DrD"
    },
    {
        "PatientID": 6904,
        "Age": 75,
        "Gender": 0,
        "Ethnicity": 0,
        "EducationLevel": 1,
        "BMI": 29,
        "Smoking": 1,
        "AlcoholConsumption": 8,
        "PhysicalActivity": 2,
        "DietQuality": 1,
        "SleepQuality": 5,
        "FamilyHistoryAlzheimers": 1,
        "CardiovascularDisease": 1,
        "Diabetes": 1,
        "Depression": 1,
        "HeadInjury": 0,
        "Hypertension": 1,
        "SystolicBP": 145,
        "DiastolicBP": 88,
        "CholesterolTotal": 230,
        "CholesterolLDL": 170,
        "CholesterolHDL": 42,
        "CholesterolTriglycerides": 190,
        "MMSE": 23,
        "FunctionalAssessment": 7,
        "MemoryComplaints": 1,
        "BehavioralProblems": 1,
        "ADL": 2,
        "Confusion": 1,
        "Disorientation": 1,
        "PersonalityChanges": 1,
        "DifficultyCompletingTasks": 1,
        "Forgetfulness": 1,
        "DoctorInCharge": "DrE"
    }
])

data = nuevos_samples.copy()
data['age_mmse_interaction'] = data['Age'] * data['MMSE']
data['cognitive_decline_score'] = data['MMSE'] + data['FunctionalAssessment'] + data['ADL']
data['vascular_risk_score'] = data['Hypertension'] + data['CardiovascularDisease'] + data['Diabetes'] + data['Smoking']
data['cholesterol_ratio'] = data['CholesterolLDL'] / (data['CholesterolHDL'] + 0.01)
data['bp_ratio'] = data['SystolicBP'] / (data['DiastolicBP'] + 0.01)
data['symptom_count'] = (
    data['Confusion'] +
    data['Disorientation'] +
    data['PersonalityChanges'] +
    data['DifficultyCompletingTasks'] +
    data['Forgetfulness'] +
    data['MemoryComplaints'] +
    data['BehavioralProblems']
)
data['lifestyle_score'] = data['PhysicalActivity'] + data['DietQuality'] + data['SleepQuality']
data['age_group'] = pd.cut(data['Age'], bins=[59, 70, 80, 91], labels=[0, 1, 2])

features = data.drop(columns=["PatientID", "DoctorInCharge"])


In [30]:
import numpy as np

# nuevos pacientes
X_new_prep = transformador.transform(data)
y_new = [0,1,0,1]

# ejemplo con 40%
X_train_final = np.vstack([X_40])
y_train_final = np.hstack([y_40])


<div id="re" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Re-training</h2> </div>

In [31]:
from xgboost import XGBClassifier

modelo_retrain = XGBClassifier(
    n_estimators=100,          
    learning_rate=0.05,
    max_depth=5,
    eval_metric="logloss",
    random_state=42
)

modelo_retrain.fit(
    X_train_final,
    y_train_final,
    xgb_model=modelo  
)


0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


Create a new XGBoost classifier configured to add additional trees (n_estimators=100) and continue training from the previously trained base model using:

**xgb_model=base_model**


This allows you to:

* not lose previous learning

* incorporate new patterns

* update the model efficiently

* reduce training time

* maintain model stability

In [32]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)

y_pred = modelo_retrain.predict(X_test)

acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average="macro", zero_division=0)
recall = recall_score(y_test, y_pred, average="macro", zero_division=0)
f1 = f1_score(y_test, y_pred, average="macro")

print(f"Accuracy : {acc:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall   : {recall:.4f}")
print(f"F1-macro : {f1:.4f}")

Accuracy : 0.9279
Precision: 0.9282
Recall   : 0.9129
F1-macro : 0.9198


<div id="se" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Save new version</h2> </div>

In [33]:
mlflow.set_experiment("Alzheimer_Modelamiento")

with mlflow.start_run(run_name="XGBoost_Retrain_v2"):

    # ===== metrics =====
    mlflow.log_param("retrain", True)
    mlflow.log_param("base_model_version", 1)

    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_macro", f1)
    mlflow.log_metric("precision_macro", precision)
    mlflow.log_metric("recall_macro", recall)

    # ===== signature =====
    signature = infer_signature(X_new_prep, modelo_retrain.predict(X_new_prep))

    # ===== save model =====
    mlflow.xgboost.log_model(
        xgb_model=modelo_retrain,
        name="xgb_model",
        registered_model_name="Alzheimer_XGBoost",
        signature=signature,
        input_example=X_new_prep[:5]
    )

print("Retrained model saved as NEW VERSION")


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Registered model 'Alzheimer_XGBoost' already exists. Creating a new version of this model...


Retrained model saved as NEW VERSION


Created version '9' of model 'Alzheimer_XGBoost'.


<div id="conclusion" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Conclusion</h2> </div>

Incremental retraining allowed the Alzheimer's risk prediction model to be updated by incorporating new patient records without completely rebuilding the original model. This strategy retains previously learned knowledge and extends it with new information, which is especially useful in clinical settings where data grows progressively.

Post-retraining evaluation metrics show that the model maintains—or improves—its predictive performance, confirming that integrating new data adds value without degrading model quality. In addition, the retrained model can be versioned and logged, enabling traceability and change control within the MLOps flow.

Taken together, this procedure demonstrates a best practice for updating models in production, aligned with principles of continuous learning and maintenance of applied artificial intelligence systems.