<div style="color:#3c4d5a; border-top: 7px solid #42A5F5; border-bottom: 7px solid #42A5F5; padding: 5px; text-align: center; text-transform: uppercase"><h1>Transformation and Processing of variables</h1> </div>

This notebook focuses on the preprocessing and transformation of variables in a dataset on Alzheimer's disease as a necessary step prior to predictive modeling. Variable processing is performed to ensure that the dataset is correctly structured and suitable for machine learning algorithms.

The workflow covers the preparation of numerical and categorical features, the construction of transformation objects, and the division of the data into training and test sets, providing a solid foundation for subsequent model development and evaluation.


- [Dataset Visualization (Dimensions)](#rela)
- [Data cleaning](#clean)
- [Correlation Matrix](#matrix)
- [Variable Processing](#proceso)
- [Definition of Columns](#definicion)
- [Split Train](#split)
- [Transforming Object](#obj)
- [Results](#results)
- [Conclusion](#conclusion)
- [References](#references)

<div style="color:#37475a"><h2>Imported modules</h2> </div>

---



In [2]:
import kagglehub
import os
from kagglehub import dataset_download
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from google.colab import files

print("Imported modules")

Modulo importados


<div id="rela" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Dataset Visualization (Dimensions)</h2> </div>

In [3]:
path = dataset_download("rabieelkharoua/alzheimers-disease-dataset")

print("Dataset descargado en:", path)

data = pd.read_csv(path + "/alzheimers_disease_data.csv")

print("Dimensiones iniciales:", data.shape)
data.head()

Using Colab cache for faster access to the 'alzheimers-disease-dataset' dataset.
Dataset descargado en: /kaggle/input/alzheimers-disease-dataset
Dimensiones iniciales: (2149, 35)


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,4751,73,0,0,2,22.927749,0,13.297218,6.327112,1.347214,...,0,0,1.725883,0,0,0,1,0,0,XXXConfid
1,4752,89,0,0,0,26.827681,0,4.542524,7.619885,0.518767,...,0,0,2.592424,0,0,0,0,1,0,XXXConfid
2,4753,73,0,3,1,17.795882,0,19.555085,7.844988,1.826335,...,0,0,7.119548,0,1,0,1,0,0,XXXConfid
3,4754,74,1,0,1,33.800817,1,12.209266,8.428001,7.435604,...,0,1,6.481226,0,0,0,0,0,0,XXXConfid
4,4755,89,0,0,0,20.716974,0,18.454356,6.310461,0.795498,...,0,0,0.014691,0,0,1,1,0,0,XXXConfid


<div id="clean" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Data cleaning</h2> </div>


In [4]:
#En este caso no existia valores nulos, pero se ajusta al procedimiento de la Fase 1, de limpieza de datos
data = data.drop_duplicates()

# Revisar nulos
print(data.isnull().sum())

# Relleno básico (ajusta si necesitas otra estrategia)
data = data.fillna(method="ffill").fillna(method="bfill")


PatientID                    0
Age                          0
Gender                       0
Ethnicity                    0
EducationLevel               0
BMI                          0
Smoking                      0
AlcoholConsumption           0
PhysicalActivity             0
DietQuality                  0
SleepQuality                 0
FamilyHistoryAlzheimers      0
CardiovascularDisease        0
Diabetes                     0
Depression                   0
HeadInjury                   0
Hypertension                 0
SystolicBP                   0
DiastolicBP                  0
CholesterolTotal             0
CholesterolLDL               0
CholesterolHDL               0
CholesterolTriglycerides     0
MMSE                         0
FunctionalAssessment         0
MemoryComplaints             0
BehavioralProblems           0
ADL                          0
Confusion                    0
Disorientation               0
PersonalityChanges           0
DifficultyCompletingTasks    0
Forgetfu

  data = data.fillna(method="ffill").fillna(method="bfill")


<div id="proceso" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Variable Processing</h2> </div>

<div style="color:#37475a"><h2>Remove Variables</h2> </div>

---


The PatientID and DoctorInCharge variables do not provide useful predictive information for the model, so they will be removed from the dataset before the training phase.

In [5]:
data = data.drop(columns=["PatientID", "DoctorInCharge"]).copy()

<div style="color:#37475a"><h2>New Variables</h2> </div>

---



In [6]:
data['age_mmse_interaction'] = data['Age'] * data['MMSE']
data['cognitive_decline_score'] = data['MMSE'] + data['FunctionalAssessment'] + data['ADL']
data['vascular_risk_score'] = (
    data['Hypertension'] +
    data['CardiovascularDisease'] +
    data['Diabetes'] +
    data['Smoking']
)
data['cholesterol_ratio'] = data['CholesterolLDL'] / (data['CholesterolHDL'] + 0.01)
data['bp_ratio'] = data['SystolicBP'] / (data['DiastolicBP'] + 0.01)

data['symptom_count'] = (
    data['Confusion'] +
    data['Disorientation'] +
    data['PersonalityChanges'] +
    data['DifficultyCompletingTasks'] +
    data['Forgetfulness'] +
    data['MemoryComplaints'] +
    data['BehavioralProblems']
)

data['lifestyle_score'] = (
    data['PhysicalActivity'] +
    data['DietQuality'] +
    data['SleepQuality']
)

data['age_group'] = pd.cut(data['Age'], bins=[59, 70, 80, 91], labels=[0, 1, 2])

In order to enrich the dataset and capture complex relationships among the original variables, new derived features were created that may be relevant for predictive modeling:

**age_mmse_interaction:** interaction between age and the MMSE cognitive score, designed to capture how the effect of age may influence cognitive performance.

**cognitive_decline_score:** sum of MMSE, FunctionalAssessment, and ADL, representing an overall score of cognitive and functional decline.

**vascular_risk_score:** combination of hypertension, cardiovascular disease, diabetes, and smoking habit, reflecting the patient’s vascular risk.

**cholesterol_ratio:** ratio between LDL and HDL, used as an indicator of lipid-related cardiovascular risk.

**bp_ratio:** ratio between systolic and diastolic blood pressure, used to assess hemodynamic risk.

**symptom_count:** sum of symptoms such as confusion, disorientation, personality changes, difficulty completing tasks, forgetfulness, and memory complaints, representing the total burden of cognitive and behavioral symptoms.

**lifestyle_score:** combination of physical activity, diet quality, and sleep quality, reflecting an indicator of healthy lifestyle habits.

**age_group:** categorization of age into groups (60–70, 71–80, 81–91) to facilitate segmented analyses by age range.

These derived variables allow the capture of interactions and complex patterns that may not be evident in individual variables, potentially improving the predictive performance of the models.

<div id="definicion" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Definition of Columns</h2> </div>



In [7]:

target = "Diagnosis"

categoricas = ["Gender", "Ethnicity", "EducationLevel", "age_group"]

numericas = [
    "Age", "BMI", "AlcoholConsumption", "PhysicalActivity",
    "DietQuality", "SleepQuality",
    "SystolicBP", "DiastolicBP", "CholesterolTotal",
    "CholesterolLDL", "CholesterolHDL", "CholesterolTriglycerides",
    "MMSE", "FunctionalAssessment", "ADL",
    "age_mmse_interaction", "cognitive_decline_score", "vascular_risk_score",
    "cholesterol_ratio", "bp_ratio", "symptom_count", "lifestyle_score"
]

binarias = [
    "Smoking", "FamilyHistoryAlzheimers", "CardiovascularDisease",
    "Diabetes", "Depression", "HeadInjury", "Hypertension",
    "MemoryComplaints", "BehavioralProblems",
    "Confusion", "Disorientation", "PersonalityChanges",
    "DifficultyCompletingTasks", "Forgetfulness"
]

<div id="split" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Split Train</h2> </div>

In [8]:
X = data.drop(columns=[target])
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

The dataset was split into training and test sets using train_test_split.
Eighty percent of the data was allocated for training the models (X_train, y_train), and the remaining 20% was reserved to evaluate their performance (X_test, y_test). Additionally, the stratify = y parameter was used to ensure that the proportion of patients with and without Alzheimer’s disease remained the same in both sets, preserving the original distribution of the target variable and preventing bias during model training or evaluation.


<div id="obj" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Transforming Object</h2> </div>


In [9]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore", drop='first'), categoricas),
        ("num", StandardScaler(), numericas),
        ("bin", "passthrough", binarias)
    ],
    remainder='drop'
)

# Fit Transform solo en TRAIN
X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep = preprocessor.transform(X_test)

Se creó un objeto transformador (preprocessor) mediante ColumnTransformer para preprocesar automáticamente las diferentes tipos de variables antes de entrenar los modelos. En este transformador:

Las variables categóricas se codifican con One-Hot Encoding eliminando la primera categoría para evitar multicolinealidad.

Las variables numéricas se estandarizan con StandardScaler para que tengan media cero y desviación estándar uno, lo cual mejora el desempeño de muchos algoritmos de aprendizaje automático.

Las variables binarias se mantienen tal cual mediante passthrough.

Luego, se aplica fit_transform sobre los datos de entrenamiento para ajustar y transformar, y transform sobre los datos de prueba para garantizar que se utilicen las mismas transformaciones, asegurando consistencia entre entrenamiento y evaluación del modelo.

<div style="color:#37475a"><h2>Save File</h2> </div>

---



In [10]:
with open("preprocessor.pkl", "wb") as f:
    pickle.dump(preprocessor, f)

# Dataset transformado útil para fase 2
dataset_transformado = {
    "X_train_prep": X_train_prep,
    "X_test_prep": X_test_prep,
    "y_train": y_train,
    "y_test": y_test
}

with open("dataset_transformado.pkl", "wb") as f:
    pickle.dump(dataset_transformado, f)

print("\n Archivos guardados:")
print("   preprocessor.pkl")
print("   dataset_transformado.pkl")


 Archivos guardados:
   preprocessor.pkl
   dataset_transformado.pkl


<div style="color:#37475a"><h2>Download</h2> </div>

---


In [None]:
files.download("preprocessor.pkl")
files.download("dataset_transformado.pkl")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<div id="results" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Results</h2> </div>


The original dataset contained 2,149 complete records and 35 variables with no missing values, so no additional data cleaning was required. The variables PatientID and DoctorInCharge were removed, as they did not provide predictive information. In addition, derived features were created to capture relevant interactions and combinations, such as age_mmse_interaction, cognitive_decline_score, vascular_risk_score, cholesterol_ratio, bp_ratio, symptom_count, lifestyle_score, and age_group.

For preprocessing, a ColumnTransformer was built to encode categorical variables, standardize numerical variables, and keep binary variables unchanged, applying fit_transform to the training set and transform to the test set. Finally, the dataset was split into training (80%) and test (20%) sets, preserving the proportion of patients with and without Alzheimer’s disease using stratify.

Additionally, the preprocessor and the transformed dataset were saved and downloaded so they can be reused later: the transformed dataset will be used in phase 2 to train and evaluate the models, while the preprocessor will be used in phase 3 to process new data and generate predictions consistent with the same transformation scheme.

<div id="conclusion" style="color:#37475a; border-bottom: 7px solid orange; width: 100%; margin-bottom: 15px; padding-bottom: 2px"><h2>Conclusion</h2> </div>

The dataset is clean, consistent, and enriched with relevant derived variables. With the applied preprocessing and the appropriate split into training and test sets, the data are ready to train robust predictive models, ensuring that meaningful information is leveraged and that irrelevant variables do not affect performance.