___
<img style="float: right; margin: 15px 15px 15px 15px;" src="https://img.freepik.com/free-vector/depression-concept-illustration_114360-3747.jpg?t=st=1657678284~exp=1657678884~hmac=b8b1d71ca0a8eb2e4ff5bf31d6a98624112f1a2254b0f39e92254ed12d7875b2" width="240px" height="180px" />

# <font color= #bbc28d> **Clasificación de Depresión - Limpieza de Datos** </font>
#### <font color= #2E9AFE> `Proyecto de Ciencia de Datos`</font>
- <Strong> Sofía Maldonado, Diana Valdivia, Samantha Sánchez & Vivienne Toledo </Strong>
- <Strong> Fecha </Strong>: 30/09/2025.

___

<p style="text-align:right;"> Image retrieved from: https://img.freepik.com/free-vector/depression-concept-illustration_114360-3747.jpg?t=st=1657678284~exp=1657678884~hmac=b8b1d71ca0a8eb2e4ff5bf31d6a98624112f1a2254b0f39e92254ed12d7875b2/p>

## <font color= #bbc28d>• **Data Loading** </font>

In [2]:
# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import mlflow

In [2]:
# Data import
df = pd.read_csv("../data/raw/depression_dataset.csv")
df

Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.90,5.0,0.0,5-6 hours,Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,M.Tech,Yes,1.0,1.0,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27896,140685,Female,27.0,Surat,Student,5.0,0.0,5.75,5.0,0.0,5-6 hours,Unhealthy,Class 12,Yes,7.0,1.0,Yes,0
27897,140686,Male,27.0,Ludhiana,Student,2.0,0.0,9.40,3.0,0.0,Less than 5 hours,Healthy,MSc,No,0.0,3.0,Yes,0
27898,140689,Male,31.0,Faridabad,Student,3.0,0.0,6.61,4.0,0.0,5-6 hours,Unhealthy,MD,No,12.0,2.0,No,0
27899,140690,Female,18.0,Ludhiana,Student,5.0,0.0,6.88,2.0,0.0,Less than 5 hours,Healthy,Class 12,Yes,10.0,5.0,No,1


## <font color= #bbc28d>• **Tratamiento de Valores Faltantes** </font>
Como pudimos ver en el EDA, solo la columna de **Financial Stress** presenta 3 valores nulos, al ser una cantidad pequeña, eliminaremos las filas por completo:

In [3]:
# Eliminar los valores faltantes
df.dropna(inplace=True)

# Comprobar que en efecto ya no contamos con valores faltantes
df.isnull().sum()

id                                       0
Gender                                   0
Age                                      0
City                                     0
Profession                               0
Academic Pressure                        0
Work Pressure                            0
CGPA                                     0
Study Satisfaction                       0
Job Satisfaction                         0
Sleep Duration                           0
Dietary Habits                           0
Degree                                   0
Have you ever had suicidal thoughts ?    0
Work/Study Hours                         0
Financial Stress                         0
Family History of Mental Illness         0
Depression                               0
dtype: int64

## <font color= #bbc28d>• **Filtrado de datos - Categóricas** </font>
Como pudimos ver en el EDA, si bien ya no contamos con valores faltantes, existen varias columnas en las cuáles los valores cuentan con una o dos filas por valor, como **City**, **Dietary Habits**, **Sleep Duration** & **Profesion**, además de que nuestro modelo va enfocado a los estudiantes por lo que eliminaremos los registros que si bien queremos conservar la variable, limpiaremos solo las filas que pueden llegar a aportar valor:

In [4]:
# Verificar que ciudades tienen muy pocos registros
df['City'].value_counts()

City
Kalyan                1570
Srinagar              1372
Hyderabad             1339
Vasai-Virar           1290
Lucknow               1155
Thane                 1139
Ludhiana              1111
Agra                  1094
Surat                 1078
Kolkata               1065
Jaipur                1036
Patna                 1007
Visakhapatnam          969
Pune                   968
Ahmedabad              951
Bhopal                 934
Chennai                885
Meerut                 825
Rajkot                 816
Delhi                  768
Bangalore              767
Ghaziabad              745
Mumbai                 699
Vadodara               694
Varanasi               684
Nagpur                 651
Indore                 643
Kanpur                 609
Nashik                 547
Faridabad              461
Saanvi                   2
Bhavna                   2
City                     2
Harsha                   2
Less Delhi               1
M.Tech                   1
3.0                    

Podemos ver que después de **Faridabad**, la cantidad de registros por ciudad disminuyen a solo 2 o 1, por lo que hay que eliminarlos.

In [5]:
# Eliminar las filas con menos de 450 registros
# Seleccionar las ciudades
ciudades = df['City'].value_counts()[df['City'].value_counts() < 450]

# Filtrar el df
df = df[~df['City'].isin(ciudades.index)]

# Comprobar
df['City'].value_counts()

City
Kalyan           1570
Srinagar         1372
Hyderabad        1339
Vasai-Virar      1290
Lucknow          1155
Thane            1139
Ludhiana         1111
Agra             1094
Surat            1078
Kolkata          1065
Jaipur           1036
Patna            1007
Visakhapatnam     969
Pune              968
Ahmedabad         951
Bhopal            934
Chennai           885
Meerut            825
Rajkot            816
Delhi             768
Bangalore         767
Ghaziabad         745
Mumbai            699
Vadodara          694
Varanasi          684
Nagpur            651
Indore            643
Kanpur            609
Nashik            547
Faridabad         461
Name: count, dtype: int64

In [6]:
# Verificar qué hábitos alimenticios tienen muy pocos registros
df['Dietary Habits'].value_counts()

Dietary Habits
Unhealthy    10303
Moderate      9915
Healthy       7642
Others          12
Name: count, dtype: int64

Podemos ver que solo tenemos **12** registros para **Others** por lo que también habrá que eliminarlos ya que no aportaran mucha información.

In [7]:
# Filtrar los datos
df = df[df['Dietary Habits'] != 'Others']
df['Dietary Habits'].value_counts()

Dietary Habits
Unhealthy    10303
Moderate      9915
Healthy       7642
Name: count, dtype: int64

In [8]:
# Verificar que duración de siesta tienen muy pocos registros
df['Sleep Duration'].value_counts()

Sleep Duration
Less than 5 hours    8297
7-8 hours            7333
5-6 hours            6172
More than 8 hours    6040
Others                 18
Name: count, dtype: int64

Podemos ver que solo tenemos **18** registros para **Others** por lo que también habrá que eliminarlos ya que no aportaran mucha información.

In [9]:
# Filtrar los datos
df = df[df['Sleep Duration'] != 'Others']
df['Sleep Duration'].value_counts()

Sleep Duration
Less than 5 hours    8297
7-8 hours            7333
5-6 hours            6172
More than 8 hours    6040
Name: count, dtype: int64

In [10]:
# Verificar que degree tiene muy pocos registros
df['Degree'].value_counts()

Degree
Class 12    6074
B.Ed        1861
B.Com       1503
B.Arch      1476
BCA         1429
MSc         1187
B.Tech      1151
MCA         1041
M.Tech      1019
BHM          924
BSc          886
M.Ed         817
B.Pharm      809
M.Com        734
BBA          696
MBBS         694
LLB          669
BE           609
BA           595
M.Pharm      581
MD           571
MBA          560
MA           544
PhD          520
LLM          481
MHM          191
ME           185
Others        35
Name: count, dtype: int64

Podemos ver como **Others** tiene solo **35** registros, por lo que podemos eliminarla.

In [11]:
# Filtrar los datos
df = df[df['Degree'] != 'Others']
df['Degree'].value_counts()

Degree
Class 12    6074
B.Ed        1861
B.Com       1503
B.Arch      1476
BCA         1429
MSc         1187
B.Tech      1151
MCA         1041
M.Tech      1019
BHM          924
BSc          886
M.Ed         817
B.Pharm      809
M.Com        734
BBA          696
MBBS         694
LLB          669
BE           609
BA           595
M.Pharm      581
MD           571
MBA          560
MA           544
PhD          520
LLM          481
MHM          191
ME           185
Name: count, dtype: int64

Para la columna de **Profesion** no habrá filtrado ya que si solo filtramos a **Student** nos quedaríamos con una sola categoría y sería mejor eliminar la columna por completo, pero esto más adelante en su apartado.

## <font color= #bbc28d>• **Filtrado de datos - Numéricas** </font>
Como vimos también en el EDA, la columna de **edad** esta muy sesgada hacia la derecha, la mayoría de los datos se concentra desde los 18 a los 35 y hay valores atípicos de hasta los 50 años que si bien, pudieran ser reales, optaremos por eliminarlos.

Además de que también hay un sesgo ligero en las columnas de **Academic Pressure** & **Study Satisfaction** en donde todos los valores se concentran del 1 al 5, con pocas muestras en el 0 por lo que también eliminaremos estas filas.

In [12]:
# Filtrar edad
df = df[df['Age'] <= 35]

# Filtrar presión académica
df = df[df['Academic Pressure'] > 0]

# Filtrar study satisfaction
df = df[df['Study Satisfaction'] > 0]

# Ver el df
df

Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.90,5.0,0.0,5-6 hours,Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,M.Tech,Yes,1.0,1.0,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27896,140685,Female,27.0,Surat,Student,5.0,0.0,5.75,5.0,0.0,5-6 hours,Unhealthy,Class 12,Yes,7.0,1.0,Yes,0
27897,140686,Male,27.0,Ludhiana,Student,2.0,0.0,9.40,3.0,0.0,Less than 5 hours,Healthy,MSc,No,0.0,3.0,Yes,0
27898,140689,Male,31.0,Faridabad,Student,3.0,0.0,6.61,4.0,0.0,5-6 hours,Unhealthy,MD,No,12.0,2.0,No,0
27899,140690,Female,18.0,Ludhiana,Student,5.0,0.0,6.88,2.0,0.0,Less than 5 hours,Healthy,Class 12,Yes,10.0,5.0,No,1


Como podemos ver, mantenemos casi la cantidad original que al principio después de el filtrado.

## <font color= #bbc28d>• **Eliminar Variables** </font>
Dentro del EDA también nos dimos cuenta que variables como **Work Pressure** y **Job Satisfaction** no son buenas variables predictoras ya que la mayoría de los estudiantes de este dataset o bien no trabajaban y por ende no tenían pressión, o eran todos infelices en sus trabajos ya que para ambas columnas, el 96% de los datos [o más] **se encontraban centrados en 0**, por lo que estas columnas en general no atribuyen información para los modelos:

In [13]:
# Eliminar las columnas
df.drop(columns=['Work Pressure', 'Profession', 'Job Satisfaction', 'id'], axis=1, inplace=True)
df

Unnamed: 0,Gender,Age,City,Academic Pressure,CGPA,Study Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,Male,33.0,Visakhapatnam,5.0,8.97,2.0,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,Female,24.0,Bangalore,2.0,5.90,5.0,5-6 hours,Moderate,BSc,No,3.0,2.0,Yes,0
2,Male,31.0,Srinagar,3.0,7.03,5.0,Less than 5 hours,Healthy,BA,No,9.0,1.0,Yes,0
3,Female,28.0,Varanasi,3.0,5.59,2.0,7-8 hours,Moderate,BCA,Yes,4.0,5.0,Yes,1
4,Female,25.0,Jaipur,4.0,8.13,3.0,5-6 hours,Moderate,M.Tech,Yes,1.0,1.0,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27896,Female,27.0,Surat,5.0,5.75,5.0,5-6 hours,Unhealthy,Class 12,Yes,7.0,1.0,Yes,0
27897,Male,27.0,Ludhiana,2.0,9.40,3.0,Less than 5 hours,Healthy,MSc,No,0.0,3.0,Yes,0
27898,Male,31.0,Faridabad,3.0,6.61,4.0,5-6 hours,Unhealthy,MD,No,12.0,2.0,No,0
27899,Female,18.0,Ludhiana,5.0,6.88,2.0,Less than 5 hours,Healthy,Class 12,Yes,10.0,5.0,No,1


## <font color= #bbc28d>• **Codificar Variables - Binarias** </font>
Al ser categorias binarias, aplicar una codificación tipo One-Hot sería lo mejor, sin embargo, esto solo genera dos valores posibles, lo que da pie a la correlación. Por lo tanto, para mapear mejor los valores dentro de la columna existente, se eliminará una de las variables dummy nuevas. Las variables a codificar de esta manera son **Gender**, **Pensamientos Suicidas** e **Historia Familiar con depresión:**

In [14]:
# Mapear los valores
gender = {'Male' : 0, 'Female' : 1}

# Las otras dos tienen los mismos valores, podemos hacer uno en general
general = {'Yes' : 1, 'No' : 0}

# Mapear los valores
df['Gender'] = df['Gender'].map(gender)
df['Have you ever had suicidal thoughts ?'] = df['Have you ever had suicidal thoughts ?'].map(general)
df['Family History of Mental Illness'] = df['Family History of Mental Illness'].map(general)

df

Unnamed: 0,Gender,Age,City,Academic Pressure,CGPA,Study Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,0,33.0,Visakhapatnam,5.0,8.97,2.0,5-6 hours,Healthy,B.Pharm,1,3.0,1.0,0,1
1,1,24.0,Bangalore,2.0,5.90,5.0,5-6 hours,Moderate,BSc,0,3.0,2.0,1,0
2,0,31.0,Srinagar,3.0,7.03,5.0,Less than 5 hours,Healthy,BA,0,9.0,1.0,1,0
3,1,28.0,Varanasi,3.0,5.59,2.0,7-8 hours,Moderate,BCA,1,4.0,5.0,1,1
4,1,25.0,Jaipur,4.0,8.13,3.0,5-6 hours,Moderate,M.Tech,1,1.0,1.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27896,1,27.0,Surat,5.0,5.75,5.0,5-6 hours,Unhealthy,Class 12,1,7.0,1.0,1,0
27897,0,27.0,Ludhiana,2.0,9.40,3.0,Less than 5 hours,Healthy,MSc,0,0.0,3.0,1,0
27898,0,31.0,Faridabad,3.0,6.61,4.0,5-6 hours,Unhealthy,MD,0,12.0,2.0,0,0
27899,1,18.0,Ludhiana,5.0,6.88,2.0,Less than 5 hours,Healthy,Class 12,1,10.0,5.0,0,1


## <font color= #bbc28d>• **Codificar Variables - Múltiples** </font>
Al ser pocos los valores de las columnas, podemos también codificar las variables de forma manual, incluso con orden jerárquico cuando sea aplicable. Si no hay un orden, entonces se procederá a realizar One-Hot:

Para la columna de degree, al ser muchos los posibles valores, reduciremos las categorías a solo 4: Secundaria, Undergraduate, Graduate & Doctorate.

In [15]:
# Mapear los valores
degree = {
    "Class 12": "Secondary",
    "B.Pharm": "Undergraduate", "BSc": "Undergraduate", "BA": "Undergraduate", "BCA": "Undergraduate",
    "B.Ed": "Undergraduate", "LLB": "Undergraduate", "BE": "Undergraduate", "BHM": "Undergraduate",
    "B.Com": "Undergraduate", "B.Arch": "Undergraduate", "B.Tech": "Undergraduate", "BBA": "Undergraduate",
    "M.Tech": "Postgraduate", "M.Ed": "Postgraduate", "MSc": "Postgraduate", "M.Pharm": "Postgraduate",
    "MCA": "Postgraduate", "MA": "Postgraduate", "MBA": "Postgraduate", "M.Com": "Postgraduate", "MHM": "Postgraduate",
    "PhD": "Doctorate", "MD": "Doctorate", "MBBS": "Doctorate", "LLM": "Doctorate"
}

# Aplicar el mapeo
df['Degree'] = df['Degree'].map(degree)
df

Unnamed: 0,Gender,Age,City,Academic Pressure,CGPA,Study Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,0,33.0,Visakhapatnam,5.0,8.97,2.0,5-6 hours,Healthy,Undergraduate,1,3.0,1.0,0,1
1,1,24.0,Bangalore,2.0,5.90,5.0,5-6 hours,Moderate,Undergraduate,0,3.0,2.0,1,0
2,0,31.0,Srinagar,3.0,7.03,5.0,Less than 5 hours,Healthy,Undergraduate,0,9.0,1.0,1,0
3,1,28.0,Varanasi,3.0,5.59,2.0,7-8 hours,Moderate,Undergraduate,1,4.0,5.0,1,1
4,1,25.0,Jaipur,4.0,8.13,3.0,5-6 hours,Moderate,Postgraduate,1,1.0,1.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27896,1,27.0,Surat,5.0,5.75,5.0,5-6 hours,Unhealthy,Secondary,1,7.0,1.0,1,0
27897,0,27.0,Ludhiana,2.0,9.40,3.0,Less than 5 hours,Healthy,Postgraduate,0,0.0,3.0,1,0
27898,0,31.0,Faridabad,3.0,6.61,4.0,5-6 hours,Unhealthy,Doctorate,0,12.0,2.0,0,0
27899,1,18.0,Ludhiana,5.0,6.88,2.0,Less than 5 hours,Healthy,Secondary,1,10.0,5.0,0,1


In [16]:
# Mapear los valores
orden_degree = {"Secondary": 0, "Undergraduate": 1, "Postgraduate": 2, "Doctorate": 3}
orden_alimentos = {'Healthy': 0, 'Unhealthy': 1, 'Moderate': 2}
orden_siesta = {'Less than 5 hours': 0, '5-6 hours': 1, '7-8 hours': 2,'More than 8 hours': 3}

# Aplicar el mapeo
df['Degree'] = df['Degree'].map(orden_degree)
df['Dietary Habits'] = df['Dietary Habits'].map(orden_alimentos)
df['Sleep Duration'] = df['Sleep Duration'].map(orden_siesta)

df

Unnamed: 0,Gender,Age,City,Academic Pressure,CGPA,Study Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,0,33.0,Visakhapatnam,5.0,8.97,2.0,1,0,1.0,1,3.0,1.0,0,1
1,1,24.0,Bangalore,2.0,5.90,5.0,1,2,1.0,0,3.0,2.0,1,0
2,0,31.0,Srinagar,3.0,7.03,5.0,0,0,1.0,0,9.0,1.0,1,0
3,1,28.0,Varanasi,3.0,5.59,2.0,2,2,1.0,1,4.0,5.0,1,1
4,1,25.0,Jaipur,4.0,8.13,3.0,1,2,2.0,1,1.0,1.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27896,1,27.0,Surat,5.0,5.75,5.0,1,1,0.0,1,7.0,1.0,1,0
27897,0,27.0,Ludhiana,2.0,9.40,3.0,0,0,2.0,0,0.0,3.0,1,0
27898,0,31.0,Faridabad,3.0,6.61,4.0,1,1,3.0,0,12.0,2.0,0,0
27899,1,18.0,Ludhiana,5.0,6.88,2.0,0,0,0.0,1,10.0,5.0,0,1


## <font color= #bbc28d>• **Codificar Variables Múltiples - One Hot** </font>
Para poder evitar data leakage, dividiremos nuestros datos en conjuntos de prueba y entrenamiento, como las demás columnas ya estan procesadas no hay inconveniente y así tendremos dos conjuntos de datos: Uno para entrenar los modelos y otro para sacar sus métricas de performance:

In [None]:
# Separar en datos y target
X = df.drop(['Depression'], axis=1)
y = df['Depression']

# Separar en Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Crear el objeto de One-Hot
encoder = OneHotEncoder(
    drop='first',
    handle_unknown='ignore',   # evita error si aparece algo nuevo
    sparse_output=False
)

# Entrenar el objeto con los datos del train
encoder.fit(X_train[['City']])
print(f'Si todas son cero, entonces es: {encoder.categories_[0][0]}')

# Aplicarlo
X_train_city = encoder.transform(X_train[['City']])
X_test_city = encoder.transform(X_test[['City']])

Si todas son cero, entonces es: Agra


In [None]:
# Nombres del One-Hot
city_cols = encoder.get_feature_names_out(['City'])  # nombres automáticos de columnas

# Crear un df con las columnas encodeadas
X_train_city_df = pd.DataFrame(X_train_city, columns=city_cols, index=X_train.index)
X_test_city_df = pd.DataFrame(X_test_city, columns=city_cols, index=X_test.index)

# Eliminar la columna original en el dataset
X_train = X_train.drop(columns=['City'])
X_test = X_test.drop(columns=['City'])

# Juntar las nuevas columnas con el dataset antiguo
X_train_final = pd.concat([X_train, X_train_city_df], axis=1)
X_test_final = pd.concat([X_test, X_test_city_df], axis=1)

## <font color= #bbc28d>• **Exportar los datos** </font>

In [None]:
X_train_final.to_csv(r'..\data\processed\Train.csv', index=False)
y_train.to_csv(r'..\data\processed\y_Train.csv', index=False)
X_test_final.to_csv(r'..\data\processed\Test.csv', index=False)
y_test.to_csv(r'..\data\processed\y_Test.csv', index=False)

# <font color= #bbc28d> **Pipeline** </font>

In [None]:
df = pd.read_csv("../data/raw/depression_dataset.csv")

def clean_data(df, save_data=False):
    # 1. Eliminar valores nulos
    df = df.dropna()


    # 2. Filtrado de categorías que otorgan poca información debido a su baja prevalencia
    # City
    ciudades = df['City'].value_counts()[df['City'].value_counts() < 450]
    df = df[~df['City'].isin(ciudades.index)]
    # Dietary Habits
    df = df[df['Dietary Habits'] != 'Others']
    # Sleep Duration
    df = df[df['Sleep Duration'] != 'Others']
    # Degree
    df = df[df['Degree'] != 'Others']
    # Age
    df = df[df['Age'] <= 35]
    # Academic Pressure
    df = df[df['Academic Pressure'] > 0]
    # Study Satisfaction
    df = df[df['Study Satisfaction'] > 0]

    # 3. Eliminar variables que no son buenas predictoras
    df.drop(columns=['Work Pressure', 'Profession', 'Job Satisfaction', 'id'], axis=1, inplace=True)


    # 4. Mapear las variables categóricas binarias
    gender = {'Male' : 0, 'Female' : 1}
    general = {'Yes' : 1, 'No' : 0}
    df['Gender'] = df['Gender'].map(gender)
    df['Have you ever had suicidal thoughts ?'] = df['Have you ever had suicidal thoughts ?'].map(general)
    df['Family History of Mental Illness'] = df['Family History of Mental Illness'].map(general)


    # 5. Mapear las variables categóricas múltiples
    degree = {
    "Class 12": "Secondary",
    "B.Pharm": "Undergraduate", "BSc": "Undergraduate", "BA": "Undergraduate", "BCA": "Undergraduate",
    "B.Ed": "Undergraduate", "LLB": "Undergraduate", "BE": "Undergraduate", "BHM": "Undergraduate",
    "B.Com": "Undergraduate", "B.Arch": "Undergraduate", "B.Tech": "Undergraduate", "BBA": "Undergraduate",
    "M.Tech": "Postgraduate", "M.Ed": "Postgraduate", "MSc": "Postgraduate", "M.Pharm": "Postgraduate",
    "MCA": "Postgraduate", "MA": "Postgraduate", "MBA": "Postgraduate", "M.Com": "Postgraduate", "MHM": "Postgraduate",
    "PhD": "Doctorate", "MD": "Doctorate", "MBBS": "Doctorate", "LLM": "Doctorate", "ME": "Postgraduate"
    }
    orden_degree = {"Secondary": 0, "Undergraduate": 1, "Postgraduate": 2, "Doctorate": 3}
    orden_alimentos = {'Healthy': 0, 'Unhealthy': 1, 'Moderate': 2}
    orden_siesta = {'Less than 5 hours': 0, '5-6 hours': 1, '7-8 hours': 2,'More than 8 hours': 3}
    # Aplicar el mapeo
    df['Degree'] = df['Degree'].map(degree)
    df['Degree'] = df['Degree'].map(orden_degree)
    df['Dietary Habits'] = df['Dietary Habits'].map(orden_alimentos)
    df['Sleep Duration'] = df['Sleep Duration'].map(orden_siesta)


    # 6. Train-Test-Val Split (70-20-10)
    X = df.drop(['Depression'], axis=1)
    y = df['Depression']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=42)
    X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.66, random_state=42)

    if save_data:
        # Guardar las variables 
        X_train.to_csv(r'..\data\interim\X_train.csv', index=False)
        X_test.to_csv(r'..\data\interim\X_test.csv', index=False)
        X_val.to_csv(r'..\data\interim\X_val.csv', index=False)
        
        y_train.to_csv(r'..\data\processed\y_train.csv', index=False)
        y_test.to_csv(r'..\data\processed\y_test.csv', index=False)
        y_val.to_csv(r'..\data\processed\y_val.csv', index=False)
    
    # Convertir las variables dependientes en NumPy arrays
    y_train = y_train.to_numpy().ravel()
    y_test = y_test.to_numpy().ravel()
    y_val = y_val.to_numpy().ravel()   

    return X_train, X_test, X_val, y_train, y_test, y_val

In [None]:
def preprocessor(X_train, X_test, X_val, save_data=False):
    # Codificar variables múltiples mediante One-Hot
    encoder = OneHotEncoder(
        drop='first',
        handle_unknown='ignore',        # Evita error si aparece algo nuevo
        sparse_output=False
    )

    # Entrenar el objeto con los datos del train
    encoder.fit(X_train[['City']])
    
    # Aplicar One-Hot
    X_train_city = encoder.transform(X_train[['City']])
    X_test_city = encoder.transform(X_test[['City']])
    X_val_city = encoder.transform(X_val[['City']])
    
    # Obtener los nombres del One-Hot
    city_cols = encoder.get_feature_names_out(['City'])  # Nombres automáticos de columnas
    
    # Crear un df con las columnas codificadas
    X_train_city_df = pd.DataFrame(X_train_city, columns=city_cols, index=X_train.index)
    X_test_city_df = pd.DataFrame(X_test_city, columns=city_cols, index=X_test.index)
    X_val_city_df = pd.DataFrame(X_val_city, columns=city_cols, index=X_val.index)
    
    # Eliminar la columna original en el dataset
    X_train = X_train.drop(columns=['City'])
    X_test = X_test.drop(columns=['City'])
    X_val = X_val.drop(columns=['City'])
    
    # Juntar las nuevas columnas con el dataset antiguo
    X_train_final = pd.concat([X_train, X_train_city_df], axis=1)
    X_test_final = pd.concat([X_test, X_test_city_df], axis=1)
    X_val_final = pd.concat([X_val, X_val_city_df], axis=1)

    # Aplicar una estandarización a los datos
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_final)
    X_test_scaled = scaler.transform(X_test_final)
    X_val_scaled = scaler.transform(X_val_final)

    if save_data:
        # Regresar los datos a dataframe y guardarlos
        X_train_df = pd.DataFrame(X_train_scaled, columns=X_train_final.columns, index=X_train_final.index)
        X_test_df = pd.DataFrame(X_test_scaled, columns=X_test_final.columns, index=X_test_final.index)
        X_val_df = pd.DataFrame(X_val_scaled, columns=X_val_final.columns, index=X_val_final.index)

        X_train_df.to_csv(r'..\data\processed\X_train.csv', index=False)
        X_test_df.to_csv(r'..\data\processed\X_test.csv', index=False)
        X_val_df.to_csv(r'..\data\processed\X_val.csv', index=False)

    return X_train_scaled, X_test_scaled, X_val_scaled, encoder, scaler

In [None]:
X_train, X_test, X_val, y_train, y_test, y_val = clean_data(df, save_data=True)

In [None]:
X_train, X_test, X_val, encoder, scaler = preprocessor(X_train, X_test, X_val, save_data=True)

In [None]:
# mlflow.data.from_numpy(X_train.data, targets=y_train, name="depression_train")
# mlflow.data.from_numpy(X_test.data, targets=y_test, name="depression_test")
# mlflow.data.from_numpy(X_val.data, targets=y_val, name="depression_val")