# **<span style="color:CadetBlue;"> Proyecto ML: "Employee Absenteeism at Work" </span>**

Análisis y predicción del absentismo laboral utilizando el dataset "Employee Absenteeism at Work", sobre empleados de una empresa en Brasil. 

El absentismo laboral es un problema común en las organizaciones y tiene un impacto directo en la productividad, la planificación operativa y el clima laboral. Mediante la predicción de conductas de ausentismo, la empresa puede anticiparse y diseñar estrategias más eficaces de gestión del talento y bienestar.

El **objetivo** es predecir el absentismo laboral  y detectar patrones de comportamiento que permitan anticipar posibles casos de ausencia frecuente. Entre las preguntas que se busca responder se encuentran: ¿Qué variables personales o laborales están más relacionadas con el absentismo? y ¿Es posible predecir cuántas horas se ausentará un empleado en función de su perfil?

In [1]:
import sys, os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from utils import bootcampviztools as bt
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, FunctionTransformer, OneHotEncoder, RobustScaler


# Gradient Boosting
import xgboost as xgb
import lightgbm as lgb

## 1. Entender el problema

El dataset de ´Absenteeism at Work´ contiene información sobre empleados, factores sociales y laborales, y la cantidad de horas que estuvieron ausentes del trabajo.

Es un problema supervisado pues tiene la variable target **Absenteeism time in hours**. Y es un problema de clasificación.


## 2. Carga de datos

In [2]:
df = pd.read_csv("../data_sample/Absenteeism_at_work.csv", sep=";")

In [3]:
df.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740 entries, 0 to 739
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ID                               740 non-null    int64  
 1   Reason for absence               740 non-null    int64  
 2   Month of absence                 740 non-null    int64  
 3   Day of the week                  740 non-null    int64  
 4   Seasons                          740 non-null    int64  
 5   Transportation expense           740 non-null    int64  
 6   Distance from Residence to Work  740 non-null    int64  
 7   Service time                     740 non-null    int64  
 8   Age                              740 non-null    int64  
 9   Work load Average/day            740 non-null    float64
 10  Hit target                       740 non-null    int64  
 11  Disciplinary failure             740 non-null    int64  
 12  Education             

*Todas las columnas son numéricas y no hay nulos*

In [5]:
target = "Absenteeism time in hours"

In [6]:
df.drop(columns=["Height", "Weight"], inplace=True)

## 3. Train y test

In [7]:
X = df.drop([target], axis=1)
y = df[target]

In [8]:
y = (df['Absenteeism time in hours'] > df['Absenteeism time in hours'].median()).astype(int)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. MiniEDA

In [10]:
X_train["Has_son"] = X_train["Son"].apply(lambda x: 1 if x > 0 else 0)
X_test["Has_son"] = X_test["Son"].apply(lambda x: 1 if x > 0 else 0)

X_train.drop(columns=["Son"], inplace=True)
X_test.drop(columns=["Son"], inplace=True)


In [11]:
X_train['Has_pet'] = X_train['Pet'].apply(lambda x: 1 if x > 0 else 0)
X_test['Has_pet'] = X_test['Pet'].apply(lambda x: 1 if x > 0 else 0)

X_train.drop(columns=["Pet"], inplace=True)
X_test.drop(columns=["Pet"], inplace=True)

Categóricas 
* Reason for absence 
* Month of absence  
* Day of the week 
* Disciplinary failure
* Education 
* Social drinker 
* Social smoker 
* Pet
* Son

Numéricas
* Transportation expense
* Distance from Residence to Work
* Service time
* Age
* Work load Average/day
* Hit target
* Weight
* Height
* Body mass index

In [12]:
features_cat = [ "Reason for absence", "Month of absence", "Day of the week", "Seasons", "Disciplinary failure", "Education", "Social drinker", "Social smoker", "Has_pet", "Has_son"]
features_num = ["Transportation expense", "Distance from Residence to Work", "Service time", "Age", "Work load Average/day ", "Hit target", "Body mass index"]

In [13]:
# Transformación logarítmica para las más sesgadas
log_feats = ['Transportation expense', 'Work load Average/day ', 'Body mass index']

log1p_transformer = FunctionTransformer(np.log1p)
X_train_log = log1p_transformer.fit_transform(X_train[log_feats])
X_test_log = log1p_transformer.transform(X_test[log_feats])

robust_scaler = RobustScaler()
X_train_log_scaled = robust_scaler.fit_transform(X_train_log)
X_test_log_scaled = robust_scaler.transform(X_test_log)

# Para las demás
std_feats = ['Distance from Residence to Work', 'Service time', 'Age', 'Hit target']

standard_scaler = StandardScaler()
X_train_std = standard_scaler.fit_transform(X_train[std_feats])
X_test_std = standard_scaler.transform(X_test[std_feats])

In [14]:
X_train_transformed_num = pd.DataFrame(
    np.hstack([X_train_log_scaled, X_train_std]),
    columns=log_feats + std_feats,
    index=X_train.index
)

X_test_transformed_num = pd.DataFrame(
    np.hstack([X_test_log_scaled, X_test_std]),
    columns=log_feats + std_feats,
    index=X_test.index)

In [15]:
razones = {
    1:'Infectious', 2:'Neoplasms', 3:'Blood', 4:'Metabolic',
    5:'Mental', 6:'Nervous', 7:'eye', 8:'ear',
    9:'circulation', 10:'respiratory', 11:'digestive', 12:'skin',
    13:'muscles', 14:'genitourinary', 15: 'pregnancy', 16:'perinatal', 
    17:'deformations', 18:'abnormalfindings', 19:'injury-poison', 20:'mortality', 
    21:'healthstatus', 22:'follow-up', 23:'consultation', 24:'blood-donation',
    25:'lab', 26:'unjustified', 27:'physio', 28:'dentist'
}

meses = {
    1: 'Jan', 2: 'Febr', 3: 'Mar', 4: 'April',
    5: 'May', 6: 'June', 7: 'July', 8: 'August',
    9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec', 0: 'UNK'
}

dias = {
    2: 'Lunes', 3: 'Martes', 4: 'Miércoles',
    5: 'Jueves', 6: 'Viernes'
}

estaciones = {
    1: 'Summer', 2: 'Autumn', 3: 'Winter', 4: 'Spring'
}

X_train["Reason for absence"] = X_train["Reason for absence"].map(razones)
X_train['Month of absence'] = X_train['Month of absence'].map(meses)
X_train['Day of the week'] = X_train['Day of the week'].map(dias)
X_train['Seasons'] = X_train['Seasons'].map(estaciones)

In [16]:

X_test["Reason for absence"] = X_test["Reason for absence"].map(razones)
X_test['Month of absence'] = X_test['Month of absence'].map(meses)
X_test['Day of the week'] = X_test['Day of the week'].map(dias)
X_test['Seasons'] = X_test['Seasons'].map(estaciones)

In [17]:
cat_feats= ["Reason for absence", "Month of absence", "Day of the week", "Seasons"]
X_train_cat = pd.get_dummies(X_train[cat_feats], drop_first=True)
X_test_cat = pd.get_dummies(X_test[cat_feats], drop_first=True)

X_test_cat = X_test_cat.reindex(columns=X_train_cat.columns, fill_value=0)

In [18]:
X_train_transformed = pd.concat([X_train_transformed_num, X_train_cat], axis=1)
X_test_transformed = pd.concat([X_test_transformed_num, X_test_cat], axis=1)

In [19]:
X_train_st = pd.concat([X_train[features_num], X_train_cat], axis=1)
X_test_st = pd.concat([X_test[features_num], X_test_cat], axis=1)

In [20]:
X_train_st.columns

Index(['Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Body mass index', 'Reason for absence_Infectious',
       'Reason for absence_Mental', 'Reason for absence_Metabolic',
       'Reason for absence_Nervous', 'Reason for absence_abnormalfindings',
       'Reason for absence_blood-donation', 'Reason for absence_circulation',
       'Reason for absence_consultation', 'Reason for absence_dentist',
       'Reason for absence_digestive', 'Reason for absence_ear',
       'Reason for absence_eye', 'Reason for absence_follow-up',
       'Reason for absence_genitourinary', 'Reason for absence_healthstatus',
       'Reason for absence_injury-poison', 'Reason for absence_lab',
       'Reason for absence_muscles', 'Reason for absence_perinatal',
       'Reason for absence_physio', 'Reason for absence_pregnancy',
       'Reason for absence_respiratory', 'Reason for absence_skin',
       'Reason for absence_

In [21]:
X_train_transformed.columns

Index(['Transportation expense', 'Work load Average/day ', 'Body mass index',
       'Distance from Residence to Work', 'Service time', 'Age', 'Hit target',
       'Reason for absence_Infectious', 'Reason for absence_Mental',
       'Reason for absence_Metabolic', 'Reason for absence_Nervous',
       'Reason for absence_abnormalfindings',
       'Reason for absence_blood-donation', 'Reason for absence_circulation',
       'Reason for absence_consultation', 'Reason for absence_dentist',
       'Reason for absence_digestive', 'Reason for absence_ear',
       'Reason for absence_eye', 'Reason for absence_follow-up',
       'Reason for absence_genitourinary', 'Reason for absence_healthstatus',
       'Reason for absence_injury-poison', 'Reason for absence_lab',
       'Reason for absence_muscles', 'Reason for absence_perinatal',
       'Reason for absence_physio', 'Reason for absence_pregnancy',
       'Reason for absence_respiratory', 'Reason for absence_skin',
       'Reason for absence_

**MODELOS**

In [22]:
from lightgbm import LGBMClassifier, LGBMRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from xgboost import XGBClassifier

In [23]:
modelos_escalados = {
    "Logistic": LogisticRegression(max_iter=2000, class_weight="balanced")
}

modelos_no_escalados = {
    "RandomF": RandomForestClassifier(max_depth=10, random_state=42, class_weight="balanced"),
    "XGB": XGBClassifier(max_depth=10, random_state=42, n_jobs=-1),
    "LGB": LGBMClassifier(max_depth=10, random_state=42, verbose=-100, class_weight="balanced", n_jobs=-1)
}


from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
def evaluate_model(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    sensibilidad = tp / (tp + fn)
    especificidad = tn / (tn + fp)
    precision = tp / (tp + fp)
    print(f"✔️ Sensibilidad: {sensibilidad:.2f}")
    print(f"✔️ Especificidad: {especificidad:.2f}")
    print(f"✔️ Precisión: {precision:.2f}")

# Ejemplo para Random Forest
for nombre, model in modelos_escalados.items():
    model.fit(X_train_transformed, y_train)      
    print("Modelo:", nombre, "(con datos escalados)")
    pred_train = model.predict(X_train_transformed)
    pred_test = model.predict(X_test_transformed)
    evaluate_model(y_test, pred_test)
    print(classification_report(y_test, pred_test))


for nombre, model in modelos_no_escalados.items():
    model.fit(X_train_st, y_train)  # sin escalar
    print("Modelo:", nombre, "(sin datos escalados)")
    pred_train = model.predict(X_train_st)
    pred_test = model.predict(X_test_st)
    evaluate_model(y_test, pred_test)
    print(classification_report(y_test, pred_test))

Modelo: Logistic (con datos escalados)
✔️ Sensibilidad: 0.78
✔️ Especificidad: 0.74
✔️ Precisión: 0.71
              precision    recall  f1-score   support

           0       0.80      0.74      0.77        81
           1       0.71      0.78      0.74        67

    accuracy                           0.76       148
   macro avg       0.76      0.76      0.76       148
weighted avg       0.76      0.76      0.76       148

Modelo: RandomF (sin datos escalados)
✔️ Sensibilidad: 0.81
✔️ Especificidad: 0.73
✔️ Precisión: 0.71
              precision    recall  f1-score   support

           0       0.82      0.73      0.77        81
           1       0.71      0.81      0.76        67

    accuracy                           0.76       148
   macro avg       0.76      0.77      0.76       148
weighted avg       0.77      0.76      0.76       148

Modelo: XGB (sin datos escalados)
✔️ Sensibilidad: 0.82
✔️ Especificidad: 0.73
✔️ Precisión: 0.71
              precision    recall  f1-score

In [24]:
#logistic
logreg = LogisticRegression(max_iter=1000, random_state=42)
param_logreg = {
    'penalty': ['l1', 'l2', None],
    'solver': ['liblinear', 'saga'],  
    'C': [0.01, 0.1, 1, 10]
}
grid_logreg = GridSearchCV(logreg, param_logreg, cv=5, scoring='accuracy', n_jobs=-1)
grid_logreg.fit(X_train_transformed, y_train)

20 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\emmag\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\emmag\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "c:\Users\emmag\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1193, in fit
    solver = _check_solver(se

In [25]:
rf = RandomForestClassifier(random_state=42)
param_rf = {
    'n_estimators': [100, 200, 400],
    'max_depth': [None, 1, 5, 10],
    'min_samples_leaf': [1,10,20,100],
    'class_weight':['balanced', None],
    'max_features':['sqrt', 'log2', None]
}
grid_rf = GridSearchCV(rf, param_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train_st, y_train)

In [26]:
lgbm = LGBMClassifier(random_state=42) 
param_lgbm = {
    'n_estimators': [100, 200, 400],
    'learning_rate': [0.1, 0.3, 0.6, 1], 
    'max_depth': [1, 6, 10, -1],  
    'min_child_samples': [1, 10, 20, 100], 
    'scale_pos_weight': [
        len(df[df[target]==0]) / len(df[df[target]==1]),
        1],
    'colsample_bytree': [0.5, 1]
}
grid_lgbm = GridSearchCV(lgbm, param_lgbm, cv=5, scoring='accuracy', n_jobs=-1)
grid_lgbm.fit(X_train_st, y_train)

In [27]:
xgb = XGBClassifier(max_depth=5, random_state=42)
param_xgb = {
    'n_estimators': [100, 200, 400],
    'eta': [0.1, 0.3, 0.6, 1],
    'max_depth': [1, 6, 10, None],
    'min_child_weight': [1,10,20,100],
    'colsample_bytree': [0.5,1]
}
grid_xgb = GridSearchCV(xgb, param_xgb, cv=5, scoring='accuracy', n_jobs=-1)
grid_xgb.fit(X_train_st, y_train)

In [28]:
from sklearn.metrics import accuracy_score

# Evaluación
models_con = {
    "LogisticR": grid_logreg}

models_sin={
    "RandomForest": grid_rf,
    "LightGBM": grid_lgbm,
    "XGBoost": grid_xgb
}

for name, model in models_con.items():
    acc = accuracy_score(y_test, model.predict(X_test_transformed))
    print(f"{name} - Best Params: {model.best_params_} | CV Score: {model.best_score_:.4f} | Test Accuracy: {acc:.4f}")


for name, model in models_sin.items():
    acc = accuracy_score(y_test, model.predict(X_test_st))
    print(f"{name} - Best Params: {model.best_params_} | CV Score: {model.best_score_:.4f} | Test Accuracy: {acc:.4f}")


LogisticR - Best Params: {'C': 10, 'penalty': 'l2', 'solver': 'saga'} | CV Score: 0.7738 | Test Accuracy: 0.7703
RandomForest - Best Params: {'class_weight': None, 'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 1, 'n_estimators': 400} | CV Score: 0.7669 | Test Accuracy: 0.7432
LightGBM - Best Params: {'colsample_bytree': 1, 'learning_rate': 1, 'max_depth': 1, 'min_child_samples': 1, 'n_estimators': 400, 'scale_pos_weight': 1} | CV Score: 0.7636 | Test Accuracy: 0.8108
XGBoost - Best Params: {'colsample_bytree': 0.5, 'eta': 0.3, 'max_depth': 1, 'min_child_weight': 1, 'n_estimators': 100} | CV Score: 0.7585 | Test Accuracy: 0.7838
