### Methodological overview and notebook's sections

We intend to leverage the fact that our dataset has several waves to balance our target variable without adding synthetic data or using random oversampling. Given that our dataset fundamentally consists of categorical data, adding synthetic data through methods like SMOTE isn't particularly useful to us, and might even be detrimental, because it might introduce unrealistic categorical combinations that hinder the model's ability to generalize over real data; and eventhough there's a version for categorical data (SMOTE-N), synthetic noise will still be introduced into the data.

In order to achieve this, we first need to reshape our data so that we only have one hospitalization variable, instead of having a hospitalization variable per wave. We will also use both respondent ('r'-prefixed) and spouse ('s'-prefixed). This means that our number of rows will grow by a factor of *2w*, where *w* is the number of waves. 

Our strategy consists of two approaches. In our first approach, we'll use waves 3, 4 and 5, take the entirety of wave 5 and rows where 'hospitalized'= 1 in waves 3 and 4 to enhance wave 5. This will result in a more balanced dataset without undersampling, and notably, we won't add noise with synthetic data: we will balance our dataset organically. We chose these 3 waves because they are evenly spread in time (2012, 2015, 2018) and because we want to establish baseline metrics to determine the impact of additional waves.

In our second approach, we will use waves 1, 2, 3 and 4 to balance wave 5. This should result in an even smaller gap between the majority and minority classes.



<a id='sections'></a>
#### 📌 [Sections](#sections) 
- [Reshaping the data](#data-reshaping)  
- [Removal of proxies, respondents under 50 and missing values in target](#removals)   
- [Key distributions](#key_dists)  
- [Destringify values of categorical variables](#destringify-cat-vals)  
- [First approach: waves 3, 4 and 5](#approach-1) 
  - [Baseline: no imputations](#baseline-no-imputations) 
  - [Baseline: imputations](#baseline-imputations)  
- [Second approach: all waves](#approach-2)  

In [1]:
import gc
import re

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report


from src import features_registry, data_utils

In [2]:
file_path = './data/H_MHAS_c2.dta'
raw_df = pd.read_stata(file_path)

print(f"Raw_df has shape {raw_df.shape}")

Raw_df has shape (26839, 5241)


<a id='data-reshaping'></a>
### Reshaping the data 

<small>[Back to top](#sections)</small>

In [3]:
# Get selected features 
selected_features_suffixes = features_registry.selected_features

target_suffix = 'hosp1y'

# Get prefixed target variables
waves = range(1, 6)

target_variables = data_utils.join_prefix_suffix(waves, target_suffix)
# Get stacked dfs

stacked_df = data_utils.stack_df(raw_df, target_variables, selected_features_suffixes, rural=False)


# Verify rows multiplied to the corresponding factor
print(f"New dataframe shape: {stacked_df.shape}")


# Release raw_df from memory
del raw_df
gc.collect()


New dataframe shape: (268390, 37)


5268

<a id='removals'></a>
### Removal of proxies, respondents under 50 and missing values in target

<small>[Back to top](#sections)</small>



In [4]:
#  <---- Inspect unique values for 'proxy' and only keep non-proxy respondents ---->

print(stacked_df['proxy'].value_counts(dropna=False))
stacked_df.drop(index=stacked_df[stacked_df['proxy'] != '0.Not proxy'].index, inplace=True)
print(stacked_df['proxy'].value_counts(dropna=False))
print(f"stacked_df now has {stacked_df.shape[0]} rows")


proxy
NaN            143790
0.Not proxy    115618
1.Proxy          8982
Name: count, dtype: int64
proxy
0.Not proxy    115618
Name: count, dtype: int64
stacked_df now has 115618 rows


In [5]:
# <---- Inspect missing values for 'hospitalized' and remove them ---->

print(stacked_df['hospitalized'].value_counts(dropna=False))
stacked_df.dropna(subset=['hospitalized'], inplace=True)
print(stacked_df['hospitalized'].value_counts(dropna=False))
print(f"stacked_df now has {stacked_df.shape[0]} rows")

hospitalized
0.No     102518
1.Yes     12915
NaN         185
Name: count, dtype: int64
hospitalized
0.No     102518
1.Yes     12915
Name: count, dtype: int64
stacked_df now has 115433 rows


In [6]:
# <---- Inspect respondents under 50 and remove them ---->

unique_agey = sorted(stacked_df['agey'].unique(), key=lambda x: (not pd.isna(x), x))
print(unique_agey)
stacked_df.dropna(subset=['agey'], inplace=True)  # Remove NaN values
stacked_df.drop(index=stacked_df[stacked_df['agey'] < 50].index, inplace=True)  # Remove values < 50
unique_agey = sorted(stacked_df['agey'].unique(), key=lambda x: (not pd.isna(x), x))
print(unique_agey)

print(f"stacked_df now has {stacked_df.shape[0]} rows")

[nan, 16.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 110.0, 112.0]
[50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 110.0, 112.0]
stacked_df now has 106537 rows


<a id='key-dists'></a>
### Key distributions 

<small>[Back to top](#sections)</small>

In [None]:
# <---- Inspect target imbalances ---->

hospitalized_proportions = stacked_df['hospitalized'].value_counts(normalize=True)
print(f'Hospitalized prportions across all waves: {hospitalized_proportions}')

hospitalized_proportions_by_wave = (
    stacked_df.groupby('wave')['hospitalized']
    .value_counts(normalize=True)
    .unstack()
)
print(f'Hospitalized proportions by wave: {hospitalized_proportions_by_wave}')

Hospitalized prportions across all waves: hospitalized
0.No     0.886368
1.Yes    0.113632
Name: proportion, dtype: float64
Hospitalized proportions per wave: hospitalized      0.No     1.Yes
wave                            
1             0.907720  0.092280
2             0.893766  0.106234
3             0.895553  0.104447
4             0.869383  0.130617
5             0.866445  0.133555


In [None]:
# <---- Inspect gender proportions across waves ---->

gender_proportions_by_wave = (
    stacked_df.groupby('wave')['gender']
    .value_counts(normalize=True)
    .unstack()
)
print(f'Gender proportions by wave: {gender_proportions_by_wave}')

Gender proportions by wave: gender     1.Man   2.Woman
wave                      
1       0.496927  0.503073
2       0.474486  0.525514
3       0.463969  0.536031
4       0.461626  0.538374
5       0.453300  0.546700


In [None]:
# <---- Inspect gender proportions by wave ---->

hospitalized_gender_proportions_by_wave = (
    stacked_df.groupby(['wave', 'hospitalized'])['gender']
    .value_counts(normalize=True)
    .unstack()
)
print(hospitalized_gender_proportions_by_wave)



gender                1.Man   2.Woman
wave hospitalized                    
1    0.No          0.501693  0.498307
     1.Yes         0.450052  0.549948
2    0.No          0.479895  0.520105
     1.Yes         0.428983  0.571017
3    0.No          0.470340  0.529660
     1.Yes         0.409340  0.590660
4    0.No          0.466974  0.533026
     1.Yes         0.426029  0.573971
5    0.No          0.456380  0.543620
     1.Yes         0.433322  0.566678


In [16]:
# <---- Inspect age distributions by wave and gender ---->

age_distribution_by_gender_wave = stacked_df.groupby(['wave', 'gender'])['agey'].describe()
print(age_distribution_by_gender_wave)

                count       mean       std   min   25%   50%   75%    max
wave gender                                                              
1    1.Man    10350.0  61.853623  9.236059  50.0  54.0  60.0  68.0  105.0
     2.Woman  10478.0  61.106509  8.929668  50.0  54.0  59.0  67.0   99.0
2    1.Man     9308.0  63.969059  8.982577  50.0  57.0  62.0  70.0  107.0
     2.Woman  10309.0  62.129887  8.997397  50.0  55.0  60.0  68.0  103.0
3    1.Man    10559.0  65.247372  9.190321  50.0  58.0  64.0  71.0  110.0
     2.Woman  12199.0  63.440200  9.068684  50.0  56.0  62.0  69.0  112.0
4    1.Man    10129.0  66.935235  9.090652  50.0  60.0  67.0  73.0  101.0
     2.Woman  11813.0  64.908321  9.162998  50.0  58.0  64.0  71.0  100.0
5    1.Man     9697.0  66.570279  9.920810  50.0  58.0  67.0  74.0  101.0
     2.Woman  11695.0  64.546131  9.787528  50.0  56.0  64.0  71.0  102.0


<a id='destringify-cat-vals'></a>
##### Destringify values of categorical variables

<small>[Back to top](#sections)</small>

In [7]:
# <---- Inspect values of categorical variables ---->

categorical_columns = stacked_df.select_dtypes(include=['object', 'category']).columns

column_summary_dict = {
    col: {
        'Type': stacked_df[col].dtype.name,
        'Unique values': stacked_df[col].astype(str).unique().tolist()
    }
    for col in categorical_columns    
}

import json
print(json.dumps(column_summary_dict, indent=4))


del categorical_columns
gc.collect()

{
    "hospitalized": {
        "Type": "object",
        "Unique values": [
            "0.No",
            "1.Yes"
        ]
    },
    "gender": {
        "Type": "object",
        "Unique values": [
            "2.Woman",
            "1.Man"
        ]
    },
    "shlt": {
        "Type": "object",
        "Unique values": [
            "4.Fair",
            "5.Poor",
            "2.Very good",
            "3.Good",
            "1.Excellent",
            "nan"
        ]
    },
    "hltc": {
        "Type": "object",
        "Unique values": [
            "4.Somewhat worse",
            "3.More or less the same",
            "2.Somewhat better",
            "5.Much worse",
            "1.Much better",
            "nan"
        ]
    },
    "mobilseva": {
        "Type": "object",
        "Unique values": [
            "0.No",
            "1.Yes",
            "nan"
        ]
    },
    "diabe": {
        "Type": "object",
        "Unique values": [
            "0.no",
            "1.y

33

In [8]:
# <---- Extract numeric element of string and generate a mapping ---->

# Function to extract numbers from categorical values (like "1.Yes" → 1)
def extract_number(value):
    if isinstance(value, str):
        match = re.search(r'\d+', value)  # Find the first number
        return float(match.group()) if match else value  # Return the number if found
    return float(value) if isinstance(value, int) else value  # Keep NaN unchanged

# Step 1: Generate mappings
mappings = {}

for col, info in column_summary_dict.items():
    unique_values = info["Unique values"]
    
    # Convert values to numbers if possible
    mapping = {val: extract_number(val) for val in unique_values}
    
    # Ensure NaN is preserved correctly
    mapping["nan"] = np.nan  
    
    # Store mapping
    mappings[col] = mapping

import json
print(json.dumps(mappings, indent=4))


{
    "hospitalized": {
        "0.No": 0.0,
        "1.Yes": 1.0,
        "nan": NaN
    },
    "gender": {
        "2.Woman": 2.0,
        "1.Man": 1.0,
        "nan": NaN
    },
    "shlt": {
        "4.Fair": 4.0,
        "5.Poor": 5.0,
        "2.Very good": 2.0,
        "3.Good": 3.0,
        "1.Excellent": 1.0,
        "nan": NaN
    },
    "hltc": {
        "4.Somewhat worse": 4.0,
        "3.More or less the same": 3.0,
        "2.Somewhat better": 2.0,
        "5.Much worse": 5.0,
        "1.Much better": 1.0,
        "nan": NaN
    },
    "mobilseva": {
        "0.No": 0.0,
        "1.Yes": 1.0,
        "nan": NaN
    },
    "diabe": {
        "0.no": 0.0,
        "1.yes": 1.0,
        "nan": NaN
    },
    "hrtatte": {
        "0.no": 0.0,
        "1.yes": 1.0,
        "nan": NaN
    },
    "stroke": {
        "0.no": 0.0,
        "1.yes": 1.0,
        "nan": NaN
    },
    "breath_m": {
        "0.no": 0.0,
        "1.yes": 1.0,
        "nan": NaN
    },
    "doctor1y": {


In [9]:
# <---- Apply mappings while preserving 'category' type of variables ---->
for col, mapping in mappings.items():
    if col in stacked_df.columns:
        stacked_df[col] = stacked_df[col].map(mapping).astype('category')

In [20]:
# <---- Save purged dataset ---->
stacked_df.to_pickle('./data/stacked_purged_recast_df.pickle')

<a id='approach-1'></a>
##### First approach: waves 3, 4 and 5

<small>[Back to top](#sections)</small>

In [2]:
### <---- Load purged dataset - CAN SKIP EXECUTION OF ALL PREVIOUS CELLS EXCEPT IMPORT CELL ----> 

file_path = './data/stacked_purged_recast_df.pickle'
stacked_df = pd.read_pickle(file_path)

#####

In [None]:
# Select all rows in wave 5 and all rows from waves 3 and 4 where 'hospitalized' = 1

wave_5_df = stacked_df[stacked_df['wave'] == 5]

waves_3_4_hosp_df = stacked_df[
    (stacked_df['wave'].isin([3, 4])) & (stacked_df['hospitalized'] == 1)
]

filtered_df = pd.concat([wave_5_df, waves_3_4_hosp_df])

filtered_df = filtered_df.reset_index(drop=True)


hospitalized_counts_filtered_df = filtered_df['hospitalized'].value_counts(normalize=True)
print(hospitalized_counts_filtered_df)


hospitalized
0.0    0.604908
1.0    0.395092
Name: proportion, dtype: float64


<a id='baseline-no-imputations'></a>
##### Baseline: no imputations

<small>[Back to top](#sections)</small>

In [None]:
# ----------------------
# 2) Define X and y
# ----------------------
target_column = 'hospitalized'
drop_columns = ['id', 'proxy', 'wave']

# --------------------------
# 1) Convert to categorical
# --------------------------


for col, mapping in mappings.items():
    if col in stacked_df.columns and col != target_column:
        # Extract the numeric portion
        stacked_df[col] = stacked_df[col].map(mapping)
        
        # Convert to actual pandas 'category' dtype
        stacked_df[col] = stacked_df[col].astype('category')



X = filtered_df.drop(columns=[target_column] + drop_columns)
y = filtered_df[target_column].astype('int')  # Keep the target as int

# ----------------------
# 3) Train/Test Split
# ----------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------------
# 4) Initialize XGBoost with enable_categorical
# -----------------------------------
model = xgb.XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    enable_categorical=True,  
    tree_method='hist'        
)

model.fit(X_train, y_train)

# ----------------------
# 5) Make Predictions
# ----------------------
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability for ROC AUC

# ----------------------
# 6) Evaluate
# ----------------------
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
class_report = classification_report(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"ROC AUC Score: {roc_auc:.4f}")
print("\nClassification Report:\n", class_report)




Model Accuracy: 0.8001
ROC AUC Score: 0.8399

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.92      0.87      3737
           1       0.73      0.52      0.61      1590

    accuracy                           0.80      5327
   macro avg       0.77      0.72      0.74      5327
weighted avg       0.79      0.80      0.79      5327



<a id='inspect-hosp-dist'></a>
### Baseline: imputations 

<small>[Back to top](#sections)</small>

In [3]:
### <---- Load purged dataset - CAN SKIP EXECUTION OF ALL PREVIOUS CELLS EXCEPT IMPORT CELL ----> 

file_path = './data/stacked_purged_recast_df.pickle'
stacked_df = pd.read_pickle(file_path)

In [11]:
# <---- Inspect proportions of missing values in selected features ---->
missing_data = stacked_df.isna().sum().to_frame(name='NaNs count')
missing_data['NaNs Proportion'] = (missing_data['NaNs count'] / len(stacked_df))

missing_data = missing_data.sort_values(by='NaNs Proportion', ascending=False)

print(f"Total row count for reference: {stacked_df.shape[0]}")
print(missing_data)

Total row count for reference: 106537
              NaNs count  NaNs Proportion
prost              57703         0.541624
papsm              50742         0.476285
mammog             50227         0.471451
breast             50218         0.471367
bmi                18201         0.170842
rxhibp               807         0.007575
rxdiab               796         0.007472
diabe                770         0.007228
hibpe                735         0.006899
cancre               696         0.006533
rxresp               694         0.006514
respe                676         0.006345
rxhrtat              661         0.006204
hrtatte              656         0.006157
rxstrok              651         0.006111
stroke               643         0.006035
work                 474         0.004449
drinkd               430         0.004036
cholst               301         0.002825
vigact               209         0.001962
doctim1y             169         0.001586
doctor1y             169         0.001

In [None]:
# <---- Imputation of missing values for bmi ---->

# Create a mask for missing BMI values
bmi_missing_mask = stacked_df['bmi'].isna()

# Create a proportion table
missing_proportion = stacked_df.groupby('gender', observed=False)['bmi'].apply(lambda x: x.isna().mean())

# Print numerical summary
print(missing_proportion.to_frame(name="Proportion of BMI Missing"))

# Define age bins (every 5 years) from 50 to 115
age_bins = list(range(50, 121, 5))  # Covers 50-115+
age_labels = [f"{age_bins[i]}-{age_bins[i+1]-1}" for i in range(len(age_bins)-1)]

# Create a new age group column
stacked_df['age_group'] = pd.cut(stacked_df['agey'], bins=age_bins, labels=age_labels, right=False)

# Compute BMI missing proportion by age group
missing_by_age = stacked_df.groupby('age_group', observed=False)['bmi'].apply(lambda x: x.isna().mean())

# Display results
print("BMI Missingness by Age Group:")
print(missing_by_age)

# Group by both age group and gender, then calculate the proportion of missing BMI values
missing_bmi_by_age_gender = stacked_df.groupby(['age_group', 'gender'], observed=False)['bmi'].apply(lambda x: x.isna().mean()).unstack()

# Display the results
print("BMI Missingness by Age Group and Gender:")
print(missing_bmi_by_age_gender)

# Compute median BMI by 5-year age groups and gender
median_bmi_by_age_gender = stacked_df.groupby(['age_group', 'gender'], observed=False)['bmi'].median().unstack()

# Display results
print("Median BMI by 5-year Age Groups and Gender:")
print(median_bmi_by_age_gender)

#  Step 1: Ensure 'age_group' column exists
age_bins = list(range(50, 121, 5))  # From 50 to 120 in 5-year intervals
age_labels = [f"{i}-{i+4}" for i in age_bins[:-1]]
stacked_df['age_group'] = pd.cut(stacked_df['agey'], bins=age_bins, labels=age_labels, right=False)

#  Step 2: Compute median BMI for (age_group, gender)
bmi_medians = stacked_df.groupby(['age_group', 'gender'])['bmi'].median()

#  Step 3: Identify and handle small sample sizes
min_sample_size = 10  # Define minimum required count per group
group_sizes = stacked_df.groupby(['age_group', 'gender'])['bmi'].count()

# If a group has less than 'min_sample_size' observations, merge with the previous group
for (age_group, gender), count in group_sizes.items():
    if count < min_sample_size:
        # Find the previous group
        prev_group = str(int(age_group[:2]) - 5) + "-" + str(int(age_group[:2]) - 1)
        if prev_group in bmi_medians.index.get_level_values(0):
            bmi_medians.loc[(age_group, gender)] = bmi_medians.loc[(prev_group, gender)]

#  Step 4: Define function to impute missing BMI
def impute_bmi(row):
    if pd.isna(row['bmi']):  # If BMI is missing
        return bmi_medians.get((row['age_group'], row['gender']), np.nan)  # Lookup median value
    return row['bmi']  # Keep existing value if not missing

#  Step 5: Apply median imputation
stacked_df['bmi'] = stacked_df.apply(impute_bmi, axis=1)

#  Step 6: Verify results
print("BMI Missing Values AFTER Imputation:", stacked_df['bmi'].isna().sum())
print(stacked_df[['age_group', 'gender', 'bmi']].sample(10))  # Sample check

stacked_df.drop(columns=['age_group'], inplace=True)


        Proportion of BMI Missing
gender                           
1.0                      0.113822
2.0                      0.221351
BMI Missingness by Age Group:
age_group
50-54      0.141942
55-59      0.154147
60-64      0.161497
65-69      0.167431
70-74      0.177162
75-79      0.211681
80-84      0.249365
85-89      0.268004
90-94      0.342237
95-99      0.440945
100-104    0.409091
105-109    0.333333
110-114    0.000000
115-119         NaN
Name: bmi, dtype: float64
BMI Missingness by Age Group and Gender:
gender          1.0       2.0
age_group                    
50-54      0.084261  0.181320
55-59      0.096646  0.200000
60-64      0.100380  0.211358
65-69      0.108093  0.227247
70-74      0.118395  0.240968
75-79      0.148237  0.282402
80-84      0.190321  0.313466
85-89      0.215667  0.327473
90-94      0.288401  0.403571
95-99      0.312500  0.571429
100-104    0.384615  0.444444
105-109    0.333333       NaN
110-114    0.000000  0.000000
115-119         NaN       N

  bmi_medians = stacked_df.groupby(['age_group', 'gender'])['bmi'].median()
  group_sizes = stacked_df.groupby(['age_group', 'gender'])['bmi'].count()


BMI Missing Values AFTER Imputation: 0
       age_group gender        bmi
38974      75-79    2.0  32.444443
208704     55-59    2.0  28.000000
101661     70-74    2.0  33.163265
42584      70-74    1.0  28.303852
105160     75-79    2.0  26.562500
44147      55-59    1.0  26.851852
126969     70-74    1.0  21.203104
213018     55-59    2.0  31.980541
212991     55-59    1.0  22.230988
143628     80-84    1.0  23.388685


<a id=''></a>
### 

<small>[Back to top](#sections)</small>

In [None]:
# <---- Rerun previous xgboost with imputed bmi ---->

# ----------------------
# 2) Define X and y
# ----------------------
target_column = 'hospitalized'
drop_columns = ['id', 'proxy', 'wave']

# --------------------------
# 1) Convert to categorical
# --------------------------


for col, mapping in mappings.items():
    if col in stacked_df.columns and col != target_column:
        # Extract the numeric portion
        stacked_df[col] = stacked_df[col].map(mapping)
        
        # Convert to actual pandas 'category' dtype
        stacked_df[col] = stacked_df[col].astype('category')



X = filtered_df.drop(columns=[target_column] + drop_columns)
y = filtered_df[target_column].astype('int')  # Keep the target as int

# ----------------------
# 3) Train/Test Split
# ----------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------------
# 4) Initialize XGBoost with enable_categorical
# -----------------------------------
model = xgb.XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    enable_categorical=True,  
    tree_method='hist'        
)

model.fit(X_train, y_train)

# ----------------------
# 5) Make Predictions
# ----------------------
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability for ROC AUC

# ----------------------
# 6) Evaluate
# ----------------------
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
class_report = classification_report(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"ROC AUC Score: {roc_auc:.4f}")
print("\nClassification Report:\n", class_report)



Model Accuracy: 0.8031
ROC AUC Score: 0.8403

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.92      0.87      3737
           1       0.73      0.54      0.62      1590

    accuracy                           0.80      5327
   macro avg       0.78      0.73      0.74      5327
weighted avg       0.80      0.80      0.79      5327



In [14]:
# <---- Save purged dataset ---->
stacked_df.to_pickle('./data/stacked_purged_recast_imputed_df.pickle')

<a id='approach-2'></a>
### Second approach: all waves

<small>[Back to top](#sections)</small>

In [None]:
### <---- Load purged & imputed dataset ----> 

file_path = './data/stacked_purged_recast_imputed_df.pickle'
stacked_df = pd.read_pickle(file_path)

In [3]:
# Select all rows in wave 5 and all rows from waves 1, 2, 3 and 4 where 'hospitalized' = 1

wave_5_df = stacked_df[stacked_df['wave'] == 5]

waves_1_2_3_4_hosp_df = stacked_df[
    (stacked_df['wave'].isin([1, 2, 3, 4])) & (stacked_df['hospitalized'] == 1)
]

filtered_df = pd.concat([wave_5_df, waves_1_2_3_4_hosp_df])

filtered_df = filtered_df.reset_index(drop=True)


hospitalized_counts_filtered_df = filtered_df['hospitalized'].value_counts(normalize=True)
print(hospitalized_counts_filtered_df)

hospitalized
0.0    0.604908
1.0    0.395092
Name: proportion, dtype: float64


In [None]:
# <---- Run xgboost with all waves ---->

# ----------------------
# 2) Define X and y
# ----------------------
target_column = 'hospitalized'
drop_columns = ['id', 'proxy', 'wave']

# --------------------------
# 1) Convert to categorical
# --------------------------
# Instead of .astype('float64'), switch to .astype('category') 
# after extracting the numeric portion from your string categories.

for col, mapping in mappings.items():
    if col in stacked_df.columns and col != target_column:
        # Extract the numeric portion
        stacked_df[col] = stacked_df[col].map(mapping)
        
        # Convert to actual pandas 'category' dtype
        stacked_df[col] = stacked_df[col].astype('category')



X = filtered_df.drop(columns=[target_column] + drop_columns)
y = filtered_df[target_column].astype('int')  # Keep the target as int

# ----------------------
# 3) Train/Test Split
# ----------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------------
# 4) Initialize XGBoost with enable_categorical
# -----------------------------------
model = xgb.XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    enable_categorical=True,  # <--- Tells XGBoost to handle categorical splits
    tree_method='hist'        # or 'gpu_hist'
)

model.fit(X_train, y_train)

# ----------------------
# 5) Make Predictions
# ----------------------
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability for ROC AUC

# ----------------------
# 6) Evaluate
# ----------------------
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
class_report = classification_report(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"ROC AUC Score: {roc_auc:.4f}")
print("\nClassification Report:\n", class_report)



Model Accuracy: 0.7802
ROC AUC Score: 0.8415

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.87      0.83      3775
           1       0.75      0.64      0.69      2354

    accuracy                           0.78      6129
   macro avg       0.77      0.75      0.76      6129
weighted avg       0.78      0.78      0.78      6129



In [14]:
# Extract feature importance
feature_importances = model.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the feature importance values
print(feature_importance_df)

      Feature  Importance
11   doctim1y    0.069782
1        shlt    0.058321
21    fatigue    0.052362
10   doctor1y    0.049721
29    rxhrtat    0.045892
12     cholst    0.044773
4   mobilseva    0.042787
22      swell    0.038335
15      papsm    0.036367
6     hrtatte    0.033358
14     mammog    0.032489
3     adltot6    0.032237
2        hltc    0.029073
18     drinkd    0.026333
30    rxstrok    0.026296
31       work    0.026251
26     rxresp    0.024965
23      hibpe    0.024803
24     cancre    0.024520
28     rxdiab    0.023064
25      respe    0.022942
9         bmi    0.022780
16      prost    0.022231
20     painfr    0.021277
13     breast    0.021158
32       agey    0.020684
27     rxhibp    0.019580
19     smoken    0.018915
17     vigact    0.018639
5       diabe    0.018595
7      stroke    0.018438
8    breath_m    0.016872
0      gender    0.016162


<a id=''></a>
### 

<small>[Back to top](#sections)</small>

<a id=''></a>
### 

<small>[Back to top](#sections)</small>

<a id=''></a>
### 

<small>[Back to top](#sections)</small>