# Antibiotic resistance prediction

## Project Introduction
This project aims to predict antibiotic resistance using structured electronic health record (EHR) data from the Antibiotic Resistance Microbiology Dataset (ARMD). The goal is to classify whether a bacterial isolate is susceptible (S) or resistant (R) to a given antibiotic, based on clinical, demographic, microbiological, and treatment-related features. This binary classification model supports empirical antibiotic selection and contributes to combating antimicrobial resistance in clinical settings.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

final_armd_ds = 'ARMD_Dataset/selected_features_output.parquet'

df = pd.read_parquet(final_armd_ds)
print(df.shape)



(2184195, 27)


### Separate target + binary encoding

In [5]:
target_col = 'susceptibility_label'
df[target_col] = df[target_col].map({'S': 0, 'R': 1})  
y = df[target_col]
X = df.drop(columns=[target_col])
print('y: ',y.shape)
print('X: ',X.shape)

y:  (2184195,)
X:  (2184195, 26)


### Identify column types

In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2184195 entries, 0 to 2184194
Data columns (total 26 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   organism_x                      string 
 1   antibiotic_x                    string 
 2   resistant_time_to_culturetime   float64
 3   age                             string 
 4   gender                          string 
 5   adi_score                       string 
 6   adi_state_rank                  string 
 7   median_wbc                      string 
 8   median_neutrophils              string 
 9   median_lymphocytes              string 
 10  median_hgb                      string 
 11  median_plt                      string 
 12  median_na                       string 
 13  median_hco3                     string 
 14  median_bun                      string 
 15  median_cr                       string 
 16  median_lactate                  string 
 17  median_procalcitonin       

## Column Categorization
- First, properly define all column categories
- Combine numerical features

In [9]:
true_categorical_cols = ['organism_x', 'antibiotic_x', 'gender', 'medication_category']
numeric_cols = ['resistant_time_to_culturetime', 'median_heartrate', 'median_resprate',
               'median_temp', 'median_sysbp', 'median_diasbp',
               'medication_time_to_culturetime', 'nursing_home_visit_culture']
numerical_med_cols = ['median_wbc', 'median_neutrophils', 'median_lymphocytes',
                     'median_hgb', 'median_plt', 'median_na', 'median_hco3',
                     'median_bun', 'median_cr', 'median_lactate', 'median_procalcitonin']
ordinal_cols = ['age', 'adi_score', 'adi_state_rank']


all_numerical_cols = numeric_cols + numerical_med_cols


all_columns = true_categorical_cols + all_numerical_cols + ordinal_cols


## Apply CatBoost Encoding

#### Data Splitting

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("\nFinal Training Set Shape:", X_train.shape)
print("Final Test Set Shape:", X_test.shape)


Final Training Set Shape: (1747356, 26)
Final Test Set Shape: (436839, 26)


#### Ensure proper data types

In [14]:
X_train[true_categorical_cols] = X_train[true_categorical_cols].astype(str)
X_test[true_categorical_cols] = X_test[true_categorical_cols].astype(str)


#### Handle missing medication categories

In [16]:

X_train['medication_category'] = X_train['medication_category'].fillna('No_medication_recorded')
X_test['medication_category'] = X_test['medication_category'].fillna('No_medication_recorded')


#### Verify no more missing values

In [18]:

print("Missing values after treatment:")
print(X_train[true_categorical_cols].isnull().sum())

Missing values after treatment:
organism_x             0
antibiotic_x           0
gender                 0
medication_category    0
dtype: int64


#### Apply CatBoost Encoding
Initialize CatBoost encoder with optimal settings
- Added noise to prevent overfitting
- Smoothing parameter

Fit and transform - ensuring no data leakage

In [20]:
from category_encoders import CatBoostEncoder
cbe = CatBoostEncoder(
    cols=true_categorical_cols,
    random_state=42,
    sigma=0.1,  # noise
    a=1.0       # Smoothing
)

X_train_encoded = cbe.fit_transform(X_train[true_categorical_cols], y_train)
X_test_encoded = cbe.transform(X_test[true_categorical_cols])

#### Create final feature sets

In [54]:
final_features = true_categorical_cols + all_numerical_cols + ordinal_cols
X_train_final = pd.concat([
    X_train_encoded,
    X_train[all_numerical_cols + ordinal_cols]
], axis=1)[final_features]  # Ensure consistent column order

X_test_final = pd.concat([
    X_test_encoded,
    X_test[all_numerical_cols + ordinal_cols]
], axis=1)[final_features]

#### Final verification

In [51]:
print("\nEncoded values validation:")
for col in true_categorical_cols:
    print(f"\n{col}:")
    print(f"Unique encoded values: {X_train_encoded[col].nunique()}")
    print("Value distribution:")
    print(X_train_encoded[col].describe())

print("\nFinal training set shape:", X_train_final.shape)
print("Final test set shape:", X_test_final.shape)


Encoded values validation:

organism_x:
Unique encoded values: 1745225
Value distribution:
count    1.747356e+06
mean     4.278749e-01
std      5.690025e-02
min      2.743977e-03
25%      4.421465e-01
50%      4.423718e-01
75%      4.424786e-01
max      9.170342e-01
Name: organism_x, dtype: float64

antibiotic_x:
Unique encoded values: 1744373
Value distribution:
count    1.747356e+06
mean     4.279393e-01
std      1.076565e-01
min      6.904201e-03
25%      4.227086e-01
50%      4.229173e-01
75%      4.229990e-01
max      8.954642e-01
Name: antibiotic_x, dtype: float64

gender:
Unique encoded values: 1747340
Value distribution:
count    1.747356e+06
mean     4.278643e-01
std      1.988554e-02
min      1.426868e-01
25%      4.209859e-01
50%      4.211375e-01
75%      4.212362e-01
max      7.140302e-01
Name: gender, dtype: float64

medication_category:
Unique encoded values: 1742488
Value distribution:
count    1.747356e+06
mean     4.278474e-01
std      5.591758e-02
min      3.419803e

#### Replace 'Null' with actual NaN values

In [67]:
print(X_train_final.isnull().sum().sum())  

2026527


In [71]:
import numpy as np
X_train_final = X_train_final.replace('Null', np.nan)
X_test_final = X_test_final.replace('Null', np.nan)


## Models implementation 

In [60]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score

# Initialize models
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "SVM": SVC(probability=True, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42)
}

# Train and evaluate each model
results = {}
for name, model in models.items():
    # Train the model
    model.fit(X_train_final, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_final)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    # Perform cross-validation
    cv_scores = cross_val_score(model, X_train_final, y_train, cv=5)
    
    # Store results
    results[name] = {
        'model': model,
        'accuracy': accuracy,
        'classification_report': report,
        'cv_mean_score': cv_scores.mean(),
        'cv_std': cv_scores.std()
    }
    
    # Print results
    print(f"--- {name} ---")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Cross-validation score: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})")
    print(report)
    print("\n")

# Compare model performances
print("\nModel Comparison:")
for name, result in results.items():
    print(f"{name}:")
    print(f"  Test Accuracy: {result['accuracy']:.4f}")
    print(f"  CV Mean Accuracy: {result['cv_mean_score']:.4f} (±{result['cv_std']:.4f})")
    print()

ValueError: could not convert string to float: 'Null'