In [51]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt

## Predicting Fetal Health with Tree-Based Models 

## Introduction + Goal

A cardiotocogram (CTG) is a medical test that records a fetus’s heart rate and the uterine contractions of the mother. Features extracted from a CTG often include things like baseline heart rate, variability, accelerations, and decelerations. Doctors use CTG data to assess fetal well-being. The CTG data, along with the doctor’s assessment of the fetus’s health, can also be used to train predictive models for fetal health classification.

The goal of this project is to train three classification models to predict whether a fetus’s health is normal, suspect or pathological based on CTG data. Model performance will be evaluated and the models compared based on accuracy, F1-score, and interpretability. All three classification methods used will be tree-based: a simple Decision Tree, AdaBoost, and Random Forest. A Decision Tree is particularly useful in this case because it can capture non-linear relationships among the cardiotocogram features and provide an interpretable structure for understanding the predictions.

## Data Description + Preprocessing

The features used for model training were automatically extracted from cardiotocogram readings. Expert obstetricians also evaluated each CTG and classified the fetus’s health as normal, suspect, or pathological. Each of the 2126 rows represents a CTG for a mother–fetus pair. 21 columns contain  features extracted from the CTG and the 22nd column contains the experts’ classifications.

Learn more about the dataset [here](https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification).

In [52]:
df = pd.read_csv('fetal_health.csv')
df.head()

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,...,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,...,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,...,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


To prepare the data for model training, I'll first extract the column holding the experts' diagnoses and remove it from the training data. The health diagnostic is encoded as 1 for Healthy, 2 for Suspect, and 3 for Pathological. 

In [53]:
target = df['fetal_health'].astype(int) # extract expert classifications 
data = df.drop(['fetal_health'], axis = 1)

In [54]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 21 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   baseline value                                          2126 non-null   float64
 1   accelerations                                           2126 non-null   float64
 2   fetal_movement                                          2126 non-null   float64
 3   uterine_contractions                                    2126 non-null   float64
 4   light_decelerations                                     2126 non-null   float64
 5   severe_decelerations                                    2126 non-null   float64
 6   prolongued_decelerations                                2126 non-null   float64
 7   abnormal_short_term_variability                         2126 non-null   float64
 8   mean_value_of_short_term_variability  

Since there is no missing data, I'm going to go ahead and split the training data 80/20 into training and testing sets for validation purposes. I'm also going to stratify the split based on the true labels to ensure each set has a similar distribution. 

In [55]:
X_train, X_test, y_train, y_test = train_test_split(data, target, 
                                                    test_size=0.2, 
                                                    stratify = target)

## Model Building

I'm going to build 3 classifiers using different algorithms: decision tree, AdaBoost, and random forest. 

### Decision Tree
A decision tree is a model that makes predictions by recursively splitting data on feature values. Each internal node represents a decision based on a feature (like whether a measurement exceeds a threshold) and each leaf node represents a predicted outcome. Before I can train the model, I'm going to tune its hyperparameters using cross-validation. Specifically, GridSearchCV performs cross-validation for each combination of hyperparameters and selects the best combination based on a performance metric. 

In [56]:
dt = DecisionTreeClassifier()
param_grid = {'max_depth': [None, 3, 5, 8], # values to test 
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [2, 5, 10]}
grid_search = GridSearchCV(
    estimator = dt,
    param_grid = param_grid,
    cv = 5,                 
    scoring = 'f1_macro',  # performance metric 
    n_jobs = -1)
grid_search.fit(X_train, y_train)
print('Best parameters:', grid_search.best_params_)

Best parameters: {'max_depth': None, 'min_samples_leaf': 5, 'min_samples_split': 10}


In [57]:
dt = DecisionTreeClassifier(max_depth = 5, min_samples_leaf = 1, min_samples_split = 20)
dt.fit(X_train, y_train)
dt_preds = dt.predict(X_test)

### AdaBoost 

Boosting is a technique to improve the performance of weak classifiers by iteratively reweighting the training data to focus on misclassified points. AdaBoost combines many weak classifiers (shallow decision trees), adjusting their weights at each step. The final classifier aggregates all of the weak learners through weighted majority voting. Before training the model, I will tune its hyperparameters with GridSearchCV to find the best number of estimators and learning rate.


In [58]:
base_estimator = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1)
ab = AdaBoostClassifier(estimator=base_estimator)
param_grid = {'n_estimators': [50, 100, 150, 250],   
              'learning_rate': [0.01, 0.1, 0.5, 1]}   
grid_search = GridSearchCV(
    estimator = ab,
    param_grid = param_grid,
    cv = 5,
    scoring = 'f1_macro',
    n_jobs = -1)
grid_search.fit(X_train, y_train)
print('Best parameters:', grid_search.best_params_)

Best parameters: {'learning_rate': 1, 'n_estimators': 150}


In [59]:
ab = AdaBoostClassifier(n_estimators = 100, 
                        learning_rate = 0.1)
ab.fit(X_train, y_train)
ab_preds = ab.predict(X_test)

### Random Forest

Random Forests is an algorithm that uses bagging (bootstrap aggregating): training many individual decision trees on different random subsets of the data (bootstrapped samples) and random subsets of features. Each tree makes its own prediction, and the final classification is determined by majority vote across all trees. I am again using GridSearchCV to tune the key hyperparameters of the random forest model.

In [None]:
rf = RandomForestClassifier()
param_grid = {'n_estimators': [200, 300, 400],      
              'max_depth': [None, 10, 15, 20],             
              'min_samples_split': [2, 5, 10],     
              'min_samples_leaf': [1, 2, 4], 
              'max_features': ['sqrt', 'log2', None]}

grid_search = GridSearchCV(
    estimator = rf,
    param_grid = param_grid,
    cv = 5,
    scoring = 'f1_macro',
    n_jobs = -1)

grid_search.fit(X_train, y_train)
print('Best parameters:', grid_search.best_params_)

In [None]:
rf = RandomForestClassifier(n_estimators = 200, 
                            max_depth = 20, 
                            min_samples_split = 10, 
                            min_samples_leaf = 1)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

## Model Evaluations

I'm going to evaluate and compare the performaces of the 3 classifiers using generalization error and macro F1. 

Generalization error quantifies how well the model will classify unseen data. It is computed as the accuracy score of the model predictions for the held-out testing set. While accuracy provides some measure of model performance, if the classes are imbalanced as they are in this case, accuracy can be misleading. 

In [None]:
target.value_counts()

Our dataset has more than 3x as many normal health diagnosis examples as suspect and pathological examples combined, so a poor model could classifier every new case as healthy and still have accuracy over 75% despite also having a false negative rate of 100%. 

A metric which accounts for unbalanced classes is the macro F1-score. Generally, an F1-score balances precision and recall to punish models that either miss too many true cases (low recall) or produce too many false positives (low precision). A high F1-score indicates a model is good at classifying true cases without causing too many false positives. A macro F1-score is used for multi-class cases because it computes the F1-score separately for each class and then averages those three scores. This method works better than accuracy for imbalanced classes because it treats all classes equally, regardless of how many examples they have. In this case, macro F1 will emphasize how well the model identifies suspect and pathological fetuses, not just the majority healthy class. 

In [None]:
accs = np.array([metrics.accuracy_score(y_test, dt_preds), 
                 metrics.accuracy_score(y_test, ab_preds), 
                 metrics.accuracy_score(y_test, rf_preds)])
f1s = np.array([metrics.f1_score(y_test, dt_preds, average='macro'), 
                metrics.f1_score(y_test, ab_preds, average='macro'), 
                metrics.f1_score(y_test, rf_preds, average='macro')])
models = np.array(['DecisionTree', 'AdaBoost', 'RandomForest'])
df = pd.DataFrame({'Model': models,
              'Accuracy': accs,
              'F1_macro': f1s})
df

In [None]:
x = np.arange(3)  
fig, ax = plt.subplots(figsize=(7,4))
ax.bar(x - width/2, df['Accuracy'], width, 
       label='Accuracy', color = 'darkblue')
ax.bar(x + width/2, df['F1_macro'], width, 
       label='Macro F1', color = 'darkorange')
ax.set_ylabel('Metric Value')
ax.set_title('Model Comparisons with Accuracy and Macro F1')
ax.set_xticks(x)
ax.set_xticklabels(models)
ax.set_ylim(0, 1)
ax.legend()
plt.show()

## Analysis 
The Random Forest model achieved the best overall performance, with an accuracy of 0.927 and a macro F1-score of 0.861, outperforming both the single Decision Tree and AdaBoost. The Decision Tree also performed well (accuracy 0.908, F1_macro 0.833), demonstrating that a single interpretable tree can capture much of the structure in the CTG dataset. AdaBoost, on the other hand, performed significantly worse (accuracy 0.798, F1_macro 0.419), likely due to the small dataset size or sensitivity to noisy cases. 

These results highlight that ensembles can improve robustness: Random Forest reduces variance through bagging and generalizes better than a single tree. In terms of interpretability, the single Decision Tree is the easiest to visualize and explain, making it useful for clinical contexts, while Random Forest and AdaBoost are harder to interpret directly. However, both ensembles allow for analysis of feature importance, which can provide insight into the most influential CTG features for making predictions.

In [None]:
fi_df = pd.DataFrame({'Feature': X_train.columns, 
                      'Importance': rf.feature_importances_})
fi_df = fi_df.sort_values(by='Importance', ascending=True)
plt.barh(fi_df['Feature'], fi_df['Importance'], color='darkgreen')
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.show()

The Random Forest model identified several CTG features as particularly important for predicting fetal health. The most influential features were abnormal_short_term_variability, percentage_of_time_with_abnormal_long_term_variability, and mean_value_of_short_term_variability, indicating that both short-term and long-term variability metrics are key indicators of fetal well-being. 

## Conclusion 

Altogether, the Random Forest model not only demonstrated the highest predictive performance among the three classifiers but also provides a valuable tool for real-world fetal health assessment. By accurately identifying cases as normal, suspect, or pathological, the model could assist obstetricians in prioritizing high-risk pregnancies and supporting timely clinical interventions. The ability to quantify feature importance further enhances its practical utility, allowing clinicians to understand which aspects of a CTG most strongly influence predictions. While not as immediately interpretable as a single Decision Tree, the combination of high accuracy, robustness to variance, and insight into key features makes Random Forest a promising model for aiding decision-making in prenatal care.