# Exploration of fetal health
[Data](https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification)

- 'baseline value'- FHR baseline (beats per minute)
- 'accelerations'- Number of accelerations per second
- 'fetal_movement'- Number of fetal movements per second
- 'uterine_contractions'- Number of uterine contractions per second
- 'light_decelerations'- Number of light decelerations per second
- 'severe_decelerations'- Number of severe decelerations per second
- 'prolongued_decelerations'- Number of prolonged decelerations per second
- 'abnormal_short_term_variability'- Percentage of time with abnormal short term variability
- 'mean_value_of_short_term_variability'- Mean value of short term variability
- 'percentage_of_time_with_abnormal_long_term_variability'- Percentage of time with abnormal long term variability
- 'mean_value_of_long_term_variability'- Mean value of long term variability
- 'histogram_width'- Width of FHR histogram
- 'histogram_min'- Minimum (low frequency) of FHR histogram
- 'histogram_max'- Maximum (high frequency) of FHR histogram
- 'histogram_number_of_peaks'- Number of histogram peaks
- 'histogram_number_of_zeroes'- Number of histogram zeros
- 'histogram_mode'- Histogram mode
- 'histogram_mean'- Histogram mean
- 'histogram_median'- Histogram median
- 'histogram_variance'- Histogram variance
- 'histogram_tendency'- Histogram tendency

In [13]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('fetal_health.csv')


display(df.head())
display(df.info())

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,...,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,...,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,...,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,...,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,...,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 22 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   baseline value                                          2126 non-null   float64
 1   accelerations                                           2126 non-null   float64
 2   fetal_movement                                          2126 non-null   float64
 3   uterine_contractions                                    2126 non-null   float64
 4   light_decelerations                                     2126 non-null   float64
 5   severe_decelerations                                    2126 non-null   float64
 6   prolongued_decelerations                                2126 non-null   float64
 7   abnormal_short_term_variability                         2126 non-null   float64
 8   mean_value_of_short_term_variability  

None

Looking at the distribution of value counts for the `fetal_health`, we have relatively few pathological readings

In [14]:

def get_pct_from_value_counts(df, value_col):
    df = df[value_col].value_counts().reset_index()
    df.columns = [value_col, 'Count']
    df['pct'] = round(df['Count'] / df['Count'].sum() * 100, 1)
    return df

fetal_health_val_counts_df = get_pct_from_value_counts(df, 'fetal_health')
fetal_health_val_counts_df['fetal_health'] = fetal_health_val_counts_df['fetal_health'].map({1:'Normal', 2:'Suspect',3:'Pathological'})
print(f"Value counts for fetal_health (1=normal; 2=suspect; 3=pathological): \n{fetal_health_val_counts_df}")

px.bar(
    fetal_health_val_counts_df,
    x='fetal_health',
    y='pct',
    title='Value Count Distribution for Fetal Health'
)

Value counts for fetal_health (1=normal; 2=suspect; 3=pathological): 
   fetal_health  Count   pct
0        Normal   1655  77.8
1       Suspect    295  13.9
2  Pathological    176   8.3


# Preliminary test using all columns 

Create a logistic regression model because we're using continuous predictors for a discrete target.

Requires the x_train values to be normalized

In [48]:
def normalize_data(x):
    #scale
    scaler = MinMaxScaler()
    x_scaled = scaler.fit_transform(x)

    return pd.DataFrame(x_scaled, columns=x.columns)

x = normalize_data(df.drop(columns=['fetal_health']))
y = df['fetal_health']

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3, random_state=5)

In [49]:
lr = LogisticRegression(max_iter=10000, random_state=10)

lr.fit(x_train, y_train)

accuracy = lr.score(x_test, y_test)
print(f"Accuracy with simple LogReg: {accuracy}")

Accuracy with simple LogReg: 0.8887147335423198
