## Classification Model for identifying Maternal Health Risk

# Introduction

Maternal health remains a critical issue worldwide, especially in rural regions and among lower-middle-class families in emerging countries. The lack of access to proper healthcare, inadequate information about maternal care, and insufficient monitoring during pregnancy contribute to high maternal mortality rates. The significance of timely interventions and constant monitoring during pregnancy cannot be overstated, as each moment is crucial to ensuring the health and safety of both the mother and the baby.
This report investigates maternal health risks using exploratory data analysis and classification techniques such as Logistic regression, SVC and Naive Bayes to identify key factors that contribute to complications during pregnancy. 

The primary question addressed in this project is: What are the key indicators that predict maternal health risks during pregnancy?

To answer this question, a dataset containing information on various maternal health factors was used. Leading to the goal of the project which is to create a predictive model that can evaluate the risk factors associated with pregnancy.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## About Data 

Data was taken from the UC Irvine Machine Learning Repository \
Dataset link - https://archive.ics.uci.edu/dataset/863/maternal+health+risk \
Column descriptions: 

- Age: Age in years when a woman is pregnant.
- SystolicBP: Upper value of Blood Pressure in mmHg, another significant attribute during pregnancy.
- DiastolicBP: Lower value of Blood Pressure in mmHg, another significant attribute during pregnancy.
- BS: Blood glucose levels is in terms of a molar concentration, mmol/L.
- HeartRate: A normal resting heart rate in beats per minute.
- Risk Level: Predicted Risk Intensity Level during pregnancy considering the previous attribute.

In [2]:
df = pd.read_csv('../data/Maternal Health Risk Data Set.csv')
df.head()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BS,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,98.0,86,high risk
1,35,140,90,13.0,98.0,70,high risk
2,29,90,70,8.0,100.0,80,high risk
3,30,140,85,7.0,98.0,70,high risk
4,35,120,60,6.1,98.0,76,low risk


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1014 entries, 0 to 1013
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          1014 non-null   int64  
 1   SystolicBP   1014 non-null   int64  
 2   DiastolicBP  1014 non-null   int64  
 3   BS           1014 non-null   float64
 4   BodyTemp     1014 non-null   float64
 5   HeartRate    1014 non-null   int64  
 6   RiskLevel    1014 non-null   object 
dtypes: float64(2), int64(4), object(1)
memory usage: 55.6+ KB


Let's make our target variable RiskLevel numeric by assigning 0 to low risk, 1 to mid risk and 2 to high risk later for the EDA. 

In [4]:
df_eda = df.copy()
df_eda['RiskLevel'].unique()

array(['high risk', 'low risk', 'mid risk'], dtype=object)

In [5]:
RiskLevel = {'low risk':0, 
        'mid risk':1, 
        'high risk':2}

df_eda['RiskLevel'] = df_eda['RiskLevel'].map(RiskLevel).astype(float)
df_eda

Unnamed: 0,Age,SystolicBP,DiastolicBP,BS,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,98.0,86,2.0
1,35,140,90,13.0,98.0,70,2.0
2,29,90,70,8.0,100.0,80,2.0
3,30,140,85,7.0,98.0,70,2.0
4,35,120,60,6.1,98.0,76,0.0
...,...,...,...,...,...,...,...
1009,22,120,60,15.0,98.0,80,2.0
1010,55,120,90,18.0,98.0,60,2.0
1011,35,85,60,19.0,98.0,86,2.0
1012,43,120,90,18.0,98.0,70,2.0


## EDA

We can see that we don't have any missing values and also all features beside the target variable are numeric. 

In [9]:
import altair as alt

In [7]:
df_eda.isnull().sum()

Age            0
SystolicBP     0
DiastolicBP    0
BS             0
BodyTemp       0
HeartRate      0
RiskLevel      0
dtype: int64

We don't have too many observations (1014), but can still be enough for initial modeling

In [8]:
df_eda.shape

(1014, 7)

We have somewhat balanced data, with high risk having the fewest observations and low risk having the most.

In [26]:
chart = alt.Chart(df).mark_bar(color = 'steelblue').encode(
    x = alt.X('RiskLevel', title = 'Risk Level', axis = alt.Axis(labelAngle = 0)),
    y = alt.Y('count()', title = 'Count'),
    color = alt.Color('RiskLevel:N', title = 'Risk Level')
).properties(
    title = "Countplot of Risk Level",
    width = 300,
    height = 300
)

chart.show()

Let's discuss some of the relations from the heatmap of the dataset:

- The "Risk Level" our target variable exhibits a noticeable correlation with "Blood Pressure" This suggests that individuals with higher or lower blood pressure levels may tend to fall into distinct risk categories. As a result, "Blood Pressure" could serve as an essential predictor in later stages of analysis.
- The upper (systolic) and lower (diastolic) blood pressure values show a strong correlation. This is expected because both measurements are closely related physiological metrics. While we might have considered dropping one of them, the correlation is not higher than 0.8, so we will retain both, as they may still provide valuable information when combined with other features.

In [35]:
correlation_matrix = df_eda.corr().abs().round(2).reset_index().melt(
    id_vars = 'index', 
    var_name = 'Variable', 
    value_name = 'Correlation'
)

heatmap = alt.Chart(correlation_matrix).mark_rect().encode(
    x = alt.X('Variable:N', title = '', sort = None),
    y = alt.Y('index:N', title = '', sort = None),
    color = alt.Color('Correlation:Q', scale = alt.Scale(scheme = 'viridis'), title = 'Correlation')
).properties(
    width = 600,
    height = 400
)

text = alt.Chart(correlation_matrix).mark_text(baseline = 'middle').encode(
    x = alt.X('Variable:N', sort = None, axis = alt.Axis(labelAngle = -45)),
    y = alt.Y('index:N', sort = None),
    text = alt.Text('Correlation:Q', format = '.2f'),
    color = alt.condition(
        'datum.Correlation > 0.5', 
        alt.value('white'), 
        alt.value('black')
    )
).properties(
    title = "Heatmap of the Maternal Health"
)

final_chart = heatmap + text

final_chart.show()


We can see from the boxplots that women in the high-risk group generally have higher values across all our features. For example, older women are more likely to be classified as high risk. Additionally, the high-risk group shows a wider range for upper blood pressure and blood glucose levels, with the median value being slightly higher than that of the other groups.

In [49]:
columns = [col for col in df.columns.tolist() if col != 'RiskLevel']

boxplots = [
    alt.Chart(df).mark_boxplot(extent = 'min-max').encode(
        x = alt.X('RiskLevel:N', title = 'Risk Level', axis = alt.Axis(labelAngle = 0)),
        y = alt.Y(f'{col}:Q', title = col),
        color = 'RiskLevel:N'
    ).properties(
        title = f'Boxplot of {col} by RiskLevel',
        width = 300,
        height = 200
    )
    for col in columns
]

final_chart = alt.vconcat(*boxplots).resolve_scale(
    color = 'independent',
    y = 'independent'
)

final_chart.show()

## Modelling

## Hyperparameter Tuning 