## Classification Model for identifying Maternal Health Risk

# Introduction

Maternal health remains a critical issue worldwide, especially in rural regions and among lower-middle-class families in emerging countries. The lack of access to proper healthcare, inadequate information about maternal care, and insufficient monitoring during pregnancy contribute to high maternal mortality rates. The significance of timely interventions and constant monitoring during pregnancy cannot be overstated, as each moment is crucial to ensuring the health and safety of both the mother and the baby.
This report investigates maternal health risks using exploratory data analysis and classification techniques such as Logistic regression, SVC and Naive Bayes to identify key factors that contribute to complications during pregnancy. 

The primary question addressed in this project is: What are the key indicators that predict maternal health risks during pregnancy?

To answer this question, a dataset containing information on various maternal health factors was used. Leading to the goal of the project which is to create a predictive model that can evaluate the risk factors associated with pregnancy.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## About Data 

Data was taken from the UC Irvine Machine Learning Repository \
Dataset link - https://archive.ics.uci.edu/dataset/863/maternal+health+risk \
Column descriptions: 

- Age: Age in years when a woman is pregnant.
- SystolicBP: Upper value of Blood Pressure in mmHg, another significant attribute during pregnancy.
- DiastolicBP: Lower value of Blood Pressure in mmHg, another significant attribute during pregnancy.
- BS: Blood glucose levels is in terms of a molar concentration, mmol/L.
- HeartRate: A normal resting heart rate in beats per minute.
- Risk Level: Predicted Risk Intensity Level during pregnancy considering the previous attribute.

In [3]:
df = pd.read_csv('../data/Maternal Health Risk Data Set.csv')
df.head()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BS,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,98.0,86,high risk
1,35,140,90,13.0,98.0,70,high risk
2,29,90,70,8.0,100.0,80,high risk
3,30,140,85,7.0,98.0,70,high risk
4,35,120,60,6.1,98.0,76,low risk


In [79]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1014 entries, 0 to 1013
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          1014 non-null   int64  
 1   SystolicBP   1014 non-null   int64  
 2   DiastolicBP  1014 non-null   int64  
 3   BS           1014 non-null   float64
 4   BodyTemp     1014 non-null   float64
 5   HeartRate    1014 non-null   int64  
 6   RiskLevel    1014 non-null   object 
dtypes: float64(2), int64(4), object(1)
memory usage: 55.6+ KB


Unnamed: 0,Age,SystolicBP,DiastolicBP,BS,BodyTemp,HeartRate
count,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0
mean,29.871795,113.198225,76.460552,8.725986,98.665089,74.301775
std,13.474386,18.403913,13.885796,3.293532,1.371384,8.088702
min,10.0,70.0,49.0,6.0,98.0,7.0
25%,19.0,100.0,65.0,6.9,98.0,70.0
50%,26.0,120.0,80.0,7.5,98.0,76.0
75%,39.0,120.0,90.0,8.0,98.0,80.0
max,70.0,160.0,100.0,19.0,103.0,90.0


## EDA

## Modelling

In [101]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_validate

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from scipy.stats import loguniform, uniform, randint

from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB, GaussianNB, MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

In [138]:
train_df, test_df =train_test_split(df, test_size=0.2, random_state=123)
X_train = train_df.drop(columns=["RiskLevel"])
X_test = test_df.drop(columns=["RiskLevel"])
y_train = train_df["RiskLevel"]
y_test = test_df["RiskLevel"]

## The Baseline Model: Dummy Classifier

In [137]:
dummy_clf = DummyClassifier()
scores = cross_validate(dummy_clf, X_train, y_train, cv=10, return_train_score=True)
pd.DataFrame(scores)['test_score'].mean()

0.4007377295995182

## Model Comparison between Decision Tree, Logistic Regression and SVC
Here we are doing a simple model comparison

In [129]:
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=123),
    "RBF SVM": SVC(random_state=123),
    "Logistic Regression": LogisticRegression(max_iter=2000, random_state=123),
}

In [130]:
# The function below is adapted from DSCI571 Supervides Learning I Lecture 4 notes
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores.iloc[i], std_scores.iloc[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

In [131]:
results_df = None
results_dict = {}

for model_name, model in models.items():
    clf_pipe = make_pipeline(StandardScaler(), model)
    results_dict[model_name] = mean_std_cross_val_scores(
        clf_pipe, X_train, y_train, cv=10, return_train_score=True, error_score='raise'
    )

results_df = pd.DataFrame(results_dict).T
results_df

Unnamed: 0,fit_time,score_time,test_score,train_score
Decision Tree,0.005 (+/- 0.005),0.001 (+/- 0.001),0.826 (+/- 0.045),0.931 (+/- 0.003)
RBF SVM,0.008 (+/- 0.000),0.003 (+/- 0.000),0.699 (+/- 0.047),0.714 (+/- 0.007)
Logistic Regression,0.003 (+/- 0.000),0.001 (+/- 0.000),0.613 (+/- 0.047),0.614 (+/- 0.007)


Decision has the best performance during the cross validation of 10 folds. It the highest validation score of 0.826.

<br>

In [134]:
dt = DecisionTreeClassifier(random_state=123)

param_dist = {
    'criterion': ['gini', 'entropy'], 
    'max_depth': randint(3, 20),                
}

random_search = RandomizedSearchCV(dt, param_dist, n_iter=100, n_jobs=-1, return_train_score = True, random_state=123)
random_search.fit(X_train, y_train)


In [120]:
random_search.best_score_

0.8138983564341438

## Reporting test score

In [121]:
random_search.score(X_test, y_test)

0.8325123152709359

## Hyperparameter Tuning 