# Identifying Age-Related Conditions Using Machine Learning Models
## Author: Boni M. Ale, MD, MSc, MPH
### Date: 08 June 2023

# 1. Introduction

To determine if someone has these medical conditions requires a long and intrusive process to collect information from patients. With predictive models, we can shorten this process and keep patient details private by collecting key characteristics relative to the conditions, then encoding these characteristics.

In this project, I will use Machine Learning to detect conditions with measurements of anonymous characteristics. Therefore the general objective of this analysis is to predict if a person has any of three medical conditions. In order to predict if the person has one or more of any of the three medical conditions (Class 1), or none of the three medical conditions (Class 0), I will create a model trained on anonymous measurements of health characteristics.



**Load Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns


from sklearn.preprocessing import StandardScaler

**Load Datasets**

In [None]:
train_raw = pd.read_csv('/Users/boniale/Desktop/Data_Science/ML_iard/data/train.csv')
test = pd.read_csv('/Users/boniale/Desktop/Data_Science/ML_iard/data/test.csv')
greeks_raw = pd.read_csv('/Users/boniale/Desktop/Data_Science/ML_iard/data/greeks.csv')
sample_submission = pd.read_csv('/Users/boniale/Desktop/Data_Science/ML_iard/data/sample_submission.csv')

# 2. Exploratory Data Analysis

## 2.1. Data Description

In [None]:
print("Raw Train Data Set's size: ", train_raw.shape)

print("Raw Greeks Data Set's size: ", greeks_raw.shape)

#separate variables into new data frames
numeric_data = train_raw.select_dtypes(include=[np.number])
cat_data = train_raw.select_dtypes(exclude=[np.number])
cat_data = cat_data.drop(['Id'], axis=1)
print ("There are {} numeric and {} categorical columns in train raw data".format(numeric_data.shape[1],cat_data.shape[1]))

It seems like there is 56 numeric variables which include our target ("if the person has one or more of any of the three medical conditions (Class 1)" or "none of the three medical conditions (Class 0)"). This means that Class is actually a categorical variable

## 2.2. Numerical Variables Exploration

Let's work on numeriacal features.

In [None]:
Target = ['Class']
allFeature = train_raw.columns.tolist()
included_features = [feature for feature in allFeature if feature not in Target]

numericalFeatures = train_raw[included_features].select_dtypes(include=['number'])

In [None]:
numericalFeatures

There is actually 55 numeric variables.

Let's explore the overall distribution of these variables.

In [None]:
num = [f for f in train_raw.columns if train_raw.dtypes[f] != 'object']
num.remove("Class")
nd = pd.melt(train_raw, value_vars = num)
barplot_train = sns.FacetGrid (nd, col='variable',
                    col_wrap=5, 
                    sharex=False, 
                              sharey = False
                   )
barplot_train = barplot_train.map(sns.histplot, 'value')
plt.show("barplot_train")

We can see that several variables are not normally distributed.

The distribution of the data confirmed the imbalance distribution of features among those who have the diseases and those who doesn't.

Are these variables highly correlated among each other ? Let's explore this visually with a heatmap. 

In [None]:
# Define Display a correlation heatmap Function
def display_correlation_heatmap(df, title):
    corr_mat = np.round(df.corr(), 3)
    
    fig, ax = plt.subplots(figsize=(5, 5))
    sns.heatmap(corr_mat, annot=True, fmt=".3f", cmap='coolwarm', cbar=False, square=True, linewidths=.5, annot_kws={"size": 12}, ax=ax)

    ax.set_title(title, fontsize=16, pad=20, y=1.05)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", fontsize=12)
    ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=12)

    plt.tight_layout()
    plt.show()
       
def plot_correlation_heatmap(df, column_name):
    correlation_matrix = df.corr()
    plt.figure(figsize=(12, 10))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
    plt.title(f'Correlation heatmap for {column_name}')
    plt.show()

In [None]:
# Display a correlation heatmap
def plot_top_correlations(df: pd.core.frame.DataFrame, n: int, title_name: str='Top Correlations') -> None:
    # Calculate correlation between all variables
    corr = df.corr()

    # Select variables having highest absolute correlation
    top_corr_cols = corr.abs().nlargest(n, columns=corr.columns).index
    top_corr = corr.loc[top_corr_cols, top_corr_cols]

    fig, axes = plt.subplots(figsize=(10, 5))
    mask = np.zeros_like(top_corr)
    mask[np.triu_indices_from(mask)] = True
    sns.heatmap(top_corr, mask=mask, linewidths=.5, cmap='YlOrRd', annot=True)
    plt.title(title_name)
    plt.show()

# Plot heatmap of top 12 correlations in training data
plot_top_correlations(num_data, 10, 'Top 10 Correlations in Train Dataset')

We can see that some variables are correlate but not highly correlated in general apart from variable BZ and BC which are highly correlated.

## 2.3. Target Distribution

First, let's first calculate the frequence table for those have one or more of any of the three medical conditions (Class 1), or none of the three medical conditions (Class 0). Secondly, we will generate the percentages in each group. Finaly, I will do a visualisation to show the distribution of our target.

#### *Frequency Table of Class*

In [None]:
freq_tab = pd.crosstab(index = train_raw["Class"],  # Make a crosstab
                     columns="Total")                  # Name the count column
freq_tab

#### *Percentage Table of Class*

In [None]:
my_tab = pd.crosstab(index = train_raw["Class"],  # Make a crosstab
                     columns="Percentage")                  # Name the count column

my_tab/my_tab.sum()*100 # Calculate the percentages 

#### *Visualisation of Class*

In [None]:
#define data
data_targ = [17.5 , 82.5]
labels = ['Has medical condition', 'No medical condition']

#define Seaborn color palette to use
colors = sns.color_palette('pastel')[0:50]

#create pie chart
plt.pie(data_targ, labels = labels, colors = colors, autopct='%.0f%%')
plt.show()

This look like an imbalance data as the number of people who has one or more of any of the three medical conditions is quite smaller than people with none of the three medical conditions.

## 2.4. Relationship between Target and Features

Let's see the distribution of our features among people who has one or more of any of the three medical conditions and  those with none of the three medical conditions.

In [None]:
figsize = (4*4, 20)
fig = plt.figure(figsize=figsize)
for idx, col in enumerate(numericalFeatures):
    ax = plt.subplot(14,4, idx + 1)
    sns.kdeplot(
        hue='Class',
        data=train_raw, fill=True,
        x=col, palette=["Gray", "Red"], legend=False
    )
            
    ax.set_ylabel(''); ax.spines['top'].set_visible(False), 
    ax.set_xlabel(''); ax.spines['right'].set_visible(False)
    ax.set_title(f'{col}', loc='center', 
                 weight='bold', fontsize=20)

fig.suptitle(f'Features distribution among people with one or more of any of the three medical conditions and those with none\n\n\n', ha='center',  fontweight='bold', fontsize=21)
fig.legend([1, 0], loc='upper center', bbox_to_anchor=(0.5, 0.96), fontsize=21, ncol=3)
plt.tight_layout()
plt.show()

The distribution of the data confirmed the imbalance distribution of our features and target.

## 2.5. Data Wrangling

 ### 2.5.1. Data Balancing


import imblearn
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

# from imblearn.under_sampling import RandomUnderSampler
# from imblearn.over_sampling import RandomOverSampler
from collections import Counter
oversample = RandomOverSampler(random_state=0)
X, y = oversample.fit_resample(x_scaled, y)
# summarize distribution
counter = Counter(y)
for k,v in counter.items():
	per = v / len(y) * 100
	print('Class=%d, n=%d (%.3f%%)' % (k, v, per))
# plot the distribution
plt.bar(counter.keys(), counter.values())
plt.show()

 ### 2.5.2. Data Scalling

sc=StandardScaler()
x_scaled = sc.fit_transform(numericalFeatures)

#conver categorical daata in numerical for categorical column [eg]
train_raw['EJ'] = train_raw['EJ'].replace({'A': 0, 'B': 1})
test['EJ'] = test['EJ'].replace({'A': 0, 'B': 1})

Missing data?