<a href="https://colab.research.google.com/github/allaalmouiz/deepLearning_stroke_prediction/blob/main/deepLearning_stroke_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Stroke Risk Prediction Challange - Module 4
Submitted by: **`Alaa Almouiz F. Moh.`**

ID Number: **`S2026_176`**

Track: **Machine Learning**

For: **ZAKA ©**

## **1- Problem Statement (Objective)**

I’ve been asked to assist a public health organization in identifying individuals most at risk of having a stroke, using a dataset of patient information and health indicators.

So, I will build a Deep Learning **Binary Classification Model** to predicts whether a patient will experinece a stroke or not.


### **Dataset**
Stroke Risk Dataset (Downloaded  from Kaggle: [Data](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)), and been diwnloaded to [my repo](https://github.com/allaalmouiz/deepLearning_stroke_prediction/blob/739145a7bf7db1e6814682ae25a67beef25940fe/healthcare-dataset-stroke-data.csv).

The dataset variables include:
* `id`: Unique identifier for each patient.
* `gender`: Patient’s gender.
* `aga`: Age of the patient.
* `hypertension`:  if the patient has hypertension.
* `ever_married`: Marital status.
* `work_type`: Type of employment.
* `heart_disease`: if the patient has a history of heart disease.
* `Residence_type`: Patient’s area of residence.
* `avg_glucose_level`: Average blood glucose level.
* `bmi`: Body Mass Index.
* `smoking_status`: Patient’s smoking status.
* `stroke`Whether the patient experienced a stroke.


## **2- Dataset Loading**

In [49]:
# Clone the dataset from my Github Repo
! git clone https://github.com/allaalmouiz/deepLearning_stroke_prediction.git

%cd deepLearning_stroke_prediction

Cloning into 'deepLearning_stroke_prediction'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects:  11% (1/9)[Kremote: Counting objects:  22% (2/9)[Kremote: Counting objects:  33% (3/9)[Kremote: Counting objects:  44% (4/9)[Kremote: Counting objects:  55% (5/9)[Kremote: Counting objects:  66% (6/9)[Kremote: Counting objects:  77% (7/9)[Kremote: Counting objects:  88% (8/9)[Kremote: Counting objects: 100% (9/9)[Kremote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 9 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (9/9), 68.78 KiB | 3.62 MiB/s, done.
Resolving deltas: 100% (1/1), done.
/content/deepLearning_stroke_prediction/deepLearning_stroke_prediction/deepLearning_stroke_prediction


In [50]:
# Importing the Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder


from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout



## **3- Exploring the Dataset**

In [51]:
# Loading the Dataset
df = pd.read_csv("/content/deepLearning_stroke_prediction/healthcare-dataset-stroke-data.csv")

In [52]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [53]:
# Removing the ID
df.drop("id", axis=1, inplace=True)

In [54]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


**Notice**:
I removed the `id` as it have high values, and will effect the performace of the model since it contain large index, that the model will mistaken its importance.

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 439.3+ KB


In [56]:
print(f"The dataset has {df.shape[0]} rows (Samples), and {df.shape[1]} coulums (features)")

The dataset has 5110 rows (Samples), and 11 coulums (features)


**Notice**: There are missing values in  `bmi` only.

The **categorial values** are `gender`, `ever_married`, `work_type`, `Residence_type`, and `smoking_status`. Also, `stroke`, `hypertension` and `heart_disease` are a categorical data, but here they're numerical and we have to cast type them.

The **Numerical Values** are `age`, `avg_glucose_level ` and `bmi` only.


In [57]:
categorical = list(df.dtypes[df.dtypes == 'object'].index)
print("Categorical Columns")
print(categorical)

print("")

numerical = list(df.dtypes[df.dtypes != 'object'].index)
print("Numerical Columns")
print(numerical)

Categorical Columns
['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

Numerical Columns
['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi', 'stroke']


In [58]:
# Do the modification based on the Analysis Above
categorical = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'hypertension', 'heart_disease', 'stroke' ]
numerical = ['age','avg_glucose_level', 'bmi']

In [59]:
# Information about the data and it's values

print ("Information about the Categorical columns")
for col in df[categorical].columns:
    print(col)
    print("first 5 unique values", df[col].unique()[:5])
    print("unique values", df[col].nunique())
    print("")
print("======")

print ("Information about the Numerical columns")
for col in df[numerical].columns:
    print(col)
    print("first 5 unique values", df[col].unique()[:5])
    print("unique values", df[col].nunique())
    print("")


Information about the Categorical columns
gender
first 5 unique values ['Male' 'Female' 'Other']
unique values 3

ever_married
first 5 unique values ['Yes' 'No']
unique values 2

work_type
first 5 unique values ['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked']
unique values 5

Residence_type
first 5 unique values ['Urban' 'Rural']
unique values 2

smoking_status
first 5 unique values ['formerly smoked' 'never smoked' 'smokes' 'Unknown']
unique values 4

hypertension
first 5 unique values [0 1]
unique values 2

heart_disease
first 5 unique values [1 0]
unique values 2

stroke
first 5 unique values [1 0]
unique values 2

Information about the Numerical columns
age
first 5 unique values [67. 61. 80. 49. 79.]
unique values 104

avg_glucose_level
first 5 unique values [228.69 202.21 105.92 171.23 174.12]
unique values 3979

bmi
first 5 unique values [36.6  nan 32.5 34.4 24. ]
unique values 418



In [60]:
# Checking duplicating rows in teh dataset

df.duplicated().sum()

np.int64(0)

In [61]:
df[numerical].describe()

Unnamed: 0,age,avg_glucose_level,bmi
count,5110.0,5110.0,4909.0
mean,43.226614,106.147677,28.893237
std,22.612647,45.28356,7.854067
min,0.08,55.12,10.3
25%,25.0,77.245,23.5
50%,45.0,91.885,28.1
75%,61.0,114.09,33.1
max,82.0,271.74,97.6


## **4- Cleaning the Dataset**

### Handling the Mssing values of `bmi`

In [62]:
bmi_mean = df["bmi"].mean()
df['bmi'].fillna(bmi_mean, inplace=True)
print(bmi_mean)
print(f"The null values in the bmi are {df['bmi'].isnull().sum()}")
print(" ")

28.893236911794666
The null values in the bmi are 0
 


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['bmi'].fillna(bmi_mean, inplace=True)


Yayyyyy!! We fixed the null values in `bmi` coulmn.

### Encoding the categorical data

In [63]:
df[categorical].head()

Unnamed: 0,gender,ever_married,work_type,Residence_type,smoking_status,hypertension,heart_disease,stroke
0,Male,Yes,Private,Urban,formerly smoked,0,1,1
1,Female,Yes,Self-employed,Rural,never smoked,0,0,1
2,Male,Yes,Private,Rural,never smoked,0,1,1
3,Female,Yes,Private,Urban,smokes,0,0,1
4,Female,Yes,Self-employed,Rural,never smoked,1,0,1


In [46]:
# Encoding all the categorical data using LabelEncoder

encoders = {}

for col in categorical:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    encoders[col] = le

In [64]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [None]:
Now Data

In [67]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

encoded = ohe.fit_transform(df[categorical])

feature_names = ohe.get_feature_names_out(categorical)

encoded_df = pd.DataFrame(encoded, columns=feature_names, index=df.index)


In [69]:
encoded_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5100,5101,5102,5103,5104,5105,5106,5107,5108,5109
gender_Female,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,...,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0
gender_Male,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
gender_Other,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ever_married_No,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
ever_married_Yes,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0
work_type_Govt_job,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
work_type_Never_worked,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
work_type_Private,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
work_type_Self-employed,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
work_type_children,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
