# QCTO - Workplace Module

### Project Title: Stroke Prediction Using Machine Learning: Identifying Risk Factors for Early Detection
#### Done By: Khumbelo Shaun Dowelani

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

### Context
Stroke is the **2nd leading cause of death globally**, responsible for approximately **11% of total deaths**, according to the **World Health Organization (WHO)**. A stroke occurs when blood flow to the brain is disrupted, leading to potential long-term disability or death. Early detection and prevention are **critical** in reducing stroke-related fatalities and improving patient outcomes.  

This dataset provides **medical and demographic information** to predict whether a person is at risk of having a stroke. It includes features such as **age, gender, underlying diseases (e.g., hypertension, diabetes), and lifestyle factors (e.g., smoking status)**. By analyzing these factors, we can build a **predictive model** to help identify high-risk individuals and support early medical intervention.  

### Purpose of the Project  
The goal of this project is to **develop a machine learning model** that predicts the likelihood of a patient experiencing a stroke based on their medical history and lifestyle. This will enable:  

- **Early detection of stroke risk**, allowing for timely medical intervention.  
- **Identification of key risk factors** contributing to stroke occurrences.  
- **Support for healthcare professionals** in decision-making and risk assessment.  

### Significance of the Study  
By leveraging **machine learning for stroke prediction**, this project aims to **reduce preventable deaths, enhance early diagnosis, and optimize healthcare resource allocation**. A well-performing model could assist in **screening patients efficiently**, leading to better stroke prevention strategies and improved patient care.

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [51]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
import seaborn as sns

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

### Data Source  
The dataset was obtained from **Kaggle**, a popular online repository for open datasets. While the exact data collection method is not specified, it is structured for **stroke prediction research** and contains relevant medical and demographic attributes. 

The dataset contains **5110 rows and 12 columns**, providing relevant information about the patient's health status and demographic data. Below is a detailed description of each attribute:

1. **id**:  
   - A **unique identifier** for each patient in the dataset.  
   
2. **gender**:  
   - The **gender** of the patient. Possible values are:
     - `"Male"`, `"Female"`, or `"Other"`.  
   
3. **age**:  
   - The **age** of the patient in **years**.  

4. **hypertension**:  
   - Indicates whether the patient has **hypertension** (high blood pressure).
     - `0` if the patient **doesn't** have hypertension,  
     - `1` if the patient **has** hypertension.  

5. **heart_disease**:  
   - Indicates whether the patient has a **heart disease**.  
     - `0` if the patient **doesn't** have heart disease,  
     - `1` if the patient **has** heart disease.  

6. **ever_married**:  
   - Indicates whether the patient has **ever been married**. Possible values:
     - `"No"` or `"Yes"`.  

7. **work_type**:  
   - The **type of work** the patient is involved in. Possible values:
     - `"children"`, `"Govt_job"`, `"Never_worked"`, `"Private"`, or `"Self-employed"`.  

8. **Residence_type**:  
   - Indicates whether the patient lives in a **rural** or **urban** area. Possible values:
     - `"Rural"` or `"Urban"`.  

9. **avg_glucose_level**:  
   - The **average glucose level** in the patient's blood, typically measured in **mg/dL**.  

10. **bmi**:  
    - The **body mass index (BMI)** of the patient, a measure of body fat based on height and weight.  

11. **smoking_status**:  
    - Indicates the patient's **smoking status**. Possible values:
      - `"formerly smoked"`, `"never smoked"`, `"smokes"`, or `"Unknown"`.  
    - Note: `"Unknown"` means that the smoking information is unavailable for the patient.  

12. **stroke**:  
    - The **target variable** indicating whether the patient had a stroke.  
      - `1` if the patient **had a stroke**,  
      - `0` if the patient **did not have a stroke**.  

These attributes form the foundation for predicting whether a patient is likely to experience a stroke. The features include **demographic details**, **health conditions**, and **lifestyle factors**, all of which contribute to the risk of stroke occurrence.  

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [2]:
# Load the dataset
df = pd.read_csv('Data/healthcare-dataset-stroke-data.csv')

# Display first few rows
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [19]:
print("Missing Values in Each Column:")
print(df.isnull().sum())

Missing Values in Each Column:
id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64


In [20]:
# Drop rows where 'bmi' column has missing values
df = df.dropna(subset=['bmi'])

# Verify if any missing values remain
print(df.isnull().sum())

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64


In [21]:
duplicate_count = df.duplicated().sum()
print(f"\nNumber of Duplicate Rows: {duplicate_count}")


Number of Duplicate Rows: 0


In [22]:
row_count = df.shape[0]
print(f"Number of rows: {row_count}")

Number of rows: 4252


In [23]:
# Define a function to remove outliers using IQR (Interquartile Range)
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Apply outlier removal to numeric columns
numeric_cols = ['age', 'avg_glucose_level', 'bmi']  # Adjusted based on your dataset
for col in numeric_cols:
    df = remove_outliers(df, col)



In [24]:
# Function to check for errors in each column
def check_data_errors(df):
    print("---- Checking for Errors in Each Column ----\n")

    # Check for duplicate IDs
    if 'id' in df.columns:
        duplicate_ids = df['id'].duplicated().sum()
        print(f"Duplicate IDs: {duplicate_ids}")

    # Check for invalid values in 'gender' column
    if 'gender' in df.columns:
        invalid_genders = df[~df['gender'].isin(['Male', 'Female', 'Other'])]
        print(f"Invalid gender values: {invalid_genders.shape[0]}")

    # Check for negative or unrealistic ages
    if 'age' in df.columns:
        invalid_ages = df[(df['age'] < 0) | (df['age'] > 120)]  
        print(f"Invalid ages: {invalid_ages.shape[0]}")

    # Check for invalid hypertension values (should be 0 or 1)
    if 'hypertension' in df.columns:
        invalid_hypertension = df[~df['hypertension'].isin([0, 1])]
        print(f"Invalid hypertension values: {invalid_hypertension.shape[0]}")

    # Check for invalid heart disease values (should be 0 or 1)
    if 'heart_disease' in df.columns:
        invalid_heart_disease = df[~df['heart_disease'].isin([0, 1])]
        print(f"Invalid heart disease values: {invalid_heart_disease.shape[0]}")

    # Check for invalid marital status
    if 'ever_married' in df.columns:
        invalid_marriage_status = df[~df['ever_married'].isin(['Yes', 'No'])]
        print(f"Invalid marriage status values: {invalid_marriage_status.shape[0]}")

    # Check for invalid work types
    if 'work_type' in df.columns:
        valid_work_types = ['children', 'Govt_job', 'Never_worked', 'Private', 'Self-employed']
        invalid_work_types = df[~df['work_type'].isin(valid_work_types)]
        print(f"Invalid work type values: {invalid_work_types.shape[0]}")

    # Check for invalid residence type
    if 'Residence_type' in df.columns:
        invalid_residence = df[~df['Residence_type'].isin(['Rural', 'Urban'])]
        print(f"Invalid residence type values: {invalid_residence.shape[0]}")

    # Check for negative or extremely high avg_glucose_level values
    if 'avg_glucose_level' in df.columns:
        invalid_glucose = df[df['avg_glucose_level'] <= 0]
        print(f"Invalid avg_glucose_level values: {invalid_glucose.shape[0]}")

    # Check for negative or extreme BMI values
    if 'bmi' in df.columns:
        invalid_bmi = df[(df['bmi'] <= 0) | (df['bmi'] > 100)]  # BMI over 100 is highly unlikely
        print(f"Invalid BMI values: {invalid_bmi.shape[0]}")

    # Check for invalid smoking status
    if 'smoking_status' in df.columns:
        valid_smoking_status = ['formerly smoked', 'never smoked', 'smokes', 'Unknown']
        invalid_smoking = df[~df['smoking_status'].isin(valid_smoking_status)]
        print(f"Invalid smoking status values: {invalid_smoking.shape[0]}")

    # Check for invalid stroke values (should be 0 or 1)
    if 'stroke' in df.columns:
        invalid_stroke = df[~df['stroke'].isin([0, 1])]
        print(f"Invalid stroke values: {invalid_stroke.shape[0]}")

# Run the function to check for errors
check_data_errors(df)

---- Checking for Errors in Each Column ----

Duplicate IDs: 0
Invalid gender values: 0
Invalid ages: 0
Invalid hypertension values: 0
Invalid heart disease values: 0
Invalid marriage status values: 0
Invalid work type values: 0
Invalid residence type values: 0
Invalid avg_glucose_level values: 0
Invalid BMI values: 0
Invalid smoking status values: 0
Invalid stroke values: 0


In [25]:
row_count = df.shape[0]
print(f"Number of rows: {row_count}")

Number of rows: 4129


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [11]:
print("\n---- Summary Statistics ----")
print(df.describe())


---- Summary Statistics ----
                 id          age  hypertension  heart_disease  \
count   4252.000000  4252.000000   4252.000000    4252.000000   
mean   37143.407338    40.598655      0.068674       0.036453   
std    20987.203681    22.460205      0.252928       0.187438   
min       77.000000     0.080000      0.000000       0.000000   
25%    18677.750000    22.000000      0.000000       0.000000   
50%    37653.000000    41.000000      0.000000       0.000000   
75%    55401.750000    58.000000      0.000000       0.000000   
max    72940.000000    82.000000      1.000000       1.000000   

       avg_glucose_level          bmi       stroke  
count        4252.000000  4252.000000  4252.000000  
mean           91.513173    27.777046     0.031985  
std            22.713855     6.676779     0.175981  
min            55.120000    10.300000     0.000000  
25%            75.060000    23.000000     0.000000  
50%            88.055000    27.400000     0.000000  
75%          

In [65]:
numeric_cols = ['age', 'avg_glucose_level', 'bmi']

for col in numeric_cols:
    fig = px.histogram(df, x=col, nbins=30, title=f"Distribution of {col}",
                       color_discrete_sequence=['lightblue'])
    fig.update_layout(template="plotly_dark", bargap=0.1)
    fig.show()

In [62]:
stroke_colors = ['#ffffff', '#ff3131', '#00bf63', '#48b4bb', '#555555']

categorical_cols = ['gender', 'ever_married', 
                    'work_type', 'Residence_type', 'smoking_status']

for col in categorical_cols:
    count_df = df[col].value_counts().reset_index()
    count_df.columns = [col, 'count']  # Rename columns for clarity

    fig = px.bar(count_df, 
                 x=col, y='count', 
                 title=f"Count Plot of {col}",
                 color=col, 
                 color_discrete_sequence=stroke_colors)  # Apply stroke-related colors

    fig.update_layout(
        template="plotly_dark",
        xaxis_title=col, 
        yaxis_title="Count"
    )
    fig.show()


In [63]:
import plotly.graph_objects as go

# Define colors (White for 0, Red for 1)
stroke_colors = {0: '#00bf63', 1: '#48b4bb'}  # White (No), Red (Yes)

# Focus only on binary columns
binary_cols = ['hypertension', 'heart_disease', 'stroke']

for col in binary_cols:
    count_df = df[col].value_counts().sort_index()  # Ensure correct order (0 first, then 1)

    # Create the figure
    fig = go.Figure()

    # Add bars for each category (0 and 1)
    for val in count_df.index:
        fig.add_trace(go.Bar(
            x=[str(val)],  # Convert to string for proper labeling
            y=[count_df[val]],
            name=f"{col} = {val}",
            marker=dict(color=stroke_colors[val], line=dict(color='white', width=2))
        ))

    # Update layout
    fig.update_layout(
        title=f"Count Plot of {col}",
        xaxis_title=col,
        yaxis_title="Count",
        template="plotly_dark",
        plot_bgcolor='black',
        paper_bgcolor='black',
        font=dict(color="white"),
        showlegend=False
    )

    fig.show()


In [68]:
# Convert categorical variables to numeric (if necessary)
df_numeric = df.select_dtypes(include=['number'])  # Keep only numeric columns

# Compute the correlation matrix
corr_matrix = df_numeric.corr().round(2)

# Create heatmap
fig = ff.create_annotated_heatmap(z=corr_matrix.values, 
                                  x=list(corr_matrix.columns), 
                                  y=list(corr_matrix.index),
                                  colorscale="greens", 
                                  showscale=True)

# Update layout
fig.update_layout(
    title="Correlation Matrix",
    template="plotly_dark",
    plot_bgcolor="black",
    paper_bgcolor="black",
    font=dict(color="white")
)

fig.show()

In [61]:
# Manually define color mapping for stroke (0 → White, 1 → Shades of Red)
stroke_color_map = {0: "#ffffff", 1: "#d62728"}  # White for No Stroke, Dark Red for Stroke

scatter_plots = [
    ('age', 'bmi'),
    ('age', 'avg_glucose_level'),
    ('bmi', 'avg_glucose_level')
]

for x_col, y_col in scatter_plots:
    fig = px.scatter(df, x=x_col, y=y_col, color=df["stroke"].map(stroke_color_map),
                     title=f"{x_col} vs {y_col} (Colored by Stroke Variations)",
                     opacity=0.8)

    fig.update_traces(marker=dict(size=7))  # Adjust marker size for better visibility

    fig.update_layout(
        template="plotly_dark",
        plot_bgcolor="black",
        paper_bgcolor="black",
        font=dict(color="white"),
        coloraxis_showscale=False  # Hide color scale since it's binary
    )
    
    fig.show()

In [69]:
for col in numeric_cols:
    fig = px.violin(df, y=col, color="stroke", box=True, points="all",
                    title=f"Violin Plot of {col} (Colored by Stroke)",
                    color_discrete_sequence=["#ffffff", "#d62728"])  # White = No Stroke, Red = Stroke

    fig.update_layout(
        template="plotly_dark",
        plot_bgcolor="black",
        paper_bgcolor="black",
        font=dict(color="white")
    )
    
    fig.show()

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
