# Hands-On Workshop - Big Data in Healthcare 8400

### Hadas Volkov - January 2024

##### Azure space and resources were kindly contributed by **Microsoft**

Welcome to this hands-on workshop on Big Data in Healthcare. In this workshop, we will perform Exploratory Data Analysis (EDA) to gain insights and intuition about a stroke records dataset.

We will be covering the following topics:

1. Introduction to the dataset and problem
2. Data cleaning and preparation
3. Exploratory Data Analysis (EDA)
4. Feature engineering
5. Model training and evaluation

Let's get started!

## Stroke Prediction Dataset

The dataset we will be using in this workshop is the [Stroke Prediction Dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) from **Kaggle**. It contains health records of over 5000 individuals, some of whom have suffered a stroke.

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset can be used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

We will be using this dataset to get initial insights about the data, and to get a *feel* for the work of a data scientist. We will understand how to use basic python scripting, packages and techniques to explore the data, and how to use this information to train a Machine Learning (ML) model in subsequent workshop.

Hopefully, this workshop will give you a taste of what it's like to be a data scientist, and will demonstrate why it is beneficial to use python and Jupyter notebooks for data science.

# Step 0: Imports and Reading Data

In [None]:
# Numpy and Pandas for data manipulation

# Matplotlib and Seaborn for visualization

In [None]:
# Read in data into a dataframe

# Step 1: Data Understanding

* Dataframe `shape`
* `head` and `tail`
*  `dtypes`
*  `describe`

In [None]:
# Display shape of dataframe

In [None]:
# Display first 10 rows of dataframe

In [None]:
# Display column names

The current dataset contains the following features:

* `id`: unique identifier
* `gender`: "Male", "Female" or "Other"
* `age`: age of the patient
* `hypertension`: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* `heart_disease`: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* `ever_married`: "No" or "Yes"
* `work_type`: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* `Residence_type`: "Rural" or "Urban"
* `avg_glucose_level`: average glucose level in blood
* `bmi`: body mass index
* `smoking_status`: "formerly smoked", "never smoked", "smokes" or "Unknown"*
* `stroke`: 1 if the patient had a stroke or 0 if not

###### **Note**: "Unknown" in smoking_status means that the information is unavailable for this patient

In [None]:
# Display information about types of data in dataframe

In [None]:
# Describe basic statistics about dataframe

# Step 2: Data Preparation

* Dropping irrelevant columns and rows
* Identifying duplicated columns
* Renaming Columns
* Feature Creation

In [None]:
# Dropping the 'irrelevant_column' and 'duplicate_column'

In [None]:
# Renaming columns back to original

In [None]:
# Identifying duplicate rows

In [None]:
# Checking the number of missing values

In [None]:
# Percentage of missing bmi values

In [None]:
# Replacing missing bmi values with mean

In [None]:
# Creating a health risk score based on normalized values of certain health indicators

# Step 3: Feature Understanding - Univariate Analysis

* Plotting Feature Distributions
  * Histogram
  * KDE
  * Boxplot

In [None]:
# Bar plot of smoking status (categorious variable)

In [None]:
# Histogram of age (continuous variable)

# Creating the figure and the first axis

# Plotting the histogram of 'age' on the first axis

# Creating the second axis

# Plotting the histogram of 'avg_glucose_level' on the second axis

# Adding legends

# Showing the plot

In [None]:
# Kernel Density Estimation (KDE) plot of health risk score (continuous variable)

# Step 4: Feature Relationship - Bivariate Analysis

* Scatterplot
* Heatmap Correlation
* Pairplot
* Groupby comparisons

In [None]:
# Scatter plot of age and health risk score (continuous variables)

In [None]:
# Scatter plot of age and health risk score (continuous variables) with regression line and 95% confidence interval

In [None]:
# Pairplot of age, avg_glucose_level, bmi, hypertension, heart_disease

In [None]:
# A more refined pairplot of age, avg_glucose_level, bmi, hypertension, heart_disease

In [None]:
# Correlation matrix of age, avg_glucose_level, bmi, hypertension, heart_disease

# Step 5: Ask a Question about the data
* Try to answer a question you have about the data using a plot or statistic.

In [None]:
# What are the residence type and work type most susceptible to stroke?

# Filter only rows where stroke occurred and group by 'Residence_type' and 'work_type', calculating the count for each group

# Sort values by count in descending order to see which combination of 'Residence_type' and 'work_type' has the most strokes

# Concatenating 'Residence_type' and 'work_type' for better labeling

# Plotting the results with the combined label


#### A static version of this file can be found in my 'github' repository: [https://github.com/hadasvolk/8400ML-WS](https://github.com/hadasvolk/8400ML-WS)