<a href="https://colab.research.google.com/github/anhle/AI-Healthcare/blob/master/AI_EHR/Ex/Compare_Demographics_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demographic Analysis

### Dataset 
Heart Disease Dataset donated to UCI ML Dataset Repository https://archive.ics.uci.edu/ml/datasets/heart+Disease. The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:
1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

### Exercise Instructions
- Given the sex and age demographic fields, visualize the demographic breakdown by stratifying on the predictor label- 'num_label'(HINT: you can use Seaborn's catplot).
- Be sure to use the following age group boundaries -0, 18, 25, 39,  54, 65, 90 for creating age bins.
- The code below is provided for you to preprocess the dataframe.

In [0]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

### Preprocessing - Analyze different age groups

In [0]:
processed_basel_path = "https://raw.githubusercontent.com/anhle/AI-Healthcare/master/AI_EHR/Ex/data/processed_swiss.csv"
processed_swiss_df = pd.read_csv(processed_basel_path).replace('?', np.nan)
subset_df = processed_swiss_df[processed_swiss_df['num_label'].isin([0, 1])]
subset_df['sex'] = subset_df['sex'].replace({1:"male", 0:"female"})
subset_df['num_label'] = subset_df['num_label'].replace({1:"Positive Label: Diameter Narrowing", 0:"Negative Label: Less Diameter Narrowing"})
demo_features = ['sex',  'age', 'num_label' ]
demo_df = subset_df[demo_features]

### Solution

In [0]:
#convert age to bins
age_bins = [0, 18, 25, 39,  54, 65, 90]
a_bin = [str(x) for x in age_bins ]
age_labels = ["".join(x) for x in zip( [x + " - " for x in a_bin[:-1]], a_bin[1:])]
demo_df['age_bins'] = pd.cut(demo_df['age'], bins=age_bins, labels=age_labels)

In [0]:
demo_df.head()

### Age Bin Grouping

In [0]:
ax = sns.countplot(x="age_bins", data=demo_df)

### Age Bin and Gender Grouping

In [0]:
ax = sns.countplot(x="age_bins", hue="sex", data=demo_df)

### Age Bin and Gender Grouping Stratified by Heart Disease Condition (predictor label)

In [0]:
g = sns.catplot(x="age_bins", hue="sex", col="num_label",
                data=demo_df, kind="count",
                height=4, aspect=1.9);