<a href="https://colab.research.google.com/github/anhle/AI-Healthcare/blob/master/AI_EHR/Ex/Lesson_EHR_Data_Security_and_Analysis_Screencast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Reasons EDA is important**

* EDA can enable you to discover features or data transformations/aggregations that might have data leakage. This can save a tremendous amount of time and prevent you from building a flawed model.

* EDA can help you better translate and define modeling objectives and corresponding evaluation metrics from a machine learning/data science and business perspective.

* EDA can help inform strategies for handling missing/null/zero valued data. This is a common issue that you will encounter with EHR data that you will have missing values and have to determine imputing strategies accordingly.

* EDA can help to identify subsets of features to utilize for feature engineering and modeling along with appropriate feature transformations based off of type (e.g. categorical vs numerical features)


## 1. Dataset Schema Analysis

We will use the following UCI dataset for this lesson and the related exercise.

**Dataset**: Heart Disease Dataset donated to UCI ML Dataset Repository https://archive.ics.uci.edu/ml/datasets/heart+Disease. 

**Modeling Objective:** Predict the incidence of heart disease

Below is a list of areas that we will be looking for in our exploratory data analysis.
- Value Distributions - Is the dataset feature uniform, normal, skewed and severely unbalanced?
- Scale of Numerical Features
- Missing Values
- High Cardinality


**Dataset Schema**: The schema for the dataset that we will be using is on the page https://archive.ics.uci.edu/ml/datasets/heart+Disease  under the **"Attribute Information"** header. Please note that only 14 attributes are used and listed below
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
    * Value 1: typical angina
    * Value 2: atypical angina
    * Value 3: non-anginal pain
    * Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
    * Value 1: upsloping
    * Value 2: flat
    * Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal:  3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
    * Value 0: < 50% diameter narrowing
    * Value 1: > 50% diameter narrowing
    * Values >1: linking to attributes 59 through 68, which are vessels (we won't focus on this for this course)


### OPTIONAL- Use Tensorflow Data Validation (TFDV) for EDA

You are free to use your tool of choice to explore the data and create an EDA report at the end and TFDV currently has some bugs with the latest version of Chrome. The intention of this lesson is to expose you to TFDV as an option to explore your data. While there are other tools for exploratory data analysis, below are some reasons that TFDV can be helpful:
* Interactive and simple descriptive statistics visualization tool  
* Scales to large datasets
    * It uses "Apache Beam's data-parallel processing framework to scale the computation of statistics over large datasets."  
* Can be used to detect anomalies and drift with new data or differences between training and testing splits

Before building a machine learning model, we must first analyze the dataset and assess for common issues that may require preprocessing. We will use the TFDV library to help analyze and visualize the dataset. Some of the information has been adapted from the TFDV page(https://www.tensorflow.org/tfx/data_validation/get_started. 

**IMPORTANT** You must use the Chrome browser to see the TFDV library visualizations.

NOTE: Please note that there are other ways we can explore and analyze the data but we will focus on these areas for the course.


### ETL 

In [0]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

**NOTE:** For this lesson and exercise we will use the processed not the raw dataset provided, so the categorical feature values have already been converted to numerical values.

For this exercise we will use the processed Cleveland Clinic dataset.

In [0]:
#https://archive.ics.uci.edu/ml/datasets/heart+Disease, Cleveland dataset
processed_cleveland_path = "https://raw.githubusercontent.com/anhle/AI-Healthcare/master/AI_EHR/Ex/data/processed.cleveland.txt"
column_header_list = [
    'age',
   'sex',
   'cp',
   'trestbps',
     'chol',
      'fbs',
      'restecg',
      'thalach',
       'exang',
      'oldpeak',
       'slope',
       'ca',
        'thal', 
     'num_label'
]
processed_cleveland_df = pd.read_csv(processed_cleveland_path, names=column_header_list)

In [0]:
processed_cleveland_df.head()

## 2. Analyze Value Distributions

* Normal is the well-known bell-curve that most people are familiar with and is also referred to as a Gaussian distribution
* The uniform distribution is where the unique values have almost the same frequency and this is important b/c this might indicate some issue with the data
* Skewed/unbalanced data distributions as the name indicates are where a smaller subset of values or a single value dominates
* Bimodal and Poisson but for the scope of this course we will not cover those

In [0]:
# visualize categorical distributions
def visualize_distributions(df, c):
    df[c].value_counts().plot(kind='bar')
    plt.show()
    plt.close()

In [0]:
example_column1 = "sex"
print("Distribution for {} feature".format(example_column1))
processed_cleveland_df[example_column1].replace({1:"male", 0:"female"}).value_counts().plot(kind='bar')

Next, we will look at another categorical feature chest pain.

In [0]:
example_column2 = "cp"
print("Distribution for {} feature".format(example_column2))
processed_cleveland_df[example_column2].replace({1: "typical angina",
2: "atypical angina",
3: "non-anginal pain",
4: "asymptomatic" }).value_counts().plot(kind='barh')

### Review of normal and uniform distributions

**Normal Distribution**

In [0]:
mu, sigma = 100, 17.0 # mean and standard deviation
norm_dist = np.random.normal(mu, sigma, 100)

In [0]:
norm_ax = sns.distplot(norm_dist, kde=False )
plt.show()

**Uniform Distribution**

In [0]:
uniform_dist = np.random.uniform(-1,0,1000)
uniform_ax = sns.distplot(uniform_dist, kde=False )
plt.show()

**What type of distribution is this?**

In [0]:
# numerical field histogram
processed_cleveland_df['trestbps'].hist()

## 3. Missing Values


Missing values are especially common in healthcare where you may have incomplete records or some fields are sparsely populated

Missing Data Classification:

* MCAR which stands for Missing Completely at Random. This means that the data is missing due to something unrelated to the data and there is no systematic reason for the missing data. In other words, there is an equal probability that data is missing for all cases. This is often due to some instrumentation like a broken instrument or process issue where some of the data is randomly missing.

* MAR refers to Missing at Random and this is the opposite case where there is some systematic relationship between data and the probability of missing data. For example, there might be some missing demographics choices in surveys.

* MNAR is a Missing Not at Random and this usually means there is a relationship between a value in the dataset and the missing values.

Understanding why data is missing help with choosing the best imputing method to fill or drop the values in your dataset.

Code Concepts
Create a function to check the percent of missing and zero values you have.

### Scaling of numerical features 
- Compare min and max and see if scale is large 

In [0]:
numerical_feature_list = ['age',  'trestbps', 'chol', 'thalach', 'oldpeak' ]

In [0]:
processed_cleveland_df[numerical_feature_list].describe()

In [0]:
# Missing values
def check_null_values(df):
    null_df = pd.DataFrame({'columns': df.columns, 
                            'percent_null': df.isnull().sum() * 100 / len(df), 
                           'percent_zero': df.isin([0]).sum() * 100 / len(df)
                           } )
    return null_df 

In [0]:
null_df = check_null_values(processed_cleveland_df)
null_df

View the results and see if there are any values that stand out. Again you may need to deal with different columns in different ways depending on their type and reason for missing or zero values.

Additional Resources:
* [Imputation Methods](https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779)
* [Advanced Imputation Methods](https://www.sciencedirect.com/science/article/pii/S2352914819302783)

## 4. Outliers

In [0]:
sns.boxplot(y=processed_cleveland_df['age'])

In [0]:
sns.boxplot(y=processed_cleveland_df['chol'])

## 5. High Cardinality

In [0]:
import numpy as np
def create_cardinality_feature(df):
    num_rows = len(df)
    random_code_list = np.arange(100, 1000, 1)
    return np.random.choice(random_code_list, num_rows)
    
def count_unique_values(df, cat_col_list):
    cat_df = df[cat_col_list]
    cat_df['principal_diagnosis_code'] = create_cardinality_feature(cat_df)
    #add feature with high cardinality
    val_df = pd.DataFrame({'columns': cat_df.columns, 
                       'cardinality': cat_df.nunique() } )
    return val_df

In [0]:
categorical_feature_list = [ 'sex', 'cp', 'restecg', 'exang', 'slope', 'ca', 'thal']


In [0]:
val_df = count_unique_values(processed_cleveland_df, categorical_feature_list) 
val_df

# 6. Demographic Analysis

In [0]:
#convert age to bins
demo_features = ['sex',  'age' ]
demo_df = processed_cleveland_df[demo_features].replace({1:"male", 0:"female"})
age_bins = np.arange(0, 90, 10)
a_bin = [str(x) for x in np.arange(0, 90, 10) ]
age_labels = ["".join(x) for x in zip( [x + " - " for x in a_bin[:-1]], a_bin[1:])]
demo_df['age_bins'] = pd.cut(demo_df['age'], bins=age_bins, labels=age_labels)

In [0]:
demo_df

### Group by Age Bins

In [0]:
ax = sns.countplot(x="age_bins", data=demo_df)

### Group by Gender and Age Bins

In [0]:
ax = sns.countplot(x="age_bins", hue="sex", data=demo_df)