<a href="https://colab.research.google.com/github/anhle/AI-Healthcare/blob/master/AI_EHR/Ex/Analyze_Dataset_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Analysis: EDA Report

### Instructions:  
- For the following exercise please create an EDA report and provide the requested information for each part. This exercise will include:
    - a. Which features are likely to be numerical features? 
    - b. Give the number of missing/zero values for each field.  
        - Why might the 'chol' field be all zeros?
    - c. Analyze value distributions for some selected fields.
    - d. Check for outliers and visualize with a box plot.
    - e. Analyze cardinality of categorical fields.

We will again use the UCI heart disease dataset for this exercise. However we will use a different dataset with the same schema though.

**Dataset**: Heart Disease Dataset donated to UCI ML Dataset Repository https://archive.ics.uci.edu/ml/datasets/heart+Disease. The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:
1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

### ETL

In [0]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

**NOTE:** For this lesson and exercise we will use the processed dataset instead of the raw dataset, so the categorical feature values have already been converted to numerical values.

In [0]:
processed_basel_path = "https://raw.githubusercontent.com/anhle/AI-Healthcare/master/AI_EHR/Ex/data/processed_swiss.csv"
processed_swiss_df = pd.read_csv(processed_basel_path).replace('?', np.nan)
processed_swiss_df.head()

In [0]:
processed_swiss_df.dtypes

### A. Dataset Schema Analysis
- Based off of the schema provided what are likely to be numerical features?

**Dataset Schema**: The schema for the dataset that we will be using is on the page https://archive.ics.uci.edu/ml/datasets/heart+Disease  under the **"Attribute Information"** header. Please note that only 14 attributes are used and listed below
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
    * Value 1: typical angina
    * Value 2: atypical angina
    * Value 3: non-anginal pain
    * Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
    * Value 1: upsloping
    * Value 2: flat
    * Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal:  3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
    * Value 0: < 50% diameter narrowing
    * Value 1: > 50% diameter narrowing
    * Values >1: linking to attributes 59 through 68, which are vessels (we won't focus on this for this course)


### Solution
- 'age'  
- 'trestbps'
- 'chol' 
- 'thalach'
- 'oldpeak' 

### B. Missing Values
- Give the number of missing/zero values for each field
- Why might the 'chol' field be all zeros?

### Solution

In [0]:
# Missing values
def check_null_values(df):
    null_df = pd.DataFrame({'columns': df.columns, 
                            'percent_null': df.isnull().sum() * 100 / len(df), 
                           'percent_zero': df.isin([0]).sum() * 100 / len(df)
                           } )
    return null_df 

In [0]:
null_df = check_null_values(processed_swiss_df)
null_df

**Answer:** The 'chol' field might be all zeros b/c of how this value was imputed for null values. 

### C. Value Distributions
- Analyze value distribution for the categorical feature 'cp' and the numerical feature 'oldpeak'.
- For the 'oldpeak' feature is it a normal or uniform distribution?

### Solution

**Note:** Feel free to use the Pandas dataframe value counts based function provided in the lesson.

In [0]:
# this another function for histogram for value counts
sns.countplot(processed_swiss_df['cp'])


In [0]:
sns.distplot(processed_swiss_df['oldpeak'])

**Answer:** Normal distribution

### D. Outliers
- Give one feature that has outliers and visualize with box plot?

### Solution

In [0]:
sns.boxplot(y=processed_swiss_df['age'])

### E. Analyzing a Dataset for High Cardinality
- Select the categorical fields and give the cardinality for each field
- Below I have added a synthetic diagnosis code field for you to the dataset.

In [0]:
def create_cardinality_feature(df):
    num_rows = len(df)
    random_code_list = np.arange(100, 1000, 1)
    return np.random.choice(random_code_list, num_rows)

new_df = processed_swiss_df.copy()
new_df['principal_diagnosis_code'] = create_cardinality_feature(new_df)

###  Solution

In [0]:
categorical_feature_list = [ 'sex', 'cp', 'restecg', 'exang', 'slope', 'ca', 'thal', 'principal_diagnosis_code']

In [0]:
def count_unique_values(df, cat_col_list):
    cat_df = df[cat_col_list]
    val_df = pd.DataFrame({'columns': cat_df.columns, 
                       'cardinality': cat_df.nunique() } )
    return val_df

In [0]:
val_df = count_unique_values(new_df, categorical_feature_list) 
val_df