# Introduction
As explained in [01.data-introduction](01.data-introduction.ipynb), the author (Dr. Roberto Detrano) intended to build a model to predict heart disease using the 13 variables and 1 target variable. He used cleveland dataset to build the model and predict against other datasets. The datasets are explained in the table below:

| Filename                   | Records (Rows) | Features (Cols) | Description                 |
|:---------------------------|:--------------:|:---------------:|:----------------------------|
| processed.cleveland.data   |      303       |       14        | Used for building the model |
| processed.hungarian.data   |      294       |       14        | Used for testing            |
| processed.va.data          |      200       |       14        | Used for testing            |
| processed.switzerland.data |      123       |       14        | Used for testing            |

## Data Dictionary
In the processed datasets (as described in [01.data-introduction](01.data-introduction.ipynb)), the author has mapped the following variables from the original (raw) dataset to processed dataset for building the model. Please find the <u>data dictionary for the processed dataset that is used in this study</u>.
| No  | Name         | Description                                              | Values                                                                                                                              |
|:----|:-------------|:---------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|
| 1.  | age          | age in years                                             |                                                                                                                                     |
| 2.  | sex          | sex                                                      | 1: male <br> 0: female                                                                                                              |
| 3.  | cp           | chest pain type                                          | 1: typical angina <br> 2: atypical angina <br> 3: non-anginal pain <br> 4: asymptomatic                                             |
| 4.  | trestbps     | systolic blood pressure at rest (in mm Hg)               |                                                                                                                                     |
| 5.  | chol         | serum cholesterol in mg/dl                               |                                                                                                                                     |
| 6.  | fbs          | fasting blood sugar > 120 mg/dl                          | 1: true <br> 0: false                                                                                                               |
| 7.  | restecg      | resting electrocardiographic results                     | 0: normal <br> 1: having ST-T wave abnormality <br> 2: showing probable or definite left ventricular hypertrophy by Estes' criteria |
| 8.  | thalach      | maximum heart rate achieved                              |                                                                                                                                     |
| 9.  | exang        | exercise induced angina                                  | 1: yes <br> 0: no                                                                                                                   |
| 10. | oldpeak      | ST depression induced by exercise relative to rest       |                                                                                                                                     |
| 11. | slope        | the slope of the peak exercise ST segment                | 1: upsloping <br> 2: flat <br> 3: downsloping                                                                                       |
| 12. | ca           | number of major vessels (0-3) colored by fluoroscopy     |                                                                                                                                     |
| 13. | thal         |                                                          | 3: normal <br> 6: fixed defect <br> 7: reversable defect                                                                            |
| 14. | num (target) | diagnosis of heart disease (angiographic disease status) | 0: < 50% diameter narrowing <br> 1: > 50% diameter narrowing                                                                        |

# The Objective
In attempts to build ML models using the heart-disease dataset, the following questions needs clarification in this stage (data exploration).
1. What problem needs to be solved?
2. Does the dataset have duplicate rows (records)?
3. What data-types the dataset contains?
4. Does the dataset have empty columns (features)?
5. Is the dataset sufficient for building the model(s)?
6. Which feature appears more important than the other?

Once, these questions are clarified, a list of action items can help for next stage.

Let's load the necessary libraries and data for investigation.

In [2]:
# All required libraries.
import pandas as pd
import matplotlib.pyplot as plt;
import seaborn as sns;
from custom_libs import helper
from importlib import reload
import numpy as np

In [8]:
filename = "switzerland"
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = pd.read_csv(f'data/uci-heart-disease/processed.{filename}.data', names=header)
# 303 records and 14 columns.
data.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,32,1,1,95,0,?,0,127,0,.7,1,?,?,1
1,34,1,4,115,0,?,?,154,0,.2,1,?,?,1
2,35,1,4,?,0,?,0,130,1,?,?,?,7,3
3,36,1,4,110,0,?,0,125,1,1,2,?,6,1
4,38,0,4,105,0,?,0,166,0,2.8,1,?,?,2


## 1. What problem needs to be solved?
<b>Classify</b> - whether a patient has heart disease based on his/her medical data.

At a glance, it appears to be binary classification problem.\
From the data-dictionary, `num` appears as the target variable. Let's investigate the target variable's supportability for <u>binary classification</u> problem.

In [9]:
# Target variable (num) has integer data-type.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123 entries, 0 to 122
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   age       123 non-null    int64 
 1   sex       123 non-null    int64 
 2   cp        123 non-null    int64 
 3   trestbps  123 non-null    object
 4   chol      123 non-null    int64 
 5   fbs       123 non-null    object
 6   restecg   123 non-null    object
 7   thalach   123 non-null    object
 8   exang     123 non-null    object
 9   oldpeak   123 non-null    object
 10  slope     123 non-null    object
 11  ca        123 non-null    object
 12  thal      123 non-null    object
 13  num       123 non-null    int64 
dtypes: int64(5), object(9)
memory usage: 13.6+ KB


In [10]:
# Target variable (num) has more than two values/classes. Meanwhile, binary classification requires only 2 values/classes.
data['num'].value_counts()

num
1    48
2    32
3    30
0     8
4     5
Name: count, dtype: int64

### Observation:
The target variable (num) cannot support binary classification since it has more than 2 values/classes. \
If we understood the original intention from the data-dictionary and the literature:
- Any patient with less than 50% vessel narrowing was marked as `value: 0` -- no heart disease
- Any patient with more than 50% vessel narrowing was marked as `value: 1` -- has heart disease. This was further expanded to 1, 2, 3 and 4 based on affected major vessel.

### Conclusion:
We <u>can safely convert this to a binary classification problem</u> by replacing any values of target variable (num) other than `0` to `1`. This logically simplifies that any patient with vessel narrowing more than 50% is suspected to have heart-disease (without distorting the original meaning much).

### Action(s)
1. Convert any values in target variable (num) other than `0` to `1` - to support binary classification problem.
2. Also, rename the target variable from `num` to `target` to give a more meaningful name.

In [None]:
# 1. Convert any values in target variable (num) other than `0` to `1` - to support binary classification problem.
# Let's see the count before conversion
data['num'].value_counts()

In [None]:
# Let's see if the sum of 1, 2, 3 and 4 has total up to 55 + 36 + 35 + 13 = 139.
data.loc[data['num']!=0,"num"]=1
data['num'].value_counts()

In [None]:
# 2. Also, rename the target variable from `num` to `target` to give a more meaningful name.
data.rename(columns={'num':'target'}, inplace=True)

In [None]:
# The new 'target' column tallies with the old 'num' column.
data['target'].value_counts()

## 2. Does the dataset have duplicate rows (records)?
So, we converted the target variable from multi-class to binary to support binary classification problem. \
Now, let's investigate if any duplicate records (rows) present in the dataset.

In [None]:
# All columns were checked, and no duplicate record found.
data[data.duplicated()]

## 3. What data-types the dataset contains?
We got the basic idea about the data types from the 'data-dictionary'. let's investigate and verify further the content.

In [None]:
data.info()

All features appears to be in numerical format except for `ca` and `thal`. They appear to be categorical from 'data-dictionary'. Let's investigate further.

In [None]:
# For column 'ca', 4 records contains the value of '?'.
data['ca'].value_counts()

In [None]:
# For column 'thal', 2 records contains the value of '?'.
data['thal'].value_counts()

In [None]:
# Let's investigate if the total 6 (4 from 'ca' and 2 from 'thal') records are disjoint, using OR operator first.
# Only when OR operator doesn't satisfy, we will use AND operator to further investigate if records are distributed between two features.
# Looks like the sum is 6, and they are disjoint (using OR operator).
data[(data['thal']=='?') | (data['ca']=='?')]

In [None]:
# There is only < 2% uninterpretable data found in 'ca' and 'thal' with '?' character.
print(f'In ca there is {round(helper.value_count(data,'ca','?'),2)}% of ? values found.')
print(f'In thal there is {round(helper.value_count(data,'thal','?'),2)}% of ? values found.')

### Observation:
From the investigation, it appears feature `ca` and `thal` have uninterpretable values which is `?`. \
Together there are 6 records (4 from `ca` and 2 from `thal`) and they are disjoint. \
There is only less than 2% of uninterpretable data found in both `ca` and `thal`.

### Conclusion
Since features in the dataset were already narrowed from 76 to 14 based on their importance for meaningful medical interpretation, dropping the 6 records appears to be more reasonable. This reason: the two variables cannot be imputed with 'mean' because through they appears in numeric format they were originally in categorical format and were already converted.

### Action(s)
1. Drop records for feature `ca` and `thal` that are `?`

In [None]:
# Drop records for feature `ca` and `thal` that are `?` The new total records are 303 - 6 = 297.
filtered = data[(data['thal'] == '?') | (data['ca'] == '?')].index
data.drop(filtered, inplace=True)
data.shape

## 4. Does the dataset have empty columns (features)?
Now that we have verified the data consistencies, let's <u>investigate if any data is missing in the dataset</u>.

In [None]:
# No missing data. isnull and isna is the same, checking for None, NaN or NaT (datetime)
data.isnull().sum()

## 5. Is the dataset sufficient for building the model(s)?
Now that the dataset was cleansed, let's explore the data for further analysis (with graphs when needed).

In [None]:
# Let's save a copy of the cleansed dataset for building models.
data.to_csv('data/uci-heart-disease/processed.cleveland-cleansed.data', index=False)

# Load the saved data for verification.
df = pd.read_csv('data/uci-heart-disease/processed.cleveland-cleansed.data')
df

In [None]:
# Ideally we should expect both classes in the target variable to have same proportion, i.e. ~148.
len(df["target"]) / 2

In [None]:
# Nevertheless, a slight risk variation can be accepted. Let's investigate if the target class in balanced.
df['target'].value_counts()

In [None]:
# To get the percentage proportion, let's view the normalized value counts.
# So, the deviation in the distribution is ~4%.
df['target'].value_counts(normalize=True)

In [None]:
# Let's see the distribution of target variable's class in bar-chart.
df['target'].value_counts().plot(kind="bar", color=['steelblue', 'darksalmon']);

In [None]:
# Let's investigate which features have strong correlation with target.
df.corr()

In [None]:
# Let's see the correlation matrix with color intensity spectrum - the darker the blue is, the higher the correlation.
corr_matrix = df.corr(method='pearson')
mask = np.triu(np.ones_like(data.corr()))
plt.figure(figsize=(15, 10))

sns.heatmap(corr_matrix,
            mask=mask,
            annot=True,
            linewidths=0.5,
            fmt= ".2f",
            cmap="GnBu");

### 5.1 Feature Correlation
By eyeballing the chart (above), we can conclude the following from the Pearson's correlation.

| Level     |   Positive   |    Negative    |
|:----------|:------------:|:--------------:|
| Strong    |  0.70 to 1   |  -0.70 to -1   |
| Moderate  | 0.30 to 0.70 | -0.30 to -0.70 |
| Weak      |  0 to 0.30   |   0 to -0.30   |

If we apply the <u>general rules for classifying correlation</u> (using the table above), we observe:
* Features have <u>only moderate correlation</u> between each other and target variable.
#### Positive correlation - positive linear relationship
* Six features has moderate <u>postive correlation to target</u> variable:
1. thal (0.53)
2. ca (0.46)
3. oldpeak (0.42)
4. exang (0.42)
5. cp (0.41)
6. slope (0.33)
* Six features has moderate <u>positive correlation between variables</u>:
1. oldpeak and slope (0.58)
2. sex and thal (0.38)
3. cp and exang (0.38)
4. age and ca (0.36)
5. oldpeak and thal (0.34)
6. exang and thal (0.33)
#### Negative correlation - negative (inverse) linear relationship
* One feature has moderate <u>negative correlation to target</u> variable:
1. thalach (-0.42)
* Six features has moderate <u> negative correlation between variables</u>:
1. age and thalach (-0.39)
2. thalach and slope (-0.39)
3. thalach and exang (-0.38)
4. thalach and oldpeak (-0.35)
2. cp and thalach (-0.34)
3. thalach and thal (-0.27)
4. thalach and ca (-0.27)

### Further investigation needed
We applied pearson's correlation to identify the correlation. We also need to investigate the following to ensure :
- both variables are quantitative
- variables are normally distributed
- no outliers
* Let's find the top 3 variables and investigate:
1. oldpeak and slope (0.58)
2. thal and target (0.53)
3. ca and target (0.42)
* And, one negative correlation:
1. thalach and target (-0.42)

In [None]:
# The curve is slightly right skewed. Mean and median also be slightly on the right.
reload(helper)
helper.draw_histogram_density_curve(df,'trestbps')

In [None]:
df['trestbps'].describe()

In [None]:
Q3 = df['trestbps'].quantile(0.75)