# Objective
Before building ML models using the UCI Heart-Disease dataset, the following questions are to be clarified in this stage (data exploration).
1. What problem needs solving?
2. What data the dataset contains?
3. Which feature appears more important than the other?
4. What is the expected outcome?

Once, these questions are clarified, we can proceed to next stage.

In [3]:
# Load required libraries.
import pandas as pd
import matplotlib.pyplot as plt;
import seaborn as sns;
from custom_libs import helper
from importlib import reload
import numpy as np
import models.uci_heart_disease_dataset as uci
from models.uci_heart_disease_dataset import UCIHeartDiseaseData

In [4]:
reload(uci)
# There is no header in the processed data file. Therefore, the following original names can be included in the dataframe.
# ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
# The 'uci_heart_disease_dataset' library was created to handle UCI dataset related matters.
# So, the 'get_standard_features' method returns meaningful names in list[] and can be used in dataframe header.
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.cleveland_standard, names=uci.get_standard_features())

# 303 records and 14 columns.
data.head(5)

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


# 1. What problem needs solving?
<b>Classify</b> - whether a patient has heart disease based on his/her medical data.

At a glance, it appears to be binary classification problem.\
From the data-dictionary, `Target` is the candidate for the classification. Let's investigate the variable's supportability for <u>binary classification</u> problem.

In [5]:
# Let's see what the 'Target' variable is made of.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Age                    303 non-null    float64
 1   Gender                 303 non-null    float64
 2   Chest Pain             303 non-null    float64
 3   BP Systolic            303 non-null    float64
 4   Cholesterol            303 non-null    float64
 5   Blood Sugar            303 non-null    float64
 6   Rest ECG               303 non-null    float64
 7   Exe. Max Heartrate     303 non-null    float64
 8   Exe. Induced Angina    303 non-null    float64
 9   Exe. ST Depression     303 non-null    float64
 10  Exe. ST Segment Slope  303 non-null    float64
 11  Major Vessels          303 non-null    object 
 12  Thalassemia            303 non-null    object 
 13  Target                 303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB

In [6]:
# From the above, it appears tha 'Target' variable has integer data-type.
# Let's investigate further if the integer values are 'true' or 'false' for binary classification.
data[uci.UCIHeartDiseaseData.target].value_counts()

Target
0    164
1     55
2     36
3     35
4     13
Name: count, dtype: int64

### Observation:
The `Target` variable cannot support binary classification since it has more than 2 values/classes.

If we understood the original intention from the [data-dictionary](data_dictionary) and the literature:
- Any patient with less than 50% vessel narrowing was marked as `value: 0` -- no heart disease
- Any patient with more than 50% vessel narrowing was marked as `value: 1` -- has heart disease. This was further expanded to 1, 2, 3 and 4 based on affected major vessel.

### Conclusion:
We can safely convert this to a binary classification problem by replacing any values of target variable other than `0` to `1`. This logically simplifies that any patient with vessel narrowing more than 50% is suspected to have heart-disease (without distorting the original meaning much).

In [7]:
# Convert any values in target variable other than `0` to `1` - to support binary classification problem.
# The sum of 1, 2, 3 and 4 classes should total up to 55 + 36 + 35 + 13 = 139.
data.loc[data[uci.UCIHeartDiseaseData.target]!=0,uci.UCIHeartDiseaseData.target]=1
data[uci.UCIHeartDiseaseData.target].value_counts()

Target
0    164
1    139
Name: count, dtype: int64

## 2. What data-types the dataset contains?
We got the basic idea about the data types from the 'data-dictionary'. let's investigate and verify further the content.

In [8]:
# The dictionary indicated all processed data is in numerical format. Let's investigate.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Age                    303 non-null    float64
 1   Gender                 303 non-null    float64
 2   Chest Pain             303 non-null    float64
 3   BP Systolic            303 non-null    float64
 4   Cholesterol            303 non-null    float64
 5   Blood Sugar            303 non-null    float64
 6   Rest ECG               303 non-null    float64
 7   Exe. Max Heartrate     303 non-null    float64
 8   Exe. Induced Angina    303 non-null    float64
 9   Exe. ST Depression     303 non-null    float64
 10  Exe. ST Segment Slope  303 non-null    float64
 11  Major Vessels          303 non-null    object 
 12  Thalassemia            303 non-null    object 
 13  Target                 303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB

In [9]:
# There are 11 numerical (float and int) data, and two objects ('Major Vessels' and 'Thalassemia'), 2 objects were not documented in data-dictionary.
# Let's check if there are for missing values (isnull/isna checks for None, NaN or NaT (datetime)).
data.isnull().sum()

Age                      0
Gender                   0
Chest Pain               0
BP Systolic              0
Cholesterol              0
Blood Sugar              0
Rest ECG                 0
Exe. Max Heartrate       0
Exe. Induced Angina      0
Exe. ST Depression       0
Exe. ST Segment Slope    0
Major Vessels            0
Thalassemia              0
Target                   0
dtype: int64

In [10]:
# Result shows no missing values. But why the data types for 'Major Vessels' and 'Thalassemia' are not numeric as indicated by data-dictionary?
# Let's investigate further the two objects manually, to identify the missing values.
for item in uci.get_standard_features():
    d = data[item]
    if d.dtype == 'object':
        print(data[item].value_counts())

Major Vessels
0.0    176
1.0     65
2.0     38
3.0     20
?        4
Name: count, dtype: int64
Thalassemia
3.0    166
7.0    117
6.0     18
?        2
Name: count, dtype: int64


In [11]:
# 'Major Vessels' and 'Thalassemia' should be numerical. However, it is marked as '?'.
# Let's investigate further on Thalassemia.
data[uci.UCIHeartDiseaseData.thalassemia].value_counts()

Thalassemia
3.0    166
7.0    117
6.0     18
?        2
Name: count, dtype: int64

In [12]:
# Let's check the distribution of the missing value marked as '?' in both features.
data[(data[uci.UCIHeartDiseaseData.thalassemia]=='?') | (data[uci.UCIHeartDiseaseData.major_vessels]=='?')]

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
87,53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,?,0
166,52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,?,3.0,0
192,43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,?,7.0,1
266,52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,?,1
287,58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0


In [25]:
# Looks like the sum is 6, and they are disjointed (4 from 'Major Vessels' and 2 from 'Thalassemia'):
print(f'[{uci.UCIHeartDiseaseData.major_vessels}] has {round(helper.value_count(data,uci.UCIHeartDiseaseData.major_vessels,
                                                                                '?'),2)}% of \'?\' values.')
print(f'[{uci.UCIHeartDiseaseData.thalassemia}] has {round(helper.value_count(data,uci.UCIHeartDiseaseData.thalassemia,
                                                                                '?'),2)}% of \'?\' values.')

[Major Vessels] has 1.32% of '?' values.
[Thalassemia] has 0.66% of '?' values.


### Observation:
- From the investigation, it appears feature `Major Vessels` and `Thalassemia` have missing values marked as `?`.
- Together there are 6 records (4 from `Major Vessels` and 2 from `Thalassemia`) and they are disjoint.
- The total missing values are less than 2%.

### Conclusion
Though there are only 6 records with undocumented missing value '?', the other datasets (Long Beach, Hungarian and Switzerland) need to be investigated too before proceeding.

## A Brief Investigation on Other Datasets

In [26]:
# Hungarian dataset
reload(uci)
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.hungarian_standard, names=uci.get_standard_features())
for item in uci.get_standard_features():
    d = data[item]
    if d.dtype == 'object':
        print(f'[{item}] has {round(helper.value_count(data,item,'?'),2)}% of ? values.')

[BP Systolic] has 0.34% of ? values.
[Cholesterol] has 7.82% of ? values.
[Blood Sugar] has 2.72% of ? values.
[Rest ECG] has 0.34% of ? values.
[Exe. Max Heartrate] has 0.34% of ? values.
[Exe. Induced Angina] has 0.34% of ? values.
[Exe. ST Segment Slope] has 64.63% of ? values.
[Major Vessels] has 98.98% of ? values.
[Thalassemia] has 90.48% of ? values.


In [27]:
# Long Beach dataset
reload(uci)
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.longbeach_standard, names=uci.get_standard_features())
for item in uci.get_standard_features():
    d = data[item]
    if d.dtype == 'object':
        print(f'[{item}] has {round(helper.value_count(data,item,'?'),2)}% of ? values.')

[BP Systolic] has 28.0% of ? values.
[Cholesterol] has 3.5% of ? values.
[Blood Sugar] has 3.5% of ? values.
[Exe. Max Heartrate] has 26.5% of ? values.
[Exe. Induced Angina] has 26.5% of ? values.
[Exe. ST Depression] has 28.0% of ? values.
[Exe. ST Segment Slope] has 51.0% of ? values.
[Major Vessels] has 99.0% of ? values.
[Thalassemia] has 83.0% of ? values.


In [28]:
# Switzerland dataset
reload(uci)
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.switzerland_standard, names=uci.get_standard_features())
for item in uci.get_standard_features():
    d = data[item]
    if d.dtype == 'object':
        print(f'[{item}] has {round(helper.value_count(data,item,'?'),2)}% of ? values.')

[BP Systolic] has 1.63% of ? values.
[Blood Sugar] has 60.98% of ? values.
[Rest ECG] has 0.81% of ? values.
[Exe. Max Heartrate] has 0.81% of ? values.
[Exe. Induced Angina] has 0.81% of ? values.
[Exe. ST Depression] has 4.88% of ? values.
[Exe. ST Segment Slope] has 13.82% of ? values.
[Major Vessels] has 95.93% of ? values.
[Thalassemia] has 42.28% of ? values.



### Important Notice!!!
As can be seen from the above brief investigations, other datasets (Long Beach, Hungarian and Switzerland) also contains the `?` character in many of its features.

This needs a thorough investigation before considering the validity of the datasets.

Meanwhile, the EDA was halted here. Further investigated on processed dataset has led to raw dataset investigation too. Please find the details here:
- [processed dataset investigation](2.1-uci-processed-dataset-investigation.ipynb).
- [raw dataset investigation](1.1-uci-raw-dataset-investigation.ipynb).

### Investigation Summary:
After investigating the processed, dataset

In [None]:
# Drop records for feature `ca` and `thal` that are `?` The new total records are 303 - 6 = 297.
filtered = data[(data['thal'] == '?') | (data['ca'] == '?')].index
data.drop(filtered, inplace=True)
data.shape

## 4. Does the dataset have empty columns (features)?
Now that we have verified the data consistencies, let's <u>investigate if any data is missing in the dataset</u>.

In [None]:
# No missing data. isnull and isna is the same, checking for None, NaN or NaT (datetime)
data.isnull().sum()

## 5. Is the dataset sufficient for building the model(s)?
Now that the dataset was cleansed, let's explore the data for further analysis (with graphs when needed).

In [None]:
# Let's save a copy of the cleansed dataset for building models.
data.to_csv('data/uci-heart-disease/processed.cleveland-cleansed.data', index=False)

# Load the saved data for verification.
df = pd.read_csv('data/uci-heart-disease/processed.cleveland-cleansed.data')
df

In [None]:
# Ideally we should expect both classes in the target variable to have same proportion, i.e. ~148.
len(df["target"]) / 2

In [None]:
# Nevertheless, a slight risk variation can be accepted. Let's investigate if the target class in balanced.
df['target'].value_counts()

In [None]:
# To get the percentage proportion, let's view the normalized value counts.
# So, the deviation in the distribution is ~4%.
df['target'].value_counts(normalize=True)

In [None]:
# Let's see the distribution of target variable's class in bar-chart.
df['target'].value_counts().plot(kind="bar", color=['steelblue', 'darksalmon']);

In [None]:
# Let's investigate which features have strong correlation with target.
df.corr()

In [None]:
# Let's see the correlation matrix with color intensity spectrum - the darker the blue is, the higher the correlation.
corr_matrix = df.corr(method='pearson')
mask = np.triu(np.ones_like(data.corr()))
plt.figure(figsize=(15, 10))

sns.heatmap(corr_matrix,
            mask=mask,
            annot=True,
            linewidths=0.5,
            fmt= ".2f",
            cmap="GnBu");

### 5.1 Feature Correlation
By eyeballing the chart (above), we can conclude the following from the Pearson's correlation.

| Level     |   Positive   |    Negative    |
|:----------|:------------:|:--------------:|
| Strong    |  0.70 to 1   |  -0.70 to -1   |
| Moderate  | 0.30 to 0.70 | -0.30 to -0.70 |
| Weak      |  0 to 0.30   |   0 to -0.30   |

If we apply the <u>general rules for classifying correlation</u> (using the table above), we observe:
* Features have <u>only moderate correlation</u> between each other and target variable.
#### Positive correlation - positive linear relationship
* Six features has moderate <u>postive correlation to target</u> variable:
1. thal (0.53)
2. ca (0.46)
3. oldpeak (0.42)
4. exang (0.42)
5. cp (0.41)
6. slope (0.33)
* Six features has moderate <u>positive correlation between variables</u>:
1. oldpeak and slope (0.58)
2. sex and thal (0.38)
3. cp and exang (0.38)
4. age and ca (0.36)
5. oldpeak and thal (0.34)
6. exang and thal (0.33)
#### Negative correlation - negative (inverse) linear relationship
* One feature has moderate <u>negative correlation to target</u> variable:
1. thalach (-0.42)
* Six features has moderate <u> negative correlation between variables</u>:
1. age and thalach (-0.39)
2. thalach and slope (-0.39)
3. thalach and exang (-0.38)
4. thalach and oldpeak (-0.35)
2. cp and thalach (-0.34)
3. thalach and thal (-0.27)
4. thalach and ca (-0.27)

### Further investigation needed
We applied pearson's correlation to identify the correlation. We also need to investigate the following to ensure :
- both variables are quantitative
- variables are normally distributed
- no outliers
* Let's find the top 3 variables and investigate:
1. oldpeak and slope (0.58)
2. thal and target (0.53)
3. ca and target (0.42)
* And, one negative correlation:
1. thalach and target (-0.42)

In [None]:
# The curve is slightly right skewed. Mean and median also be slightly on the right.
reload(helper)
helper.draw_histogram_density_curve(df,'trestbps')

In [None]:
df['trestbps'].describe()

In [None]:
Q3 = df['trestbps'].quantile(0.75)