# 1. Objective
In attempts to build ML models using the UCI Heart-Disease dataset, the following questions will be clarified in this stage (data exploration).
1. What problem needs solving?
2. What data the dataset contains?
3. Which feature appears more important than the other?
4. What is the outcome expectation?

Once, these questions are clarified, a list of action items can help for next stage.

In [None]:
# Load required libraries.
import pandas as pd
import matplotlib.pyplot as plt;
import seaborn as sns;
from custom_libs import helper
from importlib import reload
import numpy as np
import models.uci_heart_disease_dataset as uci
from models.uci_heart_disease_dataset import UCIHeartDiseaseData

In [None]:
reload(uci)
# There is no header in the processed data file. Therefore, following original names can be mapped.
# header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
# The library 'uci_heart_disease_dataset' was created to handle UCI dataset related matters.
# So, the 'get_standard_features' method returns a meaningful name instead for fluidity.
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.cleveland_standard, names=uci.get_standard_features())

# 303 records and 14 columns.
data.head(5)

## 1. What problem needs solving?
<b>Classify</b> - whether a patient has heart disease based on his/her medical data.

At a glance, it appears to be binary classification problem.\
From the data-dictionary, `Target` appears as the target variable. Let's investigate the target variable's supportability for <u>binary classification</u> problem.

In [None]:
# Target variable has integer data-type.
data.info()

In [None]:
# Target variable has more than two values/classes. Meanwhile, binary classification requires only 2 values/classes.
data[uci.UCIHeartDiseaseData.target].value_counts()

### Observation:
The target variable cannot support binary classification since it has more than 2 values/classes.

If we understood the original intention from the [data-dictionary](data_dictionary) and the literature:
- Any patient with less than 50% vessel narrowing was marked as `value: 0` -- no heart disease
- Any patient with more than 50% vessel narrowing was marked as `value: 1` -- has heart disease. This was further expanded to 1, 2, 3 and 4 based on affected major vessel.

### Conclusion:
We can safely convert this to a binary classification problem by replacing any values of target variable other than `0` to `1`. This logically simplifies that any patient with vessel narrowing more than 50% is suspected to have heart-disease (without distorting the original meaning much).

In [None]:
# Convert any values in target variable other than `0` to `1` - to support binary classification problem.
# The sum of 1, 2, 3 and 4 should total up to 55 + 36 + 35 + 13 = 139.
data.loc[data[uci.UCIHeartDiseaseData.target]!=0,uci.UCIHeartDiseaseData.target]=1
data[uci.UCIHeartDiseaseData.target].value_counts()

## 2. Does the dataset have duplicate rows (records)?
So, we converted the target variable from multi-class to binary to support binary classification problem. \
Now, let's investigate if any duplicate records (rows) present in the dataset.

In [None]:
# All columns were checked, and no duplicate record found.
data[data.duplicated()]

## 2. What data-types the dataset contains?
We got the basic idea about the data types from the 'data-dictionary'. let's investigate and verify further the content.

In [None]:
# There are 11 numerical (float and int) data and two objects ('Major Vessels' and 'Thalasemia')
# The dictionary indicated all processed data is in numerical format. Let's investigate.
data.info()

In [None]:
# Let's check for missing data (isnull/isna checks for None, NaN or NaT (datetime)) - should be sufficient for detecting missing values.
# Result shows no missing values. But why the data types for 'Major Vessels' and 'Thalassemia' are not numeric as indicated by data-dictionary?
# Let's investigate further manually - in the next cell.
data.isnull().sum()

In [None]:
# 'Major Vessels' and 'Thalassemia' should be numerical. However, it is marked as '?'. Someone marked 6 entries with missing value '?', supposedly.
for item in uci.get_standard_features():
    d = data[item]
    if d.dtype == 'object':
        print(data[item].value_counts())

In [None]:
# For column 'Thalassemia', 2 records contains the value of '?'.
data[uci.UCIHeartDiseaseData.thalassemia].value_counts()

In [None]:
# Looks like the sum is 6, and they are disjointed (4 from 'Major Vessels' and 2 from 'Thalassemia'):
data[(data[uci.UCIHeartDiseaseData.thalassemia]=='?') | (data[uci.UCIHeartDiseaseData.major_vessels]=='?')]

In [59]:
# There is only < 2% missing values for  'Major Vessels' and 'Thalassemia' with '?'
print(f'In ca there is {round(helper.value_count(data,uci.UCIHeartDiseaseData.major_vessels,'?'),2)}% of ? values found.')
print(f'In thal there is {round(helper.value_count(data,UCIHeartDiseaseData.thalassemia,'?'),2)}% of ? values found.')

In ca there is 1.32% of ? values found.
In thal there is 0.66% of ? values found.


### Observation:
- From the investigation, it appears feature `Major Vessels` and `Thalassemia` have missing values marked as `?`. \
- Together there are 6 records (4 from `Major Vessels` and 2 from `Thalassemia`) and they are disjoint. \
- There is only less than 2% of uninterpretable data found in both `Major Vessels` and `Thalassemia`.

### Conclusion
- Though there are only 6 records with undocumented missing value '?', how about the other datasets (Long Beach, Hungarian and Switzerland)?
- Let's investigate the other datasets before proceeding

## A Brief Investigation on Other Datasets

In [71]:
reload(uci)
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.hungarian_standard, names=uci.get_standard_features())
for item in uci.get_standard_features():
    d = data[item]
    if d.dtype == 'object':
        print(f'In ca there is {round(helper.value_count(data,item,'?'),2)}% of ? values found.')

In ca there is 0.34% of ? values found.
In ca there is 7.82% of ? values found.
In ca there is 2.72% of ? values found.
In ca there is 0.34% of ? values found.
In ca there is 0.34% of ? values found.
In ca there is 0.34% of ? values found.
In ca there is 64.63% of ? values found.
In ca there is 98.98% of ? values found.
In ca there is 90.48% of ? values found.


In [72]:
reload(uci)
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.longbeach_standard, names=uci.get_standard_features())
for item in uci.get_standard_features():
    d = data[item]
    if d.dtype == 'object':
        print(f'In ca there is {round(helper.value_count(data,item,'?'),2)}% of ? values found.')

In ca there is 28.0% of ? values found.
In ca there is 3.5% of ? values found.
In ca there is 3.5% of ? values found.
In ca there is 26.5% of ? values found.
In ca there is 26.5% of ? values found.
In ca there is 28.0% of ? values found.
In ca there is 51.0% of ? values found.
In ca there is 99.0% of ? values found.
In ca there is 83.0% of ? values found.


In [81]:
reload(uci)
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.switzerland_standard, names=uci.get_standard_features())
for item in uci.get_standard_features():
    d = data[item]
    if d.dtype == 'object':
        print(f'[{item}] has {round(helper.value_count(data,item,'?'),2)}% of ? values.')

[BP Systolic] has 1.63% of ? values.
[Blood Sugar] has 60.98% of ? values.
[Rest ECG] has 0.81% of ? values.
[Exe. Max Heartrate] has 0.81% of ? values.
[Exe. induced Angina] has 0.81% of ? values.
[Exe. ST Depression] has 4.88% of ? values.
[Exe. ST Segment Slope] has 13.82% of ? values.
[Major Vessels] has 95.93% of ? values.
[Thalassemia] has 42.28% of ? values.


Since features in the dataset were already narrowed from 76 to 14 based on their importance for meaningful medical interpretation, dropping the 6 records appears to be more reasonable. This reason: the two variables cannot be imputed with 'mean' because through they appears in numeric format they were originally in categorical format and were already converted.


### Important Notice
- Other datasets (Long Beach, Hungarian and Switzerland) also contains the `?` character in many of its features rendering the datasets useless. Please refer to [data set investigation](1.1-uci-processed-dataset-investigation.ipynb) for more reports.

In [None]:
# Drop records for feature `ca` and `thal` that are `?` The new total records are 303 - 6 = 297.
filtered = data[(data['thal'] == '?') | (data['ca'] == '?')].index
data.drop(filtered, inplace=True)
data.shape

## 4. Does the dataset have empty columns (features)?
Now that we have verified the data consistencies, let's <u>investigate if any data is missing in the dataset</u>.

In [None]:
# No missing data. isnull and isna is the same, checking for None, NaN or NaT (datetime)
data.isnull().sum()

## 5. Is the dataset sufficient for building the model(s)?
Now that the dataset was cleansed, let's explore the data for further analysis (with graphs when needed).

In [None]:
# Let's save a copy of the cleansed dataset for building models.
data.to_csv('data/uci-heart-disease/processed.cleveland-cleansed.data', index=False)

# Load the saved data for verification.
df = pd.read_csv('data/uci-heart-disease/processed.cleveland-cleansed.data')
df

In [None]:
# Ideally we should expect both classes in the target variable to have same proportion, i.e. ~148.
len(df["target"]) / 2

In [None]:
# Nevertheless, a slight risk variation can be accepted. Let's investigate if the target class in balanced.
df['target'].value_counts()

In [None]:
# To get the percentage proportion, let's view the normalized value counts.
# So, the deviation in the distribution is ~4%.
df['target'].value_counts(normalize=True)

In [None]:
# Let's see the distribution of target variable's class in bar-chart.
df['target'].value_counts().plot(kind="bar", color=['steelblue', 'darksalmon']);

In [None]:
# Let's investigate which features have strong correlation with target.
df.corr()

In [None]:
# Let's see the correlation matrix with color intensity spectrum - the darker the blue is, the higher the correlation.
corr_matrix = df.corr(method='pearson')
mask = np.triu(np.ones_like(data.corr()))
plt.figure(figsize=(15, 10))

sns.heatmap(corr_matrix,
            mask=mask,
            annot=True,
            linewidths=0.5,
            fmt= ".2f",
            cmap="GnBu");

### 5.1 Feature Correlation
By eyeballing the chart (above), we can conclude the following from the Pearson's correlation.

| Level     |   Positive   |    Negative    |
|:----------|:------------:|:--------------:|
| Strong    |  0.70 to 1   |  -0.70 to -1   |
| Moderate  | 0.30 to 0.70 | -0.30 to -0.70 |
| Weak      |  0 to 0.30   |   0 to -0.30   |

If we apply the <u>general rules for classifying correlation</u> (using the table above), we observe:
* Features have <u>only moderate correlation</u> between each other and target variable.
#### Positive correlation - positive linear relationship
* Six features has moderate <u>postive correlation to target</u> variable:
1. thal (0.53)
2. ca (0.46)
3. oldpeak (0.42)
4. exang (0.42)
5. cp (0.41)
6. slope (0.33)
* Six features has moderate <u>positive correlation between variables</u>:
1. oldpeak and slope (0.58)
2. sex and thal (0.38)
3. cp and exang (0.38)
4. age and ca (0.36)
5. oldpeak and thal (0.34)
6. exang and thal (0.33)
#### Negative correlation - negative (inverse) linear relationship
* One feature has moderate <u>negative correlation to target</u> variable:
1. thalach (-0.42)
* Six features has moderate <u> negative correlation between variables</u>:
1. age and thalach (-0.39)
2. thalach and slope (-0.39)
3. thalach and exang (-0.38)
4. thalach and oldpeak (-0.35)
2. cp and thalach (-0.34)
3. thalach and thal (-0.27)
4. thalach and ca (-0.27)

### Further investigation needed
We applied pearson's correlation to identify the correlation. We also need to investigate the following to ensure :
- both variables are quantitative
- variables are normally distributed
- no outliers
* Let's find the top 3 variables and investigate:
1. oldpeak and slope (0.58)
2. thal and target (0.53)
3. ca and target (0.42)
* And, one negative correlation:
1. thalach and target (-0.42)

In [None]:
# The curve is slightly right skewed. Mean and median also be slightly on the right.
reload(helper)
helper.draw_histogram_density_curve(df,'trestbps')

In [None]:
df['trestbps'].describe()

In [None]:
Q3 = df['trestbps'].quantile(0.75)