# Introduction
As explained in [01.data-introduction](01.data-introduction.ipynb), the author (Dr. Roberto Detrano) intended to build a model to predict heart disease using the 13 variables and 1 target variable. He used cleveland dataset to build the model and predict against other datasets. The datasets are explained in the table below:

| Filename                   | Records (Rows) | Features (Cols) | Description                 |
|:---------------------------|:--------------:|:---------------:|:----------------------------|
| processed.cleveland.data   |      303       |       14        | Used for building the model |
| processed.hungarian.data   |      294       |       14        | Used for testing            |
| processed.va.data          |      200       |       14        | Used for testing            |
| processed.switzerland.data |      123       |       14        | Used for testing            |

## Data Dictionary
In the processed datasets (as described in [01.data-introduction](01.data-introduction.ipynb)), the author has mapped the following variables from the original (raw) dataset to processed dataset for building the model. Please find the <u>data dictionary for the processed dataset that is used in this study</u>.
| No  | Name         | Description                                              | Values                                                                                                                              |
|:----|:-------------|:---------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|
| 1.  | age          | age in years                                             |                                                                                                                                     |
| 2.  | sex          | sex                                                      | 1: male <br> 0: female                                                                                                              |
| 3.  | cp           | chest pain type                                          | 1: typical angina <br> 2: atypical angina <br> 3: non-anginal pain <br> 4: asymptomatic                                             |
| 4.  | trestbps     | systolic blood pressure at rest (in mm Hg)               |                                                                                                                                     |
| 5.  | chol         | serum cholesterol in mg/dl                               |                                                                                                                                     |
| 6.  | fbs          | fasting blood sugar > 120 mg/dl                          | 1: true <br> 0: false                                                                                                               |
| 7.  | restecg      | resting electrocardiographic results                     | 0: normal <br> 1: having ST-T wave abnormality <br> 2: showing probable or definite left ventricular hypertrophy by Estes' criteria |
| 8.  | thalach      | maximum heart rate achieved                              |                                                                                                                                     |
| 9.  | exang        | exercise induced angina                                  | 1: yes <br> 0: no                                                                                                                   |
| 10. | oldpeak      | ST depression induced by exercise relative to rest       |                                                                                                                                     |
| 11. | slope        | the slope of the peak exercise ST segment                | 1: upsloping <br> 2: flat <br> 3: downsloping                                                                                       |
| 12. | ca           | number of major vessels (0-3) colored by fluoroscopy     |                                                                                                                                     |
| 13. | thal         |                                                          | 3: normal <br> 6: fixed defect <br> 7: reversable defect                                                                            |
| 14. | num (target) | diagnosis of heart disease (angiographic disease status) | 0: < 50% diameter narrowing <br> 1: > 50% diameter narrowing                                                                        |

# The Objective
In attempts to build ML models using the heart-disease dataset, the following questions needs clarification in this stage (data exploration).
1. What problem needs to be solved?
2. Does the dataset have duplicate rows (records)?
2. What data-types the dataset contains?
3. Does the dataset have empty columns (features)?
5. Is the dataset sufficient for building the model(s)?
6. Which feature appears more important than the other?

Once, these questions are clarified, a list of action items can help for next stage.

Let's load the necessary libraries and data for investigation.

In [126]:
# All required libraries.
import pandas as pd
from custom_libs import helper

In [127]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = pd.read_csv('data/uci-heart-disease/processed.cleveland.data', names=header)
# 303 records and 14 columns.
data.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


## 1. What problem needs to be solved?
<b>Classify</b> - whether a patient has heart disease based on his/her medical data.

At a glance, it appears to be binary classification problem.\
From the data-dictionary, it appears `num` is the target variable. Let's investigate if the target variable supports the <u>binary classification</u> problem.

In [128]:
# Column 'num' appears in integer format.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  num       303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


In [129]:
# Target variable (num) has more than two values. Meanwhile, binary classification requires only 2 values.
data['num'].value_counts()

num
0    164
1     55
2     36
3     35
4     13
Name: count, dtype: int64

### Summary:
The target variable cannot support binary classification since it has more than 2 values. \
If we understood the original intention from the data-dictionary and the literature:
- Any patient with less than 50% vessel narrowing was marked as `value: 0` -- no heart disease
- Any patient with more than 50% vessel narrowing was marked as `value: 1` -- has heart disease. This was expanded further to 1, 2, 3 and 4 based on affected major vessel.

We can safely convert this to a binary classification problem by replacing any values other than `0` to `1`. This logically simplifies that any patient with vessel narrowing more than 50% is suspected to have heart-disease (without distorting the original meaning much).

### Identified Action(s)
1. Convert any values in `num` column other than `0` to `1` - to support binary classification problem.
2. Also, rename the column from `num` to `target` to give a more meaningful name.

In [130]:
# 1. Convert any values in `num` column other than `0` to `1` - to support binary classification problem.
# Let's see the count before conversion
data['num'].value_counts()

num
0    164
1     55
2     36
3     35
4     13
Name: count, dtype: int64

In [131]:
# Let's see if the sum of 1, 2, 3 and 4 has total up to 55 + 36 + 35 + 13 = 139.
data.loc[data['num']!=0,"num"]=1
data['num'].value_counts()

num
0    164
1    139
Name: count, dtype: int64

In [132]:
# 2. Also, rename the column from `num` to `target` to give a more meaningful name.
data.rename(columns={'num':'target'}, inplace=True)

In [133]:
# The new 'target' column tallies with the old 'num' column.
data['target'].value_counts()

target
0    164
1    139
Name: count, dtype: int64

## 2. Does the dataset have duplicate rows (records)?
So, we converted the target variable from multi-class to binary to support binary classification problem. \
Now, let's investigate if any duplicate records (rows) present in the dataset.

In [134]:
# All columns were checked, and no duplicate record found.
data[data.duplicated()]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target


## 3. What data-types the dataset contains?
We got the basic idea about the data types from the 'data-dictionary'. let's investigate and verify further the content.

In [135]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  target    303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


All features appears to be in numerical format except for `ca` and `thal`. They appear to be categorical from 'data-dictionary'. Let's investigate further.

In [136]:
# For column 'ca', 4 records contains the value of '?'.
data['ca'].value_counts()

ca
0.0    176
1.0     65
2.0     38
3.0     20
?        4
Name: count, dtype: int64

In [137]:
# For column 'thal', 2 records contains the value of '?'.
data['thal'].value_counts()

thal
3.0    166
7.0    117
6.0     18
?        2
Name: count, dtype: int64

In [138]:
# Let's investigate if total 6 (4 from ca and 2 from thal) are disjoint, using OR operator first.
# Looks like the sum is 6 and they are disjoint.
data[(data['thal']=='?') | (data['ca']=='?')]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
87,53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,?,0
166,52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,?,3.0,0
192,43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,?,7.0,1
266,52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,?,1
287,58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0


In [139]:
# There is only < 2% uninterpretable data found in 'ca' and 'thal'.
print(f'In ca there is {round(helper.value_count(data,'ca','?'),2)}% of ? values found.')
print(f'In thal there is {round(helper.value_count(data,'thal','?'),2)}% of ? values found.')

In ca there is 1.32% of ? values found.
In thal there is 0.66% of ? values found.


### Summary:
From the investigation, it appears feature `ca` and `thal` have uninterpretable values which is `?`. \
Together there are 6 records (4 from `ca` and 2 from `thal`) and they are disjoint. \
There is only less than 2% of uninterpretable data found in both `ca` and `thal`. So, we can drop the records.

### Identified Action(s)
1. Drop records for feature `ca` and `thal` that are `?`

In [140]:
# Drop records for feature `ca` and `thal` that are `?` The new total records are 303 - 6 = 207.
filtered = data[(data['thal'] == '?') | (data['ca'] == '?')].index
data.drop(filtered, inplace=True)
data.shape

(297, 14)

In [None]:
# No missing data. isnull and isna is the same, checking for None, NaN or NaT (datetime)
data.isnull().sum()

# Data Exploration (EDA)

In [None]:
# Check the target balance.
data["num"].value_counts()

In [None]:
data["num"].value_counts().plot(kind='bar');

In [None]:
data[data["num"]!=0]=1

In [None]:
data["num"].value_counts()

In [None]:
data["num"].value_counts().plot(kind='bar', color=['salmon','lightblue']);

In [None]:
data.cp.value_counts()

In [None]:
 # Taken from heart-disease.names
 # The "goal" field refers to the presence of heart disease
 #     in the patient.  It is integer valued from 0 (no presence) to 4.
 #     Experiments with the Cleveland database have concentrated on simply
 #     attempting to distinguish presence (values 1,2,3,4) from absence (value
 #     0).
data.num.value_counts()

In [None]:
data.isna().sum()

In [None]:
# When investigated, 2 columns that 'ca' and 'thal' has non-numeric values.
data.info()

In [None]:
# For column 'ca', 4 records contains the value of '?'.
data['ca'].value_counts()

In [None]:
# For column 'thal', 2 records contains the value of '?'.
data['thal'].value_counts()

In [None]:
selected = data[data['thal']=='?']
selected

In [None]:
data['sex'].value_counts()

## 5. Is the dataset sufficient for building the model(s)?
1. Does the class have outliers ?


In [None]:
data