# 1. Introduction
Few heart disease dataset can be found in Kaggle. Mostly, the origin can be traced to the UCI Heart Disease dataset. Quite often, the data in Kaggle found oversampled, distorted or features are transposed and untraceable.

In order to understand the underlying problem that originally the author (Dr. Robert Detrano) intended to solve, getting the dataset from the original source can be reliable.

The original <a href='https://archive.ics.uci.edu/dataset/45/heart+disease'>UCI Heart Disease</a> dataset contains 4 data files collected from the four following locations, from 1981 to 1987:

1. Cleveland Clinic Foundation (cleveland.data), 1981 - 1984
2. Hungarian Institute of Cardiology, Budapest (hungarian.data), 1983 - 1987
3. V.A. Medical Center, Long Beach, CA (long-beach-va.data), 1984 - 1987
4. University Hospital, Zurich, Switzerland (switzerland.data), 1985

# 2. Acknowledgements
It is a requirement to include the names of the principal investigator responsible for the data collection at each institution when the dataset is being used.  They would be:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

# 3. Background
In this section, following a brief necessary history, <u>the files used in this study and its relationship is discussed</u>:
## 3.1 Dataset Content
After the original dataset was downloaded, few files are extracted in the `/heart-disease` directory. To investigate the contents of the dataset, the `cat` cmd can be used on `Index` file.

In [None]:
!cat data/uci-heart-disease/Index

## 3.2 Dataset History
The author intended to build a model to predict heart disease using the 13 variables and 1 target variable. He used cleveland dataset to build the model and predict against other datasets. The datasets are explained in the table below:

| Filename                   | Records (Rows) | Features (Cols) | Description                 |
|:---------------------------|:--------------:|:---------------:|:----------------------------|
| processed.cleveland.data   |      303       |       14        | Used for building the model |
| processed.hungarian.data   |      294       |       14        | Used for testing            |
| processed.va.data          |      200       |       14        | Used for testing            |
| processed.switzerland.data |      123       |       14        | Used for testing            |

For unknown reasons, the `cleveland.data` got corrupted during upload and became beyond recoverable. This is indicated in the `WARNING` file (`cat` cmd can be used to investigate file).

In [None]:
!cat data/uci-heart-disease/WARNING

## 3.3 Investigating the Corruption
Though other raw data files (hungarian.data, long-beach-va.data and switzerland.data) are in `ascii` format, the `cleveland.data` file found to be in binary format and indicates uploading disruption. This can be investigated with `file -I` cmd.

In [None]:
# Originally, 'text/plain' protocol was used to upload (transfer) with charset=us-ascii for encoding.
!file -I data/uci-heart-disease/hungarian.data

In [None]:
# Meanwhile, for the corrupted 'cleveland.data file, 'octet-stream' protocol was used to upload (transfer) with charset=binary for encoding.
!file -I data/uci-heart-disease/cleveland.data

Though the first half of `cleveland.data` file seem in `ascii` format, when `tail -n 100 cleveland.data` was used, the second half of the file appears in binary encoding. This can be observed with gibberish characters.

In [None]:
!tail -n 100 data/uci-heart-disease/cleveland.data

## 3.4 Processed Files as Alternatives
So, that was the necessary history (a mystery perhaps) - to understand the original (raw) `cleveland.data` file was never available for preprocessing and no attempts of re-reloading the corrupted file was seen ever since. Alternatively, a set of processed files were made available for working on this dataset.

The original (raw) file has 76 features, while, the processed file has only 14 of them, can be referred in [Data-Dictionary](#data-dictionary), below. The author in his <a href='International application of a new probability algorithm for the diagnosis of coronary artery disease.'> initial paper</a> written in 1989, indicated that he only needs the 13 features to build prediction model (of course 14<sup>th</sup> is the target variable).

The table below shows the relationship of the original (raw) and processed files' dataset.
#### Original (raw)
| Filename                   | Records (Rows) | Features (Cols) |
|:---------------------------|:--------------:|:---------------:|
| cleveland.data             |      303       |       76        |
| hungarian.data             |      294       |       76        |
| long-beach-va.data         |      200       |       76        |
| switzerland.data           |      123       |       76        |

#### Processed
| Filename                   | Records (Rows) | Features (Cols) |
|:---------------------------|:--------------:|:---------------:|
| processed.cleveland.data   |      303       |       14        |
| processed.hungarian.data   |      294       |       14        |
| processed.va.data          |      200       |       14        |
| processed.switzerland.data |      123       |       14        |

<br>
This processed dataset's dimension can be verified using pandas:

In [11]:
import pandas as pd

# Headers are described in Data Dictionary section.
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = pd.read_csv('data/uci-heart-disease/processed.cleveland.data', names=header)
print(f'dataset shape: {data.shape}.')
data.head(3)


dataset shape: (303, 14).


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1


In [12]:
from models.uci_heart_disease_dataset import get_standard_features

# Replacing the original header with meaningful/descriptive names.
header = get_standard_features()
data = pd.read_csv('data/uci-heart-disease/processed.cleveland.data', names=header)
print(f'dataset shape: {data.shape}.')
data.head(3)

dataset shape: (303, 14).


Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exercise Exe. Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1


In [13]:
data

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exercise Exe. Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


<a id='data_dictionary'></a>
# Data Dictionary
In the original journal, the author describes that he only needed the 13 variables (features) to predict heart disease in a patient based on the 14<sup>th</sup> variable, which is the target variable.

So, he dropped all other 62 features, making only 14 variables available in the processed datasets, both for model building and testing.

The detail documentation can be found in the `heart-disease.namaes` file.

In [None]:
!cat data/uci-heart-disease/heart-disease.names

In the processed datasets, the author has mapped the following variables from the original (raw) dataset to processed dataset for building the model. Please find the <u>data dictionary for the processed dataset that is used in this study</u>.
| No  | Name         | Description                                              | Values                                                                                                                              |
|:----|:-------------|:---------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|
| 1.  | age          | age in years                                             |                                                                                                                                     |
| 2.  | sex          | sex                                                      | 1: male <br> 0: female                                                                                                              |
| 3.  | cp           | chest pain type                                          | 1: typical angina <br> 2: atypical angina <br> 3: non-anginal pain <br> 4: asymptomatic                                             |
| 4.  | trestbps     | systolic blood pressure at rest (in mm Hg)               |                                                                                                                                     |
| 5.  | chol         | serum cholesterol in mg/dl                               |                                                                                                                                     |
| 6.  | fbs          | fasting blood sugar > 120 mg/dl                          | 1: true <br> 0: false                                                                                                               |
| 7.  | restecg      | resting electrocardiographic results                     | 0: normal <br> 1: having ST-T wave abnormality <br> 2: showing probable or definite left ventricular hypertrophy by Estes' criteria |
| 8.  | thalach      | maximum heart rate achieved                              |                                                                                                                                     |
| 9.  | exang        | exercise induced angina                                  | 1: yes <br> 0: no                                                                                                                   |
| 10. | oldpeak      | ST depression induced by exercise relative to rest       |                                                                                                                                     |
| 11. | slope        | the slope of the peak exercise ST segment                | 1: upsloping <br> 2: flat <br> 3: downsloping                                                                                       |
| 12. | ca           | number of major vessels (0-3) colored by fluoroscopy     | 0 <br> 1 <br> 2 <br> 3                                                                                                                                  |
| 13. | thal         |                                                          | 3: normal <br> 6: fixed defect <br> 7: reversable defect                                                                            |
| 14. | num (target) | diagnosis of heart disease (angiographic disease status) | 0: < 50% diameter narrowing <br> 1: > 50% diameter narrowing                                                                        |