# Dataset Description
The original <a href='https://archive.ics.uci.edu/dataset/45/heart+disease'>UCI Heart Disease</a> dataset contains 4 data files collected from the four following locations from 1981 to 1987:

1. Cleveland Clinic Foundation (cleveland.data), 1981 - 1984
2. Hungarian Institute of Cardiology, Budapest (hungarian.data), 1983 - 1987
3. V.A. Medical Center, Long Beach, CA (long-beach-va.data), 1984 - 1987
4. University Hospital, Zurich, Switzerland (switzerland.data), 1985

## 1. Acknowledgements
It is a requirement to include the names of the principal investigator responsible for the data collection at each institution when the dataset is being used.  They would be:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

## 2. Background
In this section, following a brief necessary history, <u>the files used in this study and its relationship is discussed</u>:
### 2.1 Dataset content
After the original dataset was downloaded, few files are extracted in the `/heart-disease` directory. To investigate the contents of the dataset, the `cat` cmd can be used on `Index` file.

In [None]:
!cat data/uci-heart-disease/Index

### 2.2 History (on raw data)
Dr. Robert Detrano used the raw `cleveland.data` file to build a model to predict the other three datasets:
- hungarian.data
- long-beach-va.data
- switzerland.data

For unknown reasons, the `cleveland.data` got corrupted during upload and became beyond recoverable. This is indicated in the `WARNING` file (`cat` cmd can be used to investigate file).

In [None]:
!cat data/uci-heart-disease/WARNING

#### `Cleveland.data` is Beyond Recoverable
Though other raw data files (hungarian.data, long-beach-va.data and switzerland.data) are in `ascii` the `cleveland.data`, file is in binary format and indicates uploading disrupted. This can be investigated with `file -I` cmd.

In [None]:
# Protocol text/plain was used for transfer and charset=us-ascii for encoding.
!file -I data/uci-heart-disease/hungarian.data

In [None]:
# Protocol octet-stream (for binary) was used for uploading and charset=binary for encoding.
!file -I data/uci-heart-disease/cleveland.data

Though the first half of `cleveland.data` file seem ascii readable, when `tail -n 100 cleveland.data` was used, the second half of the file appears in binary encoding. This can be observed with gibberish characters.

In [None]:
!tail -n 100 data/uci-heart-disease/cleveland.data

### 2.1 From Raw to Preprocessed
So, that was the necessary history (a mystery) - to understand the original (raw) `cleveland.data` file was never available for preprocessing since no attempts of re-reloading the corrupted file seen ever since. As a workaround, a set of processed files were made available for not looking back.\
The original (raw) file has 76 features (is enclosed in `Data Dictionary` section). Meanwhile, the processed file has only 14 of them. Dr. Robert Detrano in his <a href='International application of a new probability algorithm for the diagnosis of coronary artery disease.'> initial paper</a> written in 1989, indicates that he only needs the 13 features to build his model (of course 14<sup>th</sup> is the target variable) .\
The table below shows the relationship of original (raw) data and processed.
#### Original (raw)
| Filename                   | Records (Rows) | Features (Cols) |
|:---------------------------|:--------------:|:---------------:|
| cleveland.data             |      303       |       76        |
| hungarian.data             |      294       |       76        |
| long-beach-va.data         |      200       |       76        |
| switzerland.data           |      123       |       76        |

#### Processed
| Filename                   | Records (Rows) | Features (Cols) |
|:---------------------------|:--------------:|:---------------:|
| processed.cleveland.data   |      303       |       14        |
| processed.hungarian.data   |      294       |       14        |
| processed.va.data          |      200       |       14        |
| processed.switzerland.data |      123       |       14        |

This processed dataset dimension can be verified using pandas:

In [6]:
import pandas as pd

# Headers are described in Data Dictionary section.
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = pd.read_csv('data/uci-heart-disease/processed.cleveland.data', names=header)
data.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [7]:
data.shape

(303, 14)

## 3. Data dictionary