# 1. Acknowledgements
It is a requirement to include the names of the principal investigator responsible for the data collection at each institution when the dataset is being used.  They would be:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

# 2. Introduction

Few heart disease dataset can be found in the internet where its origin can be traced to the UCI Heart Disease dataset. Quite often, the processed datasets doesn't discuss the origin or inherited issues.

In order to understand the underlying problem that originally the author (Dr. Robert Detrano) intended to solve, using the original source can be reliable.

The original <a href='https://archive.ics.uci.edu/dataset/45/heart+disease'>UCI Heart Disease</a> dataset contains 4 data files collected from the four following locations, from 1981 to 1987:

| Sample Location                   | Collection Year | Filename                   | Records | Features | Purpose                     |
|:----------------------------------|:---------------:|:---------------------------|:-------:|:--------:|:----------------------------|
| Cleveland Clinic Foundation       |   1981 - 1984   | processed.cleveland.data   |   303   |    14    | Used for building the model |
| Hungarian Institute of Cardiology |   1983 - 1987   | processed.hungarian.data   |   294   |    14    | Used for testing            |
| Long Beach Medical Center         |   1984 - 1987   | processed.va.data          |   200   |    14    | Used for testing            |
| University Hospital Switzerland   |      1985       | processed.switzerland.data |   123   |    14    | Used for testing            |

### Background
Though the data is being widely used in ML field, there are quite number of critics on its validity.

For instance, there are two datasets made available for each sample, and the author had mentioned the original data for Cleveland was corrupted.

Also, most often only the `processed.cleveland.data` dataset with 303 records are used without considering the other datasets.

|        Raw         | Processed                  | Records | Raw Features | Processed Features |
|:------------------:|:---------------------------|:-------:|:------------:|:------------------:|
|   cleveland.data   | processed.cleveland.data   |   303   |      76      |         14         |
|   hungarian.data   | processed.hungarian.data   |   294   |      76      |         14         |
| long-beach-va.data | processed.va.data          |   200   |      76      |         14         |
|  switzerland.data  | processed.switzerland.data |   123   |      76      |         14         |

### Questions that needs Clarifications:
- Since a combined dataset should provide  920 records, why only 303 record are always used?
- Could the dataset derived from this and published elsewhere have inherited the issues without discussing them?
- Without understanding the data background, how a model built on it can be trusted?

The data corruption was acknowledged in the following file:

In [None]:
!cat data/uci-heart-disease/Warning

<a id='data_dictionary'></a>
# 3. Data Dictionary
The original (raw) file has 76 features, while, the processed file has only 14 of them. The author in his <a href='International application of a new probability algorithm for the diagnosis of coronary artery disease.'> initial paper</a> written in 1989, indicated that he only needs the 13 features to build prediction model (14<sup>th</sup> being the target variable).

So, he dropped all other 62 features, making only 14 variables available in the processed datasets, both for model building and testing.

The detail documentation can be found in the `heart-disease.namaes` file.

In [None]:
!cat data/uci-heart-disease/heart-disease.names

In the processed datasets, the author has mapped the following variables from the original (raw) dataset accordingly. Please find the data dictionary for the processed dataset used in this study.

| No  | Original Name  |   Meaningful Name    | Description                                              | Values                                                                                                                              |
|:----|:--------------:|:--------------------:|----------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|
| 1.  |      age       |         Age          | age in years                                             |                                                                                                                                     |
| 2.  |      sex       |        Gender        | sex                                                      | 1: male <br> 0: female                                                                                                              |
| 3.  |       cp       |      Chest Pain      | chest pain type                                          | 1: typical angina <br> 2: atypical angina <br> 3: non-anginal pain <br> 4: asymptomatic                                             |
| 4.  |    trestbps    |     BP Systolic      | systolic blood pressure at rest (in mm Hg)               |                                                                                                                                     |
| 5.  |      chol      |      Cholestrol      | serum cholesterol in mg/dl                               |                                                                                                                                     |
| 6.  |      fbs       |     Blood Sugar      | fasting blood sugar > 120 mg/dl                          | 1: true <br> 0: false                                                                                                               |
| 7.  |    restecg     |       Rest ECG       | resting electrocardiographic results                     | 0: normal <br> 1: having ST-T wave abnormality <br> 2: showing probable or definite left ventricular hypertrophy by Estes' criteria |
| 8.  |    thalach     |  Exe. Max Heartrate  | maximum heart rate achieved                              |                                                                                                                                     |
| 9.  |     exang      | Exe. Induced Angina  | exercise induced angina                                  | 1: yes <br> 0: no                                                                                                                   |
| 10. |    oldpeak     |  Exe. ST Depression  | ST depression induced by exercise relative to rest       |                                                                                                                                     |
| 11. |     slope      | Exe ST Segment Slope | the slope of the peak exercise ST segment                | 1: upsloping <br> 2: flat <br> 3: downsloping                                                                                       |
| 12. |       ca       |    Major Vessels     | number of major vessels (0-3) colored by fluoroscopy     | 0 <br> 1 <br> 2 <br> 3                                                                                                              |
| 13. |      thal      |     Thalassemia      |                                                          | 3: normal <br> 6: fixed defect <br> 7: reversable defect                                                                            |
| 14. |  num (target)  |        Target        | diagnosis of heart disease (angiographic disease status) | 0: < 50% diameter narrowing <br> 1: > 50% diameter narrowing                                                                        |

# 4. Objective
In this study, the following are done:
1. Exploratory Data Analysis, to:
    - define the problem statement
    - investigate the data validity
    - understand features relationship and importance
2. Preprocessing - to prepare the data for model building
3. Build and fine-tuning model:
    - SVM
    - LR
    - ?
4. Conclusion