# 1.0 Introduction
In this notebook, the objective of the study and approach are discussed with a brief background to the dataset used.

Also, [data-dictionary](#data-dictionary) for the dataset is provided at end of the notebook.

## Background

Few heart disease dataset can be found in the internet where its origin can be traced to the <a href='https://archive.ics.uci.edu/dataset/45/heart+disease'>UCI Heart Disease</a>. Quite often, the processed datasets doesn't discuss the origin or inherited issues.

In order to build ML models using similar dataset, understanding the actual problem the author <a href='International application of a new probability algorithm for the diagnosis of coronary artery disease.'>Dr. Robert Detrano</a> tried solving using the dataset, is important.

There are 4 samples in the UCI dataset as described below:
| Sample Location                   | Collection Year | Filename                   | Records | Features | Purpose                     |
|:----------------------------------|:---------------:|:---------------------------|:-------:|:--------:|:----------------------------|
| Cleveland Clinic Foundation       |   1981 - 1984   | processed.cleveland.data   |   303   |    14    | Used for building the model |
| Hungarian Institute of Cardiology |   1983 - 1987   | processed.hungarian.data   |   294   |    14    | Used for testing            |
| Long Beach Medical Center         |   1984 - 1987   | processed.va.data          |   200   |    14    | Used for testing            |
| University Hospital Switzerland   |      1985       | processed.switzerland.data |   123   |    14    | Used for testing            |

The author built 2 ML algorithms using the Cleveland sample and tested using the other samples (Hungarian, Long Beach and Switzerland).

## Objective
Though the data is being widely used in ML field, there are few questions:
- The combined dataset should provide 920 records, but why only 303 records are always mentioned?
- Could the data derived from this and published elsewhere have inherited the unspoken issue?
- Without data transparency, how a model built on it can be trusted?

In this study, the dataset is investigated thoroughly before building ML models using it.

For that, the following are carried out:
1. Exploratory Data Analysis, to:
    - define the problem statement
    - investigate the data validity
    - understand features relationship and importance
2. Preprocessing - to prepare the data for model building
3. Build and fine-tuning models:
    - SVM
    - LR
    - ?
4. Conclusion

## Approach
Mostly it's familiar ML pipeline (as described in Objective). Only during EDA, few turn taken, and it's worth explaining for clarity.

As described in the image below:
- During the EDA, missing value more 60% was discovered (as in 2.0), unexpected from processed dataset.
- An investigation done on processed dataset for recovery - failed attempt (as in 2.1).
- The raw dataset was processed first (2.3) then investigated (2.2) for recovery.
- The recovered data was concatenated into DataFrame, EDA and other stages were continued.

<img src="resources/approach-img.png" width="50%" height="50%">


<a id='data_dictionary'></a>
## Data Dictionary
There are 2 data files for each sample as described below, raw and processed:
|        Raw         | Processed                  | Records | Raw Features | Processed Features |
|:------------------:|:---------------------------|:-------:|:------------:|:------------------:|
|   cleveland.data   | processed.cleveland.data   |   303   |      76      |         14         |
|   hungarian.data   | processed.hungarian.data   |   294   |      76      |         14         |
| long-beach-va.data | processed.va.data          |   200   |      76      |         14         |
|  switzerland.data  | processed.switzerland.data |   123   |      76      |         14         |

The original (raw) file has 76 features, while the processed file has only 14. The author needed only 14 features to build prediction model (he dropped others).

The detail documentation can be found in the `heart-disease.namaes` file.

In [1]:
!cat data/uci-heart-disease/heart-disease.names

Publication Request: 
   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
   This file describes the contents of the heart-disease directory.

   This directory contains 4 databases concerning heart disease diagnosis.
   All attributes are numeric-valued.  The data was collected from the
   four following locations:

     1. Cleveland Clinic Foundation (cleveland.data)
     2. Hungarian Institute of Cardiology, Budapest (hungarian.data)
     3. V.A. Medical Center, Long Beach, CA (long-beach-va.data)
     4. University Hospital, Zurich, Switzerland (switzerland.data)

   Each database has the same instance format.  While the databases have 76
   raw attributes, only 14 of them are actually used.  Thus I've taken the
   liberty of making 2 copies of each database: one with all the attributes
   and 1 with the 14 attributes actually used in past experiments.

   The authors of the databases have requested:

      ...that any publications resultin

In the processed data file, the author has mapped the following variables from the original (raw) dataset accordingly.

Please find the <u>data dictionary</u> for the processed dataset used in this study.

| No  | Original Name  |   Meaningful Name    | Description                                              | Values                                                                                                                              |
|:----|:--------------:|:--------------------:|----------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|
| 1.  |      age       |         Age          | age in years                                             |                                                                                                                                     |
| 2.  |      sex       |        Gender        | sex                                                      | 1: male <br> 0: female                                                                                                              |
| 3.  |       cp       |      Chest Pain      | chest pain type                                          | 1: typical angina <br> 2: atypical angina <br> 3: non-anginal pain <br> 4: asymptomatic                                             |
| 4.  |    trestbps    |     BP Systolic      | systolic blood pressure at rest (in mm Hg)               |                                                                                                                                     |
| 5.  |      chol      |      Cholestrol      | serum cholesterol in mg/dl                               |                                                                                                                                     |
| 6.  |      fbs       |     Blood Sugar      | fasting blood sugar > 120 mg/dl                          | 1: true <br> 0: false                                                                                                               |
| 7.  |    restecg     |       Rest ECG       | resting electrocardiographic results                     | 0: normal <br> 1: having ST-T wave abnormality <br> 2: showing probable or definite left ventricular hypertrophy by Estes' criteria |
| 8.  |    thalach     |  Exe. Max Heartrate  | maximum heart rate achieved                              |                                                                                                                                     |
| 9.  |     exang      | Exe. Induced Angina  | exercise induced angina                                  | 1: yes <br> 0: no                                                                                                                   |
| 10. |    oldpeak     |  Exe. ST Depression  | ST depression induced by exercise relative to rest       |                                                                                                                                     |
| 11. |     slope      | Exe ST Segment Slope | the slope of the peak exercise ST segment                | 1: upsloping <br> 2: flat <br> 3: downsloping                                                                                       |
| 12. |       ca       |    Major Vessels     | number of major vessels (0-3) colored by fluoroscopy     | 0 <br> 1 <br> 2 <br> 3                                                                                                              |
| 13. |      thal      |     Thalassemia      |                                                          | 3: normal <br> 6: fixed defect <br> 7: reversable defect                                                                            |
| 14. |  num (target)  |        Target        | diagnosis of heart disease (angiographic disease status) | 0: < 50% diameter narrowing <br> 1: > 50% diameter narrowing                                                                        |