# Dataset Description
The original <a href='https://archive.ics.uci.edu/dataset/45/heart+disease'>UCI Heart Disease</a> dataset contains 4 data files collected from the four following locations from 1981 to 1987:

1. Cleveland Clinic Foundation (cleveland.data), 1981 - 1984
2. Hungarian Institute of Cardiology, Budapest (hungarian.data), 1983 - 1987
3. V.A. Medical Center, Long Beach, CA (long-beach-va.data), 1984 - 1987
4. University Hospital, Zurich, Switzerland (switzerland.data), 1985

## 1. Acknowledgements
It is a requirement to include the names of the principal investigator responsible for the data collection at each institution when the dataset is being used.  They would be:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

## 2. Background
In this section, following a brief necessary history, <u>the files used in this study and its relationship is discussed</u>:
### 2.1 Dataset content
After the original dataset was downloaded, few files are extracted in the `/heart-disease` directory. To investigate the contents of the dataset, the `cat` cmd can be used on `Index` file.

In [15]:
!cat data/uci-heart-disease/Index

Index of heart-disease

02 Dec 1996      644 Index
02 Dec 1996      dir costs
23 Jul 1996    11058 reprocessed.hungarian.data
14 Aug 1991     6737 bak
14 Aug 1991    10263 processed.hungarian.data
14 Aug 1991     4109 processed.switzerland.data
14 Aug 1991     6737 processed.va.data
20 Jul 1990   389771 new.data
06 Jun 1990    10060 heart-disease.names
15 Mar 1990      587 ask-detrano
15 Mar 1990    62192 hungarian.data
13 Mar 1990    23941 cleve.mod
06 Mar 1990    18461 processed.cleveland.data
31 Jan 1990    60669 cleveland.data
30 May 1989    39892 long-beach-va.data
30 May 1989    24674 switzerland.data


### 2.2 History (on raw data)
Dr. Robert Detrano used the raw `cleveland.data` file to build a model to predict the other three datasets:
- hungarian.data
- long-beach-va.data
- switzerland.data

For unknown reasons, the `cleveland.data` got corrupted during upload and became beyond recoverable. This is indicated in the `WARNING` file (`cat` cmd can be used to investigate file).

In [16]:
!cat data/uci-heart-disease/WARNING

The file cleveland.data has been unfortunately messed up when we lost
node cip2 and loaded the file on node ics.  The file processed.cleveland.data
seems to be in good shape and is useable (for the 14 attributes situation).
I'll clean up cleveland.data as soon as possible.

Bad news: my original copy of the database appears to be corrupted.
I'll have to go back to the donor to get a new copy.

David Aha


#### `Cleveland.data` is Beyond Recoverable
Though other raw data files (hungarian.data, long-beach-va.data and switzerland.data) are in `ascii` the `cleveland.data`, file is in binary format and indicates uploading disrupted. This can be investigated with `file -I` cmd.

In [17]:
# Protocol text/plain was used for transfer and charset=us-ascii for encoding.
!file -I data/uci-heart-disease/hungarian.data

data/uci-heart-disease/hungarian.data: text/plain; charset=us-ascii


In [18]:
# Protocol octet-stream (for binary) was used for uploading and charset=binary for encoding.
!file -I data/uci-heart-disease/cleveland.data

data/uci-heart-disease/cleveland.data: application/octet-stream; charset=binary


Though the first half of `cleveland.data` file seem ascii readable, when `tail -n 100 cleveland.data` was used, the second half of the file appears in binary encoding. This can be observed with gibberish characters.

In [19]:
!tail -n 100 data/uci-heart-disease/cleveland.data

-21 1 1 1 1
1 1 1
1 1 -9 -9 n1 1 1
1 1 -9 -9 n1  10 0 0 1 50 0 0 1 5020 0 -9-9 20  1 -9 -9 -9
-9 3 1382 -9-9 9
-9 9
- 84 0 1  84 0 1  3 -9 2050 701 0 1 01 0 1 01-9 -9 -9
-9 4 136-9 -9 -9
-9 4 136-9
1 9
1 92 -9 1 2 1 10 81 0 0 81 0 0  060 360 36  67.5 3.2 09 7 -9 7 -9  980 1 0
0 1
1 1 -7 177 177  70 0
0.2 0 0
0.2 0 24 81 0 0 01 0 0 01  15 -9 5 1 1 1
1 1 - 1 -9 1 1 1 1 1
1 10 80 10 80 10-9 3 --9 3 --  -60 60 60 13    0 1  0 1         1 1
1 1 -9 7
117
1174 0 13 12
4 11 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 1 -9 1
-9 0 55  0 55     -9 0 2 19 -9
-9 4 130 5 0 -9 -9 -8 808 808 30
0  -9 -9 12
6 2 1 1 1
7 2 1 1 1
7  1 32<<<  <
5  26705 64 1 -9  6
891 0�40 0 1940 0 194 2 -9 22 2 -9 22               60 0  60 0     2116 3 -93 1 -9 -9 n
-9 1 -9 1 1 1 1 1
90 0 -90 0 -92 80 1 1 0 18 1585 15 -9 7 -9 -70 1070 107e
fff -9 -9 name
89 81 0 1  81 0 1     7 17 name
23 name
23     15 1 1 -9 -9 -9
-9 -7 2449494811 1 12 1282 1282  2-9 -9 -9 3 -0 1450 1450  5 8626 84 --9 name
75-

### 2.1 From Raw to Preprocessed
So, that was the necessary history (a mystery) - to understand the original (raw) `cleveland.data` file was never available for preprocessing since no attempts of re-reloading the corrupted file seen ever since. As a workaround, a set of processed files were made available for not looking back.\
The original (raw) file has 76 features (is enclosed in `Data Dictionary` section). Meanwhile, the processed file has only 14 of them. Dr. Robert Detrano in his <a href='International application of a new probability algorithm for the diagnosis of coronary artery disease.'> initial paper</a> written in 1989, indicates that he only needs the 13 features to build his model (of course 14<sup>th</sup> is the target variable) .\
The table below shows the relationship of original (raw) data and processed.
#### Original (raw)
| Filename                   | Records (Rows) | Features (Cols) |
|:---------------------------|:--------------:|:---------------:|
| cleveland.data             |      303       |       76        |
| hungarian.data             |      294       |       76        |
| long-beach-va.data         |      200       |       76        |
| switzerland.data           |      123       |       76        |

#### Processed
| Filename                   | Records (Rows) | Features (Cols) |
|:---------------------------|:--------------:|:---------------:|
| processed.cleveland.data   |      303       |       14        |
| processed.hungarian.data   |      294       |       14        |
| processed.va.data          |      200       |       14        |
| processed.switzerland.data |      123       |       14        |


# Raw Data
Explain how the processed raw data relates to the original (76) columns.

## TODO:
1. Check for duplicate rows.

### Was the pressure converted to MAP?
- https://www.mdcalc.com/calc/74/mean-arterial-pressure-map

## Data dictionary



In z-lab
- With heart disease (1): 165
- No heart disease (0): 138

Actual
- With heart disease (1,2,3,4): 139
- No heart disease (0): 0

Value from 1 to 4
1. Mild heart disease.
2. Moderate heart disease.
3. Severe heart disease.
4. Very severe heart disease.

In [None]:
# All required libraries.
import pandas as pd

In [None]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = pd.read_csv('data/uci-heart-disease/processed.cleveland.data', names=header)
# 303 records and 14 columns.
data.head(5)

## Check for duplicate rows and missing data.

In [None]:
# When all columns were checked, no duplicate rows were found.
duplicate = data[data.duplicated()]
duplicate

In [None]:
# No missing data. isnull and isna is the same, checking for None, NaN or NaT (datetime)
data.isnull().sum()

# Data Exploration (EDA)

In [None]:
# Check the target balance.
data["num"].value_counts()

In [None]:
data["num"].value_counts().plot(kind='bar');

In [None]:
data[data["num"]!=0]=1

In [None]:
data["num"].value_counts()

In [None]:
data["num"].value_counts().plot(kind='bar', color=['salmon','lightblue']);

In [None]:
data.cp.value_counts()

In [None]:
 # Taken from heart-disease.names
 # The "goal" field refers to the presence of heart disease
 #     in the patient.  It is integer valued from 0 (no presence) to 4.
 #     Experiments with the Cleveland database have concentrated on simply
 #     attempting to distinguish presence (values 1,2,3,4) from absence (value
 #     0).
data.num.value_counts()

In [None]:
data.isna().sum()

In [None]:
# When investigated, 2 columns that 'ca' and 'thal' has non-numeric values.
data.info()

In [None]:
# For column 'ca', 4 records contains the value of '?'.
data['ca'].value_counts()

In [None]:
# For column 'thal', 2 records contains the value of '?'.
data['thal'].value_counts()

In [None]:
selected = data[data['thal']=='?']
selected

In [None]:
data['sex'].value_counts()