# Introduction
As explained in [01.data-introduction](01.data-introduction.ipynb), the author (Dr. Roberto Detrano) intended to build a model to predict heart disease using the 13 variables and 1 target variable. He used cleveland dataset to build the model and predict against other datasets. The datasets are explained in the table below:

| Filename                   | Records (Rows) | Features (Cols) | Description                 |
|:---------------------------|:--------------:|:---------------:|:----------------------------|
| processed.cleveland.data   |      303       |       14        | Used for building the model |
| processed.hungarian.data   |      294       |       14        | Used for testing            |
| processed.va.data          |      200       |       14        | Used for testing            |
| processed.switzerland.data |      123       |       14        | Used for testing            |

## Data Dictionary
In the processed datasets (as described in [01.data-introduction](01.data-introduction.ipynb)), the author has mapped the following variables from the original (raw) dataset to processed dataset for building the model. Please find the <u>data dictionary for the processed dataset that is used in this study</u>.
| No  | Name         | Description                                              | Values                                                                                                                              |
|:----|:-------------|:---------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|
| 1.  | age          | age in years                                             |                                                                                                                                     |
| 2.  | sex          | sex                                                      | 1: male <br> 0: female                                                                                                              |
| 3.  | cp           | chest pain type                                          | 1: typical angina <br> 2: atypical angina <br> 3: non-anginal pain <br> 4: asymptomatic                                             |
| 4.  | trestbps     | systolic blood pressure at rest (in mm Hg)               |                                                                                                                                     |
| 5.  | chol         | serum cholesterol in mg/dl                               |                                                                                                                                     |
| 6.  | fbs          | fasting blood sugar > 120 mg/dl                          | 1: true <br> 0: false                                                                                                               |
| 7.  | restecg      | resting electrocardiographic results                     | 0: normal <br> 1: having ST-T wave abnormality <br> 2: showing probable or definite left ventricular hypertrophy by Estes' criteria |
| 8.  | thalach      | maximum heart rate achieved                              |                                                                                                                                     |
| 9.  | exang        | exercise induced angina                                  | 1: yes <br> 0: no                                                                                                                   |
| 10. | oldpeak      | ST depression induced by exercise relative to rest       |                                                                                                                                     |
| 11. | slope        | the slope of the peak exercise ST segment                | 1: upsloping <br> 2: flat <br> 3: downsloping                                                                                       |
| 12. | ca           | number of major vessels (0-3) colored by fluoroscopy     |                                                                                                                                     |
| 13. | thal         |                                                          | 3: normal <br> 6: fixed defect <br> 7: reversable defect                                                                            |
| 14. | num (target) | diagnosis of heart disease (angiographic disease status) | 0: < 50% diameter narrowing <br> 1: > 50% diameter narrowing                                                                        |

# The Objective
In attempts to build ML models using the heart-disease dataset, the following questions needs clarification in this stage (data exploration).
1. What problem needs to be solved?
2. What data-types the dataset contains?
3. Does the dataset have empty columns (features)?
4. Does the dataset have duplicate rows (records)?
5. Is the dataset sufficient for building the model(s)?
6. Which feature appears more important than the other?

Once, these questions are clarified, a list of action items can help for next stage.

## 1. What problem needs to be solved?
<b>Classify</b> - whether a patient has heart disease based on his/her medical data.

If it's classification problem,




In [1]:
# All required libraries.
import pandas as pd

In [None]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = pd.read_csv('data/uci-heart-disease/processed.cleveland.data', names=header)
# 303 records and 14 columns.
data.head(5)

## Check for duplicate rows and missing data.

In [None]:
# When all columns were checked, no duplicate rows were found.
duplicate = data[data.duplicated()]
duplicate

In [None]:
# No missing data. isnull and isna is the same, checking for None, NaN or NaT (datetime)
data.isnull().sum()

# Data Exploration (EDA)

In [2]:
# Check the target balance.
data["num"].value_counts()

NameError: name 'data' is not defined

In [None]:
data["num"].value_counts().plot(kind='bar');

In [None]:
data[data["num"]!=0]=1

In [None]:
data["num"].value_counts()

In [None]:
data["num"].value_counts().plot(kind='bar', color=['salmon','lightblue']);

In [None]:
data.cp.value_counts()

In [None]:
 # Taken from heart-disease.names
 # The "goal" field refers to the presence of heart disease
 #     in the patient.  It is integer valued from 0 (no presence) to 4.
 #     Experiments with the Cleveland database have concentrated on simply
 #     attempting to distinguish presence (values 1,2,3,4) from absence (value
 #     0).
data.num.value_counts()

In [None]:
data.isna().sum()

In [None]:
# When investigated, 2 columns that 'ca' and 'thal' has non-numeric values.
data.info()

In [None]:
# For column 'ca', 4 records contains the value of '?'.
data['ca'].value_counts()

In [None]:
# For column 'thal', 2 records contains the value of '?'.
data['thal'].value_counts()

In [None]:
selected = data[data['thal']=='?']
selected

In [None]:
data['sex'].value_counts()