# Practical session: Exploratory Data Analysis (EDA) - Heart disease 

The main purpose of EDA is to help look at data before making any assumptions. It can help to:
- identify obvious errors,
- understand patterns,
- detect outliers or anomalous events, 
- find relations among the variables.

## Dataset

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

This dataset contains 14 columns that represent the following features:

- `age`: age
- `sex`: gender: 1 - male, 0 - female
- `cp`: type of chest pain (0 - typical angina; 1 - atypical angina; 2 - not angina; 3 - no symptoms
- `trestbps`: resting blood pressure
- `chol`: cholesterol
- `fbs`: fasting glucose level
- `restecg`: ECG (0 = normal, 1 = abnormal ST-T, 2 = according to the Estes standard, showing possible or definite hypertrophy of the left ventricle, severe condition)
- `thalach`: maximum heart rate
- `exang`: exercise-induced angina pectoris (1 - yes; 0 - no)
- `oldpeak`: ST segment suppression caused by exercise relative to rest
- `slope`: electro cardiogram of the heart at maximum load (1 = ascent, 2 = smooth, 3 = descent)
- `ca`: the number of major blood vessels with a fluorescent color (0-4). Fluorescent color is mainly associated with diabetes
- `thal`: a blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
- `target`: presence of heart disease in the patient (0 - no; 1 - yes)

In [1]:
## Importing Libraries 

# Base libraries
import os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Visualisation 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Default options 
pd.set_option("display.float_format", lambda x: "%.2f" % x)

## Data loading

- Folder: `heart` 
- File: `heart.csv`

## Data overview

### Variables

#### Independent Variable 

*aka:* `Feature, Independent, Input, Column, Predictor, Explanatory`

Other variables that are assumed to influence the dependent variable.

#### Dependent Variable

*aka:* `Target, Dependent, Output, Response`

#### Variables types

- `Numeric Variables`: temperature, age, square footage, price, etc.
- `Categorical Variables` (Nominal, Ordinal): gender (nominal), survival status (nominal), football teams (nominal), educational level (ordinal)

## Statistics  


## Target variable

## Feature analysis

## Correlation between features

- `POSITIVE CORRELATION`: If an increase in feature A leads to increase in feature B, then they are **positively correlated**. A value 1 means perfect positive correlation.
- `NEGATIVE CORRELATION`: If an increase in feature A leads to decrease in feature B, then they are **negatively correlated**. A value -1 means perfect negative correlation.

        Attention: Lets say that two features are highly or perfectly correlated, so the increase in one leads to increase in the other. This means that both the features are containing highly similar information and there is very little or no variance in information. This is known as **MultiColinearity** as both of them contains almost the same information.

*Note: Only numeric features*

## Multivariate analysis

## Outliers

An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal) objects.

<img width="300" alt="image" src="https://www.researchgate.net/publication/353410712/figure/fig1/AS:1048732418203648@1627048701501/Removal-of-outliers-using-IQR-method.png">

Inter-Quartile Range

```
IQR = Q3 - Q1
```

To detect the outliers using this method, we define a new range, let’s call it decision range, and any data point lying outside this range is considered as outlier and is accordingly dealt with. The range is as given below:

```
Lower Bound: (Q1 - 1.5 * IQR)
Upper Bound: (Q3 + 1.5 * IQR)
```
Any data point less than the Lower Bound or more than the Upper Bound is considered as an outlier.

