# Feature Analysis #

This notebook displays plots of each feature, in order to drive decision-making about Machine Learning problem approach.  



In [3]:
#import libaries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

**This is a list of all features in the dataset:**

In [4]:
heart_data = pd.read_csv("data/heart.csv")
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


**Visualize the Age Feature**


![Age_Distribution](images\AgeDistr.png)       ![Age_Percents](images\AgePercents.png)


**Visualize the Gender Feature**

0 = female, 1 = male

![Gender_Distribution](images\GenderDistr.png)       

**Visualize the Chest Pain Feature**

0. no chest pain
1. typical angina
2. atypical angina
3. non-anginal pain
4. asymptotic

![ChestPain_Distribution](images\ChestPainDistr.png)         ![ChestPain_Percents](images\ChestPainType.png)      

**Visualize the Resting Blood Pressure Feature**

| Top number (systolic) in mm HG   | Category |
| --- | --- |              
| Below 120     | Normal blood pressure                       |
| 120-129       | Elevated blood pressure                     |
| 130-139       | Stage 1 high blood pressure (hypertension)  |
| 140 or higher | Stage 2 high blood pressure (hypertension)  |

![BP_Distribution](images\BPDistr.png)         ![BP_Percents](images\BloodPressureCategory.png)      

**Visualize the Cholestoral Feature**

| category | serum cholestrol level |
| --- | --- | 
| Healthy | less than 200 mg/dL |
| Borderline high | 200-239 mg/dL |
| High | 240 mg/dL and above |

![Chol_Distribution](images\CholDistr.png)         ![Chol_Percents](images\CholestrolLevel.png)      

**Visualize the Blood Sugar Feature**

1 (true) - fasting blood sugar is greater than 120mg/dl (prediabetes / diabetes)

0 (false) - fasting blood sugar is less than 120mg/dl (normal / prediabetes)

![BloodSugar_Distribution](images\BloodSugarDistr.png)       

**Visualize the ECG Feature**

0 = normal

1 = having ST-T wave abnormality (nonspecific findings found on an EKG which may represent areas of low blood flow to the heart)

2 = left ventricular hyperthrophy (normally caused by high blood pressure and also the enlargement of the ventricle)

![ECG_Distribution](images\ECGDistr.png)         ![ECG_Percents](images\RestingECG.png)      

**Visualize the Maximum Heart Rate and Exercise Induced Angina Features**

*Heart Rate (left)*

220 - Your Age = Predicted Maximum Heart Rate.

*Angina (right)*

1 = yes; 0 = no

![MaxHeartRate_Distribution](images\MaxHRDistr.png)         ![ExerciseAngina_Distribution](images\ExerAngDistr.png)          

**Visualize the Old Peak (ST Depression) Feature**

0 - 1mm: not significant

1 - 1.5mm: significant in V5 - V6

1.5 - 2mm: significant in AVF or III

2mm or more: reversible ischaemia (in exercise stress test)



![STDepression_Distribution](images\STDeprDistr.png)         ![STDepression_Percents](images\STDepression.png)      

**Visualize the Slope Feature**

0: upsloping

1: flat

2: downsloping

"flat" is a healthy slope


![Slope_Distribution](images\SlopeDistr.png)         ![Slope_Percents](images\Slope.png)      

**Visualize the CA Feature**

CA refers to the number of major vessels (0-3) colored by fluorosopy

The documentation suggests the data will show values from 0 - 3.  

But values from 0 - 4 are present. It is unclear what this means or the positivity/negativity of each value.

I wonder if 0 actually means the patient didn't have a this test, because a majority of the patients have this value.

It doesn't make sense that the test showed no major vessels...but perhaps this will be explained more in the correlational studies later in this project.


![CA_Distribution](images\CADistr.png)         ![CA_Percents](images\CA.png)          