# Cardiovascular Desease Prediction Project.

This project is for education purposes. here I will exercise skills in machine learning, more precisely classification algorithms.

[data source: Kaggle](https://www.kaggle.com/sulianova/cardiovascular-disease-dataset/code)


**Data Description:**

There are 3 types of input features:

* Objective: factual information;
* Examination: results of medical examination;
* Subjective: information given by the patient.

Features:

* Age | Objective Feature | age | int (days)
* Height | Objective Feature | height | int (cm) |
* Weight | Objective Feature | weight | float (kg) |
* Gender | Objective Feature | gender | categorical code | 1 - women, 2 - men |
* Systolic blood pressure | Examination Feature | ap_hi | int |
* Diastolic blood pressure | Examination Feature | ap_lo | int |
* Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
* Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
* Smoking | Subjective Feature | smoke | binary | 0: no, 1: yes |
* Alcohol intake | Subjective Feature | alco | binary | 0: no, 1: yes |
* Physical activity | Subjective Feature | active | binary | 0: no, 1: yes |
* Presence or absence of cardiovascular disease | Target Variable | cardio | binary | 0: no, 1: yes |

All of the dataset values were collected at the moment of medical examination.

# Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Helper Functions

In [2]:
%matplotlib inline
%pylab inline

plt.style.use('bmh')
plt.rcParams['figure.figsize'] = [16, 8]
plt.rcParams['font.size'] = 18

pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('display.expand_frame_repr', False)

sns.set()


Populating the interactive namespace from numpy and matplotlib


# Load Data

In [3]:
df1 = pd.read_csv('C:\\Users\\felip\\repos\\cardio\\Cardiovascular-disease-prediction\\data\\cardio_train.csv',
                 sep = ';', 
                 index_col = 'id')
df1.head()

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,17474,1,156,56.0,100,60,1,1,0,0,0,0


## Data Description

In [4]:
df1.dtypes

age              int64
gender           int64
height           int64
weight         float64
ap_hi            int64
ap_lo            int64
cholesterol      int64
gluc             int64
smoke            int64
alco             int64
active           int64
cardio           int64
dtype: object

In [5]:
# Cheking for NAN data.

df1.isnull().sum().sum()

0

In [6]:
df1.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,70000.0,19468.865814,2467.251667,10798.0,17664.0,19703.0,21327.0,23713.0
gender,70000.0,1.349571,0.476838,1.0,1.0,1.0,2.0,2.0
height,70000.0,164.359229,8.210126,55.0,159.0,165.0,170.0,250.0
weight,70000.0,74.20569,14.395757,10.0,65.0,72.0,82.0,200.0
ap_hi,70000.0,128.817286,154.011419,-150.0,120.0,120.0,140.0,16020.0
ap_lo,70000.0,96.630414,188.47253,-70.0,80.0,80.0,90.0,11000.0
cholesterol,70000.0,1.366871,0.68025,1.0,1.0,1.0,2.0,3.0
gluc,70000.0,1.226457,0.57227,1.0,1.0,1.0,1.0,3.0
smoke,70000.0,0.088129,0.283484,0.0,0.0,0.0,0.0,1.0
alco,70000.0,0.053771,0.225568,0.0,0.0,0.0,0.0,1.0


In [7]:
df1.shape

(70000, 12)

# Data Cleaning

## Data Questions Based on Data Features Describe.

**Height:** The minimum height is 55cm and the maximum is 250 cm. Is that right? There are patients with nanism and gigantism in the dataset?

In [8]:
# Checking if there are many height outliers and the impact of it on the entire dataset.

shorter = len(df1[df1["height"] < 130])
bigger = len(df1[df1["height"] > 210])

print(f'There are {shorter} patients with height under 130cm'
      f' and {bigger} patient bigger than 210cm.'
      f'\nIt corresponds to {round((bigger + shorter) * 100 / len(df1), 2)}% of the dataset.')

There are 92 patients with height under 130cm and 1 patient bigger than 210cm.
It corresponds to 0.13% of the dataset.


In [9]:
# The patients under 130cm and bigger than 210cm will be excluded from dataset.

df2 = df1[df1["height"] >= 130]
df2 = df2[df2["height"] <= 210]

**Weight:** The minimum weight is 10kg and the maximum is 200kg. Is that right?patients with just 10kg?

In [11]:
# Checking for outliers in the weight feature.

thinner = len(df2[df2['weight'] < 40])
print(f'There are {thinner} patients with weight under 40kg.')

There are 50 patients with weight under 40kg.


In [12]:
# Excluding patients with less than 40kg.

df2 = df2[df2['weight'] > 40]

**Blood pressure:** There are blood pressures with negative values, is that possible?

- After some research I saw that it is possible to a blood pressure be negative, but in order to the low impact on the dataset, we decided to exclude them.

In [13]:
negative_ap_hi = len(df2[df2['ap_hi'] < 0])
negative_ap_lo = len(df2[df2['ap_lo'] < 0])

print(f'There are {negative_ap_hi} cases of negative ap_hi and {negative_ap_lo} cases of negative ap_lo')

There are 7 cases of negative ap_hi and 1 cases of negative ap_lo


In [14]:
# The negative blood pressure cases will be excluded.

df2 = (df2[df2['ap_hi'] > 0])
df2 = (df2[df2['ap_lo'] > 0])

In [15]:
print(f'After the data cleaning ware excluded from the original dataset {df1.shape[0] - df2.shape[0]} rows.')

After the data cleaning ware excluded from the original dataset 214 rows.


In [16]:
df2.shape

(69786, 12)

In [17]:
df2.describe().T 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,69786.0,19469.483822,2466.293106,10798.0,17666.0,19703.0,21326.0,23713.0
gender,69786.0,1.349784,0.476905,1.0,1.0,1.0,2.0,2.0
height,69786.0,164.455765,7.8449,130.0,159.0,165.0,170.0,207.0
weight,69786.0,74.248019,14.304075,41.0,65.0,72.0,82.0,200.0
ap_hi,69786.0,128.805462,154.056748,1.0,120.0,120.0,140.0,16020.0
ap_lo,69786.0,96.661909,188.61784,1.0,80.0,80.0,90.0,11000.0
cholesterol,69786.0,1.367165,0.68049,1.0,1.0,1.0,2.0,3.0
gluc,69786.0,1.226707,0.572547,1.0,1.0,1.0,1.0,3.0
smoke,69786.0,0.088213,0.283606,0.0,0.0,0.0,0.0,1.0
alco,69786.0,0.053822,0.225667,0.0,0.0,0.0,0.0,1.0


# Data Transformation

In [18]:
df3 = df2.copy()

## Features Transformation

In [19]:
# Transforming age from days to years

df3['age'] = df3['age'].apply(lambda x: int(x/365))

In [20]:
df3.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,69786.0,52.842304,6.764156,29.0,48.0,53.0,58.0,64.0
gender,69786.0,1.349784,0.476905,1.0,1.0,1.0,2.0,2.0
height,69786.0,164.455765,7.8449,130.0,159.0,165.0,170.0,207.0
weight,69786.0,74.248019,14.304075,41.0,65.0,72.0,82.0,200.0
ap_hi,69786.0,128.805462,154.056748,1.0,120.0,120.0,140.0,16020.0
ap_lo,69786.0,96.661909,188.61784,1.0,80.0,80.0,90.0,11000.0
cholesterol,69786.0,1.367165,0.68049,1.0,1.0,1.0,2.0,3.0
gluc,69786.0,1.226707,0.572547,1.0,1.0,1.0,1.0,3.0
smoke,69786.0,0.088213,0.283606,0.0,0.0,0.0,0.0,1.0
alco,69786.0,0.053822,0.225667,0.0,0.0,0.0,0.0,1.0
