# Cardiovascular Disease Data Analysis

### This dataset has variables that could be contributors to cardiovascular disease.
### The purpose of this notebook is to go through these variables and present any correlations that are found through data visualizations

Data set is taken from https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

# Exploratory Data Analysis

In [184]:
import pandas as pd
import numpy as np
import seaborn as sns

In [185]:
path = 'data/cardio_train.csv'
df = pd.read_csv(path, sep= ';')

### Description of the different variables that will be analyzed in this notebook
- Age is in days.
- Gender: 1 - women, 2 - men
- Height: cm
- Weight: km
- ap_hi: Systolic Blood Pressure
- ap_lo: Diastolic Blood Pressure
- Cholesterol: 1-normal, 2-above normal, 3-well above normal
- Gluc: normal, 2: above normal, 3: well above normal
- Smoke - Binary
- Alch - Binary
- Cardio - Binary
*For the binary variables 0 is no and 1 is yes*


# Data Inspection

In [186]:
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [187]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
id             70000 non-null int64
age            70000 non-null int64
gender         70000 non-null int64
height         70000 non-null int64
weight         70000 non-null float64
ap_hi          70000 non-null int64
ap_lo          70000 non-null int64
cholesterol    70000 non-null int64
gluc           70000 non-null int64
smoke          70000 non-null int64
alco           70000 non-null int64
active         70000 non-null int64
cardio         70000 non-null int64
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


In [188]:
df.shape

(70000, 13)

In [189]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,70000.0,49972.4199,28851.302323,0.0,25006.75,50001.5,74889.25,99999.0
age,70000.0,19468.865814,2467.251667,10798.0,17664.0,19703.0,21327.0,23713.0
gender,70000.0,1.349571,0.476838,1.0,1.0,1.0,2.0,2.0
height,70000.0,164.359229,8.210126,55.0,159.0,165.0,170.0,250.0
weight,70000.0,74.20569,14.395757,10.0,65.0,72.0,82.0,200.0
ap_hi,70000.0,128.817286,154.011419,-150.0,120.0,120.0,140.0,16020.0
ap_lo,70000.0,96.630414,188.47253,-70.0,80.0,80.0,90.0,11000.0
cholesterol,70000.0,1.366871,0.68025,1.0,1.0,1.0,2.0,3.0
gluc,70000.0,1.226457,0.57227,1.0,1.0,1.0,1.0,3.0
smoke,70000.0,0.088129,0.283484,0.0,0.0,0.0,0.0,1.0


In [190]:
#Check to see if there are any NA Values
df.isna().values.any()

False

In [191]:
#Are there any duplicated values

df.duplicated().any()

False

In [192]:
#How many unactive smokers are there in the dataset?
unactive_smokers = df.loc[(df.active == 0) & 
                                  (df.smoke == 1)]
len(unactive_smokers)

1007

In [193]:
#What percent of CV patients are smokers in this data set

CV_pts = df.loc[(df.cardio == 1)]
smoke_cv_pts = CV_pts.loc[(CV_pts.smoke == 1)]

len(smoke_cv_pts) / len(df.loc[(df.cardio == 1)]) * 100

8.373595585922983

In [194]:
# What percent of CV patients are unactive v unactive
active_pt = CV_pts.loc[(CV_pts.smoke == 1)]
unactive_pt = CV_pts.loc[(CV_pts.smoke == 0)]

unactive_pct = len(unactive_pt) / len(CV_pts) * 100
active_pvt = len(active_pt) / len(CV_pts) * 100

print(f'{unactive_pct} % of patients with Cardiovascular disease are unactive and {active_pvt} % of patients with cardiovascular disease are active')

91.62640441407703 % of patients with Cardiovascular disease are unactive and 8.373595585922983 % of patients with cardiovascular disease are active


In [195]:
#What percent of unactive smokers have cardio disease?

pct = len(unactive_smokers.loc[(df.cardio == 1)]) / len(unactive_smokers) * 100

print(f'{pct}% of patients with Cardiovascular diseasea are both unactive and smokers.')

55.41211519364448% of patients with Cardiovascular diseasea are both unactive and smokers.


## What is the breakdown of CV pts compared to patients without CV disease?

In cardiovascular patients:
 - the mean systolic BP is 137
 - mean diastolic BP is 109
 - mean cholesterol is at 1.52(where 1 is normal cholesterol levels)

In [196]:
CV_pts.mean()

id             50082.102233
age            20056.813031
gender             1.353441
height           164.270334
weight            76.822368
ap_hi            137.212042
ap_lo            109.023929
cholesterol        1.517396
gluc               1.277595
smoke              0.083736
alco               0.052117
active             0.789559
cardio             1.000000
dtype: float64

For patients without cardiovascular disease:
- mean systolic BP is 120.43
- mean diastolic BP is 80.0
- mean cholesterol is at 1.0 (where 1 is normal cholesterol)

In [197]:
no_CV_pts = df.loc[(df.cardio == 0)]
no_CV_pts.mean()

id             49862.869107
age            18881.623711
gender             1.345707
height           164.448017
weight            71.592150
ap_hi            120.432598
ap_lo             84.251763
cholesterol        1.216527
gluc               1.175380
smoke              0.092516
alco               0.055424
active             0.817881
cardio             0.000000
dtype: float64

In [198]:
# What percentage of smokers are in 

len(no_CV_pts.loc[(no_CV_pts.smoke == 1)]) / len(no_CV_pts) * 100

9.251591902001657

# Major Takeways
1. On average patients with cardiovascular disease have higher systolic/diastolic blood pressure then patients without CV disease.
2. On average patients with cardiovascular disease higher levels of cholesterol and glucose then patients without CV

# Interesting findings
1. In this dataset there 8.3 % of CV patients were marked as smokers while 9.3 % of patients without CV disease were marked as smokers. 
2. Although this is an interesting finding, it is important to note that patients were labeled as either smokers or nonsmokers and frequency of smoking is not taken into account in this dataset.

# Data Prep for Data Visualization
- adjust binary variables to more readable categorical variables

In [199]:
df

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,99993,19240,2,168,76.0,120,80,1,1,1,0,1,0
69996,99995,22601,1,158,126.0,140,90,2,2,0,0,1,1
69997,99996,19066,2,183,105.0,180,90,3,1,0,1,0,1
69998,99998,22431,1,163,72.0,135,80,1,2,0,0,0,1


In [200]:
def binary_to_gender(x):
    if x == 1:
        return 'female'
    if x == 2:
        return 'male'

In [201]:
df.gender = df.gender.apply(binary_to_gender)

In [202]:
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,male,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,female,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,female,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,male,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,female,156,56.0,100,60,1,1,0,0,0,0


In [207]:
df.to_csv('clean_data.csv', index=False)