# Cardiovascular Disease Dataset

We will study with a dataset on Cardiovascular Disease.

We'll try to understand the concepts like

- true means,
- confidence intervals,
- one sample t test,
- independent samples t test,
- homogenity of variance check (Levene's test),
- One-way ANOVA,
- Chi-square test.

Dataset from: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset

# Data Preparation

⭐ Import pandas, scipy.stats, seaborn, and matplotlib.pyplot libraries

In [1]:
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

⭐Run the following code to read in the "cardio.csv" file.

In [2]:
df = pd.read_csv("cardio.csv", sep=";")

In [3]:
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [14]:
df.shape

(70000, 13)

In [10]:
df0= df.sample(500).copy()
df0

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
22186,31700,16872,1,158,85.0,130,80,1,1,0,0,1,1
13908,19843,23190,1,159,60.0,130,80,2,1,0,0,1,1
60366,86191,20628,1,162,74.0,120,80,1,1,0,0,1,0
65003,92776,19694,2,176,88.0,140,100,1,1,0,0,1,1
6757,9635,22732,1,162,72.0,140,80,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
47143,67329,19535,1,153,51.0,120,80,1,1,0,0,1,1
35565,50798,22829,1,164,89.0,120,80,3,3,0,0,1,1
33435,47764,22682,1,170,72.0,120,80,1,1,0,0,1,1
15525,22182,19818,2,172,81.0,130,95,1,1,0,1,1,0


In [9]:
df0.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 35558 to 60411
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           500 non-null    int64  
 1   age          500 non-null    int64  
 2   gender       500 non-null    int64  
 3   height       500 non-null    int64  
 4   weight       500 non-null    float64
 5   ap_hi        500 non-null    int64  
 6   ap_lo        500 non-null    int64  
 7   cholesterol  500 non-null    int64  
 8   gluc         500 non-null    int64  
 9   smoke        500 non-null    int64  
 10  alco         500 non-null    int64  
 11  active       500 non-null    int64  
 12  cardio       500 non-null    int64  
dtypes: float64(1), int64(12)
memory usage: 54.7 KB


In [21]:
df0.drop(columns="id", inplace=True)

In [22]:
df0.shape

(500, 12)

In [23]:
df0.describe()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,19384.544,1.34,164.174,74.146,127.03,94.124,1.368,1.246,0.072,0.064,0.81,0.512
std,2484.54214,0.474183,8.301419,13.924528,18.489695,110.625642,0.69393,0.605158,0.258747,0.244998,0.392694,0.500357
min,14354.0,1.0,96.0,45.0,12.0,60.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,17466.5,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,19753.0,1.0,164.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,1.0
75%,21225.5,2.0,169.0,82.0,140.0,90.0,1.0,1.0,0.0,0.0,1.0,1.0
max,23657.0,2.0,187.0,129.0,200.0,1120.0,3.0,3.0,1.0,1.0,1.0,1.0


⭐Let's get rid of the outliers, moreover blood pressure could not be negative value!

In [28]:
df0.drop(df0[(df0['ap_hi'] > df0['ap_hi'].quantile(0.975)) | (df0['ap_hi'] < df0['ap_hi'].quantile(0.025))].index,inplace=True)
df0.drop(df0[(df0['ap_lo'] > df0['ap_lo'].quantile(0.975)) | (df0['ap_lo'] < df0['ap_lo'].quantile(0.025))].index,inplace=True)

df0.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
22186,16872,1,158,85.0,130,80,1,1,0,0,1,1
13908,23190,1,159,60.0,130,80,2,1,0,0,1,1
60366,20628,1,162,74.0,120,80,1,1,0,0,1,0
65003,19694,2,176,88.0,140,100,1,1,0,0,1,1
6757,22732,1,162,72.0,140,80,1,1,0,0,0,1


## Task-1. Is the Systolic blood pressure population mean 122mmhg?

ap_hi => It's the Systolic blood pressure i.e. Pressure exerted when Blood is ejected in arteries. Normal value : 122 mm Hg for all adults aged 18 and over

⭐What is the mean for Systolic blood pressure?

In [32]:
mean= df0["ap_hi"].mean()
mean

125.49574468085106

⭐What is the standard deviation for Systolic blood pressure?

In [31]:
std = df0["ap_hi"].std()
std

14.012900809722336

⭐What is the standard error of the mean for Systolic blood pressure?

In [34]:
stde = std / np.sqrt(len(df))
stde

0.054465565299923184

⭐What are the descriptive statistics of the mean for Systolic blood pressure?

In [35]:
mean.describe()

AttributeError: 'float' object has no attribute 'describe'

## Confidence Interval using the t Distribution

Key Notes about Confidence Intervals

💡A point estimate is a single number.

💡A confidence interval, naturally, is an interval.

💡Confidence intervals are the typical way to present estimates as an interval range.

💡The point estimate is located exactly in the middle of the confidence interval.

💡However, confidence intervals provide much more information and are preferred when making inferences.

💡The more data you have, the less variable a sample estimate will be.

💡The lower the level of confidence you can tolerate, the narrower the confidence interval will be.

⭐Investigate the given task by calculating the confidence interval. (Use 90%, 95% and 99% CIs)

## One Sample t Test

⭐Investigate the given task by using One Sample t Test.

Key Notes about Hypothesis Testing (Significance Testing)

💡Assumptions

💡Null and Alternative Hypothesis

💡Test Statistic

💡P-value

💡Conclusion

Conduct the significance test. Use scipy.stats.ttest_1samp

## Task-2. Is There a Significant Difference Between Males and Females in Systolic Blood Pressure?

H0: µ1 = µ2 ("the two population means are equal")

H1: µ1 ≠ µ2 ("the two population means are not equal")

⭐Show descriptives for 2 groups

___🚀Test the assumption of homogeneity of variance Hint: Levene’s Test

The hypotheses for Levene’s test are:

H0: "the population variances of group 1 and 2 are equal"

H1: "the population variances of group 1 and 2 are not equal"

___🚀Conduct the significance test. Use scipy.stats.ttest_ind

H0: µ1 = µ2 ("the two population means are equal")

H1: µ1 ≠ µ2 ("the two population means are not equal")

## Task-3. Is There a Relationship Between Glucose and Systolic Blood Pressure?

⭐Draw a boxplot to see the relationship.

⭐Show the descriptive statistics of 3 groups.

⭐Conduct the relavant statistical test to see if there is a significant difference between the mean of the groups.

## Task-4. Is There a Relationship Between Physical activity vs. Presence or absence of cardiovascular disease?

### Physical activity vs. Presence or absence of cardiovascular disease

⭐Create a crosstab using Pandas.

⭐Conduct chi-square test to see if there is a relationship between 2 categorical variables.