# Purpose:
# EDA(exploratory data analysis) and what factors that increase heart attacks!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
data = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')

In [None]:
data.head()

In [None]:
data.isnull().sum()

In [None]:
data.describe()

Seems like the data is already cleaned up. No nulls or missing values.


    Age : Age of the patient

    Sex : Sex of the patient

    exng: exercise induced angina (1 = yes; 0 = no)

    ca: number of major vessels (0-3)

    cp : Chest Pain type chest pain type
        Value 1: typical angina
        Value 2: atypical angina
        Value 3: non-anginal pain
        Value 4: asymptomatic

    trtbps : resting blood pressure (in mm Hg)

    chol : cholestoral in mg/dl fetched via BMI sensor

    fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

    rest_ecg : resting electrocardiographic results
        Value 0: normal
        Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
        Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

    thalach : maximum heart rate achieved

    target : 0= less chance of heart attack 1= more chance of heart attack


In [None]:
f, ax = plt.subplots(1,2,figsize=(18,8))
data['output'].value_counts().plot.pie(explode=[0,0.1],autopct = '%1.1f%%',ax=ax[0], shadow=True)
ax[0].set_title('At Risk')
ax[0].set_ylabel('')
sns.countplot('output',data=data,ax=ax[1])
ax[1].set_title('At Risk')
plt.show()


So it seems like the sample of people have a 9% greater chance of being at risk of a heart attack due to underlying factors.

Let's explore the factors and see what is happening that causes this.

Categorical Features: Sex, exng(exercised induced angina), fbs(fasting blood sugar)

Ordinal Features: ca(number of major vessels),cp(chest pain), rest_ecg

Continous Features: Age, trtbps(resting blood pressure), chol (cholesterol, thalach(max heart rate)

#  Categorical Feature: Sex

In [None]:
f,ax = plt.subplots(1,1,figsize=(5,5))
sns.countplot('sex',data=data,ax=ax)
plt.show()

In [None]:
f,ax = plt.subplots(1,2,figsize=(18,8))
data[['sex','output']].groupby(['sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('At Risk vs Sex')
sns.countplot('sex',hue = 'output',data = data, ax=ax[1])
ax[1].set_title('Sex: At Risk vs Not At Risk')
plt.show()

So it seems like whoever is the gender represented by 0 has a lower sample size yet around a 75% chance to be at risk of a heart attack. Gender 0 people are nearly 30% more likely to be at risk than gender 1.

# Categorical Feature: exng(exercised induced angina)

In [None]:
f,ax = plt.subplots(1,1,figsize=(5,5))
sns.countplot('exng',data=data,ax=ax)
plt.show()

In [None]:
f,ax = plt.subplots(1,1,figsize=(5,5))
sns.countplot('exng',hue='sex',data = data,ax=ax)

In [None]:
f,ax = plt.subplots(1,1,figsize=(5,5))
data[['exng','output']].groupby(['exng']).mean().plot.bar(ax=ax)
ax.set_title('At Risk vs exng')
plt.show()

There looks to be a 70% chance of being at risk of a heart attack with people who don't have exercise induced angina compared to about 23% to people who do suffer from it. So it seems like it isn't a significant risk.

# Categorical Feature: fbs(fasting blood sugar)

High fasting blood sugar usually indicates diabetes. Anything that is a 1 is a person that has blood sugar greater than 120 mg/dl which indicates diabetes or prediabetes.

In [None]:
f,ax = plt.subplots(1,1,figsize=(5,5))
sns.countplot('fbs',data=data,ax=ax)
plt.show()

In [None]:
f, ax = plt.subplots(1,1, figsize=(5,5))
data[['fbs','output']].groupby(['fbs']).mean().plot.bar(ax=ax)
ax.set_title('fbs vs At Risk')
plt.show()

Due to the sample size of people with high fasting blood sugar being significatly smaller than people that don't and yet both groups have a greater than 50% chance of a heart attack, having high blood sugar should be a considerable factor.

Over time, high blood sugar damages blood vessels and nerves from the heart causing heart disease.

# Ordinal Feature: ca (number of major vessels)

In [None]:
pd.crosstab(data.caa,data.output,margins=True).style.background_gradient(cmap='summer_r')

In [None]:
sns.factorplot('caa','output',hue='sex',data=data)
plt.show()

# Ordinal Feature: cp (chest pain)

In [None]:
pd.crosstab(data.cp,data.output,margins=True).style.background_gradient(cmap='summer_r')

In [None]:
sns.factorplot('cp','output',hue='sex',data=data)
plt.show()

The more serious the angina/chest pain, the greater chance you have in having a heart attack. Gender 0 also has the greater chance of a heart problem with occuring chest pains.

In [None]:
f,ax = plt.subplots(3,figsize=(15,15))
sns.countplot('cp',data=data,ax=ax[0])
ax[0].set_title('No. of Chest Pain Variantes')
sns.countplot('cp',hue='sex',data=data,ax=ax[1])
ax[1].set_title('Gender Split per Chest Pain Variant')
sns.countplot('cp',hue='output',data=data,ax=ax[2])
ax[2].set_title('Chest Pain vs At Risk')

Even with 1 chest pain event, the risk of heart attack increases significantly.

# Ordinal Feature: restecg (resting electrocardiographic results)

In [None]:
pd.crosstab(data.restecg,data.output,margins=True).style.background_gradient(cmap='summer_r')

In [None]:
sns.factorplot('restecg','output',hue='sex',data=data)
plt.show()

In [None]:
f,ax = plt.subplots(3,figsize=(15,15))
sns.countplot('restecg',data=data,ax=ax[0])
ax[0].set_title('No. of ecg result types')
sns.countplot('restecg',hue='sex',data=data,ax=ax[1])
ax[1].set_title('Gender Split per ecg result type')
sns.countplot('restecg',hue='output',data=data,ax=ax[2])
ax[2].set_title('Ecg Result Type vs At Risk')

Having a ST-T wave abnormality seems like a significant factor in being more at risk.

# Continous Feature: Age

In [None]:
print('Oldest Patient is:',data['age'].max(),'Years')
print('Youngest Patient is:',data['age'].min(),'Years')
print('Average Patient Age is',data['age'].mean(),'Years')

In [None]:
f, ax = plt.subplots(1,2,figsize = (18,8))
sns.violinplot('cp','age',hue='output',data=data,split=True,ax=ax[0])
ax[0].set_title('Chest Pain and Age vs At Risk')
sns.violinplot('sex','age',hue='output',data=data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs At Risk')

Observations:

-The younger you are, the less at risk of a heart attack you are

-The more serious chest pain you have, the greater chance you are at risk when you get older/ 50-70 age range.

-Gender 0 are significantly less at risk as Gender 1 as they get older

In [None]:
f,ax = plt.subplots(1,2,figsize=(20,10))
data[data['output'] == 0].age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('At Risk = 0')
x1 = list(range(35,80,5))
ax[0].set_xticks(x1)
data[data['output'] == 1].age.plot.hist(ax=ax[1],color='blue',bins=20,edgecolor='black')
ax[1].set_title('At Risk = 1')
x2 = list(range(30,80,5))
ax[1].set_xticks(x2)
plt.show()

Observations:

-Most in the not at risk graph are from the 55-65 age range (may just mean the sample is older people).

-Being at risk increases exponentially when you hit your 40's.

-Being young (less than 40) have the least risk of a heart attack.

# Continous Feature: trtbps (resting blood pressure)

In [None]:
sns.scatterplot('age','trtbps',data=data).set_title('Age vs Resting Blood Pressure')
plt.show()

In [None]:
f, ax = plt.subplots(1,2,figsize = (18,8))
sns.regplot('age','trtbps',data=data[data['output'] == 0],ax=ax[0])
ax[0].set_title('Not At Risk, Age vs Resting Blood Pressure')
sns.regplot('age','trtbps',data=data[data['output'] == 1],ax=ax[1])
ax[1].set_title('At Risk, Age vs Resting Blood Pressure')

Observations:

-As you get older, your resting blood pressure increases

-No solid differences

# Continous Feature: chol (cholestoral)

In [None]:
sns.regplot('age','chol',data=data).set_title('Age vs Cholestoral')
plt.show()

In [None]:
f,ax = plt.subplots(1,4,figsize=(20,8))
sns.distplot(data[data['cp'] == 0].chol,ax=ax[0])
ax[0].set_title('Cholestoral for Typical Angina')
sns.distplot(data[data['cp'] == 1].chol,ax=ax[1])
ax[1].set_title('Cholestoral for Atypical Angina')
sns.distplot(data[data['cp'] == 2].chol,ax=ax[2])
ax[2].set_title('Cholestoral for Non-Anginal Pain')
sns.distplot(data[data['cp'] == 3].chol,ax=ax[3])
ax[3].set_title('Cholestoral for Asymptomatic')

Observations:

-As age increases, cholestoral increases

-Cholestoral levels over 200 mg/dL usually are considered borderline high and more than 24 is considered high. Most of the people who suffer from some chest pain peak in that range of high cholestoral.

# Continous Feature: thalachh(max heart rate)

In [None]:
sns.regplot('age','thalachh',data=data).set_title('Age vs Max Heart Rate')
plt.show()

In [None]:
f,ax = plt.subplots(1,4,figsize=(20,8))
sns.distplot(data[data['cp'] == 0].thalachh,ax=ax[0])
ax[0].set_title('Max Heart Rate for Typical Angina')
sns.distplot(data[data['cp'] == 1].thalachh,ax=ax[1])
ax[1].set_title('Max Heart Rate for Atypical Angina')
sns.distplot(data[data['cp'] == 2].thalachh,ax=ax[2])
ax[2].set_title('Max Heart Rate for Non-Anginal Pain')
sns.distplot(data[data['cp'] == 3].thalachh,ax=ax[3])
ax[3].set_title('Max Heart Rate for Asymptomatic')

Observations:

-Max Heart is considerably high when you have increasing chest pain

# Correlation Between Features

In [None]:
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths = 0.2)
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()

Observations:

-From the features we observed, the highest correlations are around +/- .44.

-Being at risk of heart attack(output) has highest correlations with chest pain and max heart rate(thalachh) at .43/.44

-You can use all features since it's not correlated too much which would indicate multicollinearity(features share similar traits = more unnecessary modeling).

In [None]:
data.drop(['slp','caa','thall','oldpeak'],axis=1,inplace = True)
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths= 0.2,annot_kws={'size':20})
fig=plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

Explorative Observations:

What Increases heart attack chances?

-You are 40 years old and up

-You have had some degree of chest pain, risk increasing

-Max heart rate is unusually high.