<h1 align="center">Heart Attack Analysis and Prediction</h1>

## DataSet Attributes
`age` - Age of the patient

`sex` - Sex of the patient

`cp` - Chest pain type ~ 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 = Asymptomatic

`trtbps` - Resting blood pressure (in mm Hg)

`chol` - Cholestoral in mg/dl fetched via BMI sensor

`fbs` - (fasting blood sugar > 120 mg/dl) ~ 1 = True, 0 = False

`restecg` - Resting electrocardiographic results ~ 0 = Normal, 1 = ST-T wave normality, 2 = Left ventricular hypertrophy

`thalachh`  - Maximum heart rate achieved

`oldpeak` - Previous peak

`slp` - Slope

`caa` - Number of major vessels 

`thall` - Thalium Stress Test result ~ (0,3)

`exng` - Exercise induced angina ~ 1 = Yes, 0 = No

`output` - Target variable

## Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

### Reading CSV Data

In [2]:
df = pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv")

## Data Preprocessing

### Basic Information About the Data

In [3]:
print("The shape of the dataset is : ", df.shape)

The shape of the dataset is :  (303, 14)


In [4]:
df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [16]:
ncon_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]
target_col = ["output"]
print("Non-continuous columns : ", ncon_cols)
print("Continuous columns : ", con_cols)
print("The target variable is :  ", target_col)

Non-continuous columns :  ['sex', 'exng', 'caa', 'cp', 'fbs', 'restecg', 'slp', 'thall']
Continuous columns :  ['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']
The target variable is :   ['output']


### Checking for Missing Values

In [6]:
df.isnull().sum()

age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

### Printing the correlation matrix for continuous columns

In [7]:
df_corr = df[con_cols].corr().transpose()
df_corr

Unnamed: 0,age,trtbps,chol,thalachh,oldpeak
age,1.0,0.279351,0.213678,-0.398522,0.210013
trtbps,0.279351,1.0,0.123174,-0.046698,0.193216
chol,0.213678,0.123174,1.0,-0.00994,0.053952
thalachh,-0.398522,-0.046698,-0.00994,1.0,-0.344187
oldpeak,0.210013,0.193216,0.053952,-0.344187,1.0


### Conclusions from data preprocess with Rapid Miner

- There are no NULL values in the data, so there is no need to replace anything with the mean value
- The Outliers in the data are very less. (Checked using Rapid Miner)
- There is no linear correlation between continuous variables. (Checked by using Rapid Miner Correlation Matrix)
- By general sense we might think that older people might have higher chances for heart attack but on seeing the data  it says otherwise. Most heart attacks happened in the middle aged people.
- Even though number of male records in the dataset was much higher than female records, the number of postive male and female records are almost equal.

### Importing sklearn packages

In [33]:

from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

print('Packages imported...')

Packages imported...



## Scaling and Encoding the Features

In [9]:

df1 = df

cat_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]
df1 = pd.get_dummies(df1, columns = cat_cols, drop_first = True)

X = df1.drop(['output'],axis=1)
y = df1[['output']]

scaler = RobustScaler()
X[con_cols] = scaler.fit_transform(X[con_cols])
print("The first 5 rows of X are")
X.head()

The first 5 rows of X are


Unnamed: 0,age,trtbps,chol,thalachh,oldpeak,sex_1,exng_1,caa_1,caa_2,caa_3,...,cp_2,cp_3,fbs_1,restecg_1,restecg_2,slp_1,slp_2,thall_1,thall_2,thall_3
0,0.592593,0.75,-0.110236,-0.092308,0.9375,1,0,0,0,0,...,0,1,1,0,0,0,0,1,0,0
1,-1.333333,0.0,0.15748,1.046154,1.6875,1,0,0,0,0,...,1,0,0,1,0,0,0,0,1,0
2,-1.037037,0.0,-0.566929,0.584615,0.375,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
3,0.074074,-0.5,-0.062992,0.769231,0.0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,1,0
4,0.148148,-0.5,1.795276,0.307692,-0.125,0,1,0,0,0,...,0,0,0,1,0,0,1,0,1,0


### Spliting the data

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)
print("The shape of X_train is      ", X_train.shape)
print("The shape of X_test is       ",X_test.shape)
print("The shape of y_train is      ",y_train.shape)
print("The shape of y_test is       ",y_test.shape)

The shape of X_train is       (242, 22)
The shape of X_test is        (61, 22)
The shape of y_train is       (242, 1)
The shape of y_test is        (61, 1)


# Implementing the models

# Decision Tree

In [35]:
dt = DecisionTreeClassifier(random_state = 42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print("The test accuracy score of Decision Tree is ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

The test accuracy score of Decision Tree is  0.7868852459016393


array([[25,  4],
       [ 9, 23]])

# Random Forest

In [38]:

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print("The test accuracy score of Random Forest is ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

The test accuracy score of Random Forest is  0.7868852459016393


array([[25,  4],
       [ 9, 23]])

# Support Vector Machine

In [37]:

clf = SVC(kernel='linear', C=1, random_state=42).fit(X_train,y_train)

y_pred = clf.predict(X_test)

print("The test accuracy score of SVM is ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

The test accuracy score of SVM is  0.8688524590163934


array([[26,  3],
       [ 5, 27]])

#### Using HyperParameter Tuning

In [36]:

svm = SVC()

parameters = {"C":np.arange(1,10,1),'gamma':[0.00001,0.00005, 0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5]}

searcher = GridSearchCV(svm, parameters)

searcher.fit(X_train, y_train)
print("The best params are :", searcher.best_params_)
print("The best score is   :", searcher.best_score_)
y_pred = searcher.predict(X_test)
print("The test accuracy score of SVM after hyper-parameter tuning is ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

The best params are : {'C': 3, 'gamma': 0.1}
The best score is   : 0.8384353741496599
The test accuracy score of SVM after hyper-parameter tuning is  0.9016393442622951


array([[26,  3],
       [ 3, 29]])