# Heart Disease Prediction

## Algorithm: Logistic Regression
Dataset: https://www.kaggle.com/ronitf/heart-disease-uci

In [1]:
import numpy as np
import pandas as pd

Import the dataset. The dataset "heart.csv" is present in the same location as this notebook. If you have placed the downloaded dataset in a different folder, please use the full path.

In [2]:
data = pd.read_csv("heart.csv")

This is how the first few rows of data looks like.

In [3]:
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
print("The dataset contains {} rows and {} columns".format(data.shape[0], data.shape[1]))

The dataset contains 303 rows and 14 columns


### Data Analysis

Let us analyze each column (feature) and see for ourself what the data in it ranges like. The column we will be predicting is "target".

In [5]:
# Age
print("Age ranges from: {} to {}".format(data['age'].min(), data['age'].max()))

Age ranges from: 29 to 77


In [6]:
# Sex
print("Categories in column Sex:", data['sex'].unique())

Categories in column Sex: [1 0]


In [7]:
# Chest pain type
print("Categories in column cp:", data['cp'].unique())

Categories in column cp: [3 2 1 0]


In [8]:
# Resting blood pressure
print("Range in trestbps column: {} to {}".format(data['trestbps'].min(), data['trestbps'].max()))

Range in trestbps column: 94 to 200


In [9]:
# Serum cholestrol in mg/dl
print("Range in chol column: {} to {}".format(data['chol'].min(), data['chol'].max()))

Range in chol column: 126 to 564


In [10]:
# Fasting blood sugar > 120 mg/dl
print("Categories in fbs column:", data['fbs'].unique())

Categories in fbs column: [1 0]


In [11]:
# Resting electrocardiographic results
print("Categories in column restecg:", data['restecg'].unique())

Categories in column restecg: [0 1 2]


In [12]:
# Maximum heart rate achieved
print("Range in column thalach: {} to {}".format(data['thalach'].min(), data['thalach'].max()))

Range in column thalach: 71 to 202


In [13]:
# Exercise induced angina
print("Categories in exang column:", data['exang'].unique())

Categories in exang column: [0 1]


In [14]:
# Oldpeak = ST depression induced by exercise relative to rest
print("Range in column oldpeak: {} to {}".format(data['oldpeak'].min(), data['oldpeak'].max()))

Range in column oldpeak: 0.0 to 6.2


In [15]:
# the slope of the peak exercise ST segment
print("Categories in slope column:", data['slope'].unique())

Categories in slope column: [0 2 1]


In [16]:
# number of major vessels (0-3) colored by flourosopy
print("Range in column ca: {} to {}".format(data['ca'].min(), data['ca'].max()))

Range in column ca: 0 to 4


In [17]:
# thal: Thalium Stress Test Result
print("Categories in thal column:", data['thal'].unique())

Categories in thal column: [1 2 3 0]


In [18]:
# target
print("Categories in target column:", data['target'].unique())

Categories in target column: [1 0]


### Split features and labels

We have 14 columns in our dataset. "target" is the column we will predict. It is called label. It is separated from the rest of the dataset.

After this, we will have X with 13 columns called "features" & y with a single column called "label"

In [19]:
y = data['target']
X = data.drop(['target'], axis=1)

### Scaling

Scaling is a process of squishing the numbers between 0 and 1. As the algorithm better understands scaled values, we are implementing this step.

In [20]:
numeric_features = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 
                    'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']

In [21]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_scaled = pd.DataFrame(X)
X_scaled[numeric_features] = scaler.fit_transform(X_scaled[numeric_features])

In [22]:
X_scaled.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.708333,1.0,1.0,0.481132,0.244292,1.0,0.0,0.603053,0.0,0.370968,0.0,0.0,0.333333
1,0.166667,1.0,0.666667,0.339623,0.283105,0.0,0.5,0.885496,0.0,0.564516,0.0,0.0,0.666667
2,0.25,0.0,0.333333,0.339623,0.178082,0.0,0.0,0.770992,0.0,0.225806,1.0,0.0,0.666667
3,0.5625,1.0,0.333333,0.245283,0.251142,0.0,0.5,0.816794,0.0,0.129032,1.0,0.0,0.666667
4,0.583333,0.0,0.0,0.245283,0.520548,0.0,0.5,0.70229,1.0,0.096774,1.0,0.0,0.666667


### Split training and testing set

Now the dataset is plit into training and testing set in the ratio of 70:30, which is indicated in the test_size parameter as 0.3. You can use a varying test size if you like.

The dataset is also shuffled to break any pattern which naturally could have occured in the dataset.

A random state (use any numner. I am just going with 10) is set so that everytime after shuffling we end up with the same segments of data. This makes sure we get the same results with repeated runs of this program.

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, shuffle=True, random_state=10)

### Model fitting and prediction

In [24]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

### Calculate accuracy score

In [25]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
print("Accuracy of the model is:", score)

Accuracy of the model is: 0.7802197802197802


This model gives 78% accuracy on the testing set.