This notebook relates to a new dataset, heart_data.csv, which offers characteristics such as gender, weight, height, smoking status, and activity levels to predict the risk of cardiovascular disease. 

These characteristics will be used in a logistic regression model that can hopefully predict whether or not someone is at risk of cardiovascular disease given these parameters

In [74]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
data = pd.read_csv("heart_data.csv")
data = data.drop(columns = ['index', 'id'])
data["cardio"] = data["cardio"].astype("category", copy = False)
data["age"] = data["age"]/365 #days to years

#drop the last 3 booleans if inexplicable issues arise

After cleaning data, separating predictors & target

In [76]:
#target value is what we're guessing -- cardio
y = data["cardio"]

#predictors support guessing y
x = data.drop(columns = "cardio")

In [93]:
#normalizing data in order to correctly distribute importance across variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(xtrain)
x_scaled = scaler.fit_transform(x)
x_scaled


array([[-0.43606151,  1.36405487,  0.44345206, ..., -0.31087913,
        -0.23838436,  0.49416711],
       [ 0.30768633, -0.73310834, -1.01816804, ..., -0.31087913,
        -0.23838436,  0.49416711],
       [-0.24799666, -0.73310834,  0.07804703, ..., -0.31087913,
        -0.23838436, -2.02360695],
       ...,
       [-0.16328642,  1.36405487,  2.27047718, ..., -0.31087913,
         4.19490608, -2.02360695],
       [ 1.20058905, -0.73310834, -0.16555632, ..., -0.31087913,
        -0.23838436, -2.02360695],
       [ 0.43414373, -0.73310834,  0.68705541, ..., -0.31087913,
        -0.23838436,  0.49416711]])

In [82]:
from sklearn.model_selection import train_test_split #simple way to split data into testing and non/testing

xtrain, xtest, ytrain, ytest = train_test_split(x_scaled, y, test_size = 0.3, random_state = 42)

actual training yay!!!

In [83]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression() #an instance of the regression class
lr.fit(xtrain, ytrain) #the actual training

ypred = lr.predict(xtest) #testing the now-trained model

In [85]:
#evaluating the model

from sklearn.metrics import classification_report

accuracy = classification_report(ytest, ypred)
print (accuracy)

              precision    recall  f1-score   support

           0       0.70      0.76      0.73     10461
           1       0.74      0.68      0.71     10539

    accuracy                           0.72     21000
   macro avg       0.72      0.72      0.72     21000
weighted avg       0.72      0.72      0.72     21000



Using the model to predict specific hypothetical instances

In [103]:
single_x = np.array([[45.6383561644,1,157,72.0,150,30,1,1,0,0,1]])
single_x = scaler.transform(single_x)

print(lr.predict(single_x))
#taken from row 420 (with modified age), this individual did in fact have high cardiovascular risk!

[1]


