# Multiple Linear Regression Baseline Model

In this project, we are using a linear regression model as the baseline model to compare more advanced machine-learning models against. First we import some necessary libraries.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import csv

We import the imputed training set from the google drive to train the linear regression model off.

In [3]:
url='https://drive.google.com/file/d/1S7em_UuD7yOiN4JMq__mlNDfcwV6NCYU/view'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url)

We separate the training set into the target feature (whether the case has heart disease) and the other predictive features.

In [4]:
y = df['HeartDiseaseorAttack']
x = df.drop(['HeartDiseaseorAttack'], axis=1)
#print(y)
#print(x)

We check the effectiveness of a linear regression model between heart disease and every other feature individually. We are using the sklearn implementation of a linear regression model which has the capability to do both singular linear regression and multiple linear regression (which we will use after).

In [5]:
for column in x:
    xi = x[column].values.reshape(-1,1)
    model = LinearRegression().fit(xi,y)
    r_sq = model.score(xi, y)
    print(str(column) + ": " + str(r_sq))

HighBP: 0.029299416338264606
HighChol: 6.931231870122012e-06
CholCheck: 0.0025641724307252645
BMI: 0.0031652373435376857
Smoker: 0.0034158924119184197
Stroke: 0.010310748518839619
Diabetes: 0.02663870374081989
PhysActivity: 5.754127579793433e-05
Fruits: 8.16016018689858e-06
Veggies: 0.0001743827047662183
HvyAlcoholConsump: 0.00014304306761603325
AnyHealthcare: 0.0006809096216109367
NoDocbcCost: 5.0752932300279774e-05
GenHlth: 0.05979163435135426
MentHlth: 7.116011862196014e-05
PhysHlth: 0.009439931188026773
DiffWalk: 0.009522540045494576
Sex: 0.004964691413729683
Age: 0.048122100926456746
Education: 0.006813084089474342
Income: 4.970479972055131e-08


We then combine the non-heart disease features and carry out multiple linear regression to create a linear regression model between heart disease and all the other features.

In [6]:
model = LinearRegression().fit(x,y)
r_sq = model.score(x, y)
print(r_sq)

0.12012609349462


We then import the test set from the google drive and create a list of predictions based on the predictive features using the multiple linear regression model trained on the training set.

In [7]:
url='https://drive.google.com/file/d/1qkbZ7ZctdRZXez9FPPHNavVRfXGD8PSf/view'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
test = pd.read_csv(url)
indep = test.drop(['HeartDiseaseorAttack'], axis=1)
predictions = model.predict(indep)

np.savetxt("LinearRegProb.csv", predictions, delimiter=",")