# Multiple Linear Regression Baseline Model

In this project, we are using a linear regression model as the baseline model to compare more advanced machine-learning models against. First we import some necessary libraries.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import csv

We import the imputed training set to train the linear regression model off.

In [2]:
df = pd.read_csv('imputed_train_90_10.csv')

We separate the training set into the target feature (whether the case has heart disease) and the other predictive features.

In [3]:
y = df['HeartDiseaseorAttack']
x = df.drop(['HeartDiseaseorAttack'], axis=1)
#print(y)
#print(x)

We check the effectiveness of a linear regression model between heart disease and every other feature individually. We are using the sklearn implementation of a linear regression model which has the capability to do both singular linear regression and multiple linear regression (which we will use after).

In [4]:
for column in x:
    xi = x[column].values.reshape(-1,1)
    model = LinearRegression().fit(xi,y)
    r_sq = model.score(xi, y)
    print(str(column) + ": " + str(r_sq))

HighBP: 0.02929941633826416
HighChol: 6.931231870122012e-06
CholCheck: 0.0025641724307253755
BMI: 0.0031652373435379078
Smoker: 0.003415892411918753
Stroke: 0.010310748518839619
Diabetes: 0.026638703740820002
PhysActivity: 5.7541275798045355e-05
Fruits: 8.160160187342669e-06
Veggies: 0.00017438270476599627
HvyAlcoholConsump: 0.00014304306761669938
AnyHealthcare: 0.0006809096216112698
NoDocbcCost: 5.075293230005773e-05
GenHlth: 0.05979163435135448
MentHlth: 7.116011862184912e-05
PhysHlth: 0.009439931188027217
DiffWalk: 0.009522540045494576
Sex: 0.004964691413729572
Age: 0.04812210092645641
Education: 0.006813084089474342
Income: 4.970479972055131e-08


We then combine the non-heart disease features and carry out multiple linear regression to create a linear regression model between heart disease and all the other features.

In [5]:
model = LinearRegression().fit(x,y)
r_sq = model.score(x, y)
print(r_sq)

0.12012609349461989


We then import the test set and create a list of predictions based on the predictive features using the multiple linear regression model trained on the training set.

In [6]:
test = pd.read_csv('imputed_test_90_10.csv')
indep = test.drop(['HeartDiseaseorAttack'], axis=1)
predictions = model.predict(indep)

np.savetxt("LinearRegProb.csv", predictions, delimiter=",")