<a href="https://colab.research.google.com/github/devi777/Heart-Disease-Classification/blob/master/HD_Prediction5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-processing Data 

In [0]:
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [0]:
# Importing the dataset
df = pd.read_csv('heart.csv')

In [0]:
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

In [0]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [0]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# R-squared 

To caclulate variance inflation factor, here is the link: https://etav.github.io/python/vif_factor_python.html

Featured snippet from the web

"Pearson Product-Moment Correlation Coefficient. The Excel RSQ Function returns the square of the Pearson product-moment correlation coefficient, which is a statistical measurement of the correlation (linear association) between two sets of values."

"R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model."

(Only for single parameter)
correlation_matrix = np.corrcoef(X_train, y_train)
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2

print(r_squared)

Forward selection: which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.

Backward elimination: which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically insignificant loss of fit.

Bidirectional elimination: a combination of the above, testing at each step for variables to be included or excluded. 

So, to select k number of features from a total n no of features , we use feature selection or feature extraction to increase our evaluation speed of model and sometimes to visualize the data as well (k = 2/3 if visualizing). Let's implement feature selection in this notebook.

#  Backward elimination with Logistic Regression

In [0]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train) 

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 82.89%


In [0]:
y_pred = classifier.predict(X_test)
print(y_pred)

[0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 0
 1 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0
 0 1]


In [0]:
import statsmodels.regression.linear_model as sm 
# add a column of ones as integer data type 
X = np.append(arr = np.ones((303, 1)).astype(int), values = X, axis = 1) 

In [0]:
# choose a Significance level usually 0.05, if p>0.05 
# for the highest values parameter, remove that value 
X_opt = np.array(X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]], dtype=float)
classifier_ols = sm.OLS(endog = y, exog = X_opt).fit() 
classifier_ols.summary() 

0,1,2,3
Dep. Variable:,y,R-squared:,0.499
Model:,OLS,Adj. R-squared:,0.478
Method:,Least Squares,F-statistic:,24.06
Date:,"Sat, 18 Apr 2020",Prob (F-statistic):,5.77e-37
Time:,13:23:37,Log-Likelihood:,-114.02
No. Observations:,303,AIC:,254.0
Df Residuals:,290,BIC:,302.3
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6784,0.294,2.304,0.022,0.099,1.258
x1,-0.0010,0.003,-0.357,0.721,-0.006,0.004
x2,-0.2275,0.047,-4.841,0.000,-0.320,-0.135
x3,0.1181,0.023,5.201,0.000,0.073,0.163
x4,-0.0021,0.001,-1.667,0.097,-0.005,0.000
x5,-0.0005,0.000,-1.197,0.232,-0.001,0.000
x6,0.0283,0.061,0.467,0.641,-0.091,0.148
x7,0.0443,0.041,1.091,0.276,-0.036,0.124
x8,0.0029,0.001,2.500,0.013,0.001,0.005

0,1,2,3
Omnibus:,7.875,Durbin-Watson:,0.981
Prob(Omnibus):,0.019,Jarque-Bera (JB):,7.857
Skew:,-0.362,Prob(JB):,0.0197
Kurtosis:,2.685,Cond. No.,4620.0


So, now we need to keep removing the feature columns whose p-value is greater than 0.05 (taken by standard convention).

In [0]:
#Removing 1st Feature
X_opt = np.array(X[:, [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]], dtype=float)
classifier_ols = sm.OLS(endog = y, exog = X_opt).fit() 
classifier_ols.summary() 

0,1,2,3
Dep. Variable:,y,R-squared:,0.499
Model:,OLS,Adj. R-squared:,0.48
Method:,Least Squares,F-statistic:,26.32
Date:,"Sat, 18 Apr 2020",Prob (F-statistic):,1.16e-37
Time:,13:41:07,Log-Likelihood:,-114.08
No. Observations:,303,AIC:,252.2
Df Residuals:,291,BIC:,296.7
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6228,0.250,2.494,0.013,0.131,1.114
x1,-0.2258,0.047,-4.837,0.000,-0.318,-0.134
x2,0.1177,0.023,5.198,0.000,0.073,0.162
x3,-0.0022,0.001,-1.796,0.074,-0.005,0.000
x4,-0.0005,0.000,-1.280,0.202,-0.001,0.000
x5,0.0269,0.060,0.446,0.656,-0.092,0.146
x6,0.0451,0.040,1.114,0.266,-0.035,0.125
x7,0.0030,0.001,2.836,0.005,0.001,0.005
x8,-0.1600,0.052,-3.091,0.002,-0.262,-0.058

0,1,2,3
Omnibus:,8.14,Durbin-Watson:,0.979
Prob(Omnibus):,0.017,Jarque-Bera (JB):,8.148
Skew:,-0.37,Prob(JB):,0.017
Kurtosis:,2.686,Cond. No.,3870.0


In [0]:
#Removing 6th Feature
X_opt = np.array(X[:, [0, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12]], dtype=float)
classifier_ols = sm.OLS(endog = y, exog = X_opt).fit() 
classifier_ols.summary() 

0,1,2,3
Dep. Variable:,y,R-squared:,0.498
Model:,OLS,Adj. R-squared:,0.481
Method:,Least Squares,F-statistic:,29.01
Date:,"Sat, 18 Apr 2020",Prob (F-statistic):,2.31e-38
Time:,13:42:14,Log-Likelihood:,-114.19
No. Observations:,303,AIC:,250.4
Df Residuals:,292,BIC:,291.2
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6150,0.249,2.472,0.014,0.125,1.105
x1,-0.2250,0.047,-4.830,0.000,-0.317,-0.133
x2,0.1189,0.022,5.295,0.000,0.075,0.163
x3,-0.0021,0.001,-1.750,0.081,-0.005,0.000
x4,-0.0005,0.000,-1.284,0.200,-0.001,0.000
x5,0.0441,0.040,1.093,0.275,-0.035,0.123
x6,0.0030,0.001,2.847,0.005,0.001,0.005
x7,-0.1589,0.052,-3.078,0.002,-0.261,-0.057
x8,-0.0686,0.023,-2.980,0.003,-0.114,-0.023

0,1,2,3
Omnibus:,8.432,Durbin-Watson:,0.98
Prob(Omnibus):,0.015,Jarque-Bera (JB):,8.493
Skew:,-0.38,Prob(JB):,0.0143
Kurtosis:,2.693,Cond. No.,3860.0


In [0]:
#Removing 7th feature
X_opt = np.array(X[:, [0, 2, 3, 4, 5, 8, 9, 10, 11, 12]], dtype=float)
classifier_ols = sm.OLS(endog = y, exog = X_opt).fit() 
classifier_ols.summary() 

0,1,2,3
Dep. Variable:,y,R-squared:,0.496
Model:,OLS,Adj. R-squared:,0.481
Method:,Least Squares,F-statistic:,32.08
Date:,"Sat, 18 Apr 2020",Prob (F-statistic):,7.06e-39
Time:,13:44:35,Log-Likelihood:,-114.8
No. Observations:,303,AIC:,249.6
Df Residuals:,293,BIC:,286.7
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6702,0.244,2.750,0.006,0.191,1.150
x1,-0.2293,0.046,-4.939,0.000,-0.321,-0.138
x2,0.1192,0.022,5.308,0.000,0.075,0.163
x3,-0.0023,0.001,-1.857,0.064,-0.005,0.000
x4,-0.0006,0.000,-1.464,0.144,-0.001,0.000
x5,0.0030,0.001,2.837,0.005,0.001,0.005
x6,-0.1601,0.052,-3.099,0.002,-0.262,-0.058
x7,-0.0676,0.023,-2.940,0.004,-0.113,-0.022
x8,0.0800,0.043,1.871,0.062,-0.004,0.164

0,1,2,3
Omnibus:,7.873,Durbin-Watson:,0.983
Prob(Omnibus):,0.02,Jarque-Bera (JB):,7.961
Skew:,-0.37,Prob(JB):,0.0187
Kurtosis:,2.71,Cond. No.,3780.0


In [0]:
#Removing 5th feature
X_opt = np.array(X[:, [0, 2, 3, 4, 8, 9, 10, 11, 12]], dtype=float)
classifier_ols = sm.OLS(endog = y, exog = X_opt).fit() 
classifier_ols.summary() 

0,1,2,3
Dep. Variable:,y,R-squared:,0.493
Model:,OLS,Adj. R-squared:,0.479
Method:,Least Squares,F-statistic:,35.68
Date:,"Sat, 18 Apr 2020",Prob (F-statistic):,3.23e-39
Time:,13:46:34,Log-Likelihood:,-115.91
No. Observations:,303,AIC:,249.8
Df Residuals:,294,BIC:,283.2
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.5499,0.230,2.392,0.017,0.098,1.002
x1,-0.2148,0.045,-4.727,0.000,-0.304,-0.125
x2,0.1211,0.022,5.392,0.000,0.077,0.165
x3,-0.0024,0.001,-2.010,0.045,-0.005,-5.05e-05
x4,0.0030,0.001,2.775,0.006,0.001,0.005
x5,-0.1649,0.052,-3.193,0.002,-0.267,-0.063
x6,-0.0690,0.023,-2.998,0.003,-0.114,-0.024
x7,0.0777,0.043,1.815,0.071,-0.007,0.162
x8,-0.1090,0.021,-5.078,0.000,-0.151,-0.067

0,1,2,3
Omnibus:,7.839,Durbin-Watson:,0.975
Prob(Omnibus):,0.02,Jarque-Bera (JB):,8.08
Skew:,-0.383,Prob(JB):,0.0176
Kurtosis:,2.771,Cond. No.,2230.0


In [0]:
#Removing 11th feature
X_opt = np.array(X[:, [0, 2, 3, 4, 8, 9, 10, 12]], dtype=float)
classifier_ols = sm.OLS(endog = y, exog = X_opt).fit() 
classifier_ols.summary() 

0,1,2,3
Dep. Variable:,y,R-squared:,0.487
Model:,OLS,Adj. R-squared:,0.475
Method:,Least Squares,F-statistic:,40.0
Date:,"Sat, 18 Apr 2020",Prob (F-statistic):,2.44e-39
Time:,13:49:55,Log-Likelihood:,-117.6
No. Observations:,303,AIC:,251.2
Df Residuals:,295,BIC:,280.9
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6164,0.228,2.706,0.007,0.168,1.065
x1,-0.2124,0.046,-4.658,0.000,-0.302,-0.123
x2,0.1200,0.023,5.324,0.000,0.076,0.164
x3,-0.0025,0.001,-2.036,0.043,-0.005,-8.33e-05
x4,0.0034,0.001,3.279,0.001,0.001,0.005
x5,-0.1700,0.052,-3.283,0.001,-0.272,-0.068
x6,-0.0900,0.020,-4.504,0.000,-0.129,-0.051
x7,-0.1053,0.021,-4.908,0.000,-0.147,-0.063

0,1,2,3
Omnibus:,8.461,Durbin-Watson:,0.966
Prob(Omnibus):,0.015,Jarque-Bera (JB):,8.749
Skew:,-0.4,Prob(JB):,0.0126
Kurtosis:,2.767,Cond. No.,2200.0


So, we have selected all the features which have significance less than 5%. Let's fit it to the model.

In [0]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_opt, y, test_size = 0.25, random_state = 0)

In [0]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [0]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 78.95%


This happened because the R-squared and Adjusted R-squared values decreased after removing 5th feature, showing the performance of model has reduced. Let's try the model on the peak Adjusted R-squared value, i.e., when we removed the 6th feature.

In [0]:
X_opt = np.array(X[:, [0, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12]], dtype=float)

In [0]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_opt, y, test_size = 0.25, random_state = 0)

In [0]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [0]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
print('Test accuracy {:.2f}%'.format(classifier.score(X_test,y_test)*100))

Test accuracy 80.26%


Better, but not upto the mark. Obviously, we are applying Backward elimination to a Classification dataset. But, we were successful in implementing it :)