## Heart Failure Prediction For Adult Males Machine Learning Model Using Logistic Regression
## David Abraham

## Memo

Cardiovascular diseases are the number one cause of deaths globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Heart Failure is a common event caused by the cardiovascular diseases. People with cardiovascular diseases need early detection and management so in the research below, we will dive into the level at risk of a death event male adults have with the features: diabetes, anaemia, high blood pressure, and smoking. Using logistic regression, we will find out which of these features will most likely be linked to a death event from cardiovasular disease. To do this we began by feature engineering a male column and dropped all females from the dataset. This brought down our cases from 300 rows of data to 196. This shows that about 66% of this case study was from men and I found that interesting and began to dive into the data.  To first take a look, we split data into X and Y data frames. The Y was the dependent variable which is the case of a death event. The X data frame consisted of the features we listed above.

   To begin diving into the data and understand what puts adult men at the biggest risk for cardiovascular disease, we are going to take a look at figure 1 again and the coefficients for the features. In this figure, you can begin to see that the biggest two factors that contribute to cardiovascular disease is high blood pressure and age. Based on the constant coefficient, adult men are not at a great risk of having cardiovascular disease but every year added onto your age, adds a .05 chance of a death event occurring and having high blood pressure tacks on an extra .13. Interestingly, it says diabetes, anaemia, and smoking actually decrease the chance of a death event due to cardiovascular disease. A person has a less chance of a death event from cardiovascular disease if they are diabetic, which makes sense because their cause of death can come from a number of different things and most diabetics actually end up taking good care of themselves. A person with higher anaemia decreases the chances of a death event from cardiovascular disease and with smoking, a person has less chance of dying from cardiovascular disease as well. I'm assuming that the smokers deaths could most likely be linked to lung failure or cancer rather than a cardiovascular disease. We can actually see in figure 2, the direct correlations that each feature have with each other. Age and high blood pressure increase with each other, so the older you are, the higher blood pressure you might have which puts you in higher risk of a cardiovascular disease. Also with age, you tend to see more anaemia which you can see in the correlation chart. The smoking and diabetes features however, have an inverse effect which each other which also makes sense. If you have diabetes, chances are you probably are not going to be smoking which is why those features have a negative correlation.
    
   Now we have to check for multicollinearity, which refers to high correlation among the model's features. To diagnose this we have to take a look at the variance inflation factor or VIF scores. We do this in figure 5 where we are generally concerned with VIF factors greater than two. None of the VIF scores in figure 5 are greater than two so we do not have a problem. In the next part, we are going to split the data set into train and test sets is to see whether the models for the two different samples are comparable. We do this in figure 3 and figure 4. When you compare the pseudo R-squared of two similar models, we can see which model better predicts the outcome. Figure 3 has a pseudo r-squared of 0.1580 while figure 4 has a pseudo r-squared of 0.08729. The higher pseudo r-squared indicates which model better predicts the outcome. In this situation that would be the train model, which is figure 3. In figure 3, this shows that smoking and high blood pressure are the leading factors in heart disease. The coefficient numbers spiked for these two showing a dramatic increase of a death event with high blood pressure and smoking showing that the people who landed in the train model most likely had a death event while being a smoker and having high blood pressure. Still in figure 1, it shows that the higher age you are and the higher your blood pressue is, the more liekly you will have a death event linked to cardiovascular disease.

In [1]:
import pandas as pd
heart = pd.read_csv('heart_failure_clinical_records_dataset.csv')
heart.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [2]:
heart.shape

(299, 13)

**What does binary output mean for each category?**
- Sex: Gender of patient Male = 1, Female =0
- Age: Age of patient
- Diabetes: 0 = No, 1 = Yes
- Anaemia: 0 = No, 1 = Yes
- High_blood_pressure: 0 = No, 1 = Yes
- Smoking: 0 = No, 1 = Yes
- DEATH_EVENT: 0 = No, 1 = Yes

A couple intresting points about this data:
- The average age in this data set is 60.83, so it's mainly geared towards that age group
- A majority of the people in this data set are male
- A majority do not smoke or have high blood pressure
- Most importantly, only 32% of people in this data set died. 

In [3]:
heart.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,299.0,60.833893,11.894809,40.0,51.0,60.0,70.0,95.0
anaemia,299.0,0.431438,0.496107,0.0,0.0,0.0,1.0,1.0
creatinine_phosphokinase,299.0,581.839465,970.287881,23.0,116.5,250.0,582.0,7861.0
diabetes,299.0,0.41806,0.494067,0.0,0.0,0.0,1.0,1.0
ejection_fraction,299.0,38.083612,11.834841,14.0,30.0,38.0,45.0,80.0
high_blood_pressure,299.0,0.351171,0.478136,0.0,0.0,0.0,1.0,1.0
platelets,299.0,263358.029264,97804.236869,25100.0,212500.0,262000.0,303500.0,850000.0
serum_creatinine,299.0,1.39388,1.03451,0.5,0.9,1.1,1.4,9.4
serum_sodium,299.0,136.625418,4.412477,113.0,134.0,137.0,140.0,148.0
sex,299.0,0.648829,0.478136,0.0,0.0,1.0,1.0,1.0


In [4]:
df = heart.copy()

In [5]:
# Make sure there are no null values
df.isnull().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

In [6]:
df2 = df.copy()
df2.shape

(299, 13)

Created to new columns for adults and male:
- If age is 18 or over, then the row value will be true
- In the male column, it will be a 1 if you are a male, 0 if women

In [7]:
# Change to true if you are equal or above the age of q6
df2['age'] = pd.to_numeric(df['age'])
df2['adults'] = df2['age'] >= 18

In [8]:
# Created a new column named male
df2['male'] = 0
df2.male[df2.sex == 1] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [9]:
# We will not use these points so decided to drop them from set df2
df2 = df2.drop(['creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium',\
          'time'], axis=1)

In [10]:
df2.head()

Unnamed: 0,age,anaemia,diabetes,high_blood_pressure,sex,smoking,DEATH_EVENT,adults,male
0,75.0,0,0,1,1,0,1,True,1
1,55.0,0,0,0,1,0,1,True,1
2,65.0,0,0,0,1,1,1,True,1
3,50.0,1,0,0,1,0,1,True,1
4,65.0,1,1,0,0,0,1,True,0


In [11]:
df3 = df2.copy()

Decided to drop anyone who was female and focus on the males of the database since they seemed to be the majority. Also dropped people the rows containing people who are younger than 18 because those people are at low risk for heart disease.

In [12]:
df3.drop(df3[df3['adults'] == False].index, inplace=True)
df3.drop(df3[df3['male'] == 0].index, inplace=True)

In [13]:
# split data into X and Y dataframes
X = df3[['diabetes','anaemia','high_blood_pressure','smoking','age']].copy() 
Y = df3['DEATH_EVENT'].copy()

In [14]:
df3.head()

Unnamed: 0,age,anaemia,diabetes,high_blood_pressure,sex,smoking,DEATH_EVENT,adults,male
0,75.0,0,0,1,1,0,1,True,1
1,55.0,0,0,0,1,0,1,True,1
2,65.0,0,0,0,1,1,1,True,1
3,50.0,1,0,0,1,0,1,True,1
5,90.0,1,0,1,1,1,1,True,1


## Memo


## Appendix

**Figure 1**

In [15]:
import statsmodels.api as sm
import math
X = sm.add_constant(X) # required if a value for alpha is expected
est = sm.Logit(Y,X).fit() # fit model
predictions = est.predict() # get predicted values
print(est.summary())

Optimization terminated successfully.
         Current function value: 0.576238
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:            DEATH_EVENT   No. Observations:                  194
Model:                          Logit   Df Residuals:                      188
Method:                           MLE   Df Model:                            5
Date:                Sun, 06 Dec 2020   Pseudo R-squ.:                 0.08031
Time:                        14:09:24   Log-Likelihood:                -111.79
converged:                       True   LL-Null:                       -121.55
                                        LLR p-value:                  0.001535
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  -4.1252      0.943     -4.373      0.000      -5.974      -2.276
di

**Figure 2**

In [16]:
corr = X.corr()
corr.style.background_gradient()

  xa[xa < 0] = -1


Unnamed: 0,const,diabetes,anaemia,high_blood_pressure,smoking,age
const,,,,,,
diabetes,,1.0,-0.0610655,-0.0233564,-0.111688,-0.0613733
anaemia,,-0.0610655,1.0,0.0178963,-0.0952734,0.104937
high_blood_pressure,,-0.0233564,0.0178963,1.0,-0.0428634,0.110417
smoking,,-0.111688,-0.0952734,-0.0428634,1.0,-0.0256584
age,,-0.0613733,0.104937,0.110417,-0.0256584,1.0


**Figure 3**

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, \
        test_size = 0.5, random_state = 2)

In [18]:
import statsmodels.api as sm
import math
from sklearn.metrics import mean_squared_error
X_train = sm.add_constant(X_train) # required if constant expected
est = sm.Logit(y_train,X_train).fit() # fit model
predictions = est.predict() # get predicted values
print(est.summary()) # prints full regression results

Optimization terminated successfully.
         Current function value: 0.461176
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:            DEATH_EVENT   No. Observations:                   97
Model:                          Logit   Df Residuals:                       91
Method:                           MLE   Df Model:                            5
Date:                Sun, 06 Dec 2020   Pseudo R-squ.:                  0.1580
Time:                        14:09:24   Log-Likelihood:                -44.734
converged:                       True   LL-Null:                       -53.130
                                        LLR p-value:                  0.004913
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  -6.1351      1.671     -3.672      0.000      -9.409      -2.861
di

**Figure 4**

In [19]:
import statsmodels.api as sm

X_test = sm.add_constant(X_test) # required if constant expected
est = sm.Logit(y_test,X_test).fit() # fit model
predictions = est.predict() # get predicted values
print(est.summary()) # prints full regression results

Optimization terminated successfully.
         Current function value: 0.615022
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:            DEATH_EVENT   No. Observations:                   97
Model:                          Logit   Df Residuals:                       91
Method:                           MLE   Df Model:                            5
Date:                Sun, 06 Dec 2020   Pseudo R-squ.:                 0.08729
Time:                        14:09:24   Log-Likelihood:                -59.657
converged:                       True   LL-Null:                       -65.362
                                        LLR p-value:                   0.04382
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  -2.8929      1.207     -2.397      0.017      -5.259      -0.527
di

**Figure 5**

In [20]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = est.model.exog # get model features
vif = pd.DataFrame() # create a dataframe
vif["VIF Factor"] = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif["features"] = X_test.columns
print('VIF: {}'.format(vif))

VIF:    VIF Factor             features
0   28.309503                const
1    1.024993             diabetes
2    1.017343              anaemia
3    1.002764  high_blood_pressure
4    1.038371              smoking
5    1.003674                  age
