# Heart failure prediction

#### Heart failure — sometimes known as congestive heart failure — occurs when the heart muscle doesn't pump blood as well as it should. When this happens, blood often backs up and fluid can build up in the lungs, causing shortness of breath.

#### Certain heart conditions, such as narrowed arteries in the heart (coronary artery disease) or high blood pressure, gradually leave the heart too weak or stiff to fill and pump blood properly.

#### Proper treatment can improve the signs and symptoms of heart failure and may help some people live longer. Lifestyle changes — such as losing weight, exercising, reducing salt (sodium) in your diet and managing stress — can improve your quality of life. However, heart failure can be life-threatening. People with heart failure may have severe symptoms, and some may need a heart transplant or a ventricular assist device (VAD).

#### One way to prevent heart failure is to prevent and control conditions that can cause it, such as coronary artery disease, high blood pressure, diabetes and obesity.

In [None]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
data = pd.read_csv (r'../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

In [None]:
data.head()

In [None]:
data.columns

In [None]:
# make columns lowercases for ease of use
data.columns = data.columns.str.lower()

# Understanding features:

- Anemia - anemia is a condition in which you lack enough healthy red blood cells to carry adequate oxygen to your body's tissues. Having anemia, also referred to as low hemoglobin, can make you feel tired and weak. (there is not anemia - 0, there is anemia - 1)

- Creatine_phosphokinase (CPK) - CPK is an enzyme in the body. It is found mainly in the heart, brain, and skeletal muscle. Total CPK normal values: 10 to 120 micrograms per liter (mcg/L)

- Ejection_fraction (EF) - EF is a measurement, expressed as a percentage, of how much blood the left ventricle pumps out with each contraction. An ejection fraction of 60 percent means that 60 percent of the total amount of blood in the left ventricle is pushed out with each heartbeat. This indication of how well your heart is pumping out blood can help to diagnose and track heart failure. A normal heart’s ejection fraction may be between 50 and 70 percent.

- Platelets - platelets are colorless blood cells that help blood clot. Platelets stop bleeding by clumping and forming plugs in blood vessel injuries. Thrombocytopenia might occur as a result of a bone marrow disorder such as leukemia or an immune system problem. The normal number of platelets in the blood is 150,000 to 400,000 platelets per microliter (mcL) or 150 to 400 × 109/L.

- Serum_creatinine - The amount of creatinine in your blood should be relatively stable. An increased level of creatinine may be a sign of poor kidney function. Serum creatinine is reported as milligrams of creatinine to a deciliter of blood (mg/dL) or micromoles of creatinine to a liter of blood (micromoles/L). Here are the normal values by age: 0.9 to 1.3 mg/dL for adult males. 0.6 to 1.1 mg/dL for adult females. 0.5 to 1.0 mg/dL for children ages 3 to 18 years.

- Serum_sodium - Measurement of serum sodium is routine in assessing electrolyte, acid-base, and water balance, as well as renal function. Sodium accounts for approximately 95% of the osmotically active substances in the extracellular compartment, provided that the patient is not in renal failure or does not have severe hyperglycemia. The normal range for blood sodium levels is 135 to 145 milliequivalents per liter (mEq/L).

- Time - follow-up period (days)

- High_blood_pressure - (True - 1, False - 0)

- Age - between 40 - 95

- Diabetes - (True - 1, False - 0)

- Sex - (male - 1, female - 0)

- Smoking - (True - 1, False - 0)

- Death event - (True - 1, False - 0)

# EDA

### Make a copy of the data for the visualisation:

In [None]:
df_vis = data.copy()
df_vis.death_event = df_vis.death_event.map({0:'Alive',1:'Dead'})
df_vis.diabetes = df_vis.diabetes.map({0:'No',1:'Yes'})
df_vis.smoking = df_vis.smoking.map({0:'No',1:'Yes'})
df_vis.sex = df_vis.sex.map({0:'Female',1:'Male'})


## Distribution of Death by heart failure in dataset:

In [None]:
plt.figure(figsize=(7,7))
plt.pie(data['death_event'].value_counts(),labels=['Alive','Dead'],autopct='%1.1f%%',shadow=True,explode=[0,0.1], colors = ['lightblue','lightgreen'])
plt.title('Death Event',fontsize=20)
plt.show()

##### Data is unbalanced

## Sex distribution:

In [None]:
df_vis['sex'].value_counts()

##### As we can see males in the dataset are higher than females, and because of that a heart failure can not be measured by gender as the data is somehow biased towards males.

## Is there any relation between gender and death event?

In [None]:
plt.figure(figsize=(7,7))
sns.countplot(x='sex',hue='death_event',data=df_vis)

#### As we can see as males are more in the dataset, it's understandable that they are more likely to die from heart failure.

## Is there a relationship between age and heart failure?

In [None]:
plt.figure(figsize=(7,7))
sns.violinplot(x='death_event',y='age',data=df_vis, palette='Set3')
plt.title('Age with Death Event',fontsize=20)
plt.show()


#### As we can see, the older the person, the more likely he is to die from heart failure.

## Can diabetes be a cause of heart failure?

In [None]:
plt.figure(figsize=(7,7))
sns.countplot(x='diabetes',hue='death_event',data=df_vis, palette='Set1')
plt.title('Diabetes with Death Event',fontsize=20)
plt.show()


#### As we can see, there is no significant difference between the number of people with diabetes and without diabetes who died from heart failure.

## Can smoking be a cause of heart failure?

In [None]:
# chart for smoking vs death event
plt.figure(figsize=(7,7))
sns.countplot(x='smoking',hue='death_event',data=df_vis, palette='Set2')
plt.title('Smoking with Death Event',fontsize=20)
plt.show()

#### As we can see, people who smoke are more likely to die from heart failure.


## creatinine_phosphokinase vs death event

In [None]:
import plotly.express as px
fig = px.violin(df_vis, y="creatinine_phosphokinase", x="death_event", color="death_event", box=True, points="all", hover_data=df_vis.columns)
fig.show()

# Data preprocessing

#### Check for missing values


In [None]:
data.isnull().sum()

## Check for outliers

In [None]:
nums = data.select_dtypes(exclude=["object"])
nums.plot(subplots = True , kind ='box', layout = (15,4), figsize = (25,35), patch_artist= True ,color = "#6F266E")
plt.subplots_adjust(wspace = 0.5)
plt.style.use("ggplot")
plt.show()

#### Outliers are present in the data

## Deletion of outliers

In [None]:
data = data.drop(data[data['platelets']>420000].index)
data = data.drop(data[data['serum_creatinine']>2.5].index)
data = data.drop(data[data['creatinine_phosphokinase']>1500].index)

## Correlation heatmap:

In [None]:
# heatmap for correlation
plt.figure(figsize=(14,10))
sns.heatmap(data.corr(),annot=True,cmap='coolwarm')
plt.title('Correlation Heatmap',fontsize=20)
plt.show()


## Importance of features:

In [None]:
# Feature Selection

plt.rcParams['figure.figsize']=15,6
sns.set_style("darkgrid")
x = data.iloc[:, :-1]
y = data.iloc[:,-1]
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(x,y)
print(model.feature_importances_)
feat_importances = pd.Series(model.feature_importances_, index=x.columns)
feat_importances.nlargest(12).plot(kind='barh',color='blue')
plt.show()


## VIF
#### Check for multicolinearity

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data=data
VIF=pd.Series([variance_inflation_factor(vif_data.values,i) 
for i in range(vif_data.shape[1])],index=vif_data.columns)
VIF

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data=data
VIF=pd.Series([variance_inflation_factor(vif_data.values,i) 
for i in range(vif_data.shape[1])],index=vif_data.columns)
VIF

### Treatment of multicolinearity:


In [None]:
def MC_remover(data):
    vif=pd.Series([variance_inflation_factor(data.values,i)for i in range(data.shape[1])],index=data.columns)
    if vif.max()>13:
        print(vif[vif == vif.max()].index[0],'has been removed')
        data = data.drop(columns=[vif[vif==vif.max()].index[0]])
        return data
    else:
        print("No multicollinearity present anymore")
        return data

In [None]:
for i in range(10):
    vif_data=MC_remover(vif_data)
vif_data.head()

### Calculating VIF for remaining columns:


In [None]:
VIF=pd.Series([variance_inflation_factor(vif_data.values,i) for i in range(vif_data.shape[1])],index=vif_data.columns)
VIF,len(vif_data.columns)

### Splitting data:

In [None]:
X = vif_data.drop('death_event',axis=1)
y = vif_data['death_event']


#### Use smote to balance the data

In [None]:
# balance the data by oversampling the minority class

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_sm, y_sm = sm.fit_resample(X, y)



## Scaling data:

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_sm = scaler.fit_transform(X_sm)


## Model comparison:

In [None]:
# # lazy prediction
# from lazypredict.Supervised import LazyClassifier
# clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
# models, predictions = clf.fit(X_sm, X_sm, y_sm, y_sm)
# models

## Split into train and test data:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2, random_state=42)

## Logistic regression

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))


## Random forest

In [None]:
# random forest classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
print('Accuracy of random forest classifier on test set: {:.2f}'.format(rfc.score(X_test, y_test)))

# ANN

In [None]:
from gc import callbacks
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import BatchNormalization
from keras.layers import LeakyReLU
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow as tf


early_stopping = tf.keras.callbacks.EarlyStopping(
     min_delta=0.001, 
    patience=20, 
    restore_best_weights=True
)


model = Sequential()

# layers
model.add(Dense(units = 16, kernel_initializer = 'uniform', activation = 'relu', input_dim = 10))
model.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.25))
model.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.25))
model.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.01))
model.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

# Compiling the ANN
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Train the ANN
history = model.fit(X_train, y_train, batch_size = 32, epochs = 200,callbacks=[early_stopping], validation_split=0.2)

# plot the loss and accuracy 
plt.title('Training and Validation loss')
plt.plot(history.history['loss'], label='loss', color='blue')
plt.plot(history.history['val_loss'], label='validation loss', color='orange')
plt.legend()
plt.show()


plt.title('Training and Validation accuracy')
plt.plot(history.history['accuracy'], label='accuracy', color='green')
plt.plot(history.history['val_accuracy'], label='validation accuracy', color='red')
plt.legend()
plt.show()

# predict the test set
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5)




## Classification report


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

## Confusion matrix


In [None]:
# plot confusion matrix with seaborn heatmap
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.show()


