<a href="https://www.kaggle.com/code/amirulmahmud/heart-failure-prediction-with-logistic-regression?scriptVersionId=124935993" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# The Objective

The objective of this work is to create a Classification Model that can predict whether or not a person has presence of heart failure based on physical features of that person.

# Load The Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/heart-failure-prediction/heart.csv')

In [None]:
df.head()

In [None]:
df.info()

# Data Cleaning

Before data is used, let's check if there is any missing values in the data.

In [None]:
df.isna().sum()

Fortunately, the data is complete and does not have any missing values.

# Exploratory Data Analysis

**Create a statistical summary**

In [None]:
df.describe().transpose()

**Create a pairplot**

In [None]:
sns.pairplot(df,hue='HeartDisease')

**Create a heatmap that displays the correlation between features**

In [None]:
plt.figure(figsize=(6,4),dpi=150)
sns.heatmap(df.corr(),cmap='viridis',annot=True)

**Find the top 5 correlated features with target.**

In [None]:
np.abs(df.corr()['HeartDisease']).sort_values().tail(6)

# Split The Data

The approach here will use Cross Validation on 90% of the dataset, and then judge the results on a final test set of 10% to evaluate the model.

In [None]:
X = df.drop('HeartDisease',axis=1)
y = df['HeartDisease']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Check the balance of the label data in training set.

In [None]:
y_train.value_counts()

In [None]:
sns.countplot(x=y_train)

In [None]:
372/(454+372)

As we see here, the label class of training data is slightly balance (45% : 55%). So, the data is ready to use in the next steps.

# Features Engineering

ColumnTransformer is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the categorical ones.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [None]:
df.info()

In [None]:
df['FastingBS'].value_counts()

In [None]:
col_numeric = ['Age','RestingBP','Cholesterol','MaxHR','Oldpeak']
col_categoric = ['Sex','ChestPainType','FastingBS','RestingECG','ExerciseAngina','ST_Slope']

In [None]:
scaler = StandardScaler()
encoder = OneHotEncoder(drop='first')

In [None]:
preprocessor = ColumnTransformer([
    ('num',scaler,col_numeric),
    ('cat',encoder,col_categoric)
])

# Logistic Regression Model

**Create a base model of Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
base_model = LogisticRegression(max_iter=1000)

**Create a Pipeline**

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pipe = Pipeline([('preprocessor',preprocessor),('base_model',base_model)])

Perform a grid-search with the pipeline to test various parameters and report back the best performing parameters.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = ({'base_model__C': [0.001,0.01,0.1,1,10],
               'base_model__penalty': ['l1','l2','elasticnet'],
               'base_model__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
              })

In [None]:
grid = GridSearchCV(estimator=pipe,param_grid=parameters,scoring='accuracy',cv=5)

In [None]:
grid.fit(X_train,y_train)

Find the best estimator and best parameters

In [None]:
grid.best_estimator_

In [None]:
grid.best_estimator_.get_params()

# Cross Validation Results

In [None]:
cv_results = pd.DataFrame(grid.cv_results_)

In [None]:
cv_results.info()

**Result: C Values**

In [None]:
cv_c = cv_results.groupby('param_base_model__C').agg('mean')['mean_test_score']
cv_c

**Result : Penalty**

In [None]:
cv_penalty = cv_results.groupby('param_base_model__penalty').agg('mean')['mean_test_score']
cv_penalty

**Result: Solver**

In [None]:
cv_solver = cv_results.groupby('param_base_model__solver').agg('mean')['mean_test_score']
cv_solver

# Final Model Evaluation

In [None]:
y_pred = grid.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,plot_confusion_matrix

In [None]:
confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
plot_confusion_matrix(grid,X_test,y_test)

In [None]:
from sklearn.metrics import accuracy_score,f1_score

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
f1_score(y_test,y_pred)

# CONCLUSION

1. The best parameters of Logistic Regression in this model are combination of C = 1, penalty = 'l2', and solver = 'newton-cg'.
1. The model performs quite well in predicting the unseen data (X_test), with accuracy of 84.78% and f1 score of 86.79%.