![image.png](heart.png)

## **Heart Failure Prediction**
The task of this project is to analyze dataset containing different characteristics of 918 patients to predict heart failures using Python, Machine Learning and data visualization tools. Utilize the pandas data visualization tools to show the correlation between the variables and find out the factors that are most significant factors in heart failure. Utilize machine learning model to create a model to assess the likelihood of a possible heart disease event.
**Understanding the terms/characteristics in the dataset**
**ChestPainType** - TA: Typical Angina - substernal chest pain precipitated by physical exertion or emotional stress and relieved with rest or nitroglycerin; ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic
**RestingBP** - resting blood pressure in mm Hg/ millimeters of mercury. A normal reading would be any blood pressure below 120/80 mm Hg and above 90/60 mm Hg in an adult.
**Cholesterol** - Total or serum cholesterol measured in [mm/dl]. Below 200 mg/dL -  is desirable/normal; 200-239 mg/dL - borderline high; 240 mg/dL and above - high.
RestingECG
MaxHR
Oldpeak

## Predictions: 
This analysis is trying to find the best model that can detect if a patient will get a heart disease or not. To come up with a solutiom, I will be using the following models: 
- a LogisticRegression and 
- RandomForestClassifier. 

The logistic regression will be a better suited model for this dataset, because I am trying to predict whether the heart failure is going to happen or not. In addition, the features/variables in the dataset potentially highly correlated. The Logistic regression performs well on such data. 

The random forest fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. But in this case, I predict, using the random forest will be overfitting and logistic regression will be more fitted since the heart data that I was provided with is strongly linked together and highly correlated.

## Step I: Data Preprocessing

In [1]:
# Import our dependencies
import pandas as pd
# import matplotlib as plt
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import seaborn as sns
import pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, LabelEncoder

ImportError: cannot import name 'animation' from partially initialized module 'matplotlib' (most likely due to a circular import) (c:\users\aconnolly\appdata\local\programs\python\python39\lib\site-packages\matplotlib\__init__.py)

In [None]:
#  Import and read the heart.csv.
heart = pd.read_csv("heart.csv")
heart

In [None]:
heart.dtypes

In [None]:
heart.shape

In [None]:
#Display statistical description of the features
heart.describe()

In [None]:
heart['FastingBS'] = heart['FastingBS'].astype(str)
heart['HeartDisease'] = heart['HeartDisease'].astype(str)

In [None]:
# Defining plot design
def plot_design():
    plt.xlabel('')
    plt.ylabel('')
    plt.yticks(fontsize=15, color='black')
    plt.xticks(fontsize=15, color='black')
    plt.box(False)
    plt.title(i[1], fontsize=24, color='black')
    plt.tight_layout(pad=5.0)

In [None]:
# Select categorical variables
categ = heart.select_dtypes(include=object).columns

# Visualize
fig, ax = plt.subplots(figsize =(15, 15))
fig.patch.set_facecolor('#CAD5E0')
mpl.rcParams['font.family'] = 'TeX Gyre Heros'

# Loop columns
for i in (enumerate(categ)):
    plt.subplot(4, 2, i[0]+1)
    sns.countplot(y = i[1], data = heart, order=heart[i[1]].value_counts().index, palette='Greens_r', edgecolor='black')
    plot_design()
    plt.suptitle('Categorical Variables', fontsize=40)

In [None]:
# Check the total missing values in each column. A field with a NULL value is the one that has been left blank during the record creation.
print("Total NULL Values in each columns")
print("*********************************")
print(heart.isnull().sum())

In [None]:
#cleaning the dataset by removing all zeroes in the column "cholesterol" as there is no 0 cholesterole.
clean_df=heart[heart['Cholesterol'] !=0]
clean_df

## Step II: Machine Learning model


In [None]:
# Convert categorical data to numeric and separate target feature for training data

In [None]:
# Drop the target column

X_train = clean_df.drop("target", axis=1).copy()
X_test = test_df.drop("target", axis=1).copy()
y_train = train_df["target"].copy()
y_test = test_df["target"].copy()


X = clean_df.drop('HeartDisease', axis=1)
y = clean_df['HeartDisease']
target_names = ["negative", "positive"]

In [None]:
# Prepare the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=21)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Data Visualization

In [None]:
#  Group by gender and get the number to plot
distr_gender = clean_df.groupby(["Sex"])
gender_df= pd.DataFrame(distr_gender.size())
gender_df

# Create the dataframe with total count of Female and Male
gender_df.columns = ["Total Count"]
gender_df

In [None]:
# # Gender Breakdown
colors = ["yellowgreen", "lightcoral"]
#define how the pie will devide/explode
explode = (0.1, 0)
gender_df.plot.pie(y='Total Count', colors = colors, startangle=120, explode = explode, shadow = True, autopct="%1.1f%%")
plt.title("Distribution of Female vs. Male")
plt.axis("equal")
plt.show()