Attribute Information

    Age: age of the patient [years]
    
    Sex: sex of the patient [M: Male, F: Female]
    
    ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
    
    RestingBP: resting blood pressure [mm Hg]
    
    Cholesterol: serum cholesterol [mm/dl]
    
    FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
    
    RestingECG:resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST      elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by  Estes' riteria]
    
    MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
    
    ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
    
    Oldpeak: oldpeak = ST [Numeric value measured in depression]
    
    ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
    
    HeartDisease: output class [1: heart disease, 0: Normal]


In [2]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [7]:
df = pd.read_csv('../input/heart-failure-prediction/heart.csv')

**Exploratory Data Analysis**

In [8]:
df.head(10)

In [9]:
df.tail(10)

In [10]:
df.info()

In [12]:
# Age
df['Age'].describe()

In [14]:
# Show age distribution
plt.style.use('bmh')
plt.figure(figsize=(15, 10))
plt.hist(df['Age'], bins=40, color='yellowgreen' , edgecolor="#6A9662")
plt.gca().set(title='Age Distribution', xlabel='Age', ylabel='Frequency')
plt.show()

In [15]:
# Sex
df['Sex'].value_counts()

In [17]:
# Encode Sex column 
from sklearn.preprocessing import LabelEncoder
LabelEncoderModel = LabelEncoder()
df['Sex'] = LabelEncoderModel.fit_transform(df['Sex'])

In [18]:
df['Sex'].value_counts()

In [19]:
# ChestPainType
ChestPainType = df['ChestPainType'].value_counts()
ChestPainType

In [22]:
plt.figure(figsize=(15, 10))
plt.pie(ChestPainType, labels=['ASY', 'NAP', 'ATA', 'TA'],
                        autopct='%1.1f%%', shadow=False,
                        explode=[0.0, 0.0, 0.1, 0.1])
plt.title('Chest Pain Types')
plt.legend()
plt.show()

In [24]:
# Encode ChestPainType column
df['ChestPainType'] = LabelEncoderModel.fit_transform(df['ChestPainType'])

In [26]:
df['ChestPainType'].value_counts()

In [28]:
# RestingBP
df['RestingBP'].describe()

In [29]:
# Show Blood Pressure At Different Ages
plt.figure(figsize=(15, 10))
plt.scatter(df['RestingBP'], df['Age'], color='yellowgreen', edgecolor="#6A9662")
plt.title('Blood Pressure At Different Ages')
plt.xlabel('Blood Pressure')
plt.ylabel('Age')
plt.show()

In [30]:
# Observations which have 0
df[df['RestingBP'] == 0]

In [31]:
# drop index number 449
df.drop(index=449, inplace=True)

In [32]:
# Show Blood Pressure At Different Ages
plt.figure(figsize=(15, 10))
plt.scatter(df['RestingBP'], df['Age'], color='yellowgreen', edgecolor="#6A9662")
plt.title('Blood Pressure At Different Ages')
plt.xlabel('Blood Pressure')
plt.ylabel('Age')
plt.show()

In [33]:
# Cholesterol
# Cholesterol column is an object type, so we will convert it to int64 
df['Cholesterol'] = df['Cholesterol'].astype('int64')

In [34]:
df['Cholesterol'].describe()

In [35]:
# Show Serum Cholesterol At Different Ages
plt.figure(figsize=(15, 10))
plt.scatter(df['Cholesterol'], df['Age'], color='yellowgreen', edgecolor="#6A9662")
plt.title('Cholesterol At Different Ages')
plt.xlabel('Cholesterol')
plt.ylabel('Age')
plt.show()

In [36]:
# Observations which have 0
df[df['Cholesterol'] == 0]['Cholesterol'].count()

In [37]:
# Drop observation which have 0
indexes = []
c = 0
for i in df['Cholesterol']:
    c = c+1
    if i == 0:
        indexes.append(c)

df.drop(labels=indexes, inplace=True)

In [39]:
df[df['Cholesterol'] == 0]['Cholesterol']

In [40]:
list = [293, 421, 423, 427, 434, 446]
df.drop(labels=list, inplace=True)

In [41]:
# Show Serum Cholesterol At Different Ages
plt.figure(figsize=(15, 10))
plt.scatter(df['Cholesterol'], df['Age'], color='yellowgreen', edgecolor="#6A9662")
plt.title('Cholesterol At Different Ages')
plt.xlabel('Cholesterol')
plt.ylabel('Age')
plt.show()

Sometimes the cholesterol value can reach 600, but in this case, you need immediate treatment.

In [43]:
# Observation which have cholestrol more value than 400
df[df['Cholesterol'] >= 400]

In [44]:
# FastingBS
df['FastingBS'].value_counts()

In [45]:
# RestingECG
RestingECG = df['RestingECG'].value_counts()
RestingECG

In [47]:
# Show resting electrocardiogram
plt.figure(figsize=(15, 10))
plt.pie(RestingECG, labels=['Normal', 'LVH', 'ST'],
                    autopct='%1.1f%%', shadow=False,
                    explode=[0.0, 0.1, 0.1])
plt.title('Resting Electrocardiogram')
plt.legend()
plt.show()

In [48]:
# Encode RestingECG column
df['RestingECG'] = LabelEncoderModel.fit_transform(df['RestingECG'])

In [49]:
df['RestingECG'].value_counts()

In [50]:
# MaxHR
df['MaxHR'].describe()

In [51]:
# Show Maximum Heart Rate At Different Ages
plt.figure(figsize=(15, 10))
plt.scatter(df['MaxHR'], df['Age'], color='yellowgreen', edgecolor="#6A9662")
plt.title('Maximum Heart Rate At Different Ages')
plt.xlabel('Maximum Heart Rate')
plt.ylabel('Age')
plt.show()

In [52]:
# ExerciseAngina
df['ExerciseAngina'].value_counts()

In [53]:
# Encode ExerciseAngina column
df['ExerciseAngina'] = LabelEncoderModel.fit_transform(df['ExerciseAngina'])

In [54]:
df['ExerciseAngina'].value_counts()

In [55]:
# Oldpeak
df['Oldpeak'].describe()

In [56]:
# ST_Slope
df['ST_Slope'].value_counts()

In [57]:
# Encode ST_Slope column
df['ST_Slope'] = LabelEncoderModel.fit_transform(df['ST_Slope'])

In [58]:
df['ST_Slope'].value_counts()

# Machine Learning Model

In [59]:
# Divide data into dependent and independent
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [62]:
X.shape

In [63]:
y.shape

In [64]:
# Scaling Model
from sklearn.preprocessing import StandardScaler
StandardScalerModel = StandardScaler()
X = StandardScalerModel.fit_transform(X)

In [65]:
# Splitting Model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True)

In [66]:
# Logistic Regression Model
from sklearn.linear_model import LogisticRegression
LogisticRegressionModel = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=100)
LogisticRegressionModel.fit(X_train, y_train)

In [69]:
# Score of model
print('Score of train data = ', round(LogisticRegressionModel.score(X_train, y_train), 4))
print('Score of test data = ', round(LogisticRegressionModel.score(X_test, y_test), 4))

In [70]:
# SVC Model
from sklearn.svm import SVC
SVCModel = SVC(C=1, kernel='rbf')
SVCModel.fit(X_train, y_train)

In [71]:
# Score of model
print('Score of train data = ', round(SVCModel.score(X_train, y_train), 4))
print('Score of test data = ', round(SVCModel.score(X_test, y_test), 4))

In [72]:
# Decision Tree Classifier Model
from sklearn.tree import DecisionTreeClassifier
DecisionTreeClassifierModel = DecisionTreeClassifier()
DecisionTreeClassifierModel.fit(X_train, y_train)

In [73]:
# Score of model
print('Score of train data = ', round(DecisionTreeClassifierModel.score(X_train, y_train), 4))
print('Score of test data = ', round(DecisionTreeClassifierModel.score(X_test, y_test), 4))

**Oooh it has an overfit problem..**

In [74]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifierModel = RandomForestClassifier(n_estimators=10, criterion='entropy')
RandomForestClassifierModel.fit(X_train, y_train)

In [75]:
# Score of model
print('Score of train data = ', round(RandomForestClassifierModel.score(X_train, y_train), 4))
print('Score of test data = ', round(RandomForestClassifierModel.score(X_test, y_test), 4))

**It also has an overfit problem..**

In [76]:
# K-Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifierModel = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
KNeighborsClassifierModel.fit(X_train, y_train)

In [77]:
# Score of model
print('Score of train data = ', round(KNeighborsClassifierModel.score(X_train, y_train), 4))
print('Score of test data = ', round(KNeighborsClassifierModel.score(X_test, y_test), 4))

**It seems that K-Neighbors Classifier is the best one.
Number of observation is a little bit, so we can't have more than this accurecy..**

In [78]:
y_pred = KNeighborsClassifierModel.predict(X_test)

In [79]:
# Confusion Metric
from sklearn.metrics import confusion_matrix
ConfusionMetricModel = confusion_matrix(y_test, y_pred)

In [80]:
ConfusionMetricModel

there're 17 prediction will be wrong..

# THANK YOU..