## About Dataset


### Context:
<p>
The dataset is the Cleveland Heart Disease dataset taken from the UCI repository. The dataset consists of 303 individuals’ data. There are 14 columns in the dataset(which have been extracted from a larger set of 75). No missing values. The classification task is to predict whether an individual is suffering from heart disease or not. (0: absence, 1: presence)

original data: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
</p>

In [None]:
# IMPORTING THE BASIC NECESSARY MODULES FOR THE PROJECT

import pandas as pd  # to load and manipulate the data
import numpy as np   # to calculate the mathematical operations 
import matplotlib.pyplot as plt    # to draw graphs

In [None]:
# LOADING THE DATASET USING PANDAS

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
                header=None)


In [None]:
# PRINTING THE FIRST 5 ROWS
df.head()

Instead of column names we are provided with column numbers. As column names makes it easier to format the data, we can replace the column numbers with the required column names

- **age**
- **sex**
- **cp** (chest pain)
- **restbp** (resting blood pressure in mm Hg)
- **chol** (cholesterol in mg/dl)
- **fbs** (fasting blood sugar)
- **restecg** (resting electrocardiographic results)
- **thalach** (maximum heart rate achieved)
- **exang** (exercise induced angina)
- **oldpeak** (ST depression)
- **slope** (the slope of peak exercise)
- **ca** (number of major vessels from 0 -3 colured from fluroscopy)
- **thal** (short term of thalium heart scan)
 - **hd** (diagnosis of heart disease)

In [None]:
# CHANGING COLUMN NUMBERS TO COLUMN NAMES

df.columns = ['age', 'sex', 'cp', 'restbp', 
             'chol', 'fbs', 'restecg', 'thalach', 'exang',
             'oldpeak', 'slope', 'ca', 'thal', 'hd']

In [None]:
df.head()

In [None]:
# IDENTIFYING THE MISSING DATA

<p>
There are two main ways to deal with missing data:

1. We can remove the rows that contain missing data from the dataset. This is relatively easy to do, but it wastes all of the other values that we
collected. How a big of a waste this is depends on how important this missing value is for classification. For example, if we are missing a value for
age, and age is not useful for classifying if people have heart disease or not, then it would be a shame to throw out all of someone's data just
because we do not have their age.
2. We can impute the values that are missing. In this context impute is just a fancy way of saying "we can make an educated guess about about what
the value should be". Continuing our example where we are missing a value for age, instead of throwing out the entire row of data, we can fill the
missing value with the average age or the median age, or use some other, more sophisticated approach, to guess at an appropriate value.
</p>

In [None]:
# CHECK FOR DATA TYPE FOR EACH COLUMNS

df.dtypes

In [None]:
df['ca'].unique()

In [None]:
df['thal'].unique()

In [None]:
# '?' may refer to the missing data

In [None]:
# DEALING WITH MISSING DATA

len(df.loc[(df['ca'] == "?") | (df['thal'] == "?")])

In [None]:
# ROWS HAVING MISSING DATA

df.loc[(df['ca'] == "?") | (df['thal'] == "?")]

In [None]:
len(df)  # TOTAL ROWS IN A DATASET

In [None]:
# As only 6 rows has missing data, the rows can be deleted from the dataset

In [None]:
df_no_missing = df.loc[(df['ca'] != "?") & (df['thal'] != "?")]

In [None]:
df_no_missing

In [None]:
df_no_missing['ca'].unique()

In [None]:
df_no_missing['thal'].unique()

In [None]:
# FORMATTING THE DATA: SPLIT THE DATA INTO DEPENDENT AND INDEPENDENT VARIABLES

In [None]:
X = df_no_missing.drop("hd", axis =1).copy()  
y = df_no_missing["hd"]

In [None]:
# FORMATTING THE DATA: ONE - HOT ENCODING

X.dtypes

In [None]:
X['cp'].unique()

In [None]:
pd.get_dummies(X, columns=['cp'], dtype=int).head()

In [None]:
X_encoded = pd.get_dummies(X, columns=['cp', 'restecg',
                                      'slope', 'thal'], 
                          dtype=int)

In [None]:
X_encoded.head()


In [None]:
y.unique()

In [None]:
y_not_zero_index = y > 0 # Getting index for each non-zero value in y
y[y_not_zero_index] = 1 # Setting non-zero values to one

y.unique()

## BUILDING THE CLASSIFICATION TREE

In [None]:
# SPLITING THE DATA FOR TRAINING AND TESTING SET

from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, random_state=42)

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
clf_dt = DecisionTreeClassifier(random_state=42)
clf_dt = clf_dt.fit(X_train, y_train)

In [None]:
# PLOTTING THE TREE

from sklearn.tree import plot_tree

plt.figure(figsize=(25, 14), dpi=300)
plot_tree(clf_dt,
         filled=True,
         rounded=True,
         class_names=["No HD", "Yes HD"],
         feature_names=X_encoded.columns.tolist());

In [None]:
# COST COMPLEXITY PRUNING

In [None]:
path = clf_dt.cost_complexity_pruning_path(X_train, y_train) # determine values for alpha
ccp_alphas = path.ccp_alphas # extract different values for alpha
ccp_alphas = ccp_alphas[ :- 1] # exclude the maximum value for alpha

clf_dts = [] # create an array that we will put decision trees into

## now create one decision tree per value for alpha and store it in the array
for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf_dt.fit(X_train, y_train)
    clf_dts.append(clf_dt)

In [None]:
train_scores = [clf_dt.score(X_train, y_train) for clf_dt in clf_dts]
test_scores = [clf_dt.score(X_test, y_test) for clf_dt in clf_dts]

fig, ax = plt.subplots()

ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test", drawstyle="steps-post")
ax. legend()
plt.show()


In [None]:
# CROSS VALIDATION
from sklearn.model_selection import cross_val_score


clf_dt = DecisionTreeClassifier(random_state=42, ccp_alpha=0.016)
scores = cross_val_score(clf_dt, X_train, y_train, cv=5)
df = pd.DataFrame(data={'tree' : range(5), 'accuracy': scores})
df.plot(x="tree", y="accuracy", marker= 'o', linestyle = '--')

In [None]:
# create an array to store the results of each fold 

alpha_loop_values = []

for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    scores = cross_val_score(clf_dt, X_train, y_train, cv=5)
    alpha_loop_values.append([ccp_alpha, np.mean(scores), np.std(scores)])

## Now we can draw a graph of the means and standard deviations of the scores
## for each candidate value for alpha
alpha_results = pd.DataFrame(alpha_loop_values,columns=['alpha', 'mean_accuracy', 'std'])

alpha_results.plot(x='alpha',
                y='mean_accuracy',
                yerr='std',
                marker='o',
                linestyle='--')

In [None]:
alpha_results[(alpha_results['alpha'] > 0.014)
&
(alpha_results['alpha'] < 0.015)]

In [None]:
ideal_ccp_alpha = alpha_results[(alpha_results['alpha'] > 0.014)
&
(alpha_results['alpha'] < 0.015) ]['alpha']



ideal_ccp_alpha

In [None]:
ideal_ccp_alpha = float(ideal_ccp_alpha)

In [None]:
# BUILDING THE CLASSIFICATION TREE

clf_dt_pruned = DecisionTreeClassifier(random_state=42,
                                       ccp_alpha=ideal_ccp_alpha)

clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(20, 11), dpi = 250)
plot_tree(clf_dt_pruned,
         filled=True,
         rounded=True,
         class_names=["No HD", "Yes HD"],
         feature_names=X_encoded.columns.tolist());