## **ST10083706**
## **PDAN8411 Part 2**

#### **Reference to dataset:**

Kaggle. 2024. Heart Disease Dataset. [Online] Available at: https://www.kaggle.com/datasets/mexwell/heart-disease-dataset [Accessed: 20 May 2024]

#### **Dataset description:**

According to Kaggle, 2024, this heart disease dataset is curated by combining 5 popular heart disease datasets already available independently but not combined before. These datasets were collected and were combined at a single place to help further along research on CAD (Coronary artery disease)-related machine learning and data mining algorithms. It is also hoped to help advance clinical diagnosis and early treatment(Kaggle, 2024). 
This dataset consists of 1190 instances with 11 features.

#### **Classification:**

Classification is the act of classifying objects into sub-categories based on their features or attributes (GeeksforGeeks, 2024). Classification falls under supervised machine learning where the model attempts to predict the right label of the input data. The training data is used to train the model and then the model is evaluated on test data before being used to perform prediction on new unseen data. The main goal for classification is to build a model that can allocate labels or categories to new observations based on its features (GeeksforGeeks, 2024). 
For example, the dataset that is being used contains labeled features of factors relating to CAD and a classification model can use this data to make diagnosis for CAD easier. 

**Classification Types**

Binary Classification: 
The target variable has only two possible classes or labels (e.g.: spam vs. non-spam email). The objective is to classify the data into one of the two classes(Datacamp, 2022). 

Multi-class Classification: 
The target variable has more than two classes (e.g., classifying images of handwritten digits into one of 10 classes). The objective is to classify the data into one of the multiple classes (Datacamp, 2022).

**Characteristics**

- Input Data: Classification algorithms take input data where each feature describes a particular aspect of the data (Datacamp, 2022). 

- Target Labels: Each data point in the training set is associated with a target label or class (Datacamp, 2022). 

- Training Data: A labeled dataset is used to train the classification model. It consists of input data points along with their corresponding target labels. The model learns patterns and relationships in the training data to make predictions on unseen data (Datacamp, 2022).

- Model Building: Classification algorithms build a model that can predict the class label of new data based on the patterns the model learned from the training data (Datacamp, 2022). 

- Evaluation: After the model is trained, it's evaluated using a different dataset which is the test set. The performance of the model is assessed based on various metrics such as accuracy, precision, recall, F1 score, etc (Datacamp, 2022).

- Prediction: After evaluation, the trained model can be used to make predictions on new and unseen data (Datacamp, 2022). 

#### **Why the chosen dataset is appropriate for analysis with classification:**

Dataset Overview:

The dataset provided includes labeled features related to CAD (Kaggle, 2024).

The features are:

Age
Sex
Chest pain type
Resting blood pressure
Cholesterol level
Fasting blood sugar
Resting electrocardiographic results
Maximum heart rate achieved
Exercise-induced angina
ST depression induced by exercise relative to rest (oldpeak)
Slope of the peak exercise ST segment
Target (the presence or absence of CAD)

Each data point represents a patient and includes both the features, input data, and the target label,diagnosis of CAD.

Input Data: 
The dataset includes various features that describe each patient's health parameters. These features serve as the input data for the classification algorithm. According to Datacamp (2022), classification algorithms require input data where each feature describes a particular aspect of the data, making this dataset suitable for such analysis.

Target Labels: 
The target variable in the dataset is the diagnosis of CAD, which is a binary classification problem (presence or absence of CAD). This aligns with the definition of binary classification, where the target variable has only two possible classes (GeeksforGeeks, 2024; Datacamp, 2022).

Training Data: 
The dataset is labeled, meaning each patient record includes a target label indicating the presence or absence of CAD. This labeled dataset can be used to train the classification model, allowing the model to learn patterns and relationships in the data to make predictions on unseen data (Datacamp, 2022).

Model Building: 
Classification algorithms can use the features provided to build a model that predicts the CAD diagnosis based on the patterns learned from the training data. The diverse and relevant features in the dataset (age, sex, cholesterol levels, ...) are suitable for creating a predictive model (Datacamp, 2022).

Evaluation: 
The dataset can be split into training and test sets. The model can be trained on the training set and then evaluated on the test set using metrics such as accuracy, precision, recall, and F1 score. This evaluation process is critical to assess the model's performance and its ability to generalize to new data (Datacamp, 2022).

Prediction: Once evaluated and fine-tuned, the trained model can be used to make predictions on new and unseen patient data. This application aligns with the goal of classification, which is to allocate labels or categories to new observations based on their features (GeeksforGeeks, 2024; Datacamp, 2022).

 The features included in the dataset are medically relevant and commonly used indicators of heart health and CAD risk. Features such as age, cholesterol levels, blood pressure, and exercise-induced angina are critical in diagnosing CAD (Mayo Clinic Staff, 2022).

Binary Classification Task: 
The diagnosis of CAD (presence or absence) fits into a binary classification framework. This goes perfectly with the classification types discussed, making the dataset suitable for algorithms designed for binary classification tasks (Datacamp, 2022).

Sufficient Data for Training: 
The dataset includes 1190 entries, which provides a substantial amount of data to train a robust classification model. This volume of data allows for meaningful training and evaluation, reducing the risk of overfitting and ensuring the model can generalise well to new data.

Balanced Class Distribution: 
While not explicitly stated, if the dataset has a balanced distribution of positive and negative cases of CAD, it would be highly beneficial for training a balanced model. Even if there is some imbalance, techniques such as oversampling, undersampling, or using different evaluation metrics can help address this.

Real-world Applicability: 
The use of real patient data with medically relevant features ensures that the model built from this dataset can be practically applied in clinical settings to aid in the diagnosis of CAD, which could potentially improving diagnostic accuracy and patient outcomes.

#### **Analysis:**

Purpose of the Analysis:

The primary purpose of the analysis is to develop a predictive model for the diagnosis of Coronary Artery Disease (CAD) using machine learning classification techniques. This analysis aims to leverage patient data to accurately categorize individuals into those who are at risk of CAD and those who are not. 

Data Preparation: 
Clean and pre-process the dataset to handle any missing , (CF Blog, 2022). 
Normalise numerical features, and encode categorical variables.

Exploratory Data Analysis (EDA): 
Conduct EDA to understand the distribution of features, identify potential correlations, and visualise the data to gain insights into patterns that might influence CAD diagnosis.

Model Selection: 
Choose appropriate classification algorithms (e.g., logistic regression, decision trees, support vector machines, or neural networks) based on the characteristics of the dataset and the problem at hand.

Training and Validation: 
Split the dataset into training and validation sets. Train the model on the training set and validate its performance using the validation set. Use techniques such as cross-validation to ensure the model's robustness.

Evaluation: 
Evaluate the model using metrics such as accuracy, precision, recall, F1 score, and ROC-AUC to assess its performance. These metrics provide a comprehensive view of the model's ability to correctly diagnose CAD.

Optimization: 
Fine-tune the model by adjusting hyperparameters, selecting the most relevant features, and implementing techniques to handle class imbalance if necessary.

Prediction: 
Use the optimized model to make predictions on new, unseen data. Assess the model's predictive power in a real-world setting to ensure its practical applicability.



#### **Importing Packages**
For data processing: 


In [11]:
import pandas as pd
import numpy as np

For modelling:


In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint

for tree visualisation


In [13]:
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz

Loading dataset:

In [14]:
df = pd.read_csv('heart_statlog_cleveland_hungary_final.csv')
print(df)

      age  sex  chest pain type  resting bp s  cholesterol  \
0      40    1                2           140          289   
1      49    0                3           160          180   
2      37    1                2           130          283   
3      48    0                4           138          214   
4      54    1                3           150          195   
...   ...  ...              ...           ...          ...   
1185   45    1                1           110          264   
1186   68    1                4           144          193   
1187   57    1                4           130          131   
1188   57    0                2           130          236   
1189   38    1                3           138          175   

      fasting blood sugar  resting ecg  max heart rate  exercise angina  \
0                       0            0             172                0   
1                       0            0             156                0   
2                       0     

In [16]:
df.describe()

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
count,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0,1190.0
mean,53.720168,0.763866,3.232773,132.153782,210.363866,0.213445,0.698319,139.732773,0.387395,0.922773,1.62437,0.528571
std,9.358203,0.424884,0.93548,18.368823,101.420489,0.409912,0.870359,25.517636,0.48736,1.086337,0.610459,0.499393
min,28.0,0.0,1.0,0.0,0.0,0.0,0.0,60.0,0.0,-2.6,0.0,0.0
25%,47.0,1.0,3.0,120.0,188.0,0.0,0.0,121.0,0.0,0.0,1.0,0.0
50%,54.0,1.0,4.0,130.0,229.0,0.0,0.0,140.5,0.0,0.6,2.0,1.0
75%,60.0,1.0,4.0,140.0,269.75,0.0,2.0,160.0,1.0,1.6,2.0,1.0
max,77.0,1.0,4.0,200.0,603.0,1.0,2.0,202.0,1.0,6.2,3.0,1.0


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1190 entries, 0 to 1189
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  1190 non-null   int64  
 1   sex                  1190 non-null   int64  
 2   chest pain type      1190 non-null   int64  
 3   resting bp s         1190 non-null   int64  
 4   cholesterol          1190 non-null   int64  
 5   fasting blood sugar  1190 non-null   int64  
 6   resting ecg          1190 non-null   int64  
 7   max heart rate       1190 non-null   int64  
 8   exercise angina      1190 non-null   int64  
 9   oldpeak              1190 non-null   float64
 10  ST slope             1190 non-null   int64  
 11  target               1190 non-null   int64  
dtypes: float64(1), int64(11)
memory usage: 111.7 KB


##### References:

CF Blog. 2022. A Step-by-Step Guide to the Data Analysis Process. [Online]  Available at: https://careerfoundry.com/en/blog/data-analytics/the-data-analysis-process-step-by-step/ [Accessed: 20 May 2024]

‌Datacamp. 2022. Classification in Machine Learning: A Guide for Beginners. [Online]  Available at: https://www.datacamp.com/blog/classification-machine-learning [Accessed: 20 May 2024]

GeeksforGeeks. 2024. Getting started with Classification - GeeksforGeeks. [Online] Available at: https://www.geeksforgeeks.org/getting-started-with-classification/ [Accessed: 20 May 2024]

Kaggle. 2024. Heart Disease Dataset. [Online] Available at: https://www.kaggle.com/datasets/mexwell/heart-disease-dataset [Accessed: 20 May 2024]

Mayo Clinic Staff. 2022. Coronary artery disease - Symptoms and causes. [Online]  Available at: https://www.mayoclinic.org/diseases-conditions/coronary-artery-disease/symptoms-causes/syc-20350613#:~:text=Coronary%20artery%20disease%2C%20also%20called [Accessed: 20 May 2024]