# SC1015 Mini-Project

Group: FCEE

Lee Heng Sheng Brandon, U2322900C \
Alan Lee Leman, \
Zi Hao



**TO-DO:**

1. Remember to add README file (3-5 minute summary, with all consulted references). 
2. Add citations/references for introduction
3. Weekly consultation with TA
4. Determine what techniques we will use in our project (logistic regression, decision tree/random forest, gridsearchcv)
5. Submit our presentation video, PPT/PDF slides used for the presentation and all code on github with their references.



## Introduction

Heart disease, also known as cardiovascular disease (CVD), refers to a range of conditions that affect the heart and blood vessels. It is one of the leading causes of death globally, taking an estimated 17.9 million lives annually. In Singapore alone, CVD accounted for 31.4% of all deaths in 2022.

Traditional methods of detecting CVD include electrocardiography (ECG) and angiography. Although ECG is non-invasive, it cannot provide a definite diagnosis of CVD. Conversely, while angiography may provide a more definite diagnosis, it is invasive and can have various side effects and complications.

Our project aims to provide a measure to detect CVD in a non-invasive but definite way. 

## Problem Statement

How may we accurately detect heart disease in a patient?

### Attribute Information

> 1. `age`: age in years
2. `sex`: 0 = female; 1 = male
3. `cp`: Chest pain type (4 values)
4. `trestbps`: Resting blood pressure (in mm Hg on admission to the hospital)
5. `chol`: Serum cholestoral in mg/dl (serum cholestoral in mg/dl)
6. `fbs`: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. `restecg`: Resting electrocardiographic results (values 0,1,2)
8. `thalach`: Maximum heart rate achieved (in bpm)
9. `exang`: Exercise induced angina (0 = no; 1 = yes)
10. `oldpeak`: ST depression induced by exercise relative to rest
11. `slope`: The slope of the peak exercise ST segment
12. `ca`: Number of major vessels (0-3) colored by flourosopy
13. `thal`: 1 = normal; 2 = fixed defect; 3 = reversable defect
14. `target`: 0 = no heart disease; 1 = heart disease

The names and social security numbers of the patients have been removed from the database and replaced with dummy values.

## Overview

10% for coming up with your own problem definition based on a dataset
10% for data preparation and cleaning to suit the problem of your choice
20% for exploratory data analysis/visualization to gather relevant insights
20% for the use of machine learning techniques to solve specific problem
20% for the presentation of data-driven insights and the recommendations
10% for the quality of your final team presentation and overall impressions
10% for learning something new and doing something beyond this course

1. Basic Exploration Analysis
2. Data Cleaning
3. Exploratory Data Analysis/Visualisation
4. Decision Tree & Random Forest
5. Logistic Regression
6. New technique beyond the course
7. Comparing our models
8. Insights and Conclusions

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python \
Pandas : Library for Data Acquisition and Preparation \
Matplotlib : Low-level library for Data Visualization \
Seaborn : Higher-level library for Data Visualization 

In [30]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot

# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import plot_tree

sb.set() # set the default Seaborn style for graphics

### Import the Dataset

Dataset on [Heart Disease](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset). By David Lapp. 

### Context

This dataset dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

In [31]:
data = pd.read_csv("heart.csv")

print("Data dimensions:", data.shape)

data

Data dimensions: (1025, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


### Basic Exploratory Analysis

From our attribute information, we know that variables such as `sex`, `cp`, `fbs`, `restecg`, `exang`, `ca`, `thal` and `target` are categorical values. The remaining data should be numerical. 

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


In [33]:
data.describe().round(2)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.43,0.7,0.94,131.61,246.0,0.15,0.53,149.11,0.34,1.07,1.39,0.75,2.32,0.51
std,9.07,0.46,1.03,17.52,51.59,0.36,0.53,23.01,0.47,1.18,0.62,1.03,0.62,0.5
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


### Data Cleaning

Notice from the attribute information that `thal` should be between 1-3 and `ca` should be between 0-3, but from the statistical summary of the data, we can see that there are some 0 `thal` values in the data (min = 0.00) and 4 `ca` values in the data (max = 4.00). Let us first remove these data.

In [34]:
clean_data = data[(data["thal"] != 0) & (data["ca"] != 4)]

clean_data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


### Checking for Skew

We should check for any skew in the distribution of patients with heart disease. An imbalanced dataset could affect FPR/FNR values.

In [35]:
print("CVD absent:", clean_data[clean_data["target"] == 0].shape[0])
print("CVD present:", clean_data[clean_data["target"] == 1].shape[0])

CVD absent: 492
CVD present: 508


### Dropping Outliers

We should also drop any outliers that may affect our model. 

### Feature Engineering

### Determining Relevant Variables

Now, we find our predictor variables with a strong correlation to `target`.