# SC1015 Mini-Project

Group: 2, FCEE

Lee Heng Sheng Brandon, U2322900C \
Alan Lee Leman, U2321753B \
Wee Zi Hao, U2323380H

### Final Attribute Information

> 1. `age`: age in years (Numerical)
2. `sex`: 0 = female; 1 = male (Categorical)
3. `cp` changed to `chest_pain`: Chest pain type (4 values) (Categorical)
4. `trestbps` changed to `blood_pressure`: Resting blood pressure (in mm Hg on admission to the hospital) (Numerical)
5. `chol` changed to `cholesterol`: Serum cholesterol in mg/dl (serum cholestoral in mg/dl) (Numerical)
6. `fbs` changed to `fasting_blood_sugar`: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) (Categorical)
7. `restecg` changed to `resting_ecg_result`: Resting electrocardiographic results (values 0,1,2) (Categorical)
8. `thalach` changed to `max_heart_rate`: Maximum heart rate achieved (in bpm) (Numerical)
9. `exang` changed to `exercise_induced_angina`: Exercise induced angina (0 = no; 1 = yes) (Categorical)
10. `oldpeak` changed to `st_depression`: ST depression induced by exercise relative to rest (Numerical)
11. `new_st_depression`: The presence of ST depression induced by exercise relative to rest (0 = no; 1 = yes) (Categorical)
12. `slope`: The slope of the peak exercise ST segment (0, 1, 2) (Categorical)
13. `ca` changed to `num_affected_vessels`: Number of major vessels (0-3) colored by fluoroscopy (Categorical)
14. `thal` changed to `defect_type`: 1 = normal; 2 = fixed defect; 3 = reversable defect (Categorical)
15. `target` changed to `heart_disease`: 0 = no heart disease; 1 = heart disease (Categorical)

### Essential Libraries

Let us begin by importing the essential Python Libraries for Data Extraction and Cleaning.

> NumPy : Library for Numeric Computations in Python \
Pandas : Library for Data Acquisition and Preparation \
Matplotlib : Low-level library for Data Visualization \
Seaborn : Higher-level library for Data Visualization 

In [13]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot

### Import the Dataset

We will be importing our clean_data.csv dataset that we previously saved.\
Dataset is a cleaned version of [Heart Disease](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset). By David Lapp. 


In [14]:
# Importing our dataset
clean_data = pd.read_csv("datasets\clean_data.csv")

print("Data dimensions:", clean_data.shape)

clean_data

Data dimensions: (1000, 15)


Unnamed: 0,age,sex,chest_pain,blood_pressure,cholesterol,fasting_blood_sugar,resting_ecg_result,max_heart_rate,exercise_induced_angina,st_depression,new_st_depression,slope,num_affected_vessels,defect_type,heart_disease
0,52,1,0,125,212,0,1,168,0,1.0,1,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,1,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,59,1,1,140,221,0,1,164,1,0.0,0,2,0,2,1
996,60,1,0,125,258,0,0,141,1,2.8,1,1,1,3,0
997,47,1,0,110,275,0,0,118,1,1.0,1,1,1,2,0
998,50,0,0,110,254,0,0,159,0,0.0,0,2,0,2,1


In [15]:
# Make a list of numerical and categorical variables
cat_var = ["sex", "chest_pain", "fasting_blood_sugar", "resting_ecg_result", "exercise_induced_angina", "new_st_depression", 
           "slope", "num_affected_vessels", "defect_type", "heart_disease"]
num_var = [var for var in clean_data.columns if var not in cat_var]

## Assumptions of Logistic Regression

Because logistic regression does not assume normality, the model will not be affected by skew. Thus, we should not be removing outliers from our data.

Rather, logistic regression assumes the following:

1. Independence of Observations (which we shall assume)
2. Absence of Multicollinearity (independent variables should not be highly correlated with any other variable in the model)
3. Linearity of Logit (there is a linear relationship between the logit of the dependent variable and the (continuous) independent variable)

We can use these assumptions to determine the relevant independent variables for our model. 

### Absence of Multicollinearity

We can test for this with VIF (Variance Inflation Factor). The formula is given as:

$VIF_i = \frac{1}{1 - R^2}$

Generally, a VIF above 5 indicates a high multicollinearity and we should avoid using these independent variables in our model. 

In [16]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
from statsmodels.tools.tools import add_constant

# Get independent variables
independent_vars = clean_data.drop("target", axis = 1)

# variance_inflation_factor expects the presence of a constant in the matrix of explanatory variables
# We can add a constant column using add_constant from statsmodels
independent_vars = add_constant(independent_vars)

VIF_df = pd.DataFrame(independent_vars.columns).rename({0 : "VARIABLES"}, axis = 1) # rename variable column

VIF_df["VIF"] = [vif(independent_vars, i) for i in range(len(independent_vars.columns))]

VIF_df

KeyError: "['target'] not found in axis"

It appears that all our variables have a VIF below 5, and we do not need to drop any of them. 

### One-Hot Encoding

Since Logistic Regression is a linear model, we will need to convert categorical variables into a set of binary (dummy) variables before fitting them in the model. 

In [None]:
# Import the encoder from sklearn
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

# One-Hot Encoding of categorical predictors
cat_pred = clean_data[cat_var].drop("target", axis = 1)
ohe.fit(cat_pred)

cat_pred_ohe = pd.DataFrame(ohe.transform(cat_pred).toarray(), 
             columns = ohe.get_feature_names_out(cat_pred.columns))

# Check the encoded variables
cat_pred_ohe.info()

In [None]:
# Concatenate with the numeric variables
clean_data_ohe = pd.concat([clean_data[num_var], cat_pred_ohe, clean_data["target"]], axis = 1)

print("Dimensions:", clean_data_ohe.shape)

# Check the final DataFrame
clean_data_ohe

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split the data into predictors and response
X = clean_data_ohe.drop(["target"], axis = 1)
y = clean_data_ohe["target"]

# Split the dataset into train and test (80:20 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

logreg = LogisticRegression(max_iter = 300) # default 100 max iterations leads to convergence
logreg.fit(X_train, y_train)

print("Classes:", logreg.classes_)
print("Intercept:", logreg.intercept_)
print("Coefficients:", [i.round(4) for i in logreg.coef_[0]])

### Goodness of Fit of Model

Let us check its classification accuracy and its confusion matrix. 

In [None]:
# Predict target with model
y_train_pred = logreg.predict(X_train)
y_test_pred = logreg.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", logreg.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", logreg.score(X_test, y_test))
print()

In [None]:
from sklearn.metrics import confusion_matrix

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))

sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

axes[0].set_title("Train")
axes[1].set_title("Test");

In [None]:
# Define a function to print rate metrics
def printMetrics(true, pred): 
    FP = confusion_matrix(true, pred)[0][1]
    FN = confusion_matrix(true, pred)[1][0]
    TP = confusion_matrix(true, pred)[1][1]
    TN = confusion_matrix(true, pred)[0][0]
    TPR = TP / (TP + FN)
    FPR = FP / (FP + TN)
    TNR = TN / (TN + FP)
    FNR = FN / (FN + TP)
    print("TPR:\t", TPR)
    print("FPR:\t", FPR)
    print("TNR:\t", TNR)
    print("FNR:\t", FNR)
    print()

print("TRAIN SET:")
printMetrics(y_train, y_train_pred)
print("TEST SET:")
printMetrics(y_test, y_test_pred)

### Linearity of Logit

We shall check for this by plotting the logit of the dependent variable `target` against the independent variables in a scatterplot. We can also check for linearity with pearson's correlation. The formula for logit is given as: 

$Logit(p) = \log (\frac{p}{1 - p})$

where $p$ is odds of success and $1 - p$ is the odds of failure. We can get the estimated probabilities from our model with `predict_proba()`. Let us first check the linearity on the train set. 

In [None]:
proba_arr = logreg.predict_proba(X_train)

proba_arr

In [None]:
import math

# Make the logit function
logit = lambda x : math.log(x[1] / x[0])

target = clean_data_ohe["target"]
clean_data_ohe = clean_data_ohe.drop(["target"], axis = 1)

# Get the estimated probabilities
proba_arr = logreg.predict_proba(clean_data_ohe)

# Apply the logit function across all probability values
logit_arr = np.array(list(map(logit, proba_arr)))

In [None]:
clean_data_ohe = pd.concat([clean_data_ohe, pd.DataFrame(logit_arr), target], axis = 1).rename({0 : "logit"}, axis = 1)

clean_data_ohe["logit"]

In [None]:
f, axes = plt.subplots(len(num_var), 1, figsize = (24, 12))

for i, var in enumerate(num_var):
    sb.scatterplot(data = clean_data_ohe, x = var, y = "logit", ax = axes[i]).set(xlabel = var)
    print(f"{var} correlation:", clean_data_ohe[var].corr(clean_data_ohe["logit"]))

f.tight_layout()

# Ok, something probably went wrong here. 

### Check for Relevant Variables via Coefficients

### Dimensionality Issue of One-Hot Encoding