In the context of data types, "ordered" data refers to ordinal data.
Ordinal data is a type of categorical data with an order (or rank).
The order of these values is significant and typically represents some sort of hierarchy.
For example, ratings data (like "poor", "average", "good", "excellent") is ordinal
because there is a clear order to the categories.

| Feature         | Description                                                  | Type
| ---             | ---                                                          | ---                                    
| **age**         | Age                                                          | Real
| **sex**         | Sex                                                          | Binary
| **cp**          | Chest pain type (4 values)                                   | Nominal
| **trestbps**    | Resting blood age                                            | Real
| **chol**        | Serum cholesterol (in mg/dl)                                 | Real
| **fbs**         | Fasting blood sugar > 120 mg/dl                              | Binary
| **restecg**     | Resting electrocardiographic results (values 0,1,2)          | Nominal
| **thalach**     | Maximum heart rate achieved                                  | Real
| **exang**       | Exercise induced angina                                      | Binary
| **oldpeak**     | Oldpeak = ST depression induced by exercise relative to rest | Real
| **slope**       | The slope of the peak exercise ST segment                    | Ordered
| **ca**          | Number of major vessels (0-3) colored by flouroscopy         | Real
| **thal**        | Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect    | Nominal
| **target**      | 1 = no disease; 2 = presence of disease                      | 

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

detail = {"age": "Age", "sex": "Sex", "cp": "Chest Pain Type", "trestbps": "Resting Blood Pressure",
          "chol": "Serum Cholesterol", "fbs": "Fasting Blood Sugar", "restecg": "Resting ECG",
          "thalach": "Max Heart Rate", "exang": "Exercise Induced Angina", "oldpeak": "Oldpeak",
          "slope": "Slope", "ca": "Number of major vessels", "thal": "Thal", "target": "(0 - no disease, 1 - disease))"}

sns.set_theme(context="paper", font_scale=1.5, style="whitegrid", palette="Set2")

data = pd.read_csv("heart.dat", sep="\\s+", header=None)
data.columns = detail.keys()

numericFeatures = ["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]
categoricalFeatures = ["sex", "cp", "fbs", "restecg", "exang", "slope", "thal"]

# Data Pre-Processing


In [56]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Check for missing values
print("Number of missing values:", data.isnull().sum().sum(), "\n")
# Check for duplicates
print("Number of duplicates:", data.duplicated().sum(), "\n")

X = data.iloc[:, :-1]
Y = data.iloc[:, -1] - 1

# Print class percentages
class_counts = Y.value_counts()
class_percentages = class_counts / len(Y) * 100
print("Absence of heart disease class percentage: ", class_percentages[0].round(2), "%")
print("Presence of heart disease class percentage: ", class_percentages[1].round(2), "%\n")

# print(data.describe())

# print(X[continuousFeatures].head(), "\n")
# Apply scaler only to continuous variables
X[numericFeatures] = StandardScaler().fit_transform(X[numericFeatures])
#X[continuousFeatures] = MinMaxScaler().fit_transform(X[continuousFeatures])

normalizedX = X.copy()
normalizedX[continuousFeatures] = MinMaxScaler().fit_transform(X[continuousFeatures])

X_oneHot = pd.get_dummies(X, columns=["cp", "restecg", "slope", "thal"])

one_hot_X = pd.get_dummies(X, columns=["cp", "restecg", "slope", "thal"])
one_hot_standardizedX = pd.get_dummies(standardizedX, columns=["cp", "restecg", "slope", "thal"])
one_hot_normalizedX = pd.get_dummies(normalizedX, columns=["cp", "restecg", "slope", "thal"])

print("one hot", one_hot_X.head(), "\n")
print("one hot standard", one_hot_standardizedX.head(), "\n")
print("one hot norm", one_hot_normalizedX.head(), "\n")




Number of missing values: 0 

Number of duplicates: 0 

Absence of heart disease class percentage:  55.56 %
Presence of heart disease class percentage:  44.44 %

one hot         age  sex  trestbps      chol  fbs   thalach  exang   oldpeak  \
0  1.712094  1.0 -0.075410  1.402212  0.0 -1.759208    0.0  1.181012   
1  1.382140  0.0 -0.916759  6.093004  0.0  0.446409    0.0  0.481153   
2  0.282294  1.0 -0.411950  0.219823  0.0 -0.375291    0.0 -0.656118   
3  1.052186  1.0 -0.187590  0.258589  0.0 -1.932198    1.0 -0.743600   
4  2.152032  0.0 -0.636310  0.374890  0.0 -1.240239    1.0 -0.743600   

         ca  cp_1.0  ...  cp_4.0  restecg_0.0  restecg_1.0  restecg_2.0  \
0  2.472682   False  ...    True        False        False         True   
1 -0.711535   False  ...   False        False        False         True   
2 -0.711535   False  ...   False         True        False        False   
3  0.349871   False  ...    True         True        False        False   
4  0.349871   False  .

# Feature Correlation

## Heatmap

Although the heatmap should work better with numeric features, categorical binary ones also are simple enough that a numeric relationship can also apply to its categorical nature.
Features used:
- **numerical** - age, trestbps, chol, thalach, oldpeak, ca
- **categorical** - sex, fbs, exang

In [None]:
figsize = (10, 8)
vmin = -0.75
vmax = 0.75

X_heatmap = X[numericFeatures + ["sex", "fbs", "exang"]]

dataCorr = pd.concat([X_heatmap, data["target"]], axis=1).corr()

upperHalf_mask = np.tril(np.ones_like(dataCorr, dtype=bool))    # remove bottom left corner

plt.figure(figsize=figsize)
plt.title("Feature Heatmap")
sns.heatmap(dataCorr, annot=True, linewidths=2,
            mask=upperHalf_mask, cmap="Spectral_r", vmin=vmin, vmax=vmax
)
plt.savefig(f"plots/heatmap/heatmap.png")

Features like **fbs**, **chol** and **trestbps** seem uninteresting for target prediction. Before trying the models without them, let's try removing the outliers and see if anything changes.

In [None]:
X_heatmap = X[numericFeatures]

Q1 = X_heatmap.quantile(0.25)
Q3 = X_heatmap.quantile(0.75)
IQR = Q3 - Q1

# Removing outliers 1.5 * IQR below and above the Q1 and Q3, respectively
X_heatmap_noOut =  X_heatmap[~((X_heatmap < (Q1 - 1.5 * IQR)) | (X_heatmap > (Q3 + 1.5 * IQR))).any(axis=1)]
X_heatmap_noOut = pd.concat([X_heatmap_noOut, X[["sex", "fbs", "exang"]]], axis=1)

dataCorr_noOut = pd.concat([X_heatmap_noOut, data["target"]], axis=1).corr()

upperHalf_mask = np.tril(np.ones_like(dataCorr_noOut, dtype=bool))

plt.figure(figsize=figsize)
plt.title("No Outliers Feature Heatmap")
sns.heatmap(dataCorr_noOut, annot=True, linewidths=2,
            mask=upperHalf_mask, cmap="Spectral_r", vmin=vmin, vmax=vmax
)
plt.savefig(f"plots/heatmap/heatmap_noOutliers.png")


Plotting the difference between heatmaps...

In [None]:
# Calculate the difference between the two correlation matrices
heatmap_diff = dataCorr_noOut - dataCorr

# Plot the difference heatmap

plt.figure(figsize=figsize)
plt.title("Difference between Heatmaps")
sns.heatmap(heatmap_diff, annot=True, linewidths=2, mask=upperHalf_mask, cmap="coolwarm")
plt.savefig(f"plots/heatmap/heatmap_diff.png")


The feature **trestbps** was the most affected by outlier removal, as the correlation decrease 0.79 from 0.16. This could say the outliers were relevant cases to consider, but we can look at its density distribution and boxplot to assess that it provides little information for the classification, as the distributions for disease and no disease are pratically overlapping.  

A slighter version of the same phenomenon happens for **chol**.  

The categorical feature **fbs** also appears to have little effect on separating the two classes, as the normalized bar plot shows the same proportions between target = 2 (diease) and target = 1 (no disease) for both **fbs** classes.  

It should be relevant to try out the models without these 3 features.

Also, removing features like **age** for Naive Bayes could in theory provide better results, as it is not strongly correlate with the target and but correlates a bit with many other features. 


## Principal Component Analysis

[PCA Explanation](https://builtin.com/data-science/step-step-explanation-principal-component-analysis)  

GitHub Copilot: Advantages of using PCA for dimensionality reduction:

1. **Removes Correlated Features**: In the real world, features are often correlated. PCA allows you to identify the most important features of your dataset, reducing it to a smaller set of uncorrelated features, known as principal components.

2. **Improves Algorithm Performance**: With fewer features, the performance of a machine learning algorithm can improve. It can also reduce overfitting.

3. **Reduces Overfitting**: By reducing the dimensionality of your feature space, you're less likely to overfit your model.

4. **Improves Visualization**: It's hard to visualize high dimensional data. PCA transforms a high dimensional data set to 2 or 3 dimensions so we can plot and understand data better.

Disadvantages of using PCA:

1. **Independent variables become less interpretable**: After implementing PCA, your original features will turn into Principal Components. Principal Components are the linear combination of your original features. Principal Components are not as readable and interpretable as original features.

2. **Data standardization is must before PCA**: You must standardize your data before implementing PCA, otherwise PCA will not be able to find the optimal Principal Components.

3. **Information Loss**: Although principal components attempt to retain as much information as possible, some information is lost when reducing dimensions, which can potentially degrade the performance of your machine learning model.

4. **Doesn't handle non-linear features well**: PCA assumes that the principal components are a linear combination of the original features. If this assumption is not true, PCA may not give you the results you're looking for.

In [None]:
from sklearn.decomposition import PCA

X_pca = X[numericFeatures]

pca = PCA(n_components=len(X_pca.columns))
pca_result = pca.fit_transform(X_pca)

explained_variance = pca.explained_variance_ratio_

# plot no of components vs cumulative explained variance
explained_variance = np.cumsum(pca.explained_variance_ratio_)
plt.figure(figsize=(4, 3))
plt.plot(explained_variance, )
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()



By plotting the *Number of Components* against the *Cumulative Explained Variation*, we can see that 5 principal components are useful to explain 100% of the variance, the same number of numeric features. PCA won't provide the benefit of reducing dimensionality of the dataset.

Copilot:

That said, PCA can still be useful in this case for other reasons:

1. **Feature Independence**: The PCs are linearly independent of each other, which can help with certain types of models that assume feature independence (like linear regression).

2. **Interpretability**: PCs can sometimes be interpreted in terms of the original features, which can provide insights into the structure of your data.

3. **Noise Reduction**: PCA can help to reduce noise in your data by focusing on the directions of maximum variance and ignoring smaller, potentially noisy fluctuations.

In [None]:
from mpl_toolkits.mplot3d import Axes3D

nPCs = 5
pca = PCA(n_components=nPCs)
pca_result = pca.fit_transform(X_pca)

explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance: {explained_variance}")
# print(pca.components_)  # feature weight for each pc

# Convert it back to a DataFrame
pca_df = pd.DataFrame(data=pca_result, columns=["PC" + str(i + 1) for i in range(nPCs)])


X_oneHot_PCAed = pd.concat([pca_df, X_oneHot.drop(X_oneHot.columns, axis=1)], axis=1)

# print(X_oneHot_PCAed.columns)

# Naive Bayes

In [58]:
from sklearn.model_selection import train_test_split, cross_val_predict, cross_validate
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix


continuousX=X[continuousFeatures].copy()
#print(continuousX.head())

croppedX = continuousX.copy()
croppedX = croppedX.drop(["thalach"], axis=1)
#print(croppedX.head())

y_pred_all = cross_val_predict(GaussianNB(), X, Y, cv=5)
y_pred_continuous = cross_val_predict(GaussianNB(), continuousX, Y, cv=5)
y_pred_cropped = cross_val_predict(GaussianNB(), croppedX, Y, cv=5)

print(y_pred_all[0:5])
print(y_pred_continuous[0:5])
print(y_pred_cropped[0:5])




nb_results = cross_validate(GaussianNB(), X, Y, cv=5, scoring=["accuracy","precision", "recall", "f1", "roc_auc"], return_train_score=True)
results_df = pd.DataFrame(nb_results)
print("ALL FEATURES \n", results_df.mean(), "\n\n")

nb_results = cross_validate(GaussianNB(), continuousX, Y, cv=5, scoring=["accuracy","precision", "recall", "f1", "roc_auc"], return_train_score=True)
results_df = pd.DataFrame(nb_results)
print("CONTINUOUS FEATURES \n", results_df.mean(), "\n\n")

nb_results = cross_validate(GaussianNB(), croppedX, Y, cv=5, scoring=["accuracy","precision", "recall", "f1", "roc_auc"], return_train_score=True)
results_df = pd.DataFrame(nb_results)
print("CONTINUOUS FEATURES W/o THALACH \n", results_df.mean(), "\n\n")

[1 1 0 1 0]
[1 1 0 1 1]
[1 1 0 0 0]
ALL FEATURES 
 fit_time           0.003132
score_time         0.009930
test_accuracy      0.840741
train_accuracy     0.862963
test_precision     0.835804
train_precision    0.854784
test_recall        0.800000
train_recall       0.833333
test_f1            0.817353
train_f1           0.843868
test_roc_auc       0.900000
train_roc_auc      0.918003
dtype: float64 


CONTINUOUS FEATURES 
 fit_time           0.002387
score_time         0.009433
test_accuracy      0.762963
train_accuracy     0.778704
test_precision     0.755897
train_precision    0.773451
test_recall        0.708333
train_recall       0.710417
test_f1            0.727403
train_f1           0.740534
test_roc_auc       0.826667
train_roc_auc      0.844167
dtype: float64 


CONTINUOUS FEATURES W/o THALACH 
 fit_time           0.001802
score_time         0.008785
test_accuracy      0.744444
train_accuracy     0.756481
test_precision     0.773581
train_precision    0.780472
test_recall      

score_time        0.007607
test_accuracy     0.744444
test_precision    0.736559
test_recall       0.846667
test_f1           0.784980
test_roc_auc      0.805278

# Logistic Regression

In [44]:
from sklearn.linear_model import LogisticRegression


# nb_results = cross_validate(LogisticRegression(), standardizedX, Y, cv=5, scoring=["accuracy", "precision", "recall", "f1", "neg_log_loss"])
# results_df = pd.DataFrame(nb_results)
# print("standardizedX", "\n", results_df.mean(), "\n\n")


# Sistema

nb_results = cross_validate(LogisticRegression(), one_hot_standardizedX, Y, cv=5, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"])
results_df = pd.DataFrame(nb_results)
print("one_hot_standardizedX","\n", results_df.mean(), "\n\n")

# nb_results = cross_validate(LogisticRegression(), one_hot_normalizedX, Y, cv=5, scoring=["accuracy", "precision", "recall", "f1", "neg_log_loss"])
# results_df = pd.DataFrame(nb_results)
# print("one_hot_normalizedX","\n", results_df.mean(), "\n\n")

param_grid = [    
    {
    'C' : [0.001,0.01,0.1,1,10,100,1000],
    }
]

clf = GridSearchCV(LogisticRegression(), param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"], return_train_score=True, refit="recall")
clf.fit(one_hot_standardizedX, Y)
results_gs = pd.DataFrame(clf.cv_results_)
results_gs[['param_C','mean_test_recall','mean_test_f1', 'mean_test_roc_auc']]




one_hot_standardizedX 
 fit_time          0.010674
score_time        0.012706
test_accuracy     0.848148
test_precision    0.876926
test_recall       0.766667
test_f1           0.817716
test_roc_auc      0.908333
dtype: float64 


Fitting 5 folds for each of 7 candidates, totalling 35 fits


Unnamed: 0,param_C,mean_test_recall,mean_test_f1,mean_test_roc_auc
0,0.001,0.175,0.291074,0.893889
1,0.01,0.708333,0.791018,0.907222
2,0.1,0.775,0.820146,0.9125
3,1.0,0.766667,0.817716,0.908333
4,10.0,0.783333,0.824153,0.901944
5,100.0,0.8,0.831662,0.900278
6,1000.0,0.791667,0.827037,0.898333


# K-Nearest Neighbors

In [23]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# Ensure one_hot_standardizedX and Y are properly defined

param_grid = {"n_neighbors": np.linspace(1, 30).astype(int)}

# Choose a single scoring metric (e.g., "accuracy") for refit
refit_metric = "accuracy"

clf = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=param_grid,
    cv=5,
    scoring=["accuracy", "precision", "recall", "f1", "roc_auc"],
    return_train_score=True,
    refit=refit_metric
)

clf.fit(one_hot_standardizedX, Y)

# results_gs = pd.DataFrame(clf.cv_results_)
# results_subset = results_gs[
#     ["param_n_neighbors", f"mean_test_{refit_metric}", f"mean_train_{refit_metric}"]
# ]
# print(results_subset)

# Print the metrics table
print("Metrics Table:")
print(results_gs[['param_n_neighbors', 'mean_test_accuracy', 'mean_test_precision', 'mean_test_recall', 'mean_test_f1', 'mean_test_roc_auc']])


Traceback (most recent call last):
  File "c:\ProgramData\miniconda3\Lib\site-packages\sklearn\metrics\_scorer.py", line 76, in _cached_call
    rand_score,
               ^
KeyError: 'predict'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\ProgramData\miniconda3\Lib\site-packages\sklearn\metrics\_scorer.py", line 115, in __call__
    def __init__(self, *, scorers, raise_exc=True):
                            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\ProgramData\miniconda3\Lib\site-packages\sklearn\metrics\_scorer.py", line 276, in _score
    ``self._kwargs`` and ``kwargs`` passed as metadata.
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\ProgramData\miniconda3\Lib\site-packages\sklearn\metrics\_scorer.py", line 78, in _cached_call
    )
      
  File "c:\ProgramData\miniconda3\Lib\site-packages\sklearn\neighbors\_classification.py", line 234, in predict
    ----------
               
  File "c:\ProgramData

KeyboardInterrupt: 

# Decision Trees

In [52]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

X_DT = X[["thalach", "oldpeak", "ca", "exang"]].copy()

param_grid = {"max_depth": [3, 5, 7, 9], "min_samples_split": [2, 4, 8, 10]}

clf = GridSearchCV(DecisionTreeClassifier(), param_grid=param_grid, cv=5, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"], return_train_score=True, refit=False)
clf.fit(X, Y)
results_gs = pd.DataFrame(clf.cv_results_)
results_gs[['param_max_depth','param_min_samples_split', 'mean_train_recall','mean_train_f1','mean_train_roc_auc','mean_test_recall', 'mean_test_f1','mean_test_roc_auc']]


Unnamed: 0,param_max_depth,param_min_samples_split,mean_train_recall,mean_train_f1,mean_train_roc_auc,mean_test_recall,mean_test_f1,mean_test_roc_auc
0,3,2,0.7875,0.833166,0.912951,0.691667,0.755672,0.827222
1,3,4,0.7875,0.833166,0.912951,0.691667,0.755672,0.827222
2,3,8,0.7875,0.833166,0.912951,0.691667,0.755672,0.83125
3,3,10,0.7875,0.833166,0.912951,0.691667,0.755672,0.83125
4,5,2,0.902083,0.936005,0.983516,0.7,0.723168,0.774583
5,5,4,0.902083,0.933,0.982734,0.683333,0.719009,0.779306
6,5,8,0.879167,0.906312,0.974601,0.683333,0.718523,0.80125
7,5,10,0.866667,0.900344,0.972457,0.675,0.716899,0.798889
8,7,2,0.979167,0.986386,0.999141,0.675,0.691034,0.729167
9,7,4,0.939583,0.959527,0.996493,0.708333,0.720378,0.759861
