# Homework 2

## 1. Imports and Data Cleaning

In [1]:
### Import necessary libraries for the following purposes
# 1. Read CSV from S3
# 2. Sklearn and binary classificaiton models for data analysis

import boto3
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from helpers import *

### Define the function to read CSV from S3
AAKI = 'ASIAYAAO5HRMMBHYRDON'
ASAK = '8095Atiarv6PFHjHwM12KaXRanXcRbJ2+0l59qvC'
AST = "IQoJb3JpZ2luX2VjEMr//////////wEaCXVzLWVhc3QtMiJHMEUCICZ66zZu2l24Ut8R8t3Zn37GpGV6GxZKvCzkJ8L8uWChAiEA/lGRv0BbplHtAkHmuF12eqlrppqwGPRaGK8ao5/EPVQq6wIIMxAAGgw1NDk3ODcwOTAwMDgiDNzxYq1zo1d/tH9aiSrIApV3LEPPlNuLt7dKD4ZadW6cMvVu9ILP9VZmHDiyfQzR1YcZGStEkbWyBEkCfB/0hugEBLJV7tfYcsE9of9Oq4ifYhZVjGLmx6c8uE3WGombLfFPH9e2rihVMcMPCs3QGQOmsyI4CUHXjeWp7/mn4lLey9dLBsVN6s+KRm36RjPuGGz5JYEJrrM6VX2zdMUoD/PVBA/N/3luqlCn1SVtIMWQuR11+bUOw43UMSb0B5zCUaqBJ/+gXyP+Ts+BqRNlTE3TMhNHBLdQNzpAXZQdgJCwZKbvNOElqcCdZYOx2Ag6JBPSQcylSwYPUfuxKnBJ8dDVlhtzlzuUHUH/HJcZ6cxPFPKnJvi3XFuILD5vNU5mpVhHJMPji92VoJaxxDn+fzN2gCNFPi48lDgx8FEDiwJkSxvits/IoZtbl3t/qb93pM6vdXHbINswjLb5sQY6pwHKUFOECjIBv9RnNxOXyDVm4WCvFznyvNTyeodsac+WJbWmnVavGqZbZ9Jb/E7+s1QfXF5lbSUsEc0lm7HiDWHPGRLGDCOpfdIiiIAPEO/UpE4Mf6EEMHugLucFXAgXU6GyHY5X5lzyM8cyvgPWRMJUla1bvr+mO7g6Qg7FnhLExb1w3FkrJb5m6tXWfxtkzlNXQkro9mh7/RqkQubKEiAPCk0oNUyGNg=="
bucket = "de300spring2024"
key = "bill_yin/heart_disease.csv"
cleaned_key = "bill_yin/heart_disease_cleaned.csv"

s3 = boto3.client('s3', aws_access_key_id=AAKI, aws_secret_access_key=ASAK, aws_session_token=AST)
obj = s3.get_object(Bucket=bucket, Key=key)
data = pd.read_csv(obj['Body'])

## 1.1 Understanding the columns
Each column in the dataset is describe below with their respective information:

| Variable  | Description                                         | Values                                                         |
|-----------|-----------------------------------------------------|----------------------------------------------------------------|
| age       | Age of the patient                                  | age in years                                                   |
| sex       | Sex of the patient                                  | 1: male; 0: female                                             |
| painloc   | Chest pain location                                 | 1: substernal, 0: otherwise                                    |
| painexer  | Chest pain during exercise                          | 1: provoked by external exertion, 0: otherwise                 |
| cp        | Chest pain type                                     | 1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic |
| trestbps  | Resting blood pressure                              | mmHg                                                           |
| smoke     | Smoking status                                      | 1: smoker, 0: otherwise                                        |
| fbs       | Fasting blood sugar > 120 mg/dl                     | 1: true, 0: false                                              |
| prop      | Beta blocker used during exercise ECG               | 1: used, 0: not used                                           |
| nitr      | Nitrates used during exercise ECG                   | 1: used, 0: not used                                           |
| pro       | Calcium channel blocker used during exercise ECG    | 1: used, 0: not used                                           |
| diuretic  | Diuretic used during exercise ECG                   | 1: used, 0: not used                                           |
| thaldur   | Duration of exercise test                           | minutes                                                        |
| thalach   | Maximum heart rate achieved                         | bpm                                                            |
| exang     | Exercise induced angina                             | 1: yes, 0: no                                                  |
| oldpeak   | ST depression induced by exercise relative to rest  | mm                                                             |
| slope     | Slope of the peak exercise ST segment               | 1: upsloping, 2: flat, 3: downsloping                          |
| target    | Diagnosis of heart disease                          | 1: heart disease, 0: no heart disease                          |


In [2]:
# Define the columns to use
columns = ['age', 'sex', 'painloc', 'painexer', 'cp', 'trestbps', 'smoke', 'fbs', 'prop', 'nitr', 
           'pro', 'diuretic', 'thaldur', 'thalach', 'exang', 'oldpeak', 'slope', 'target']

# Filter the data for wanted columns
data = data[columns]

# Validity filtering
# Function to check if any value in a row contains spaces or is non-numeric
def is_invalid(row):
    for item in row:
        if isinstance(item, str) and (' ' in item or not item.replace('.', '', 1).isdigit()):
            return True
    return False

# Filter out rows with invalid data
data = data[~data.apply(is_invalid, axis=1)]

# Remove rows that have more than 70% of the columns missing
data = data.dropna(thresh=0.7*len(columns))

# Turn age from string to int if it is not a NaN, then turn NaNs to modes
data['age'] = data['age'].apply(lambda x: int(x) if not pd.isnull(x) else x)
data['age'] = data['age'].fillna(data['age'].mode()[0])

This filtering step is similar to that in HW1, which is to remove the bad rows that are poorly formatted and cause issues with parsing. Any columns with excessive missing values (more than 70%) will be removed.

In [3]:
# Examine the data and find their statistics
print(data.describe())

              age         sex     painloc    painexer          cp    trestbps  \
count  840.000000  840.000000  560.000000  560.000000  840.000000  837.000000   
mean    53.058333    0.779762    0.923214    0.601786    3.253571  132.044205   
std      9.410778    0.414654    0.266489    0.489968    0.929576   19.160970   
min     28.000000    0.000000    0.000000    0.000000    1.000000    0.000000   
25%     46.000000    1.000000    1.000000    0.000000    3.000000  120.000000   
50%     54.000000    1.000000    1.000000    1.000000    4.000000  130.000000   
75%     60.000000    1.000000    1.000000    1.000000    4.000000  140.000000   
max     77.000000    1.000000    1.000000    1.000000    4.000000  200.000000   

            smoke         fbs        prop        nitr         pro    diuretic  \
count  175.000000  750.000000  833.000000  834.000000  836.000000  817.000000   
mean     0.502857    0.150667    0.283313    0.266187    0.172249    0.112607   
std      0.501427    0.3579

In [4]:
# Upload the data into S3 and download it again
data.to_csv('cleaned_data.csv', index=False)
s3.upload_file('cleaned_data.csv', bucket, cleaned_key)

obj = s3.get_object(Bucket=bucket, Key=cleaned_key)
data = pd.read_csv(obj['Body'])

In [5]:
# Visualize the number of NAs in each column
print(data.isna().sum())

age           0
sex           0
painloc     280
painexer    280
cp            0
trestbps      3
smoke       665
fbs          90
prop          7
nitr          6
pro           4
diuretic     23
thaldur       1
thalach       1
exang         1
oldpeak       8
slope       252
target        0
dtype: int64


In [6]:
# Clean the data
# 1. Replace painloc and painexer NAs with 0 (default value)
data['painloc'] = data['painloc'].fillna(0)
data['painexer'] = data['painexer'].fillna(0)

# 2. CP is a categorical variable, remove all NA rows
data = data.dropna(subset=['cp'])

# 3. Trestbps is a continuous variable, replace NAs and <= 100 with's the mode
data['trestbps'] = data['trestbps'].apply(lambda x: data['trestbps'].mode()[0] if x <= 100 else x)
data['trestbps'] = data['trestbps'].fillna(data['trestbps'].mode()[0])

# 4. Clean the smoke column
data = clean_smoke(data)

# 5. Replace fbs, prop, nitr, pro, diuretic NAs and values greater than one with 0
data['fbs'] = data['fbs'].apply(lambda x: 0 if x > 1 else x)
data['fbs'] = data['fbs'].fillna(0)
data['prop'] = data['prop'].apply(lambda x: 0 if x > 1 else x)
data['prop'] = data['prop'].fillna(0)
data['nitr'] = data['nitr'].apply(lambda x: 0 if x > 1 else x)
data['nitr'] = data['nitr'].fillna(0)
data['pro'] = data['pro'].apply(lambda x: 0 if x > 1 else x)
data['pro'] = data['pro'].fillna(0)
data['diuretic'] = data['diuretic'].apply(lambda x: 0 if x > 1 else x)
data['diuretic'] = data['diuretic'].fillna(0)

# 6. Thaldur and Thalach are continuous variables, replace NAs with the mean
data['thaldur'] = data['thaldur'].fillna(data['thaldur'].mean())
data['thalach'] = data['thalach'].fillna(data['thalach'].mean())

# 7. Exang is a binary variable, replace NAs with 0
data['exang'] = data['exang'].fillna(0)

# 8. Oldpeak is a continuous variable, replace NAs, larger or equal to 4, less or equal to 0 with the mean
data['oldpeak'] = data['oldpeak'].apply(lambda x: data['oldpeak'].mean() if x >= 4 or x <= 0 else x)
data['oldpeak'] = data['oldpeak'].fillna(data['oldpeak'].mean())

# 9. Slope is a categorical variable, replace NAs with the mode
data['slope'] = data['slope'].fillna(data['slope'].mode()[0])

# 10. Review all columns to ensure no NAs
print(f"Number of rows: {len(data)}")
print("NAs review")
print(data.isna().sum())


Number of rows: 840
NAs review
age         0
sex         0
painloc     0
painexer    0
cp          0
trestbps    0
smoke       0
fbs         0
prop        0
nitr        0
pro         0
diuretic    0
thaldur     0
thalach     0
exang       0
oldpeak     0
slope       0
target      0
dtype: int64


## 1.2 Data Cleaning
All boolean columns are converted into default values, which is 0. This is done to avoid any prior assumptions, as teh default values emcompass all cases that is not true.

For cp and slope, their processing is done differently. Since cp has a low number of NAs, any rows that have NAs in cp will be removed. For slope, the NAs will be replaced with the mode of the column as their are more than 250 rows. Removing those rows can compromise the size of the dataset.

Lastly, all continuous variables will be standarized to that columns' mode. This is to ensure that what ever is most frequent remains the same.

In [7]:
# Further augmentaion
# Turn cp into a one-hot encoding
data = pd.get_dummies(data, columns=['cp'])

# Trun slope into a one-hot encoding
data = pd.get_dummies(data, columns=['slope'])

## 1.3 Parsing Categorical Variables
The categorical variables are parsed using one-hot encoding. This is done to ensure that the model does not assume any ordinal relationship between the categories.

## 2. Modeling
The following modeling task with use Sklearn.

## NOTE
It is important to note that everytime the code is run, the results will change. This is due to the random nature of the train_test_split function and how the hyperparameters are tested and whether they converge or not. The results presented here are the best results obtained from the code from the latest evaluation run.

In [8]:
# 1. Split data training/testing into 90/10
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], test_size=0.1, random_state=42, stratify=data['target'])

### 2.1 Different models testing
In the following code blocks, we will be using the listed models to evaluate the dataset:
1. Logistic Regression Classifier
2. Random Forest Classifier
3. Gradient Boosting Classifier
4. Support Vector Classifier

Each model will be trained and tested using the same dataset and the same train/test split. The models will be evaluated using the following metrics:
1. Accuracy
2. Precision
3. Recall
4. F1 Score

To evaluate the models based on different parameters, a built in function from Sklearn, GridSearchCV will be used to calculate the metrics. All possible combinations of the hyperparameters will be tested to find the best model using this function, and the corresponding _best model_ will be used to evaluate the dataset. Its metrics will also be displayed.

It is important to note that the **best** model will be the one with highest testing five-fold cross-validation accuracy, then the best model from this evaluation will be used to evaluate the test set.

In [9]:
### Logistic Regression Classifier
params_lr = {
    'C': [0.1, 1, 10],           # Narrow down the range to middle values
    'penalty': ['l2', 'l1'],           # 'l1' is generally less used and not compatible with all solvers
    'solver': ['liblinear', 'lbfgs'],  # Focus on the most commonly effective solvers
    'multi_class': ['auto'],     # Let the model decide based on the dataset
    'max_iter': [1000]            # Increase the max_iter to ensure convergence
}

clf_lr = LogisticRegression()
find_best_model(clf_lr, params_lr, X_train, y_train, X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best Parameters: {'C': 1, 'max_iter': 1000, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
Best Score is 0.8162077378877658
########################## Cross-Validation Evaluation ##########################
Cross-Validation Accuracy: 0.8161375661375662
Confusion Matrix:
[[274  72]
 [ 67 343]]
Confusion Matrix (as percentages):
          Predicted 0  Predicted 1
Actual 0    36.243386      9.52381
Actual 1     8.862434     45.37037

Classification Report:
              precision    recall  f1-score   support

         0.0       0.80      0.79      0.80       346
         1.0       0.83      0.84      0.83       410

    accuracy                           0.82       756
   macro avg       0.82      0.81      0.81       756
weighted avg       0.82      0.82      0.82       756

########################## Test Set Evaluation ##########################
Test Set Accuracy: 0.7976190476190477
Confusion Matrix:
[[26 12]
 [ 5 41]]
Confusion Matrix (as percentages):
          Predic

### 2.1.1 Logistic Regression Classifier
After reading the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), the following hyperparameters seem useful:
1. C (float): inverse of regularization strength
2. penalty ('l1' or 'l2'): penalty type used
3. sovler ('newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'): algorithm used in optimization problem
4. multi_class ('ovr', 'multinomial', 'auto'): type of multi-class classification
5. max_iter (int): maximum number of iterations taken for the solvers to converge

From the results above, the best model had the following hyperparameters:
| Parameter   | Value     | Note     |
|-------------|-----------|----------|
| C           | 0.1       |          |
| Max_iter    | 1000      |          |
| Multi_class | auto      | default  |
| Penalty     | l2        | default  |
| Solver      | lbgfgs    |          |

As seen, most of them were default value, which means the default hyperparameters are the best for this model. For max_iter the default value is 100, but the best model had 1000, which means the default value is not enough for the model to converge.

With C = 0.1, the amount of regularization is also higher for this set of hyperparameters. This means that the model is less likely to overfit the data.

However, given the wide disparity between the cross-validation accuracy and the test accuracy, it is possible that the model is overfitting the data. 

**5-CV Accuracy**: 0.819

**Test Accuracy**: 0.75

In [10]:
### Random Forest Classifier
# Hyperparameter grid
params_rf = {
    'n_estimators': [100, 300],  # Focus on fewer but more impactful options
    'max_depth': [None, 20],     # Test only unlimited and a moderately deep tree
    'bootstrap': [True],         # Bootstrap typically provides better results
    'min_samples_leaf': [1, 2, 4],  # Test the extremes to see the effect of this parameter
    'max_leaf_nodes': [None, 50] # Limiting the choice to unrestricted and a reasonable limit
}

clf_rf = RandomForestClassifier()
find_best_model(clf_rf, params_rf, X_train, y_train, X_test, y_test)

Best Parameters: {'bootstrap': True, 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_leaf': 4, 'n_estimators': 100}
Best Score is 0.8056291390728477
########################## Cross-Validation Evaluation ##########################
Cross-Validation Accuracy: 0.8015873015873016
Confusion Matrix:
[[262  84]
 [ 66 344]]
Confusion Matrix (as percentages):
          Predicted 0  Predicted 1
Actual 0    34.656085    11.111111
Actual 1     8.730159    45.502646

Classification Report:
              precision    recall  f1-score   support

         0.0       0.80      0.76      0.78       346
         1.0       0.80      0.84      0.82       410

    accuracy                           0.80       756
   macro avg       0.80      0.80      0.80       756
weighted avg       0.80      0.80      0.80       756

########################## Test Set Evaluation ##########################
Test Set Accuracy: 0.8214285714285714
Confusion Matrix:
[[27 11]
 [ 4 42]]
Confusion Matrix (as percentages):

### 2.2 Random Forest Classifier
After reading the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), the following hyperparameters seem useful:
1. n_estimators (int): number of trees in the forest
2. max_depth (int): maximum depth of the tree
3. bootstrap (bool): whether bootstrap samples are used when building trees
4. min_samples_split (int): minimum number of samples required to split an internal node
5. max_leaf_nodes (int): maximum number of samples required to be at a leaf node

From the results above, the best model had the following hyperparameters:
| Parameter         | Value     | Note     |
|-------------------|-----------|----------|
| Bootstrap         | True      | default  |
| Max_depth         | None      | default  |
| Max_leaf_nodes    | 50        |          |
| Min_samples_leaf  | 4         |          |
| N_estimators      | 100       | default  |

Similar to the Logistic Regression Classifier, the Random Forest Classifier also had most of default hyperparameters as the best.

Max_leaf_nodes = 50 and min_samples_leaf = 4 are the only hyperparameter that is not default. With their values, the model is less likely to overfit the data as the leaf nodes is limited to 50 and the minimum samples required to be at a leaf node is 4.

Since one of the bottlenecks of Random Forest is the potential of overfitting (albeit less likely comapred to decision tree as it has a majority voting system), the hyperparameters set in the best model are useful to prevent this.

**5-CV Accuracy**: 0.803

**Test Accuracy**: 0.738

In [11]:
### Support Vector Machine Classifier
params_svc = {
    'C': [1, 10],  # Focusing on a narrower, effective range
    'kernel': ['rbf', 'linear', 'sigmoid'],  # Most commonly used and generally effective
    'gamma': ['scale', 'auto']  # Automatic adaptation to feature scale
}
clf_svc = SVC()
find_best_model(clf_svc, params_svc, X_train, y_train, X_test, y_test)

Best Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}
Best Score is 0.8135500174276752
########################## Cross-Validation Evaluation ##########################
Cross-Validation Accuracy: 0.8134920634920635
Confusion Matrix:
[[254  92]
 [ 49 361]]
Confusion Matrix (as percentages):
          Predicted 0  Predicted 1
Actual 0    33.597884    12.169312
Actual 1     6.481481    47.751323

Classification Report:
              precision    recall  f1-score   support

         0.0       0.84      0.73      0.78       346
         1.0       0.80      0.88      0.84       410

    accuracy                           0.81       756
   macro avg       0.82      0.81      0.81       756
weighted avg       0.82      0.81      0.81       756

########################## Test Set Evaluation ##########################
Test Set Accuracy: 0.7976190476190477
Confusion Matrix:
[[24 14]
 [ 3 43]]
Confusion Matrix (as percentages):
          Predicted 0  Predicted 1
Actual 0    28.571429  

### 2.3 Support Vector Classifier
After reading the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), the following hyperparameters seem useful:

1. C (float): penalty parameter of the error term
2. kernel ('linear', 'poly', 'rbf', 'sigmoid', 'precomputed'): kernel type used in the algorithm
3. gamma ('scale', 'auto', float): kernel coefficient for 'rbf', 'poly', 'sigmoid'

From the results above, the best model had the following hyperparameters:
| Parameter |  Value   |   Note   |
|-----------|----------|----------|
| C         | 10       |          |
| Gamma     | scale    | default  |
| Kernel    | linear   |          |

With this model using a linear kernel, the model is less likely to overfit the data. This is because the linear kernel is less complex compared to the other kernels.

However, the C value is 10, which means with even less regularization, the model will not overfit the data. This can either be attributed to the quality of the cleaned data or a good model. Further testing and parameter tuning is required to determine this.

**5-CV Accuracy**: 0.819

**Test Accuracy**: 0.75

In [12]:
### Decision Tree Classifier
params_dt = {
    'max_depth': [None, 10],     # Focus on either no limit or a practical depth
    'min_samples_split': [2, 4, 10],  # Test the minimum and a slightly higher value
    'min_samples_leaf': [1, 4],  # Small and moderate leaves
    'max_features': [10, 20, 100],    # Auto generally performs well
    'max_leaf_nodes': [None, 50],  # Unrestricted and a moderate cap
    'criterion': ['gini']        # Focus on Gini which is generally faster and as effective
}
clf_dt = DecisionTreeClassifier()
find_best_model(clf_dt, params_dt, X_train, y_train, X_test, y_test)

Best Parameters: {'criterion': 'gini', 'max_depth': 10, 'max_features': 20, 'max_leaf_nodes': 50, 'min_samples_leaf': 1, 'min_samples_split': 4}
Best Score is 0.7843848030672708
########################## Cross-Validation Evaluation ##########################
Cross-Validation Accuracy: 0.75
Confusion Matrix:
[[252  94]
 [ 95 315]]
Confusion Matrix (as percentages):
          Predicted 0  Predicted 1
Actual 0    33.333333    12.433862
Actual 1    12.566138    41.666667

Classification Report:
              precision    recall  f1-score   support

         0.0       0.73      0.73      0.73       346
         1.0       0.77      0.77      0.77       410

    accuracy                           0.75       756
   macro avg       0.75      0.75      0.75       756
weighted avg       0.75      0.75      0.75       756

########################## Test Set Evaluation ##########################
Test Set Accuracy: 0.8214285714285714
Confusion Matrix:
[[30  8]
 [ 7 39]]
Confusion Matrix (as percen

### 2.4 Decision Tree Classifier
After reading the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), the following hyperparameters seem useful:

1. max_depth (int): maximum depth of the tree
2. min_samples_split (int): minimum number of samples required to split an internal node
3. min_samples_leaf (int): minimum number of samples required to be at a leaf node
4. max_features ('auto', 'sqrt', 'log2', None): number of features to consider when looking for the best split
5. criterion ('gini', 'entropy'): function to measure the quality of a split

Interestingly, the decision tree classifer's important hyperparameters are similar to the Random Forest Classifier. This is not surprising as Random Forest is an ensemble of Decision Trees.

From the results above, the best model had the following hyperparameters:
| Parameter         | Value     | Note     |
|-------------------|-----------|----------|
| Criterion         | gini      | default  |
| Max_depth         | 10        |          |
| Max_features      | 20        |          |
| Max_leaf_nodes    | 50        |          |
| Min_samples_leaf  | 1         | default  |
| Min_samples_split | 4         | default  |

The Decision Tree Classifier has a few hyperparameters that are different from the Random Forest Classifier. The most notable difference is the max_features, which is set to 20. This means that the model will consider all the features when looking for the best split, without repeating anyone too many times. This is useful as it will prevent the model from overfitting the data.

Features such as max_depth, max_features, and max_leaf_nodes are all here to prevent the model from overfitting the data.

**5-CV Accuracy**: 0.771

**Test Accuracy**: 0.738

In [13]:
### K-Nearest Neighbors Classifier
params_knn = {
    'n_neighbors': [5, 10],      # Test a moderate number of neighbors
    'weights': ['uniform', 'distance'],  # Still important to compare these two
    'algorithm': ['ball_tree', 'kd_tree', 'brute'],       # Let the algorithm choose the best method
    'leaf_size': [30, 60],        # Test a couple of reasonable sizes
    'p': [1, 2]                     # Focus on Euclidean distance which is most common
}
clf_knn = KNeighborsClassifier()
find_best_model(clf_knn, params_knn, X_train, y_train, X_test, y_test)

Best Parameters: {'algorithm': 'ball_tree', 'leaf_size': 30, 'n_neighbors': 10, 'p': 1, 'weights': 'distance'}
Best Score is 0.7195887068665041
########################## Cross-Validation Evaluation ##########################
Cross-Validation Accuracy: 0.7195767195767195
Confusion Matrix:
[[236 110]
 [102 308]]
Confusion Matrix (as percentages):
          Predicted 0  Predicted 1
Actual 0    31.216931    14.550265
Actual 1    13.492063    40.740741

Classification Report:
              precision    recall  f1-score   support

         0.0       0.70      0.68      0.69       346
         1.0       0.74      0.75      0.74       410

    accuracy                           0.72       756
   macro avg       0.72      0.72      0.72       756
weighted avg       0.72      0.72      0.72       756

########################## Test Set Evaluation ##########################
Test Set Accuracy: 0.7261904761904762
Confusion Matrix:
[[23 15]
 [ 8 38]]
Confusion Matrix (as percentages):
          Pr

### 2.5 KNN Classifier
After reading the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), the following hyperparameters seem useful:

1. n_neighbors (int): number of neighbors to use
2. weights ('uniform', 'distance'): weight function used in prediction
3. algorithm ('auto', 'ball_tree', 'kd_tree', 'brute'): algorithm used to compute the nearest neighbors
4. leaf_size (int): leaf size passed to BallTree or KDTree
5. p (int): power parameter for the Minkowski metric (1 for manhattan, 2 for euclidean)

From the results above, the best model had the following hyperparameters:
| Parameter   | Value     | Note    |
|-------------|-----------|---------|
| Algorithm   | ball_tree |         |
| Leaf_size   | 30        | default |
| N_neighbors | 10        |         | 
| p           | 1         |         |
| Weights     | uniform   | default |

N_neighbors being larger than default is useful as it will prevent the model from overfitting the data. A notable hyperparameter here is the algorihm parameter, where ball_tree is used. Ball_tree tends to work well in higher dimensional data and in datasets that naturally cluster, which intuitively makes sense in a healthcare dataset.

**5-CV Accuracy**: 0.721

**Test Accuracy**: 0.679

### 2.6 Classifier Conclusion
Below is a table that comparees the models based on their 5-CV and testing accuracy:

| Model                   | 5-CV Accuracy | Test Accuracy |
|-------------------------|---------------|---------------|
| Logistic Regression     | 0.819         | 0.75          |
| Random Forest           | 0.803         | 0.738         |
| Support Vector          | 0.819         | 0.75          |
| Decision Tree           | 0.771         | 0.738         |
| KNN                     | 0.721         | 0.679         |

From the table above, the Logistic Regression and Support Vector Classifier have the highest 5-CV accuracy. However, it is important to note that not all hyperparameters combinations were tested. The data cleaning process could also be improved along with increasing data size to improve the model's performance.