# Decision Tree and Ensemble Models

This notebook deals with the Decision Tree Classifier and various Ensemble Models/ Methods.
First, we want to compare the performance of a standard DecisionTreeClassifier to Ensemble Models such as examples for Bagging and Boosting on the preprocessed training data.
Secondly, halving grid search is used to perform hyperparameter tuning for each model with included cross-validation, different resampling methods, and furthermore, principle component analysis (PCA) is taken into account. 
Eventually, precision and recall curves are used for analysis and the final evaluation of the best performing classifier out of the selected Decision Tree and Ensemble Methods is conducted by comparing the different models based on F1 taken as the scoring metric as part of the halving grid search (hyperparameter tuning) for all models.


The steps followed in this notebook are:
1. **Initial Exploration of Models**: Evaluate the performance of DecisionTreeClassifier and Ensemble Methods such as Bagging and Boosting (with two examples for each) on the preprocessed training data.
2. **Hyperparameter Tuning - Each Model**: Use halving grid search with cross validaton, different resampling methods, and PCA (data with reduced dimensionality) to find the optimal hyperparameters for each of the classifiers based on the F1 score.
2. **Hyperparameter Tuning - Best Model**: Use halving grid search with cross validaton, different resampling methods, and PCA (data with reduced dimensionality) to find the best model with the optimal hyperparameters based on the F1 score.
5. **Precision Recall Analysis**: Plot precision recall curves of the results of the hyperparameter tuning of all the models individually and, finally, the best performing model due to the fact that the scoring metric F1 is composed of the harmonic mean of precision and recall.

By the above mentioned steps, we intend to find the best performing classifier with respect to the F1 score out of the selected Decision Tree and Ensemble Models.


# 1. Initial Exploration of Models

In [None]:
# Make imports and preparations to load the data
import os
import sys

sys.path.append(os.path.abspath("../scripts"))
from data_loader import DataLoader

# Load the data
data_loader = DataLoader()
X_train, y_train = data_loader.training_data
X_val, y_val = data_loader.validation_data
X_test, y_test = data_loader.test_data

In [None]:
# Imports for the decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the decision tree model with default parameters
decision_tree = DecisionTreeClassifier(random_state=42)

# Train the model on the preprocessed training data
decision_tree.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = decision_tree.predict(X_train)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train, y_train_pred)

# Make predictions on the validation set
y_val_pred = decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

=> TODO Overfitting because training accuracy almost 100% and validation accuracy only 78%

In [None]:
# TODO above

In [None]:
#TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    decision_tree,
    feature_names=X_train.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

## Bagging

### Bagging (Example: Bagging with Decision Tree)

In [None]:
# Imports for the bagging classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the base estimator
estimator = DecisionTreeClassifier(random_state=42)

# Initialize the Bagging classifier with the base estimator
bagging_decision_tree = BaggingClassifier(estimator=estimator, n_estimators=100, random_state=42)

# Train the model on the preprocessed training data
bagging_decision_tree.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = bagging_decision_tree.predict(X_train)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train, y_train_pred)

# Make predictions on the validation set
y_val_pred = bagging_decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

=> still overfitting

In [None]:
#TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    bagging_decision_tree.estimators_[0],
    feature_names=X_train.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

### Bagging (Example: Random Forest Classifier)

In [None]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

#TODO Adjust max_depth to? 
#TODO random_state = 0 soll üblich sein für reproducability, habe aber auch 42 gelesen als arbitrary number?
#TODO is it okay to set zero_division to 1?

# Initialize the random forest ensemble model with default parameters
bagging_random_forest = RandomForestClassifier(random_state=42)

# Train the model on the preprocessed training data
bagging_random_forest.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = bagging_random_forest.predict(X_train)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train, y_train_pred)

# Make predictions on the validation set
y_val_pred = bagging_random_forest.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

=> Still overfitting

In [None]:
#TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    bagging_random_forest.estimators_[0],  
    feature_names=X_train.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()


## Boosting

### Boosting (Example: Adaptive Boosting)

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Initialize the AdaBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
adaptive_boosting = AdaBoostClassifier(random_state=42)

# Train the model on the preprocessed training data
adaptive_boosting.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = adaptive_boosting.predict(X_train)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train, y_train_pred)

# Make predictions on the validation set
y_val_pred = adaptive_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

=> No overfitting, finally

In [None]:
#TODO only one layer => decision stump

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model (with this model only one split and layer occurs)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    adaptive_boosting.estimators_[0],  
    feature_names=X_train.columns,
    class_names=["No Diabetes", "Diabetes"],
)
plt.show()

=> only one split, only one layer occurs (this is called "Stump")

### Boosting (Example: Extreme Gradient Boosting)


In [None]:
from xgboost import XGBClassifier

# Initialize the XGBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
extreme_gradient_boosting = XGBClassifier(random_state=42)

# Train the model on the preprocessed training data
extreme_gradient_boosting.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = extreme_gradient_boosting.predict(X_train)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train, y_train_pred)

# Make predictions on the validation set
y_val_pred = extreme_gradient_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

=> No overfitting

In [None]:
# #TODO is three layers already too much?

# import matplotlib.pyplot as plt
# from sklearn import tree

# # Visualize the first tree of the model but only show the first three layers (max_depth=3)
# plt.figure(figsize=(20, 10))
# tree.plot_tree(
#     extreme_gradient_boosting.get_booster().get_dump()[0],  
#     feature_names=X_train.columns,
#     class_names=["No Diabetes", "Diabetes"],
#     max_depth=3
# )
# plt.show()

In [None]:
#TODO => is not from the sklearn library, but from the xgboost library
# therefore need to find different way to visualize the tree

### Random Undersampling

In [None]:
# test random undersampling
X_train_undersampling_random, y_train_undersampling_random = data_loader.training_data_undersampling_random
X_val, y_val = data_loader.validation_data
X_test, y_test = data_loader.test_data

print(f"X_train_undersampling_random shape: {X_train_undersampling_random.shape}")
print(f"y_train_undersampling_random shape: {y_train_undersampling_random.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

In [None]:
# Imports for the decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the decision tree model with default parameters
decision_tree = DecisionTreeClassifier(random_state=42)

# Train the model on the preprocessed training data
decision_tree.fit(X_train_undersampling_random, y_train_undersampling_random)

# Make predictions on the training set
y_train_pred_undersampling_random = decision_tree.predict(X_train_undersampling_random)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_undersampling_random, y_train_pred_undersampling_random)

# Make predictions on the validation set
y_val_pred = decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
#TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    decision_tree,
    feature_names=X_train_undersampling_random.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

In [None]:
# Imports for the bagging classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the base estimator
estimator = DecisionTreeClassifier(random_state=42)

# Initialize the Bagging classifier with the base estimator
bagging_decision_tree = BaggingClassifier(estimator=estimator, n_estimators=100, random_state=42)

# Train the model on the preprocessed training data
bagging_decision_tree.fit(X_train_undersampling_random, y_train_undersampling_random)

# Make predictions on the training set
y_train_pred_undersampling_random = bagging_decision_tree.predict(X_train_undersampling_random)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_undersampling_random, y_train_pred_undersampling_random)

# Make predictions on the validation set
y_val_pred = bagging_decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
 #TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    bagging_decision_tree.estimators_[0],
    feature_names=X_train_undersampling_random.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Initialize the AdaBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
adaptive_boosting = AdaBoostClassifier(random_state=42)

# Train the model on the preprocessed training data
adaptive_boosting.fit(X_train_undersampling_random, y_train_undersampling_random)

# Make predictions on the training set
y_train_pred_undersampling_random = adaptive_boosting.predict(X_train_undersampling_random)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_undersampling_random, y_train_pred_undersampling_random)

# Make predictions on the validation set
y_val_pred = adaptive_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
#TODO only one layer => decision stump

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model (with this model only one split and layer occurs)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    adaptive_boosting.estimators_[0],  
    feature_names=X_train_undersampling_random.columns,
    class_names=["No Diabetes", "Diabetes"],
)
plt.show()

In [None]:
from xgboost import XGBClassifier

# Initialize the XGBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
extreme_gradient_boosting = XGBClassifier(random_state=42)

# Train the model on the preprocessed training data
extreme_gradient_boosting.fit(X_train_undersampling_random, y_train_undersampling_random)

# Make predictions on the training set
y_train_pred_undersampling_random = extreme_gradient_boosting.predict(X_train_undersampling_random)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_undersampling_random, y_train_pred_undersampling_random)

# Make predictions on the validation set
y_val_pred = extreme_gradient_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

### Random Oversampling

In [None]:
# test random oversampling
X_train_oversampling_random, y_train_oversampling_random = data_loader.training_data_oversampling_random
X_val, y_val = data_loader.validation_data
X_test, y_test = data_loader.test_data

print(f"X_train_oversampling_random shape: {X_train_oversampling_random.shape}")
print(f"y_train_oversampling_random shape: {y_train_oversampling_random.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

In [None]:
# Imports for the decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the decision tree model with default parameters
decision_tree = DecisionTreeClassifier(random_state=42)

# Train the model on the preprocessed training data
decision_tree.fit(X_train_oversampling_random, y_train_oversampling_random)

# Make predictions on the training set
y_train_pred_oversampling_random = decision_tree.predict(X_train_oversampling_random)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_random, y_train_pred_oversampling_random)

# Make predictions on the validation set
y_val_pred = decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
#TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    decision_tree,
    feature_names=X_train_oversampling_random.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

In [None]:
# Imports for the bagging classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the base estimator
estimator = DecisionTreeClassifier(random_state=42)

# Initialize the Bagging classifier with the base estimator
bagging_decision_tree = BaggingClassifier(estimator=estimator, n_estimators=100, random_state=42)

# Train the model on the preprocessed training data
bagging_decision_tree.fit(X_train_oversampling_random, y_train_oversampling_random)

# Make predictions on the training set
y_train_pred_oversampling_random = bagging_decision_tree.predict(X_train_oversampling_random)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_random, y_train_pred_oversampling_random)

# Make predictions on the validation set
y_val_pred = bagging_decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
 #TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    bagging_decision_tree.estimators_[0],
    feature_names=X_train_oversampling_random.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Initialize the AdaBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
adaptive_boosting = AdaBoostClassifier(random_state=42)

# Train the model on the preprocessed training data
adaptive_boosting.fit(X_train_oversampling_random, y_train_oversampling_random)

# Make predictions on the training set
y_train_pred_oversampling_random = adaptive_boosting.predict(X_train_oversampling_random)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_random, y_train_pred_oversampling_random)

# Make predictions on the validation set
y_val_pred = adaptive_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
#TODO only one layer => decision stump

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model (with this model only one split and layer occurs)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    adaptive_boosting.estimators_[0],  
    feature_names=X_train_oversampling_random.columns,
    class_names=["No Diabetes", "Diabetes"],
)
plt.show()

In [None]:
from xgboost import XGBClassifier

# Initialize the XGBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
extreme_gradient_boosting = XGBClassifier(random_state=42)

# Train the model on the preprocessed training data
extreme_gradient_boosting.fit(X_train_oversampling_random, y_train_oversampling_random)

# Make predictions on the training set
y_train_pred_oversampling_random = extreme_gradient_boosting.predict(X_train_oversampling_random)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_random, y_train_pred_oversampling_random)

# Make predictions on the validation set
y_val_pred = extreme_gradient_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

### SMOTE Oversampling

In [None]:
# test smote oversampling
X_train_oversampling_smote, y_train_oversampling_smote = data_loader.training_data_oversampling_smote
X_val, y_val = data_loader.validation_data
X_test, y_test = data_loader.test_data

print(f"X_train_oversampling_smote shape: {X_train_oversampling_smote.shape}")
print(f"y_train_oversampling_smote shape: {y_train_oversampling_smote.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

In [None]:
# Imports for the decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the decision tree model with default parameters
decision_tree = DecisionTreeClassifier(random_state=42)

# Train the model on the preprocessed training data
decision_tree.fit(X_train_oversampling_smote, y_train_oversampling_smote)

# Make predictions on the training set
y_train_pred_oversampling_smote = decision_tree.predict(X_train_oversampling_smote)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_smote, y_train_pred_oversampling_smote)

# Make predictions on the validation set
y_val_pred = decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
#TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    decision_tree,
    feature_names=X_train_oversampling_smote.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

In [None]:
# Imports for the bagging classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the base estimator
estimator = DecisionTreeClassifier(random_state=42)

# Initialize the Bagging classifier with the base estimator
bagging_decision_tree = BaggingClassifier(estimator=estimator, n_estimators=100, random_state=42)

# Train the model on the preprocessed training data
bagging_decision_tree.fit(X_train_oversampling_smote, y_train_oversampling_smote)

# Make predictions on the training set
y_train_pred_oversampling_smote = bagging_decision_tree.predict(X_train_oversampling_smote)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_smote, y_train_pred_oversampling_smote)

# Make predictions on the validation set
y_val_pred = bagging_decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
 #TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    bagging_decision_tree.estimators_[0],
    feature_names=X_train_oversampling_smote.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Initialize the AdaBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
adaptive_boosting = AdaBoostClassifier(random_state=42)

# Train the model on the preprocessed training data
adaptive_boosting.fit(X_train_oversampling_smote, y_train_oversampling_smote)

# Make predictions on the training set
y_train_pred_oversampling_smote = adaptive_boosting.predict(X_train_oversampling_smote)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_smote, y_train_pred_oversampling_smote)

# Make predictions on the validation set
y_val_pred = adaptive_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
#TODO only one layer => decision stump

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model (with this model only one split and layer occurs)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    adaptive_boosting.estimators_[0],  
    feature_names=X_train_oversampling_smote.columns,
    class_names=["No Diabetes", "Diabetes"],
)
plt.show()

In [None]:
from xgboost import XGBClassifier

# Initialize the XGBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
extreme_gradient_boosting = XGBClassifier(random_state=42)

# Train the model on the preprocessed training data
extreme_gradient_boosting.fit(X_train_oversampling_smote, y_train_oversampling_smote)

# Make predictions on the training set
y_train_pred_oversampling_smote = extreme_gradient_boosting.predict(X_train_oversampling_smote)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_smote, y_train_pred_oversampling_smote)

# Make predictions on the validation set
y_val_pred = extreme_gradient_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

### SMOTE Tomek

In [None]:
X_train_oversampling_smote_tomek, y_train_oversampling_smote_tomek = data_loader.training_data_resampling_smote_tomek
X_val, y_val = data_loader.validation_data
X_test, y_test = data_loader.test_data

print(f"X_train_oversampling_smote shape: {X_train_oversampling_smote_tomek.shape}")
print(f"y_train_oversampling_smote shape: {y_train_oversampling_smote_tomek.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

In [None]:
# Imports for the decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the decision tree model with default parameters
decision_tree = DecisionTreeClassifier(random_state=42)

# Train the model on the preprocessed training data
decision_tree.fit(X_train_oversampling_smote_tomek, y_train_oversampling_smote_tomek)

# Make predictions on the training set
y_train_pred_oversampling_smote_tomek = decision_tree.predict(X_train_oversampling_smote_tomek)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_smote_tomek, y_train_pred_oversampling_smote_tomek)

# Make predictions on the validation set
y_val_pred = decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
#TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    decision_tree,
    feature_names=X_train_oversampling_smote_tomek.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

In [None]:
# Imports for the bagging classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the base estimator
estimator = DecisionTreeClassifier(random_state=42)

# Initialize the Bagging classifier with the base estimator
bagging_decision_tree = BaggingClassifier(estimator=estimator, n_estimators=100, random_state=42)

# Train the model on the preprocessed training data
bagging_decision_tree.fit(X_train_oversampling_smote_tomek, y_train_oversampling_smote_tomek)

# Make predictions on the training set
y_train_pred_oversampling_smote_tomek = bagging_decision_tree.predict(X_train_oversampling_smote_tomek)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_smote_tomek, y_train_pred_oversampling_smote_tomek)

# Make predictions on the validation set
y_val_pred = bagging_decision_tree.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
 #TODO is three layers already too much?

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model but only show the first three layers (max_depth=3)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    bagging_decision_tree.estimators_[0],
    feature_names=X_train_oversampling_smote_tomek.columns,
    class_names=["No Diabetes", "Diabetes"],
    max_depth=3 
)
plt.show()

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Initialize the AdaBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
adaptive_boosting = AdaBoostClassifier(random_state=42)

# Train the model on the preprocessed training data
adaptive_boosting.fit(X_train_oversampling_smote_tomek, y_train_oversampling_smote_tomek)

# Make predictions on the training set
y_train_pred_oversampling_smote_tomek = adaptive_boosting.predict(X_train_oversampling_smote_tomek)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_smote_tomek, y_train_pred_oversampling_smote_tomek)

# Make predictions on the validation set
y_val_pred = adaptive_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

In [None]:
#TODO only one layer => decision stump

import matplotlib.pyplot as plt
from sklearn import tree

# Visualize the first tree of the model (with this model only one split and layer occurs)
plt.figure(figsize=(20, 10))
tree.plot_tree(
    adaptive_boosting.estimators_[0],  
    feature_names=X_train_oversampling_smote_tomek.columns,
    class_names=["No Diabetes", "Diabetes"],
)
plt.show()

In [None]:
from xgboost import XGBClassifier

# Initialize the XGBoost ensemble model (with a decision tree as the default base classifier and other default parameters)
extreme_gradient_boosting = XGBClassifier(random_state=42)

# Train the model on the preprocessed training data
extreme_gradient_boosting.fit(X_train_oversampling_smote_tomek, y_train_oversampling_smote_tomek)

# Make predictions on the training set
y_train_pred_oversampling_smote_tomek = extreme_gradient_boosting.predict(X_train_oversampling_smote_tomek)

# Evaluate the model's performance on the training dataset
accuracy_train = accuracy_score(y_train_oversampling_smote_tomek, y_train_pred_oversampling_smote_tomek)

# Make predictions on the validation set
y_val_pred = extreme_gradient_boosting.predict(X_val)

# Evaluate the model's performance on the validation dataset
report = classification_report(y_val, y_val_pred, digits=4)

print("Training Accuracy", accuracy_train)
print("Classification Report:\n", report)

## Hyperparameter Tuning (with Cross-Validation and PCA)

### Halving Grid Search


#### Halving Grid Search for Decision Tree Classifier (with Cross-Validation and PCA)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek
from sklearn.decomposition import PCA

pipeline = Pipeline(
    [
        ("resampler", None),  # Placeholder for resampling method
        ("pca", None),  # Placeholder for PCA
        ('classifier', None)  # Placeholder for classifier
    ]
)

param_grid = [
# Source of Hyperparameters for Decision Tree & ChatGPT (ranges for large dataset):  https://ken-hoffman.medium.com/decision-tree-hyperparameters-explained-49158ee1268e
    {
    'classifier': [DecisionTreeClassifier()],
    'classifier__max_depth': [1, 2, 5, 10, 50, 100, None], 
    'classifier__max_leaf_nodes': [1, 2, 5, 10, 50, 100, 500, None], 
    'classifier__max_features': ['auto', 'sqrt', 'log2', None],  
    'classifier__min_samples_split': [2, 5, 10, 50, 100], 
    'classifier__min_samples_leaf': [1, 2, 5, 10, 20, 50], 
    'classifier__criterion': ['gini', 'entropy', 'log_loss'],  
    'classifier__splitter': ['best', 'random'],  
    'resampler': [
        None,
        RandomOverSampler(random_state=42),
        RandomUnderSampler(random_state=42),
        SMOTE(random_state=42),
        SMOTETomek(random_state=42)
    ],  
    "pca": [None, PCA(n_components=5), PCA(n_components=10), PCA(n_components=None)
    ]  # PCA options for dimensionality reduction.
}
]

# Set up HalvingGridSearchCV
halving_grid_search = HalvingGridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=10,  # 10-fold cross-validation
    scoring="f1",  # Scoring metric
    n_jobs=-1,  # Use all processors
    verbose=1,  # To track progress
)

# Fit the halving grid search on training data
halving_grid_search.fit(X_train, y_train)

# Get the best parameters and score
print("Best Parameters:", halving_grid_search.best_params_)
print("Best Cross-Validation F1 Score:", halving_grid_search.best_score_)

#### Halving Grid Search for Bagging Classifier (with Cross-Validation and PCA)

In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek
from sklearn.decomposition import PCA

pipeline = Pipeline(
    [
        ("resampler", None),  # Placeholder for resampling method
        ("pca", None),  # Placeholder for PCA
        ('classifier', None)  # Placeholder for classifier
    ]
)

param_grid = [

# Source of Hyperparameters for Bagging Classifier & ChatGPT (ranges for large dataset): https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.BaggingClassifier.html
   {
    'classifier': [BaggingClassifier(estimator=DecisionTreeClassifier())],
    'classifier__n_estimators': [10, 50, 100, 200], 
    'classifier__max_samples': [0.5, 0.7, 1.0],  
    'classifier__max_features': [0.25, 0.5, 0.7, 1.0],  
    'classifier__bootstrap': [True, False],  
    'classifier__oob_score': [True, False],  
    'classifier__warm_start': [True, False],  
    'classifier__n_jobs': [None,-1],  
    'classifier__random_state': [42], 
    'classifier__verbose': [0, 1], 
    'resampler': [
        None,
        RandomOverSampler(random_state=42),
        RandomUnderSampler(random_state=42),
        SMOTE(random_state=42),
        SMOTETomek(random_state=42)
    ]
   }
]

# Set up HalvingGridSearchCV
halving_grid_search = HalvingGridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=10,  # 10-fold cross-validation
    scoring="f1",  # Scoring metric
    n_jobs=-1,  # Use all processors
    verbose=1,  # To track progress
)

# Fit the halving grid search on training data
halving_grid_search.fit(X_train, y_train)

# Get the best parameters and score
print("Best Parameters:", halving_grid_search.best_params_)
print("Best Cross-Validation F1 Score:", halving_grid_search.best_score_)

#### Halving Grid Search for Random Forest Classifier (with Cross-Validation and PCA)

In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek
from sklearn.decomposition import PCA

pipeline = Pipeline(
    [
        ("resampler", None),  # Placeholder for resampling method
        ("pca", None),  # Placeholder for PCA
        ('classifier', None)  # Placeholder for classifier
    ]
)

param_grid = [

    # Source of Hyperparameters for Random Forest Classifier adjusted ranges with ChatGPT
    {
    'classifier': [RandomForestClassifier()],
    'classifier__max_depth': [1, 2, 5, 10, 50, 100, None], 
    'classifier__max_features': ['sqrt', 'log2', 0.5, 0.7, 1.0],  
    'classifier__min_samples_split': [2, 3, 5, 10, 50, 100], 
    'classifier__bootstrap': [True, False],  
    'classifier__criterion': ['gini', 'entropy', 'log_loss'], 
    'resampler': [
        None,
        RandomOverSampler(random_state=42),
        RandomUnderSampler(random_state=42),
        SMOTE(random_state=42),
        SMOTETomek(random_state=42)
    ],  
    "pca": [
        None, 
        PCA(n_components=5), 
        PCA(n_components=10), 
        PCA(n_components=None)
    ]  
    }
]


# Set up HalvingGridSearchCV
halving_grid_search = HalvingGridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=10,  # 10-fold cross-validation
    scoring="f1",  # Scoring metric
    n_jobs=-1,  # Use all processors
    verbose=1,  # To track progress
)

# Fit the halving grid search on training data
halving_grid_search.fit(X_train, y_train)

# Get the best parameters and score
print("Best Parameters:", halving_grid_search.best_params_)
print("Best Cross-Validation F1 Score:", halving_grid_search.best_score_)

#### Halving Grid Search for Adaptive Boosting Classifier (with Cross-Validation and PCA)

In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek
from sklearn.decomposition import PCA

pipeline = Pipeline(
    [
        ("resampler", None),  # Placeholder for resampling method
        ("pca", None),  # Placeholder for PCA
        ('classifier', None)  # Placeholder for classifier
    ]
)

param_grid = [
    # Source of Hyperparameters for Adaptive Boosting Classifier adjusted ranges with ChatGPT
    {
    'classifier': [AdaBoostClassifier()],
    'classifier__n_estimators': [10, 50, 100, 500], 
    'classifier__learning_rate': [0.001, 0.01, 0.1, 1, 10],  
    'classifier__algorithm': ['SAMME'],
    'classifier__random_state': [42],  
    'resampler': [
        None,
        RandomOverSampler(random_state=42),
        RandomUnderSampler(random_state=42),
        SMOTE(random_state=42),
        SMOTETomek(random_state=42)
    ],  
    "pca": [
        None, 
        PCA(n_components=5), 
        PCA(n_components=10), 
        PCA(n_components=None)
    ]  
    }
]


# Set up HalvingGridSearchCV
halving_grid_search = HalvingGridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=10,  # 10-fold cross-validation
    scoring="f1",  # Scoring metric
    n_jobs=-1,  # Use all processors
    verbose=1,  # To track progress
)

# Fit the halving grid search on training data
halving_grid_search.fit(X_train, y_train)

# Get the best parameters and score
print("Best Parameters:", halving_grid_search.best_params_)
print("Best Cross-Validation F1 Score:", halving_grid_search.best_score_)

#### Halving Grid Search for Extreme Gradient Boosting Classifier (with Cross-Validation and PCA)

In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek
from sklearn.decomposition import PCA

pipeline = Pipeline(
    [
        ("resampler", None),  # Placeholder for resampling method
        ("pca", None),  # Placeholder for PCA
        ('classifier', None)  # Placeholder for classifier
    ]
)

param_grid = [
# Source of Hyperparameters for XGBoost & ChatGPT: https://medium.com/@amitsinghrajput_92567/understanding-hyperparameters-in-decision-trees-xgboost-and-lightgbm-7b64cfed77f0
    {
        'classifier': [XGBClassifier()],
        'classifier__max_depth': [1, 2, 5, 10, 50, 100, None], 
        'classifier__learning_rate': [0.001, 0.01, 0.1, 1, 10],  
        'classifier__min_child_weight': [1, 3, 5, 10, 50], 
        'classifier__gamma': [0, 0.1, 0.5, 1, 5, 10],
        'classifier__subsample': [0, 0.1, 0.3, 0.5, 0.7, 1],
        'classifier__colsample_bytree': [0, 0.1, 0.3, 0.5, 0.7, 1],
        'resampler': [
        RandomOverSampler(random_state=42),
        RandomUnderSampler(random_state=42),
        SMOTE(random_state=42),
        SMOTETomek(random_state=42)
    ],  
    "pca": [
        None, 
        PCA(n_components=5), 
        PCA(n_components=10), 
        PCA(n_components=None)
    ]  
    }
]


# Set up HalvingGridSearchCV
halving_grid_search = HalvingGridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=10,  # 10-fold cross-validation
    scoring="f1",  # Scoring metric
    n_jobs=-1,  # Use all processors
    verbose=1,  # To track progress
)

# Fit the halving grid search on training data
halving_grid_search.fit(X_train, y_train)

# Get the best parameters and score
print("Best Parameters:", halving_grid_search.best_params_)
print("Best Cross-Validation F1 Score:", halving_grid_search.best_score_)

### Halving Grid Search for All Classifiers (with Cross-Validation and PCA)

In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek
from sklearn.decomposition import PCA

pipeline = Pipeline(
    [
        ("resampler", None),  # Placeholder for resampling method
        ("pca", None),  # Placeholder for PCA
        ('classifier', None)  # Placeholder for classifier
    ]
)

param_grid = [
# Source of Hyperparameters for Decision Tree & ChatGPT (ranges for large dataset):  https://ken-hoffman.medium.com/decision-tree-hyperparameters-explained-49158ee1268e
    {
    'classifier': [DecisionTreeClassifier()],
    'classifier__max_depth': [1, 2, 5, 10, 50, 100, None], 
    'classifier__max_leaf_nodes': [1, 2, 5, 10, 50, 100, 500, None], 
    'classifier__max_features': ['auto', 'sqrt', 'log2', None],  
    'classifier__min_samples_split': [2, 5, 10, 50, 100], 
    'classifier__min_samples_leaf': [1, 2, 5, 10, 20, 50], 
    'classifier__criterion': ['gini', 'entropy', 'log_loss'],  
    'classifier__splitter': ['best', 'random'],  
    'resampler': [
        None,
        RandomOverSampler(random_state=42),
        RandomUnderSampler(random_state=42),
        SMOTE(random_state=42),
        SMOTETomek(random_state=42)
    ],  
    "pca": [None, PCA(n_components=5), PCA(n_components=10), PCA(n_components=None)
    ]  # PCA options for dimensionality reduction.
},
# # Source of Hyperparameters for Bagging Classifier & ChatGPT (ranges for large dataset): https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.BaggingClassifier.html
#    {
#     'classifier': [BaggingClassifier(estimator=DecisionTreeClassifier())],
#     'classifier__n_estimators': [10, 50, 100, 200], 
#     'classifier__max_samples': [0.5, 0.7, 1.0],  
#     'classifier__max_features': [0.25, 0.5, 0.7, 1.0],  
#     'classifier__bootstrap': [True, False],  
#     'classifier__oob_score': [True, False],  
#     'classifier__warm_start': [True, False],  
#     'classifier__n_jobs': [None,-1],  
#     'classifier__random_state': [42], 
#     'classifier__verbose': [0, 1], 
#     'resampler': [
#         None,
#         RandomOverSampler(random_state=42),
#         RandomUnderSampler(random_state=42),
#         SMOTE(random_state=42),
#         SMOTETomek(random_state=42)
#     ],  
#     'pca': [
#         None, 
#         PCA(n_components=5), 
#         PCA(n_components=10), 
#         PCA(n_components=None)
#     ] 
#     },
    # Source of Hyperparameters for Random Forest Classifier adjusted ranges with ChatGPT
    {
    'classifier': [RandomForestClassifier()],
    'classifier__max_depth': [1, 2, 5, 10, 50, 100, None], 
    'classifier__max_features': ['sqrt', 'log2', 0.5, 0.7, 1.0],  
    'classifier__min_samples_split': [2, 3, 5, 10, 50, 100], 
    'classifier__bootstrap': [True, False],  
    'classifier__criterion': ['gini', 'entropy', 'log_loss'], 
    'resampler': [
        None,
        RandomOverSampler(random_state=42),
        RandomUnderSampler(random_state=42),
        SMOTE(random_state=42),
        SMOTETomek(random_state=42)
    ],  
    "pca": [
        None, 
        PCA(n_components=5), 
        PCA(n_components=10), 
        PCA(n_components=None)
    ]  
    },
    # Source of Hyperparameters for Adaptive Boosting Classifier adjusted ranges with ChatGPT
    {
    'classifier': [AdaBoostClassifier()],
    'classifier__n_estimators': [10, 50, 100, 500], 
    'classifier__learning_rate': [0.001, 0.01, 0.1, 1, 10],  
    'classifier__algorithm': ['SAMME'],
    'classifier__random_state': [42],  
    'resampler': [
        None,
        RandomOverSampler(random_state=42),
        RandomUnderSampler(random_state=42),
        SMOTE(random_state=42),
        SMOTETomek(random_state=42)
    ],  
    "pca": [
        None, 
        PCA(n_components=5), 
        PCA(n_components=10), 
        PCA(n_components=None)
    ]  
    },
# Source of Hyperparameters for XGBoost & ChatGPT: https://medium.com/@amitsinghrajput_92567/understanding-hyperparameters-in-decision-trees-xgboost-and-lightgbm-7b64cfed77f0
    {
        'classifier': [XGBClassifier()],
        'classifier__max_depth': [1, 2, 5, 10, 50, 100, None], 
        'classifier__learning_rate': [0.001, 0.01, 0.1, 1, 10],  
        'classifier__min_child_weight': [1, 3, 5, 10, 50], 
        'classifier__gamma': [0, 0.1, 0.5, 1, 5, 10],
        'classifier__subsample': [0, 0.1, 0.3, 0.5, 0.7, 1],
        'classifier__colsample_bytree': [0, 0.1, 0.3, 0.5, 0.7, 1],
        'resampler': [
        RandomOverSampler(random_state=42),
        RandomUnderSampler(random_state=42),
        SMOTE(random_state=42),
        SMOTETomek(random_state=42)
    ],  
    "pca": [
        None, 
        PCA(n_components=5), 
        PCA(n_components=10), 
        PCA(n_components=None)
    ]  
    }
]


# Set up HalvingGridSearchCV
halving_grid_search = HalvingGridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=10,  # 10-fold cross-validation
    scoring="f1",  # Scoring metric
    n_jobs=-1,  # Use all processors
    verbose=1,  # To track progress
)

# Fit the halving grid search on training data
halving_grid_search.fit(X_train, y_train)

# Get the best parameters and score
print("Best Model w/ Best Parameters:", halving_grid_search.best_params_)
print("Best Cross-Validation F1 Score:", halving_grid_search.best_score_)

In [None]:
#TODO Precision Recall Curves for Positive Class of Best Performing Classifier (with respect to F1 Score)

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, f1_score, classification_report

# Assuming grid_search is the result of your hyperparameter tuning
best_clf = halving_grid_search.best_estimator_

# Fit the best classifier on the training data
best_clf.fit(X_train, y_train)

# Predict probabilities for the positive class of target variable
y_scores = best_clf.predict_proba(X_test)[:, 1]

# Calculate precision and recall
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)

# Plot Precision-Recall curve
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, label='Precision-Recall curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve for Positive Class')
plt.legend()
plt.show()

# Calculate F1 score for each threshold
f1_scores = 2 * (precision * recall) / (precision + recall)

# Find the threshold that gives the best F1 score
best_threshold = thresholds[f1_scores.argmax()]

# Print the best F1 score and the corresponding threshold
print(f'Best F1 Score: {f1_scores.max():.2f}')
print(f'Best Threshold: {best_threshold:.2f}')

# Classification report for the best threshold
y_pred_best_threshold = (y_scores >= best_threshold).astype(int)
print(classification_report(y_test, y_pred_best_threshold))

In [None]:
#TODO Precision Recall Curves for Negative Class of Best Performing Classifier (with respect to F1 Score)

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, f1_score, classification_report

# Assuming grid_search is the result of your hyperparameter tuning
best_clf = halving_grid_search.best_estimator_

# Fit the best classifier on the training data
best_clf.fit(X_train, y_train)

# Predict probabilities for the negative class
y_scores_negative = best_clf.predict_proba(X_test)[:, 0]

# Calculate precision and recall for the negative class
precision_neg, recall_neg, thresholds_neg = precision_recall_curve(1 - y_test, y_scores_negative)

# Plot Precision-Recall curve for the negative class of target variable 
plt.figure(figsize=(10, 6))
plt.plot(recall_neg, precision_neg, label='Precision-Recall curve (Negative Class)')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve for Negative Class')
plt.legend()
plt.show()

# Calculate F1 score for each threshold
f1_scores_neg = 2 * (precision_neg * recall_neg) / (precision_neg + recall_neg)

# Find the threshold that gives the best F1 score
best_threshold_neg = thresholds_neg[f1_scores_neg.argmax()]

# Print the best F1 score and the corresponding threshold
print(f'Best F1 Score (Negative Class): {f1_scores_neg.max():.2f}')
print(f'Best Threshold (Negative Class): {best_threshold_neg:.2f}')

# Classification report for the best threshold
y_pred_best_threshold_neg = (y_scores_negative >= best_threshold_neg).astype(int)
print(classification_report(1 - y_test, y_pred_best_threshold_neg))

In [None]:
# save model to pkl file for later reuse
import joblib
from datetime import datetime

# Get the best model from the halving grid search
best_model = halving_grid_search.best_estimator_


In [None]:
import joblib

# Make predictions on the test set
y_test_pred = best_model.predict(X_val)

# print(y_test_pred)
# Evaluate the model's performance on the test set
report = classification_report(y_val, y_test_pred, digits=4)
print("Classification Report:\n", report)

In [None]:
# import joblib
# from datetime import datetime

# # Get the best model from the halving grid search
# best_model = halving_grid_search.best_estimator_

# # Get the current timestamp
# timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# # Save the best model to a file with a timestamp
# model_filename = f'../models/decision_tree_ensemble_methods/lr_model_sampling_{timestamp}.pkl'
# joblib.dump(best_model, model_filename)

# print(f"Best model saved to '{model_filename}'")

## Further Possible Exploration: Base Learners for Voting and Stacking

In [None]:
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.linear_model import LogisticRegression
# from sklearn.naive_bayes import GaussianNB
# from sklearn.svm import SVC
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.metrics import accuracy_score, classification_report


# #TODO How many neighbours n= ?
# #TODO Do I need probability=True for soft voting?
# # TODO Need the best n for k-NN? => I chose 5 for now

# #Initialize base classifiers 
# base_classifiers = {
#     "decision tree": DecisionTreeClassifier(),
#     "random forest": RandomForestClassifier(),
#     "logisitc regression": LogisticRegression(),
#     "naive bayes":GaussianNB(), 
#     "support vector machines": SVC(probability=True), 
#     "k-NN with n = 5": KNeighborsClassifier(n_neighbors=5)
# }

# # TODO => should I rather use recall?
# # Implement function to display accuracy as performance metric for different classifiers

# def evaluate_classifier(e_name, e, X_train, y_train, X_val, y_val):
#     # Train the model on the preprocessed training data
#     e.fit(X_train, y_train)

#     # Make predictions on the training set
#     y_train_pred = e.predict(X_train)

#     # Evaluate the model's performance on the training dataset
#     accuracy_train = accuracy_score(y_train, y_train_pred)

#     # Make predictions on the validation set
#     y_val_pred = e.predict(X_val)

#     # Evaluate the model's performance on the validation dataset
#     report = classification_report(y_val, y_val_pred, digits=4)

#     print(f'Training Accuracy of {e_name}: {accuracy_train}')
#     print(f'Classification Report of {e_name}:\n", {report}')


#     # y_val_pred = e.fit(X_train, y_train).predict(X_val)
#     # acc = accuracy_score(y_val, y_pred)
#     # print(f'{e_name}: ACC={acc:.2f}')

# # Evaluate the base classifiers on the preprocessed training and validation data
# for e_name, e in base_classifiers.items():
#     evaluate_classifier(e_name, e, X_train, y_train, X_val, y_val)

In [None]:
#TODO n_estimator = ? by hyperparameter tuning

In [None]:
# TODO for Voting => look if classifiers are independent, no correlation between predicitons

In [None]:
# from sklearn.metrics import accuracy_score
# import numpy as np

# # Get predictions from the classifiers
# predictions_knn = estimators['k-NN'].predict(X_test)
# predictions_ncc = estimators['NCC'].predict(X_test)

# # Calculate the correlation between the predictions
# correlation = np.corrcoef(predictions_knn, predictions_ncc)[0, 1]
# print(f'Correlation between k-NN and NCC predictions: {correlation:.2f}')

# # Check if the classifiers are dependent
# if correlation > 0.5:
#     print("The classifiers are likely dependent.")
# else:
#     print("The classifiers are likely independent.")

### Voting (Example: with all base learners)

In [None]:
# #Use voting ensemble method as classifier
# #TODO random_states = ? und weights = ? voting soft=?
# voting = VotingClassifier(
#     ("dt", DecisionTreeClassifier(),
#      "rf", RandomForestClassifier(),
#      "lr", LogisticRegression(),
#      "nb",GaussianNB(), 
#      "svm", SVC(probability=True), # SVM with probabilities for soft voting
#      "knn_5" KNeighborsClassifier(n_neighbors=5)                           
#      )  
# )

# #TODO Complete voting
# #TODO Evaluate the classifiers' accuracies


### Stacking (Example: with all base learners)