# **Week 4: Colab Experiment**

# I. Introduction
In this exercise, we load the Breast cancer wisconsin dataset for classification.

# II. Methods

This code loads the breast cancer wisconsin dataset, and classifies between benign and malignant.

The dataset was split into five parts using K-Fold cross-validation. From this, training and test sets were generated, and three machine learning models—Logistic Regression, SVM, and Decision Tree—were trained and evaluated. Finally, the mean error and standard deviation of the error rates were calculated.

In [45]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from collections import Counter
from datetime import datetime
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import zero_one_loss

In [46]:
# Define the dependent and independent variables.
data = load_breast_cancer()
Y = data.target
X = data.data

In [47]:
# Create a KFold object that will split data into 5 folds for
# cross-validation, shuffling the data
# and using a random seed (7) for reproducibility.
num_folds = 5
kf = KFold(n_splits=num_folds, random_state=0, shuffle=True)

# Initialize a dictionary to store the indices of the training and test sets
kfold_indices = {}

# Loop over the 5 folds and store the indices in the dictionary
for i, (train_index, test_index) in enumerate(kf.split(X)):
  kfold_indices[f"fold_{i}"] = {'train': train_index, 'test': test_index}

In [48]:
# Train models and apply them to the test set

# Initialize a dictionary to store the error rates of the models
Error_rate = {'log_reg': [], 'svm': [], 'decision_tree': []}

# Loop over the 5 folds
for fold_id in range(num_folds):

  # Training data extraction 
  X_train = X[kfold_indices[f"fold_{fold_id}"]['train']]
  Y_train = Y[kfold_indices[f"fold_{fold_id}"]['train']]
  # Test data extraction
  X_test = X[kfold_indices[f"fold_{fold_id}"]['test']]
  Y_test = Y[kfold_indices[f"fold_{fold_id}"]['test']]

  # Logistic regression
  # Create a pipeline that removes the mean and scale
  # to unit variance, then applies logistic regression 
  # with a max of 10,000 iterations
  log_reg_pipeline = Pipeline([
      ('scaler', StandardScaler()),
      ('log_reg', LogisticRegression(max_iter=10000))
  ])
  # Fit the pipeline to the training data
  log_reg_pipeline.fit(X_train, Y_train)
  # Predict the test data
  Y_pred_log_reg = log_reg_pipeline.predict(X_test)
  # Calculate the error
  error_log_reg = zero_one_loss(Y_test, Y_pred_log_reg)
  # Append the error to the dictionary
  Error_rate['log_reg'].append(error_log_reg)

  # SVM
  # Create a pipeline that removes the mean and scale
  # to unit variance, then applies a support vector machine
  svm_pipeline = Pipeline([
      ('scaler', StandardScaler()),
      ('svm', SVC())
  ])
  # Fit the pipeline to the training data
  svm_pipeline.fit(X_train, Y_train)
  # Predict the test data
  Y_pred_svm = svm_pipeline.predict(X_test)
  # Calculate the error
  error_svm = zero_one_loss(Y_test, Y_pred_svm)
  # Append the error to the dictionary
  Error_rate['svm'].append(error_svm)

  # Decision tree
  # Create a pipeline that removes the mean and scale
  # to unit variance, then applies a decision tree
  decision_tree_pipeline = Pipeline([
      ('scaler', StandardScaler()),
      ('dt', DecisionTreeClassifier())
  ])
  # Fit the pipeline to the training data
  decision_tree_pipeline.fit(X_train, Y_train)
  # Predict the test data
  Y_pred_dt = decision_tree_pipeline.predict(X_test)
  # Calculate the error
  error_dt = zero_one_loss(Y_test, Y_pred_dt)
  # Append the error to the dictionary
  Error_rate['decision_tree'].append(error_dt)

## III. Results

Here we report the mean and standard deviation of the error rates over 5 folds for each method.

In [49]:
print(f"The error rate over 5 folds in CV:")

# Calculate the mean and standard deviation of the error rates
for method, errors in Error_rate.items():
    mean_error = np.mean(errors)
    std_error = np.std(errors)
    print(f"{method}: mean = {mean_error:.4f}, std = {std_error:.4f}")

The error rate over 5 folds in CV:
log_reg: mean = 0.0281, std = 0.0170
svm: mean = 0.0211, std = 0.0131
decision_tree: mean = 0.0773, std = 0.0237


# IV. Conclusion and Discussion


In conclusion, the results are:
1. SVM performs the best overall, with the lowest mean error and the smallest standard deviation. This suggests that it is the most accurate and the most stable model for this dataset.
2. Logistic Regression performs well, with a slightly higher mean error and standard deviation. This indicates that it is still a very good option for this classification problem.
3. The Decision Tree performs the worst, with the highest mean error and standard deviation. This shows that the Decision Tree is less accurate and its performance fluctuates more between different folds. The reason to this might be the following.

    (i) Overfitting: Decision Trees are likely to encounter overfitting, especially if not pruned or regularized.
    
    (ii) Complexity: Decision Trees are not linear models, whereas logistic regression and SVM are. On a dataset like the breast cancer dataset, which might not have complex decision boundaries, the simplicity of linear models might be more suitable, leading to better performance.

Next, I am going to some other classification methods taught in the lecture: Bagging and Random Forest.

In [50]:
# Required Libraries
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
Y = data.target

# Create a KFold object for 5-fold cross-validation 
num_folds = 5
kf = KFold(n_splits=num_folds, random_state=0, shuffle=True)

# Initialize a dictionary to store the error rates of the models
Error_rate = {'bagging': [], 'random_forest': []}

# Loop over the 5 folds
for fold_id, (train_index, test_index) in enumerate(kf.split(X)):
    # Training and Test Data extractions
    X_train = X[train_index]
    Y_train = Y[train_index]
    X_test = X[test_index]
    Y_test = Y[test_index]

    # Bagging
    # Create a pipeline to scale the data and apply the Bagging classifier
    bagging_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('bagging', BaggingClassifier(estimator=DecisionTreeClassifier(),
                                      n_estimators=100, max_samples=1.0, 
                                      random_state=0, n_jobs=-1)) 
    ])
    # Fit the pipeline to the training data
    bagging_pipeline.fit(X_train, Y_train)
    # Predict the test data
    Y_pred_bagging = bagging_pipeline.predict(X_test)
    # Calculate the error
    error_bagging = zero_one_loss(Y_test, Y_pred_bagging)
    # Append the error to the dictionary
    Error_rate['bagging'].append(error_bagging)

    # Random Forest
    # Create a pipeline to scale the data and apply the Random Forest classifier
    random_forest_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('random_forest', RandomForestClassifier(n_estimators=100, max_depth=None,
                                                 random_state=0, n_jobs=-1, 
                                                 max_features='sqrt')) 
    ])
    # Fit the pipeline to the training data
    random_forest_pipeline.fit(X_train, Y_train)
    # Predict the test data
    Y_pred_rf = random_forest_pipeline.predict(X_test)
    # Calculate the error
    error_rf = zero_one_loss(Y_test, Y_pred_rf)
    # Append the error to the dictionary
    Error_rate['random_forest'].append(error_rf)

# Print the results
print(f"The error rate over 5 folds in CV:")

# Calculate the mean and standard deviation of the error rates
for method, errors in Error_rate.items():
    mean_error = np.mean(errors)
    std_error = np.std(errors)
    print(f"{method}: mean = {mean_error:.4f}, std = {std_error:.4f}")


0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to di

The error rate over 5 folds in CV:
bagging: mean = 0.0386, std = 0.0172
random_forest: mean = 0.0386, std = 0.0131


Overall, Bagging and Random Forest outperform Decision Trees, yet they do not surpass the performance of Logistic Regression and SVM. This suggests that Bagging and Random Forest effectively mitigate the issue of overfitting and provide more stable predictions. However, the increased model complexity introduced by these ensemble methods do not result in better outcomes than those achieved with Logistic Regression and SVM for this particular dataset. Therefore, it may be useful to explore other simpler training models or consider advanced hyperparameter optimization techniques.