# Portfolio assignment week 7

## 1. Bagging vs Boosting
The scikit-learn library provides several options for bagging and boosting. It is possible to create your own boosting model based on a base model. For instance, you can create a tree based bagging model. In addition, scikit-learn provides AdaBoost. For XGBoost it is best to use the xgboost library.

Based on the theory in the [accompanying notebook](../Exercises/E_BAGGING_BOOSTING.ipynb), create a bagging, boosting and dummy classifier. Test these classifiers on the [breast cancer dataset](https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset). Go through the data science pipeline as you've done before:

1. Try to understand the dataset globally.
2. Load the data.
3. Exploratory analysis
4. Preprocess data (skewness, normality, etc.)
5. Modeling (cross-validation and training). (**Create several bagging classifiers with different estimators**.)
6. Evaluation (**Use the evaluation methods as described in the previous lessons. Then compare the different models**.)
7. Try to understand why some methods perform better than others. Try different configurations for your bagging and boosting models.


**1. Try to understand the dataset globally**

The breast cancer dataset contains information that can be applied to binary classification prediction of breast cancer. The dataset consists of 568 patients with either a malignent or benign breast tumor. For each patient information about 30 tumor parameters is provided. These parameters can be used to create a predictive model for classifying malignent vs benign tumors using machine learning.  

**2. Load the data**

In [14]:
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

In [2]:
# Loading data
import yaml

def get_config():
    with open("config.yaml", 'r') as stream:
        config = yaml.safe_load(stream)
    return config

config = get_config()
data = (config['breast-cancer'])
df = pd.read_csv(data)

**3. Exploratory analysis**

In [3]:
# Displaying number of malignent vs benign
print(f'\nNumber of malignent tumors: {len(df[df.diagnosis == "M"])}')
print(f'Number of benign tumors: {len(df[df.diagnosis == "B"])}\n')
    
# Displaying number of columns and rows, datatypes, missing values
print(df.info())


Number of malignent tumors: 212
Number of benign tumors: 357

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se       

**4. Preprocess data (skewness, normality, etc.)**

The features have varying ranges and should be scaled. The diagnosis (y) should be converted from catergorical variables to a numerical representation. The preprocessed data is then split into training and testing sets for modelling.

In [4]:
X = df.drop(['id', 'diagnosis'], axis = 1)

# Scaling the features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [5]:
y = df['diagnosis']

# Encoding the "diagnosis" column
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)

**5. Modeling (cross-validation and training). Create several bagging classifiers with different estimators.**

In [10]:
# Create a list of base estimators
base_estimators = [
    DecisionTreeClassifier(),
    RandomForestClassifier(n_estimators=100),
    ExtraTreesClassifier(n_estimators=100)
]

# Create a dictionary of bagging classifiers with different base estimators
bagging_classifiers = {}
for base_estimator in base_estimators:
    bagging_classifier = BaggingClassifier(base_estimator, n_estimators=10)
    classifier_name = base_estimator.__class__.__name__
    bagging_classifiers[classifier_name] = bagging_classifier

# Evaluate each bagging classifier using cross-validation
for classifier_name, bagging_classifier in bagging_classifiers.items():
    scores = cross_val_score(bagging_classifier, X, y, cv=5)
    print(f"{classifier_name}: Mean Accuracy: {scores.mean():.3f}, Standard Deviation: {scores.std():.3f}")

DecisionTreeClassifier: Mean Accuracy: 0.949, Standard Deviation: 0.031
RandomForestClassifier: Mean Accuracy: 0.958, Standard Deviation: 0.018
ExtraTreesClassifier: Mean Accuracy: 0.965, Standard Deviation: 0.020


ExtraTreesClassifiers has the highest mean accuracy. RandomForestClassifier has the lowest standard deviation. DecisionTreeClassifier appears be the least accurate.

**6. Evaluation (Use the evaluation methods as described in the previous lessons. Then compare the different models.)**

In [13]:
# Train and evaluate the models
for classifier_name, bagging_classifier in bagging_classifiers.items():
    bagging_classifier.fit(X_train, y_train)
    y_pred = bagging_classifier.predict(X_test)
    
    print(f"\n{classifier_name} Bagging Classifier:")
    # Classification report
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
    # Confusion matrix
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))


DecisionTreeClassifier Bagging Classifier:
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.97      0.97        71
           1       0.95      0.93      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

Confusion Matrix:
[[69  2]
 [ 3 40]]

RandomForestClassifier Bagging Classifier:
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.97        71
           1       0.98      0.93      0.95        43

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

Confusion Matrix:
[[70  1]
 [ 3 40]]

ExtraTreesClassifier Bagging Classifier:
Classification Report:
              precision    recall  f1-score   support

           0      

The ExtraTreesClassifier Bagging Classifier achieves the highest precision, recall, F1-score, and accuracy among the three classifiers. It shows a better performance in correctly identifying both the positive (malignant) and negative (benign) cases.

The RandomForestClassifier Bagging Classifier also performs well, especially in terms of recall, with a high percentage of true positives correctly classified as malignant.

The DecisionTreeClassifier Bagging Classifier has slightly lower recall compared to the other classifiers. It still appears to be performing well. 

The performance differences between the classifiers can be explained by the characteristics of the base estimators and the variability introduced by the bagging technique. Decision trees tend to have high variance but can capture complex relationships. Random forests combine multiple decision trees to reduce variance and improve performance. Extra trees further enhance this approach by using additional randomness in the tree-building process. This is why the ExtraTreesClassifier seems to be performing the best

In [None]:
# Further optimising TreeClassifiers
base_estimators = [
    ExtraTreesClassifier(n_estimators=100)
]

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [10, 50, 100],
    'estimator__max_depth': [None, 3, 5, 7]
}

# Perform grid search to find the best hyperparameters for each bagging classifier
for classifier_name, bagging_classifier in bagging_classifiers.items():
    grid_search = GridSearchCV(bagging_classifier, param_grid, cv=5)
    grid_search.fit(X_train, y_train)
    
    # Get the best hyperparameters and score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    
    print(f"\n{classifier_name} Bagging Classifier:")
    print("Best Hyperparameters:", best_params)
    print("Best Score:", best_score)
    
    # Train and evaluate the bagging classifier with the best hyperparameters
    bagging_classifier.set_params(**best_params)