# 1. Introduction
## 1.1 Definition
### AdaBoost Ensemble Algorithms for Machine Learning
* **AdaBoost, short for Adaptive Boosting,** is a powerful and versatile ensemble learning technique that combines multiple **"weak"** learners (often decision trees), each with slightly better-than-chance performance, to create a strong predictor.

* It's a popular choice for its simplicity, effectiveness, and ability to handle complex datasets and improve the accuracy of the model iteratively.

* AdaBoost is widely used in various applications, including face detection, fraud detection, and customer churn prediction, where the ability to boost the performance of weak learners is valuable.


* **Boosting** involves adding models sequentially to the ensemble where new models attempt to correct the errors made by prior models already added to the ensemble. As such, the more ensemble members that are added, the fewer errors the ensemble is expected to make, at least to a limit supported
by the data and before **overfitting** the training dataset.

* The idea of **boosting** was first developed as a theoretical idea, and the **AdaBoost algorithm** was the first successful approach to realizing
a boosting-based ensemble algorithm.
  
* **AdaBoost** works by fitting **decision trees** on versions of the training dataset weighted so that the tree pays more attention to examples (rows) that the prior members got wrong, and less attention to those that the prior models got correct.

* Rather than full decision trees, AdaBoost uses very simple trees that make a single decision on one input variable before making a prediction. These short trees are referred to as **decision stumps.**

* AdaBoost is available in scikit-learn via the **AdaBoostClassifier** and **AdaBoostRegressor** classes, which use a decision tree (decision stump) as the base-model by default and you can specify the number of trees
to create via the **n estimators** argument.

## 1.2 Purposes
Evaluate the performance of an **AdaBoost ensemble model** for a **classification** task using cross validation with the help of synthetic dataset.

**Thanks to**
* [Ensemble Machine Learning (7-Day Mini-Course) by Jason Brownlee](https://machinelearningmastery.com/ensemble-machine-learning-with-python-7-day-mini-course/)

# 2. Import libraries

In [1]:
# example of evaluating an AdaBoost ensemble for classification
# calculate the mean and standard deviation of the model's accuracy scores.
from numpy import mean
from numpy import std

# generating synthetic datasets
from sklearn.datasets import make_classification

# evaluating models
from sklearn.model_selection import cross_val_score

#  creating resampling strategies
from sklearn.model_selection import RepeatedStratifiedKFold

# implementing the AdaBoost ensemble method
from sklearn.ensemble import AdaBoostClassifier

# 3. Create the Synthetic Dataset

In [2]:
# create the synthetic classification dataset
# X : input features matrix
# y : output/target labels
# random_state = 1 : ensures that the data generation process is reproducible.
X, y = make_classification(random_state=1)

# 4. Configure the AdaBoost Ensemble Model
**n_estimators=150**

Specifies the number of **weak learners (e.g., decision trees)** to be used in the ensemble. AdaBoost builds each new weak learner *based on the errors of the previous ones, improving the model's performance.*



In [3]:
# configure the ensemble model
model = AdaBoostClassifier(n_estimators=150)

# 5. Configure the Resampling Method
* **n_splits=10 :** Split the dataset into **10 parts (folds)**, ensuring that each fold has an equal representation of each class.
* **n_repeats=3 :** Repeat the **10-fold cross-validation** process **three times,** providing more reliable evaluation results by reducing the effect of random variations.
* **random_state=1 :** Ensures that the data splitting process is reproducible.


In [4]:
# configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# 6. Evaluate the Model Using Cross-Validation
* **scoring='accuracy':** The metric used to evaluate the model is accuracy, which is the *ratio of correctly predicted instances to the total instances.*
* **cv=cv :** Uses the **RepeatedStratifiedKFold** cross-validation strategy defined earlier.
* **n_jobs=-1 :** Utilizes all available CPU cores to perform the cross-validation in parallel, speeding up the process.



In [5]:
# evaluate the ensemble on the dataset using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

#7. Report the Model's Performance
* **mean(n_scores) :** Calculates the **mean (average) accuracy** across *all the cross-validation folds*, providing an overall performance estimate of the model.
* **std(n_scores) :** Calculates the **standard deviation of the accuracy scores,** indicating how much the *model's performance varies across different folds.*

In [6]:
# report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Mean Accuracy: 0.950 (0.089)
