<a href="https://colab.research.google.com/github/cloudpedagogy/data-science-programming/blob/main/machine-learning-scikit-learn/09_Advanced_Topics_in_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Advanced Topics in Machine Learning


#Ensemble methods (bagging, boosting)




##Bagging


Bagging, which stands for Bootstrap Aggregating, is an ensemble learning method used in machine learning. It involves creating multiple subsets of the original dataset through resampling (bootstrap sampling) and training a separate model on each subset. The predictions from each model are then combined (e.g., averaged) to make the final prediction.

In Scikit-Learn, the `BaggingClassifier` and `BaggingRegressor` classes are available for implementing bagging. These classes provide an easy way to create an ensemble of models using bagging. Each base estimator within the ensemble is trained on a random subset of the original data.

Here's an example of using bagging with the Pima Indian Diabetes dataset for classification:


In [None]:
import pandas as pd
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Separate the features and target variable
X = dataset.drop("Outcome", axis=1)
y = dataset["Outcome"]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a base classifier (decision tree)
base_classifier = DecisionTreeClassifier()

# Create the bagging classifier
bagging = BaggingClassifier(base_classifier, n_estimators=10, random_state=42)

# Train the bagging classifier
bagging.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging.predict(X_test)

# Evaluate the accuracy of the bagging classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


In this example, we first load the Pima Indian Diabetes dataset using Pandas. We separate the features (X) and the target variable (y). Then, we split the data into training and test sets using the `train_test_split` function from Scikit-Learn.

Next, we create a base classifier, which is a decision tree in this case. We then create a `BaggingClassifier` and pass the base classifier as the parameter. The `n_estimators` parameter specifies the number of base estimators to use in the ensemble.

We train the bagging classifier using the `fit` method on the training data. After that, we make predictions on the test set using the `predict` method. Finally, we evaluate the accuracy of the bagging classifier by comparing the predicted labels (`y_pred`) with the true labels (`y_test`). The accuracy is printed as the output.

Note: You may need to import the necessary libraries (`pandas`, `sklearn.ensemble`, `sklearn.tree`, `sklearn.model_selection`, `sklearn.metrics`) and install Scikit-Learn (`pip install scikit-learn`) to run this example.


##Boosting
Boosting is a machine learning ensemble technique in which multiple weak models (also known as base learners) are combined to create a stronger predictive model. Scikit-Learn provides various boosting algorithms that can be used for classification and regression tasks.

Here are two popular boosting algorithms available in Scikit-Learn:

1. AdaBoost (Adaptive Boosting): It focuses on training weak models sequentially, where each subsequent model tries to correct the mistakes made by the previous models. It assigns higher weights to the misclassified instances, making them more important in the next iteration.

2. Gradient Boosting: It builds the model in a stage-wise manner, where each subsequent model tries to minimize the loss function (e.g., mean squared error for regression) by fitting the negative gradient of the loss. Gradient Boosting typically uses decision trees as weak learners.

Now let's see an example of using AdaBoost and Gradient Boosting with the Pima Indian Diabetes dataset for classification:


In [None]:
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Pima Indian Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
column_names = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
dataset = pd.read_csv(url, names=column_names)

# Split the dataset into features and target variable
X = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# AdaBoost classifier
adaBoost = AdaBoostClassifier(n_estimators=100, random_state=42)
adaBoost.fit(X_train, y_train)
adaBoost_predictions = adaBoost.predict(X_test)
adaBoost_accuracy = accuracy_score(y_test, adaBoost_predictions)
print("AdaBoost accuracy:", adaBoost_accuracy)

# Gradient Boosting classifier
gradientBoost = GradientBoostingClassifier(n_estimators=100, random_state=42)
gradientBoost.fit(X_train, y_train)
gradientBoost_predictions = gradientBoost.predict(X_test)
gradientBoost_accuracy = accuracy_score(y_test, gradientBoost_predictions)
print("Gradient Boosting accuracy:", gradientBoost_accuracy)


In this example, we first load the Pima Indian Diabetes dataset using Pandas. Then, we split the dataset into features (X) and the target variable (y). Next, we split the data into training and testing sets using `train_test_split()`.

We then create an AdaBoost classifier and a Gradient Boosting classifier using `AdaBoostClassifier` and `GradientBoostingClassifier` classes from Scikit-Learn. We set the number of estimators (weak models) to 100 for both classifiers.

After that, we fit the classifiers to the training data using `fit()` and make predictions on the test set using `predict()`. Finally, we calculate the accuracy of each classifier using the `accuracy_score()` function and print the results.


#Reflection points

1. **What are ensemble methods in machine learning?** Ensemble methods combine multiple base models to make predictions. Reflect on the advantages of using ensemble methods compared to single models.

2. **What is bagging and how does it work?** Discuss the concept of bagging, which involves training multiple models independently on different subsets of the training data. Reflect on how bagging reduces variance and improves model performance.

3. **What are the popular bagging algorithms in Python?** Reflect on the different bagging algorithms available in Python, such as Random Forest and Extra Trees, and their strengths and weaknesses.

4. **What is boosting and how does it work?** Discuss the concept of boosting, which involves training multiple models sequentially, where each subsequent model focuses on correcting the mistakes of the previous ones. Reflect on how boosting reduces bias and improves model performance.

5. **What are the popular boosting algorithms in Python?** Reflect on the different boosting algorithms available in Python, such as AdaBoost, Gradient Boosting, and XGBoost, and their specific characteristics and use cases.

6. **What are the key hyperparameters for bagging and boosting algorithms?** Reflect on the hyperparameters that are crucial for bagging and boosting algorithms, such as the number of estimators, learning rate, depth of trees, and regularization parameters.

7. **How can you evaluate ensemble models effectively?** Reflect on the evaluation metrics suitable for ensemble models, such as accuracy, precision, recall, F1-score, and area under the ROC curve. Consider the importance of cross-validation and out-of-bag error estimation.

8. **What are the feature importance measures in ensemble methods?** Reflect on the feature importance measures provided by ensemble methods, such as Gini importance and permutation importance, and discuss their significance in feature selection and interpretation.

9. **How can ensemble methods be applied to regression problems?** Reflect on the adaptation of ensemble methods for regression tasks, including the use of bagging and boosting algorithms for predicting continuous outcomes.

10. **What are the potential challenges and limitations of ensemble methods?** Reflect on the limitations of ensemble methods, such as increased model complexity, longer training times, and the potential for overfitting. Discuss strategies to mitigate these challenges.


#A quiz on Ensemble methods (bagging, boosting)



1. What is the main purpose of Bagging and Boosting in machine learning?
   <br>a) To reduce model complexity
   <br>b) To increase model diversity
   <br>c) To improve model accuracy
   <br>d) To speed up model training

2. Which ensemble technique involves training multiple models independently on different subsets of the training data and then combining their predictions through voting or averaging?
   <br>a) Bagging
   <br>b) Boosting
   <br>c) Stacking
   <br>d) Gradient Boosting

3. In Bagging, how are the subsets of the training data created for each model?
   <br>a) By using all of the training data for each model
   <br>b) By selecting random subsets with replacement from the training data
   <br>c) By selecting random subsets without replacement from the training data
   <br>d) By selecting the hardest examples from the training data

4. Which ensemble technique assigns higher weights to misclassified samples to focus on those during the training process?
   <br>a) Bagging
   <br>b) Boosting
   <br>c) AdaBoost
   <br>d) Random Forest

5. Which of the following statements is true?
   <br>a) Bagging can lead to overfitting on the training data.
   <br>b) Boosting can lead to overfitting on the training data.
   <br>c) Bagging only works with decision tree models.
   <br>d) Boosting only works with linear models.

---
**Answers**

1. c) To improve model accuracy
2. a) Bagging
3. b) By selecting random subsets with replacement from the training data
4. c) AdaBoost
5. b) Boosting can lead to overfitting on the training data.

---