# Ensemble methods

* Pre-reqs:
    * Linear classifiers in Python
    * Machine Learning with Tree-Based Models in Python
    * Supervised Learning with Scikit-Learn

### CAPSTONE TO DO: choose an evaluation metric for ensemble model: accuracy, sensitivity, specificty, other?
* **Accuracy:** shows the percentage of the dataset instances correctly predicted by the model developed by the machine learning algorithm
* **Sensitivity:** shows the percentage of COVID-19 positive patients correctly by the models
* **Specificity:**  shows the percentage of COVID-19 negative patients correctly by the models

* Create cross-val table of: CCI, TP, FP, Precision, Recall, F Measure/F1 Score, ROC/AUC?
* **F1 Score for imbalanced data classes**


### Combining multiple models
* When you're building a model, you want to choose the one that performs the best according to some evaluation metric
* Ensemble methods: form a new model by combining existing ones; the combined responses of different models will likely lead to a better decision than relying on a single response.
    * the combined model will have better performance than any of the individual models (or at least be as good as the best individual model)
* Ensemble learning is one of the most effective techniques in machine learning
* A useful Python library for machine learning called `mlxtend`
* Main scikit-learn module: `sklearn.ensemble`


```
from sklearn.ensemble import MetaEstimator
# Base estimators
est1 = Model1()
est2 = Model1()
estN = ModelN()
# Meta estimator
est_combined = MetaEstimator(estimators=[est1, est2,  ..., estN], 
                # Additional parameters (specific to the ensemble method)
)
# Train and test
est_combined.fit(X_train, y_train)
pred = est_combined.predict(X_test)
```
* The best feature of the meta estimator is that it works similarly to the scikit-learn estimators you already know, with the standard methods of fit and predict
* Decision trees are the building block of many ensemble methods

RECAP DT CODE:

```
# Split into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the regressor
reg_dt = DecisionTreeRegressor(min_samples_leaf=3, min_samples_split=9, random_state=500)

# Fit to the training set
reg_dt.fit(X_train, y_train)

# Evaluate the performance of the model on the test set
y_pred = reg_dt.predict(X_test)
print('MAE: {:.3f}'.format(mean_absolute_error(y_test, y_pred)))
```

### Voting

* Concept of "wisdom of the crowds": refers to the collective intelligence based on a group of individuals instead of a single expert 
* Also known as **collective intelligence**
* The aggregated opinion of the crowd can be as good as (and is usually superior to) the answer of any individual, even that of an expert.
* Useful technique commonly applied to problem solving, decision making, innovation, and prediction (we are particularly interested in prediction)

* **Majority voting:** a technique that combines the output of many classifiers using a maority voting approach
    * Properties:
        * Classification problems
        * Majority voting $\Rightarrow$ Mode of individual predictions
        * **It is recommended to use an odd number of classifiers:** Use at least **three** classifiers, and when problem constraints allow it, use five or more
        
        
* **Wise Crowd Characteristics:**
    * **The ensemble needs to be diverse:** do this by using different algorithms or different datasets.
    * **Independent and uncorrelated:** Each prediction needs to be independent and uncorrelated from the rest.
    * **Use individual knowledge:** Each model should be able to make its own predition without relying on the other predictions
    * **Aggregate individual predictions:** into a collective one
    
**NOTE:** Majority Voting can only be applied to classification problems
    * For regression problems, see: ensemble voting regressor
    
```
from sklearn.ensemble import VotingClassifier
clf_voting = VotingClassifier(
       estmators = [
           ('label1', clf_1),
           ('label2', clf_2),
           ('labelN', clf_N)])
         
```

```
# Create the individual models
clf_knn = KNeightborsClassifier(5) # k=5 nearest neighbors (to avoid multi-modal predictions)
clf_dt = DecisionTreeClassifier()
clf_lr = LogisticRegression()
# Create voting classifier
clf_voting = VotingClassifier(
    estimators=[
        ('knn', clf_knn),
        ('dt', clf_dt),
        ('lr', clf_lr)])
# Fit combined model to the training set and predict
clf_voting.fit(X_train, y_train)
y_pred = clf_voting.predict(X_test)
```
#### Evaluate the performance

```
# Get the accuracy score
acc = accuracy_score(X_test, y_pred)
print("Accuracy: {:0.3f}".format(acc))
```


* The main input- with keyword "estimators" - is a list of (string, estimator) tuples.
* Each string is a label and each estimtor is an sklearn classifier
* **You do not need to fit the classifiers individually, as the voting classifier will take care of that.**
    * But you do need to tune hyperparameters? Not sure...
    
```
# Make the invidual predictions
pred_lr = clf_lr.predict(X_test)
pred_dt = clf_dt.predict(X_test)
pred_knn = clf_knn.predict(X_test)

# Evaluate the performance of each model
score_lr = f1_score(y_test, pred_lr)
score_dt = f1_score(y_test, pred_dt)
score_knn = f1_score(y_test, pred_knn)
```

* **1) Choose the best model**
    * generate predictions as per each model
    * get evaluation metric scores for each model's predictions (F1 score, accuracy, precision, recall, etc)
    * Decide which model performs the best
    
* **2) Instantiate models**

```
# Instantiate the individual models
clf_knn = KNeighborsClassifier(n_neighbors=5)
clf_lr = LogisticRegression(class_weight='balanced')
clf_dt = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)

# Create and fit the voting classifier
clf_vote = VotingClassifier(
    estimators=[('knn', clf_knn), ('lr', clf_lr), ('dt', clf_dt)]
)
clf_vote.fit(X_train, y_train)
```

* **3) Evaluate your ensemble**

```
# Calculate the predictions using the voting classifier
pred_vote = clf_vote.predict(X_test)

# Calculate the F1-Score of the voting classifier
score_vote = f1_score(y_test, pred_vote)
print('F1-Score: {:.3f}'.format(score_vote))

# Calculate the classification report
report = classification_report(y_test, pred_vote)
print(report)
```

### Averaging aka "Soft Voting"
* Averaging is another popular ensemble method
* Averaging is also referred to as **soft voting**
* Averaging/ Soft Voting can be applied to both classification and regression.
* In this technique, the combined prediction is the mean of the individual predictions
* **Soft voting: Mean**
    * **Regression:** mean of predicted values
    * **Classification:** mean of predicted *probabilities*
* As the mean doesn't have ambiguous cases (like the mode), we can use any number of estimators (odd or even), as long as we have at least two.

* To build an averaging classifier, we'll use the same class as before: **`VotingClassifier()`**
    * **Main difference:** we specify an additional parameter: **`voting='soft`**
        * Default value of voting is `'hard'`
#### Averaging Classifier    
    
```
from sklearn.ensemble import VotingClassifier
clf_voting = VotingClassifier(
                estimators=[
                    ('label1', clf_1),
                    ('label2', clf_2),
                    ...
                    ('labelN', clf_N)],
                voting = 'soft', 
                weights=[w_1, w_2, ..., w_N]
)
```
* parameter `weights` is optional; specifies a weight for each of the estimators
    * If specified, the combined prediction is a weighted average of the individual ones
    * Otherwise, weights are considered uniform.
* To build an **`AveragingRegressor()`:**
#### Averaging Regressor:

```
from sklearn.ensemble import VotingRegressor
reg_voting = VotingRegressor(
        estimators = [
            ('label1', reg_1),
            ('label2', reg_2),
            ...
            ('labelN', reg_N)],
        voting = 'soft',
        weights = [w_1, w_2, ..., w_N]
)
```
* The first parameter is also a list of the string/estimator tuples, but instead of classifiers, we use regressors


***
* **Averaging classigier example:**
```
from sklearn.ensemble import VotingRegressor
clf_knn = KNeighborsClassifier(5)
clf_dt = DecisionTreeClassifier()
clf_lr = LogisticRegression()
clf_voting = VotingClassifier(
        estimators = [
            ('knn', clf_knn),
            ('dt', clf_dt),
            ('lr', clf_lr)],
        voting='soft',
        weights = [1, 2, 1]
)
```
* Assuming we know that Decision Tree has better individual performance, we give it a higher weight
* Ideally, the weights should be tuned while training the model, for example, using grid search cross-validation (`GridSearchCV`).

```
# Build the individual models
clf_lr = LogisticRegression(class_weight='balanced')
clf_dt = DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9, random_state=500)
clf_svm = SVC(probability=True, class_weight='balanced', random_state=500)

# List of (string, estimator) tuples
estimators = [('lr', clf_lr), ('dt', clf_dt), ('svm', clf_svm)]

# Build and fit an averaging classifier
clf_avg = VotingClassifier(estimators = estimators,voting='soft')
clf_avg.fit(X_train, y_train)

# Evaluate model performance
acc_avg = accuracy_score(y_test,  clf_avg.predict(X_test))
print('Accuracy: {:.2f}'.format(acc_avg))
```

## Bagging
#### The strength of weak models
* What is a weak model and how to identify one by its properties?
* Voting and averaging work by combining the predictions of already trained models
    * Small number of estimators
    * **Fine-tuned estimators**
    * Individually optimized and trained for the problem
    
#### Weak vs. fine-tuned model

#### Weak estimator
* A "weak" model does not mean that it is a "bad" model, just that it is not as strong as a highly-optimized, finely-tuned model.
* Performance is slightly better than random guessing
* The error rate is less than 50%, but close to it
* A weak model should be light in terms of space and computational requirements, and fast during training and modeling.
* One good example: a decision tree limited to a depth of two
* **The three desired properties:
    * Low performance (just above guessing)
    * It is light
    * Low training and evaluation time
* Common examples of weak models:
    * decision tree with small depth
    * Logistic Regression (makes the assumption that the classes are linearly separable)
        * could also limit number of iterations for training
        * or, specify a high value of the parameter C to use a weak regularization
    * Linear Regression 
        * Makes the assumption that the output is a linear function of the input features
        * could limit the number of iterations or not use normalization
    * Other restricted models

#### Bagging = Bootstrap Aggregating
* Heterogenous vs homogenous ensemble methods
#### Heterogenous:
* Different algorithms (fine-tuned)
* Work well with a small number of estimators
* For example, we could combine a decision tree, a logistic regression, and a support vector machine using voting to improve the results
* Included: Voting, Averaging, Stacking
#### Homogenous
* methods such as bagging
* work by applying the same algorithm on all the estimators, and this algorithm must be a "weak" model
* Large number of weak estimators
* Bagging and Boosting are some of the most popular of this kind

#### Condorcet's Jury Theorem:
* **Requirements:**
* All models must be independent
* Each model performs better than random guessing
* All individual models have similar performance

* If these three conditions are met, then adding more models increases the probability of the ensemble to be correct, and makes this probability tend to 1 (1 equivalent to 100%)
* The second and third requirements can be fulfilled by using the same weak model for all the estimators

* To guarantee the first requirement of the theorem, the **bagging algorithm** trains individual models using a random subsample for each (known as **bootstrapping**)

* A wise crowd needs to be diverse, either through using different datasets or algorithms

* **Bootstrapping** requires:
    * Random subsamples $\Rightarrow$ provides **diversity** of data
    * Using replacement
    
* Bagging helps reduce variance, as the sampling is truly random
* Bias can be reduced since we use voting or averaging to combine the models 
* Bagging provides stablity and robustness
* However, **bagging is computationally expensive in terms of space and time**

* To take a sample, you'll use pandas' `.sample()` method, which has a replace parameter. For example, the following line of code samples with replacement from the whole DataFrame df:
    * `df.sample(frac=1.0, replace=True, random_state=42)`
    
```
# Take a sample with replacement
X_train_sample = X_train.sample(frac=1.0, replace=True, random_state=42)
y_train_sample = y_train.loc[X_train_sample.index]

# Build a "weak" Decision Tree classifier
clf = DecisionTreeClassifier(max_depth=4, random_state=500)

# Fit the model to the training sample
clf.fit(X_train_sample, y_train_sample)
```

```
def build_decision_tree(X_train, y_train, random_state=None):
    # Takes a sample with replacement,
    # builds a "weak" decision tree,
    # and fits it to the train set

def predict_voting(classifiers, X_test):
    # Makes the individual predictions 
    # and then combines them using "Voting"
```

#### BaggingClassifier
* **Heterogenous Ensemble Functions**
* A key difference between the ensemble functions from heterogenous and homogenous methods:

```
het_est = HeterogenousEnsemble(
    estimators= [('est1', est1), ('est2', est2), ...], 
    # additional parameters
)
```

* **Homogenous Ensemble Functions:**
* To build a homogenous ensemble model, instead of a list of estimators, we pass the parameter base_estimator, which is the instantiated "weak" model we have chosen for our ensemble:

```
hom_est = HomogenousEnsemble(
            base_estimator= est_base,
            n_estimators= chosen_number,
            #additional paramters
)
```

#### BaggingClassifier example:

```
clf_dt = DecisionTreeClassifier(max_depth=3)
clf_bag = BaggingClassifer(
            base_estimator= clf_dt,
            n_estimators=5
)
clf_bag.fit(X_train, y_train)
y_pred = clf_bag.predict(X_test)
```

#### BaggingRegressor example

```
reg_lr = LinearRegression(normalization=False)
reg_bag = BaggingRegressor(
            base_estimator=reg_lr,
            oob_score = True # oob_score = out-of-bag score
)
reg_bag.fit(X_train, y_train)
y_pred = reg_bag.predict(X_test)
```
* Note: default number of estimators is ten, when left undefined as above

#### Out-of-bag score
* In a bagging ensemble, each estimator is trained on a bootstrap sample. Therefore, each of the samples will leave out some of the instances, which are then used to evaluate each estimator, similar to a train-test split
* To get the out-of-bag score for each instance, we calculate the predictions using all the estimators for which it was out of the sample
* Then, combine individual predictions
* Evaluate the metric on those predictions
    * For **classification** the default metric is accuracy
    * For **regression** the default metric is R2 (aka **coefficient of determination**)
* **Out of bag score helps avoid the need for an independent test set**
    * However, it's often lower than the actual performance
* To get the out-of-bag score from a Bagging ensemble, we need to set the parameter `oob_score` to True
* After training the model, we can access the oob score using the attribute `.oob_score_`
* It's good to compare oob score to actual metric (for example, R2 or accuracy)
    * The two values being close is a good indicator of the model's ability to generalize to new data
    
```
# Instantiate the base model
clf_dt = DecisionTreeClassifier(max_depth=4)

# Build and train the Bagging classifier
clf_bag = BaggingClassifier(
  base_estimator=clf_dt,
  n_estimators=21,
  random_state=500)
clf_bag.fit(X_train, y_train)

# Predict the labels of the test set
pred = clf_bag.predict(X_test)

# Show the F1-score
print('F1-Score: {:.3f}'.format(f1_score(y_test, pred)))
```

```
# Build and train the bagging classifier
clf_bag = BaggingClassifier(
  base_estimator=clf_dt,
  n_estimators=21,
  oob_score=True,
  random_state=500)
clf_bag.fit(X_train, y_train)

# Print the out-of-bag score
print('OOB-Score: {:.3f}'.format(clf_bag.oob_score_))

# Evaluate the performance on the test set to compare
pred = clf_bag.predict(X_test)
print('Accuracy: {:.3f}'.format(accuracy_score(y_test, pred)))
```