# Ensemble methods

* Pre-reqs:
    * Linear classifiers in Python
    * Machine Learning with Tree-Based Models in Python
    * Supervised Learning with Scikit-Learn

### CAPSTONE TO DO: choose an evaluation metric for ensemble model: accuracy, sensitivity, specificty, other?
* **Accuracy:** shows the percentage of the dataset instances correctly predicted by the model developed by the machine learning algorithm
* **Sensitivity:** shows the percentage of COVID-19 positive patients correctly by the models
* **Specificity:**  shows the percentage of COVID-19 negative patients correctly by the models

* Create cross-val table of: CCI, TP, FP, Precision, Recall, F Measure/F1 Score, ROC/AUC?


### Combining multiple models
* When you're building a model, you want to choose the one that performs the best according to some evaluation metric
* Ensemble methods: form a new model by combining existing ones; the combined responses of different models will likely lead to a better decision than relying on a single response.
    * the combined model will have better performance than any of the individual models (or at least be as good as the best individual model)
* Ensemble learning is one of the most effective techniques in machine learning
* A useful Python library for machine learning called `mlxtend`
* Main scikit-learn module: `sklearn.ensemble`


```
from sklearn.ensemble import MetaEstimator
# Base estimators
est1 = Model1()
est2 = Model1()
estN = ModelN()
# Meta estimator
est_combined = MetaEstimator(estimators=[est1, est2,  ..., estN], 
                # Additional parameters (specific to the ensemble method)
)
# Train and test
est_combined.fit(X_train, y_train)
pred = est_combined.predict(X_test)
```
* The best feature of the meta estimator is that it works similarly to the scikit-learn estimators you already know, with the standard methods of fit and predict
* Decision trees are the building block of many ensemble methods

RECAP DT CODE:

```
# Split into train (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the regressor
reg_dt = DecisionTreeRegressor(min_samples_leaf=3, min_samples_split=9, random_state=500)

# Fit to the training set
reg_dt.fit(X_train, y_train)

# Evaluate the performance of the model on the test set
y_pred = reg_dt.predict(X_test)
print('MAE: {:.3f}'.format(mean_absolute_error(y_test, y_pred)))
```

### Voting

* Concept of "wisdom of the crowds": refers to the collective intelligence based on a group of individuals instead of a single expert 
* Also known as **collective intelligence**
* The aggregated opinion of the crowd can be as good as (and is usually superior to) the answer of any individual, even that of an expert.
* Useful technique commonly applied to problem solving, decision making, innovation, and prediction (we are particularly interested in prediction)

* **Majority voting:** a technique that combines the output of many classifiers using a maority voting approach
    * Properties:
        * Classification problems
        * Majority voting $\Rightarrow$ Mode of individual predictions
        * **It is recommended to use an odd number of classifiers:** Use at least **three** classifiers, and when problem constraints allow it, use five or more
        
        
* **Wise Crowd Characteristics:**
    * **The ensemble needs to be diverse:** do this by using different algorithms or different datasets.
    * **Independent and uncorrelated:** Each prediction needs to be independent and uncorrelated from the rest.
    * **Use individual knowledge:** Each model should be able to make its own predition without relying on the other predictions
    * **Aggregate individual predictions:** into a collective one
    
**NOTE:** Majority Voting can only be applied to classification problems
    * For regression problems, see: ensemble voting regressor
    
```
from sklearn.ensemble import VotingClassifier
clf_voting = VotingClassifier(
       estmators = [
           ('label1', clf_1),
           ('label2', clf_2),
           ('labelN', clf_N)])
         
```

```
# Create the individual models
clf_knn = KNeightborsClassifier(5) # k=5 nearest neighbors (to avoid multi-modal predictions)
clf_dt = DecisionTreeClassifier()
clf_lr = LogisticRegression()
# Create voting classifier
clf_voting = VotingClassifier(
    estimators=[
        ('knn', clf_knn),
        ('dt', clf_dt),
        ('lr', clf_lr)])
# Fit combined model to the training set and predict
clf_voting.fit(X_train, y_train)
y_pred = clf_voting.predict(X_test)
```
#### Evaluate the performance

```
# Get the accuracy score
acc = accuracy_score(X_test, y_pred)
print("Accuracy: {:0.3f}".format(acc))
```


* The main input- with keyword "estimators" - is a list of (string, estimator) tuples.
* Each string is a label and each estimtor is an sklearn classifier
* **You do not need to fit the classifiers individually, as the voting classifier will take care of that.**
    * But you do need to tune hyperparameters? Not sure...