# Non-Boosting Methods

In this notebook will be some additional problems regarding non-boosting ensemble learning methods. This material corresponds to lectures:
- `Lectures/Supervised Learning/Ensemble Learning/1. What is Ensemble Learning`,
- `Lectures/Supervised Learning/Ensemble Learning/2. Random Forests`,
- `Lectures/Supervised Learning/Ensemble Learning/3. Bagging and Pasting` and
- `Lectures/Supervised Learning/Ensemble Learning/8. Voter Models`.

In [None]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a dark background
sns.set_style("whitegrid")

##### 1. 

If you have trained five different models on the exact same training data, and they all achieve $95\%$ precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?

##### Write here




##### 2. Voter Model Regression

While we implemented a voter model for a classification problem, it can also be used for regression purposes. In this setting the voter model provides a predicted value by taking a weighted average (the default is to use uniform weights) of its constituent regressor model's predictions.

In `sklearn` this is done with `VotingRegressor` <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html</a>.

Load in the `baseball_run_diff.csv` data set from the `Data` folder. Build a `VotingRegressor` model to predict `W` using `RD`. Let your constituent models be simple linear regression, $k$-nearest neighbors using $k=10$ and an extra trees regressor with `max_depth=4`. Plot the predictions on top of the training data for the voter model as well as each individual constituent model. In addition create a validation set and provide the performance of the voter model and each individual constituent model on the validation set.

In [None]:
## load the data here
baseball = pd.read_csv("../../../Data/baseball_run_diff.csv")

In [None]:
## make a train test split
from sklearn.model_selection import train_test_split

baseball_train, baseball_test = train_test_split(baseball.copy(), 
                                                    shuffle=True,
                                                    random_state=314,
                                                    test_size=.2)

## make a validation set
baseball_train_train, baseball_val = train_test_split(baseball_train.copy(), 
                                                        shuffle=True,
                                                        random_state=13241,
                                                        test_size=.2)

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### 3. Bagging/Pasting Regression

Similarly to 2., we can use bagging/pasting models for regression as well. Here we use a regression model as our base estimator and then to make a prediction take an average of all `n_estimators` models' predictions. In `sklearn` this is performed with `BaggingRegressor`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html</a>, where just like `BaggingClassifier` whether you perform bagging or pasting is determined by `bootstrap`.

Build a bagging regression model on that baseball data using a $k$NN regressor with $k=5$ as the base estimator. Plot the training data, the $k$NN regression prediction and the bagging prediction on the same plot. Use `n_estimators=5`, `bootstrap=True` and `max_samples = int(.25*len(baseball_train))`.


<i>Would using a bagging regressor introduce more bias or variance to your model?</i>

In [None]:
## code here




In [None]:
## code here




In [None]:
## code here




In [None]:
## code here




In [None]:
## code here




##### 4. Introducing MNIST

The MNIST dataset is a database of $60,000$ handwritten digits. It has been used to help create computer vision algorithms in detecting handwritten digits. It is also a common dataset used when teaching classification. 

We will first go through the data together, then you will build a voting classifier on the data.

In [None]:
## Load the data, this can take a while
## For speed, we'll use the sklearn version of the data
## which is smaller data set, and has a lower resolution
## than the original data set
from sklearn.datasets import load_digits

X,y = load_digits(return_X_y=True)

In [None]:
# Each observation contains the grayscale values for an 8 x 8 grid
X[0,:]

In [None]:
sns.set_style("white")

fig,ax = plt.subplots(2,5,figsize=(20,8))

for i in range(10):
    ax[i//5,i%5].imshow(X[i,:].reshape(8, 8), cmap='gray_r')
    ax[i//5,i%5].text(.1,.1,str(y[i]),fontsize=16)

plt.show()

In [None]:
## I'll scale X for you prior to the split, this ensures
## each value is in the range of 0 to 1
X = X/255

## Perform the split here
## set aside 20% for the test set
## stratify on y
X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                    test_size = .2,
                                                    shuffle = True,
                                                    random_state=440,
                                                    stratify=y) 

Use cross-validation to find the number of neighbors for a KNN classifier that produces the highest mean cv accuracy.

In [None]:
## code here




In [None]:
## code here




In [None]:
## code here




In [None]:
## code here




Perform cross-validation to select the optimal value for `max_depth` for a `RandomForestClassifier`.

In [None]:
## code here




In [None]:
## code here




In [None]:
## code here




In [None]:
## code here




Now build a voter model with KNN using the number of neighbors you just found, a random forest model using the maximum depth you just found and a linear discriminant analysis model. Find the cross-validation accuracy for the voter model and each model individually.

In [None]:
## code here




In [None]:
## code here




In [None]:
## code here




--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)