# Conservation Bagging Support Vector Classifiers

## Introduction
This project explores an idea derived from *Conservation Random Forests*, written by Moshe Sipper and Jason H. Moore. In their paper, Sipper and Moore use cultivation methods with decision trees and show that a significant improvement can be attained by using models that are already in possession. This project extends the work done in *Conservation Random Forests* by evaluating the utilized cultivation methods when decision trees are replaced with support vector classifiers (SVCs).

## Methods
Similar to the methods used in *Conservation Random Forests*, 3 methods are used to produce final ensembles: factory, super-ensemble, and lexiworkshop.

### 1. Factory
The factory method is similar to the jungle method used in *Conservation Random Forests*. With the factory method, multiple bagging classifiers that use SVCs as base estimators are trained. Then, all SVCs in each bagging classifier are added to a collection called a factory. The name "factory" arises from the fact that a factory has many machines; in this case, the machines are support vector machines. The factory performs class predictions through majority voting, where each SVC votes for a single class.

### 2. Super-ensemble
The super-ensemble method is similar to the super-ensemble method used in *Conservation Random Forests*. With the super-ensemble method, each bagging classifier is added to a collection called a super-ensemble. The name "super-ensemble" comes from the fact that a super-ensemble collection is an ensemble of ensembles. Prediction for the super-ensemble is also made through majority voting, where each bagging classifier votes for a single class.

### 3. Lexiworkshop
The lexiworshop method is similar to the lexigarden method used in *Conservation Random Forests*. The lexiworkshop method receives a factory (a collection of SVCs) and returns a workshop (a subset of the collection of SVCs) of a specified size. The SVCs in the workshop are selected through lexicase selection. The name "workshop" is used here because a workshop contains fewer machines relative to a factory. Like the previous methods, the lexiworkshop method performs class predictions via majority voting, where each SVC in the workshop votes for a single class.


## Datasets

The datasets used in this project are a subset of those used in *Conservation Random Forest*. Datasets were selected with the intention of creating a diverse collection of datasets. Here, our datasets come from 4 different sources: 

### 1. Easy
This is Scikit-learn's "easy" collection of datasets where high performance is expected. These datasets are loaded via Scikit-learn's API. The datasets used from this source are:
- [iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris)
- [wine](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine)
- [breast cancer](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer)
- [digits](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits)

### 2. Clf
This source refers to datasets created using Scikit-learn's [`make_classification`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification) function that "initially creates clusters of points normally distributed (std=1) about vertices of an `n_informative`-dimensional hypercube with sides of length 2*`class_sep` and assigns an equal number of clusters to each class." This project creates three different datasets using the following function calls:
- `make_classification(n_samples=500, n_features=400, n_informative=200, n_classes=4, random_state=0)`
- `make_classification(n_samples=1000, n_features=100, n_informative=90, n_classes=2, random_state=0)`
- `make_classification(n_samples=1000, n_features=300, n_informative=200, n_classes=4, random_state=0)`

### 3. OpenML
OpenML is a repository of over 21,000 datasets. The datasets selected from OpenML are listed below and saved in `datasets/openML/` as `.csv` files:
- [teachingAssistant](https://www.openml.org/d/1115)
- [monk-problems-2](https://www.openml.org/d/334)
- [one-hundred-plants-margin](https://www.openml.org/d/1491)

### 4. PMLB
The [Penn Machine Learning Benchmark](https://epistasislab.github.io/pmlb/) repository is a public benchmark resource to help identify the strengths and weaknesses of different machine learning techniques. This project accesses these datasets using the `pmlb` Python Package Index (PyPI). The PMLB datasets used in this project are listed below and cached in `datasets/PMLB/`:
- [allrep](https://epistasislab.github.io/pmlb/profile/allrep.html)
- [biomed](https://epistasislab.github.io/pmlb/profile/biomed.html)
- [car_evaluation](https://epistasislab.github.io/pmlb/profile/car_evaluation.html)
- [cloud](https://epistasislab.github.io/pmlb/profile/cloud.html)

## Experiments

### Setup
The experimental setup for this project is very similar to that of *Conservation Random Forests*; however, the experiments here are scaled back relative to those conducted in the paper. 

The project's experiments use Scikit-learn's `SVC` (with default parameters) and `BaggingClassifier` objects. 10 replicate runs are conducted for each of the 14 datasets above. For each replicate run, 5 folds are created for 5-fold cross-validation. For each fold, the dataset is split into a training set of 4 folds and the left-out test fold. 10 runs are conducted for each fold. Each run includes fitting a 100-SVC bagging classifier to the training set and testing the fitted bagging classifier on the test set. In addition, all SVCs are saved into a factory, and all bagging classifiers are saved into a super-ensemble. Both the factory and super-ensemble are tested on the test set. The factory serves to produce workshops of sizes 100, 300, and 500, which are tested as well. 

For each of the 10 replicate runs on a dataset, the mean accuracy score across all 5 folds are saved for each method (factory, super-ensemble, lexiworkshop of size 100, lexiworkshop of size 300, and lexiworkshop of size 500) as a `.csv` file in `results/`.

### Code
#### `experiment.py`
This file contains all the functions for each experiment. The main algorithm for the experimental setup is detailed in `experiment()`. The lexiworkshop method is detailed in `lexiworkshop()`.

#### `drivers.py`
This file contains driver functions that call `experiment()` on the datasets for each source. For example, `openml_driver()` runs experiments on all the datasets from the OpenML. Each driver function is called in this `.ipynb` file and run on Google Colab. The following cells run each driver function; however, there is no need to run these cells again since the results of the experiments have already been saved in `results/`.


In [1]:
# uncomment and run this cell for google colab

# from google.colab import drive
# drive.mount('/content/gdrive/')
# %cd /content/gdrive/MyDrive/COSC-247/Final\ Project/conservation-bagging-svcs

In [2]:
# install the pmlb PyPI using pip
%pip install pmlb

Note: you may need to restart the kernel to use updated packages.


In [3]:
# import driver functions from drivers.py
from drivers import easy_driver, clf_driver, openml_driver, pmlb_driver

In [4]:
# uncomment and run this cell to run all experiments
# however, there is no need to because the results have already been saved in results/

# easy_driver()
# clf_driver()
# openml_driver()
# pmlb_driver()

## Results

`transform_results.py` contains a function called `get_combined_results()` that compiles all the results from each experiment to display as a `pandas.DataFrame`. Each row of the `DataFrame` presents the results of running `experiment()` (10 replicate runs) on a single dataset. Each column of the `DataFrame` is detailed below:
- `source`: The source of the dataset.
- `dataset`: The name of the dataset.
- `bag_svc`: The mean accuracy score of bagging classifiers across all replicate runs.
- `factory`: The mean accuracy score of the factory method across all replicate runs.
- `super_ensemble`: The mean accuracy score of the super-ensemble method across all replicate runs.
- `workshop_100`: The mean accuracy score of the lexiworkshop method of size 100 across all replicate runs.
- `workshop_300`: The mean accuracy score of the lexiworkshop method of size 300 across all replicate runs.
- `workshop_500`: The mean accuracy score of the lexiworkshop method of size 500 across all replicate runs.

In [5]:
from transform_results import get_combined_results

get_combined_results()

Unnamed: 0,source,dataset,bag_svc,factory,super_ensemble,workshop_100,workshop_300,workshop_500
0,Easy,iris,96.1,96.1,96.1,99.6,99.6,99.6
1,Easy,wine,98.4,98.4,98.4,100.0,100.0,100.0
2,Easy,breast cancer,97.7,97.7,97.6,99.1,99.1,99.1
3,Easy,digits,98.1,98.1,98.1,98.8,98.8,98.8
4,Clf,500_400_200_4,43.7,43.6,43.7,48.2,48.3,48.4
5,Clf,1000_100_90_2,89.6,89.6,89.5,92.0,92.1,92.0
6,Clf,1000_300_200_4,58.1,58.0,58.2,60.3,60.3,60.4
7,OpenML,teachingAssistant,57.7,58.4,58.2,69.1,68.7,69.0
8,OpenML,monk-problems-2,74.0,74.3,74.0,80.8,81.2,81.2
9,OpenML,one-hundred-plants-margin,79.8,79.9,79.9,81.9,81.9,82.0


## Concluding Remarks

As can be seen in the above `DataFrame`, the results of this project differ from those found in *Conservation Random Forests*. In *Conservation Random Forests*, all "methods used to produce final ensembles proved efficacious to some degree or other," with lexigarden methods taking the lead. However, in this project, only the lexiworkshop methods had better accuracy scores than those of standard bagging classifiers. In this case, while it cannot be concluded that the factory and super-ensemble methods were effective in increasing accuracy, it is evident that the lexiworkshop method was effective in doing so. This result further exemplifies the ability for lexicase selection to improve the accuracy of cultivation methods. 

In regards to future work, more exploration could be done to understand why the factory and super-ensemble methods did not have better accuracy scores relative to standard bagging classifiers in this project. Additionally, other base classifiers such as logistic regressors could be used to explore the generalizability of the cultivation methods across different models. Similarly, as mentioned in *Conservation Random Forests*, further research on applying these methods to regression and clustering tasks instead of classification tasks could prove to be beneficial as well.