# Conservation Bagging Support Vector Classifiers


## Project Description
For this project, I explore an idea derivied from *Conservation Random Forests*, written by Moshe Sipper and Jason Moore. Instead of creating ensembles based on decision tree classifiers, I extend this work by replacing decision tree classifiers with support vector classifiers (SVCs). Similar to what the authors did in *Conservation Random Forests*, I use three methods for producing final ensembles: 1) factory, 2) super-ensemble, and 3) lexiworkshop. With the factory method, I first train multiple bagging classifiers that use SVCs as base estimators. Then, all SVCs in each bagging classifier are added to a collection known as the factory. (Note: the name "factory" arises from the fact that a factory has many machines; in our case, the machines are support vector machines.) The factory performs class predictions through majority voting, where each SVC votes for a single class. With the super-ensemble method, each bagging classifier is added to a super-ensemble collection, which is an ensemble of ensembles. Prediction for the super-ensemble is also made through majority voting, where each bagging classifier votes for a single class. The lexiworkshop method receives a factory (a collection of SVCs) and returns a workshop (a subset of the collection of SVCs) of a specified size, where the SVCs are selected through lexicase selection. (Note: I use the name "workshop" here because a workshop contains fewer machines relative to a factory.) The authors of *Conservation Random Forests* showed that lexigarden performed the best out of the other methods. Similarly, I would like to explore whether lexiworkshop performs better than the factory and super-ensemble methods.

I replicate the experimental setup mentioned in *Conversation Random Forests* as closely as possible; however, due to time and computation constraints, I have scaled back my experiments relative to those conducted in the paper. My experiments use Scikit-learn’s SVC (with default parameters) and BaggingClassifier functions. For each replicate run, I create 5 folds for 5-fold cross-validation. The dataset is split into a training set of 4 folds and the left-out test fold for each fold. 10 runs are conducted for each fold. Each run includes fitting a 100-SVC bagging classifier to the training set and testing the fitted bagging classifier on the test set. In addition, all SVCs are saved into a factory, and all bagging classifiers are saved into a super-ensemble. The factory and super-ensemble are also tested on the test set, and the factory also serves to produce workshops of sizes 100, 300, and 500, which are also tested. As time permits, I will perform tests on as many of the datasets used in *Conservation Random Forests* as possible.

Similar to *Conservation Random Forests*, I plan to observe the mean accuracy for each ensemble method across the 10 replicate experiments for each dataset. When reporting my results, I will have a table where each line consists of each dataset's experimental results. Each line will have the dataset name, the number of dataset samples, the number of features, the number of informative features (when known), the number of target classes, the mean accuracy of the SVC bagging classifier on the test set across all replicate experiments (with standard deviation in parentheses), results for workshops of size 100, 300, and 500 SVCs, results for a factory of size 1,000 SVCs, and results for super-ensemble of size 10 SVC bagging classifiers.

## Current Status
So far, I have finished writing the necessary code for conducting the experiments. What is left is to write code that stores the results of each experiment in a dataset. As of now, the results are being printed as output. Furthermore, I also need to obtain more datasets and transform them so that they can be used for my experiments. The code for the experiments is in `experiment.py` and the datasets that I use are stored in `datasets/`.

## Code

The following code runs 2 replicate experiments on 2 different datasets. Only 2 replicate experiments are conducted below for demonstrative purposes. The first dataset is constructed from `sklearn`'s `make_classification` function using default parameters. The second dataset is OpenML's [teachingAssistant](https://www.openml.org/d/1115) dataset. Datasets are standardized and labels are encoded as necessary.

The following list describes the output when running `experiment(X, y)`:
- `rep`: the current replicate experiment
- `fold`: the current fold for 5-fold cross-validation
- `run`: the current run for the current fold
- `bag_svc_score`: the run's 100-SVC bagging classifier's accuracy score on the fold
- `factory_score`: the factory's accuracy score on the fold
- `super_ensemble_score`: the super-ensemble's accuracy score on the fold
- `workshop_100_score`: the size 100 workshop's accuracy score on the fold
- `workshop_300_score`: the size 300 workshop's accuracy score on the fold
- `workshop_500_score`: the size 500 workshop's accuracy score on the fold



In [1]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from experiment import experiment


### `make_classification` dataset

In [2]:
X, y = make_classification(random_state=0)
experiment(X, y, n_repl=2)

rep: 1
	fold: 1
		run: 1
			bag_svc_score: 0.9
		run: 2
			bag_svc_score: 0.9
		run: 3
			bag_svc_score: 0.9
		run: 4
			bag_svc_score: 0.9
		run: 5
			bag_svc_score: 0.9
		run: 6
			bag_svc_score: 0.9
		run: 7
			bag_svc_score: 0.9
		run: 8
			bag_svc_score: 0.9
		run: 9
			bag_svc_score: 0.9
		run: 10
			bag_svc_score: 0.9
		factory_score: 0.9
		super_ensemble_score: 0.9
		workshop_100_score: 0.95
		workshop_300_score: 0.95
		workshop_500_score: 0.95
	fold: 2
		run: 1
			bag_svc_score: 0.75
		run: 2
			bag_svc_score: 0.75
		run: 3
			bag_svc_score: 0.75
		run: 4
			bag_svc_score: 0.75
		run: 5
			bag_svc_score: 0.75
		run: 6
			bag_svc_score: 0.75
		run: 7
			bag_svc_score: 0.75
		run: 8
			bag_svc_score: 0.75
		run: 9
			bag_svc_score: 0.75
		run: 10
			bag_svc_score: 0.75
		factory_score: 0.75
		super_ensemble_score: 0.75
		workshop_100_score: 0.85
		workshop_300_score: 0.85
		workshop_500_score: 0.85
	fold: 3
		run: 1
			bag_svc_score: 0.9
		run: 2
			bag_svc_score: 0.9
		run: 3
	

### teachingAssistant dataset

In [3]:
df = pd.read_csv('datasets/teachingAssistant.csv')
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
sc = StandardScaler()
X = sc.fit_transform(X)
le = LabelEncoder()
y = le.fit_transform(y)

experiment(X, y, n_repl=2)

rep: 1
	fold: 1
		run: 1
			bag_svc_score: 0.6129032258064516
		run: 2
			bag_svc_score: 0.6129032258064516
		run: 3
			bag_svc_score: 0.6129032258064516
		run: 4
			bag_svc_score: 0.6129032258064516
		run: 5
			bag_svc_score: 0.6129032258064516
		run: 6
			bag_svc_score: 0.6129032258064516
		run: 7
			bag_svc_score: 0.6129032258064516
		run: 8
			bag_svc_score: 0.6129032258064516
		run: 9
			bag_svc_score: 0.6129032258064516
		run: 10
			bag_svc_score: 0.6129032258064516
		factory_score: 0.6129032258064516
		super_ensemble_score: 0.6129032258064516
		workshop_100_score: 0.7419354838709677
		workshop_300_score: 0.7419354838709677
		workshop_500_score: 0.7419354838709677
	fold: 2
		run: 1
			bag_svc_score: 0.5333333333333333
		run: 2
			bag_svc_score: 0.5333333333333333
		run: 3
			bag_svc_score: 0.5333333333333333
		run: 4
			bag_svc_score: 0.5333333333333333
		run: 5
			bag_svc_score: 0.5333333333333333
		run: 6
			bag_svc_score: 0.5333333333333333
		run: 7
			bag_svc_score: 0.5333333