# Stratified Splitting

This notebook provides several tutorials on how to utilize any algorithm proposed
in the **straSplit** package to split a multi-label dataset using less explored
[stratified strategy](https://bit.ly/3s3IDA8). Please install
[anaconda](https://www.anaconda.com/) package and other modules listed
in [requirement.txt](../../requirements.txt) file.

# Load modules and datasets

First, let us change the directory to the `model`.

In [1]:
import os
os.chdir('../model')
os.getcwd()

'D:\\MultiLabel\\straSplit\\src\\model'

Also, load the following modules to run the algorithms introduced in this notebook.

In [2]:
import pickle as pkl
import pandas as pd
from IPython.display import HTML, display

## load utilities
from utils import DATASET_PATH,RESULT_PATH, data_properties
from utils import check_type, custom_shuffle, data_properties, LabelBinarizer

## load modules
from naive2split import NaiveStratification
from iterative2split import IterativeStratification
from extreme2split import ExtremeStratification
from plssvd2split import ClusterStratification
from eigencluster2split import ClusteringEigenStratification
from comm2split import CommunityStratification
from enhance2split import LabelEnhancementStratification
from active2split import ActiveStratification
from gan2split import GANStratification

## Set dataframe to maxwidth
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)  # or 199

Make sure that to ensure that **DATASET_PATH** (dataset folder) and **RESULT_PATH** (results folder, such as dataset properties) are non-empty and set appropriately in the [utils.py](utils.py) module.

Now, let us assign values to the following arguments:

In [3]:
split_type = "extreme"
split_size = 0.80
num_epochs = 10
num_jobs = 2
use_solver = False

where `use_solver` is only applicable in the context of `active2split` module. This argument suggests whether to utilize the sklearn based optimization algorithm ([SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)) or the custom build optimization algorithm. split_type takes only: `extreme"`, `"iterative"`, and `"naive"` while `split_size` represents the proportion of the dataset to include in training set.

Let us use the well-known "birds" multi-label data. The data is comprised of 351 examples with $\mathbb{R}^{260}$ dimension each and 19 distinct classes.

In [4]:
dsname="birds"
X_name = dsname + "_X.pkl"
y_name = dsname + "_y.pkl"
file_path = os.path.join(DATASET_PATH, y_name)
with open(file_path, mode="rb") as f_in:
    y = pkl.load(f_in)
    idx = list(set(y.nonzero()[0]))
    y = y[idx]

file_path = os.path.join(DATASET_PATH, X_name)
with open(file_path, mode="rb") as f_in:
    X = pkl.load(f_in)
    X = X[idx]

print("Size of the data: ", X.shape)
print("Label size of the data: ", y.shape[1])

Size of the data:  (351, 260)
Label size of the data:  19


Using the above data and configuration arguments, we will show some interesting outcomes using each splitting strategies.

**We note that our discussions are primarily focused on the "birds" dataset and are not necessarily extensible to other datasets.**

## Naive approach
This strategy was proposed in the [paper](https://doi.org/10.1371/journal.pcbi.1008174) and integrated into the [mlLGPR](https://github.com/hallamlab/mlLGPR) software for the purpose of pathway prediction. This is an iterative procedure where at first it selects a label, independently of others, then finds examples associated with this label. Next, the algorithm splits data based on the `split_size` parameter and assigns examples to training and test sets accordingly. If an example was already being added to the test or training sets then continue the process by selecting another label at random. This process iterates until all examples are consumed in the splitting process according to the `split_size` parameter which may not partition examples according to that threshold due to multiple labels being assigned to the same example. This approach, although being simple, it is nonetheless scalable to large-scale data. However, it suffers from the class imbalance problem and being naive that does not consider label-correlations to split a dataset. 

To see the results using this algorithm, you may run the following command.

In [5]:
st = NaiveStratification(shuffle=True, split_size=split_size, batch_size=500,
                         num_jobs=num_jobs)
training_idx, test_idx = st.fit(y=y)

## Configuration parameters to naive based stratified multi-label dataset
   splitting:
		1. Shuffle the dataset? True
		2. Split size: 0.8
		3. Number of examples to use in each iteration: 500
		4. Number of parallel workers: 2


	>> Perform splitting...
		--> Splitting progress: 100.00%...


where *training_idx* and *test_idx* are two lists corresponding to the indices of the given dataset.

Let us explore some properties of the resulted training anda test sets.

In [6]:
model_name = "naive2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=False)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,281.0,70.0
1,Number of labels,654.0,524.0,130.0
2,Label cardinality,1.863248,1.864769,1.857143
3,Label density,0.002849,0.003559,0.014286
4,Distinct label sets,19.0,19.0,19.0
5,Frequency of distinct label sets,0.054131,0.067616,0.271429
6,Labels having less than or equal to 5 examples,0.0,1.0,8.0
7,Labels having more than 6 examples,19.0,18.0,11.0
8,KL difference between complete and data partition,0.0,0.000883,0.016654


where *Label cardinality* is defined as the mean number of labels associated for an example, *Label density* is defined as cardinality divided by the number of labels, *Distinct label sets* is defined as the number of distinct labels in the data, *Ferequency of distinct label sets* is defined as the number of appearances of distinct labels divided by the total number of examples, and *KL* (Kullback-divergence) difference between complete and data partition measures the difference between two probability distributions where low number explains that a partitioned data is closely resembling the complete data label distributions.

From the table, we observe that both training and test sets are close approximated to the complete data with regard to the KL metric. 

Let us plot the resulted data in terms of the frequency of examples for each label.

In [7]:
chart

The above chart confirms our observation from the table.

## Iterative approach

This is a modified algorithm from the [paper](https://bit.ly/2QqHd4V), which performs iterative splitting to the dataset. The algorithm starts by calculating the desired number of examples and proportions for training and test sets. This is followed by estimating the desired number of examples of each label at each partition. Then the algorithm is iteratively examined an individual label, at a time, with the fewest examples. Then, for each example of this label, the algorithm selects appropriate partitions for distribution. Once the appropriate subset is selected, we add the example to the partition and decrement the number of desired examples for each label of this example as well as the total number of desired examples for that group.

Let us apply this algorithm.

In [8]:
st = IterativeStratification(shuffle=True, split_size=split_size)
training_idx, test_idx = st.fit(y=y)

## Configuration parameters to iteratively stratifying a multi-label
   dataset splitting:
		1. Shuffle the dataset? True
		2. Split size: 0.8


	>> Perform splitting (iterative)...
		--> Splitting progress: 100.00%...

In [9]:
model_name = "iterative2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=False)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,279.0,72.0
1,Number of labels,654.0,521.0,133.0
2,Label cardinality,1.863248,1.867384,1.847222
3,Label density,0.002849,0.003584,0.013889
4,Distinct label sets,19.0,19.0,19.0
5,Frequency of distinct label sets,0.054131,0.0681,0.263889
6,Labels having less than or equal to 5 examples,0.0,1.0,9.0
7,Labels having more than 6 examples,19.0,18.0,10.0
8,KL difference between complete and data partition,0.0,0.001513,0.02422


In [10]:
chart

As can be observed that this algorithm produces less optimum results (KL metric) in comparison to the Naive based approach.

## Stratifying XML data approach

[paper](https://arxiv.org/pdf/2103.03494.pdf)

CycleGAN uses a cycle consistency loss to enable training without the need for paired data. In other words, it can translate from one domain to another without a one-to-one mapping between the source and target domain.
This opens up the possibility to do a lot of interesting tasks like photo-enhancement, image colorization, style transfer, etc. All you need is the source and the target dataset (which is simply a directory of images).

As mentioned in the [paper](https://arxiv.org/abs/1703.10593), apply random jittering and mirroring to the training dataset. These are some of the image augmentation techniques that avoids overfitting.

In [None]:
st = ExtremeStratification(swap_probability=0.1, threshold_proportion=0.1, decay=0.1,
                           shuffle=True, split_size=split_size, num_epochs=num_epochs)
training_idx, test_idx = st.fit(X=X, y=y)

In [None]:
model_name = "extreme2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=False)
df

In [None]:
chart

## Clustering based strategy

Note: This example model is trained for fewer epochs (40) than the paper (200) to keep training time reasonable for this tutorial. Predictions may be less accurate.

In [None]:
st = ClusterStratification(num_clusters=5, swap_probability=0.1, threshold_proportion=0.1,
                           decay=0.1, shuffle=True, split_size=split_size, batch_size=100,
                           num_epochs=num_epochs, lr=0.0001, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

In [None]:
model_name = "plssvd2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=False)
df

In [None]:
chart

## Clustering eigenvalues based strategy

Note: This example model is trained for fewer epochs (40) than the paper (200) to keep training time reasonable for this tutorial. Predictions may be less accurate.

In [None]:
st = ClusteringEigenStratification(num_subsamples=10000, num_clusters=5, sigma=2, swap_probability=0.1,
                                   threshold_proportion=0.1, decay=0.1, shuffle=True, split_size=split_size,
                                   batch_size=500, num_epochs=num_epochs, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

In [None]:
model_name = "eigencluster2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=False)
df

In [None]:
chart

## Community based splitting strategy

Import the generator and the discriminator used in [Pix2Pix](https://github.com/tensorflow/examples/blob/master/tensorflow_examples/models/pix2pix/pix2pix.py) via the installed [tensorflow_examples](https://github.com/tensorflow/examples) package.

The model architecture used in this tutorial is very similar to what was used in [pix2pix](https://github.com/tensorflow/examples/blob/master/tensorflow_examples/models/pix2pix/pix2pix.py). Some of the differences are:

* Cyclegan uses [instance normalization](https://arxiv.org/abs/1607.08022) instead of [batch normalization](https://arxiv.org/abs/1502.03167).
* The [CycleGAN paper](https://arxiv.org/abs/1703.10593) uses a modified `resnet` based generator. This tutorial is using a modified `unet` generator for simplicity.

There are 2 generators (G and F) and 2 discriminators (X and Y) being trained here. 

* Generator `G` learns to transform image `X` to image `Y`. $(G: X -> Y)$
* Generator `F` learns to transform image `Y` to image `X`. $(F: Y -> X)$
* Discriminator `D_X` learns to differentiate between image `X` and generated image `X` (`F(Y)`).
* Discriminator `D_Y` learns to differentiate between image `Y` and generated image `Y` (`G(X)`).

In [None]:
st = CommunityStratification(num_subsamples=20000, num_communities=5, sigma=2, swap_probability=0.1,
                             threshold_proportion=0.1, decay=0.1, shuffle=True, split_size=split_size,
                             batch_size=500, num_epochs=num_epochs, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

In [None]:
model_name = "comm2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=False)
df

In [None]:
chart

## Label enhancement based strategy

Note: This example model is trained for fewer epochs (40) than the paper (200) to keep training time reasonable for this tutorial. Predictions may be less accurate.

In [None]:
st = LabelEnhancementStratification(num_subsamples=10000, num_communities=10, sigma=2, alpha=0.2,
                                    swap_probability=0.1, threshold_proportion=0.1, decay=0.1, shuffle=True,
                                    split_size=split_size, batch_size=500, num_epochs=num_epochs,
                                    num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

In [None]:
model_name = "enhance2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=False)
df

In [None]:
chart

## Active learning based splitting strategy

Note: This example model is trained for fewer epochs (40) than the paper (200) to keep training time reasonable for this tutorial. Predictions may be less accurate.

In [None]:
st = ActiveStratification(subsample_labels_size=10, acquisition_type="entropy", top_k=5, calc_ads=False,
                          ads_percent=0.7, use_solver=use_solver, loss_function="hinge", swap_probability=0.1,
                          threshold_proportion=0.1, decay=0.1, penalty='elasticnet', alpha_elastic=0.0001,
                          l1_ratio=0.65, alpha_l21=0.01, loss_threshold=0.05, shuffle=True,
                          split_size=split_size, batch_size=500, num_epochs=num_epochs, lr=1e-3,
                          display_interval=1, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

In [None]:
model_name = "active2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=False)
df

In [None]:
chart

## GAN learning based splitting strategy

Note: This example model is trained for fewer epochs (40) than the paper (200) to keep training time reasonable for this tutorial. Predictions may be less accurate.

In [None]:
st = GANStratification(dimension_size=50, num_examples2gen=20, update_ratio=1, window_size=2,
                       num_subsamples=10000, num_clusters=5, sigma=2, swap_probability=0.1,
                       threshold_proportion=0.1, decay=0.1, shuffle=True, split_size=split_size,
                       batch_size=1000, max_iter_gen=30, max_iter_dis=30, num_epochs=num_epochs, 
                       lambda_gen=1e-5, lambda_dis=1e-5, lr=1e-3, display_interval=2, 
                       num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

In [None]:
model_name = "gan2split"
df, chart = data_properties(y=y.toarray(), selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=True, display_figure=True)
df

In [None]:
chart

## Next steps

This tutorial has shown how to run various splitting algorithms while exploring outcomes. 

As a next step, you could try to improve the algorithms or analyze results on a large number of multi-label data or singly labeled data. Also, you may rerun the algorithms using different configurations. For instance, you could try setting the `split_type` parameter to `"iterative"` or `"naive"` or use a range of split size values (`split_size` $\in (0,1)$) and document performance results. If you choose to apply `comom2split.py` or `enhance2split.py` then apply using smal labelset data.