# Stratified Splitting

This notebook provides several tutorials on how to utilize any algorithm proposed
in the **straSplit** package to split a multi-label dataset using less explored
[stratified strategy](https://bit.ly/3s3IDA8). Please install
[anaconda](https://www.anaconda.com/) package and other modules listed
in [requirement.txt](../../requirements.txt) file.

# Load modules and datasets

First, let us change the directory to the `model`.

In [1]:
import os
os.chdir('../model')
os.getcwd()

'D:\\MultiLabel\\straSplit\\src\\model'

Also, load the following modules to run the algorithms introduced in this notebook.

In [2]:
import pickle as pkl
import pandas as pd
from IPython.display import HTML, display

## load utilities
from utils import DATASET_PATH,RESULT_PATH, data_properties
from utils import check_type, custom_shuffle, data_properties, LabelBinarizer

## load modules
from naive2split import NaiveStratification
from iterative2split import IterativeStratification
from extreme2split import ExtremeStratification
from plssvd2split import ClusterStratification
from eigencluster2split import ClusteringEigenStratification
from comm2split import CommunityStratification
from enhance2split import LabelEnhancementStratification
from active2split import ActiveStratification

## multi-label model
from skmultilearn.problem_transform import ClassifierChain
from sklearn.svm import SVC

## evaluation metrics
from sklearn.metrics import f1_score

## Set dataframe to maxwidth
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)  # or 199

Make sure that to ensure that **DATASET_PATH** (dataset folder) and **RESULT_PATH** (results folder, such as dataset properties) are non-empty and set appropriately in the [utils.py](utils.py) module.

Now, let us assign values to the following arguments:

In [3]:
split_type = "extreme"
split_size = 0.80
num_epochs = 50
num_jobs = 2
use_solver = False

where `use_solver` is only applicable in the context of `active2split` module. This argument suggests whether to utilize the sklearn based optimization algorithm ([SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)) or the custom build optimization algorithm. split_type takes only: `extreme"`, `"iterative"`, and `"naive"` while `split_size` represents the proportion of the dataset to include in training set.

Let us use the well-known "birds" multi-label data. The data is comprised of 351 examples with $\mathbb{R}^{260}$ dimension each and 19 distinct classes.

In [4]:
dsname="birds"
X_name = dsname + "_X.pkl"
y_name = dsname + "_y.pkl"
file_path = os.path.join(DATASET_PATH, y_name)
with open(file_path, mode="rb") as f_in:
    y = pkl.load(f_in)
    idx = list(set(y.nonzero()[0]))
    y = y[idx]

file_path = os.path.join(DATASET_PATH, X_name)
with open(file_path, mode="rb") as f_in:
    X = pkl.load(f_in)
    X = X[idx]

print("Size of the data: ", X.shape)
print("Label size of the data: ", y.shape[1])

Size of the data:  (351, 260)
Label size of the data:  19


Using the above data and configuration arguments, we will show some interesting outcomes using each splitting strategies.

We also define a multi-label model in order to validated the resulted splits. Here, we choose the well-known `Classifier Chains`([paper](https://link.springer.com/chapter/10.1007/978-3-642-04174-7_17)) method that follows chaining method to constructing a seuence of classfiers according to the Bayesian chain rule. This method is able to presevre label correlations while able to downscale computational complexity.

We apply the Classifier Chains multi-label classifier with a `SVC` base classifier which supports sparse input, as defined below:

In [5]:
model = ClassifierChain(classifier = SVC(), require_dense = [False, True])

For performance evaluation, in this notebook we use F1-score performance metric as implemented in the `score` method.

In [6]:
def score(y_true, y_pred):
    f1_samples_average = f1_score(y_true, y_pred, average='samples')
    f1_samples_micro = f1_score(y_true, y_pred, average='micro')
    f1_samples_macro = f1_score(y_true, y_pred, average='macro')
    print('\t>> Average sample f1-score: {0:.4f}'.format(f1_samples_average))
    print('\t>> Micro f1-score: {0:.4f}'.format(f1_samples_micro))
    print('\t>> Macro f1-score: {0:.4f}'.format(f1_samples_macro))

**We note that our discussions are primarily focused on the "birds" dataset and are not necessarily extensible to other datasets.**

## Naive approach
This strategy was proposed in the [paper](https://doi.org/10.1371/journal.pcbi.1008174) and integrated into the [mlLGPR](https://github.com/hallamlab/mlLGPR) software for the purpose of pathway prediction. This is an iterative procedure where at first it selects a label, independently of others, then finds examples associated with this label. Next, the algorithm splits data based on the `split_size` parameter and assigns examples to training and test sets accordingly. If an example was already being added to the test or training sets then continue the process by selecting another label at random. This process iterates until all examples are consumed in the splitting process according to the `split_size` parameter which may not partition examples according to that threshold due to multiple labels being assigned to the same example. This approach, although being simple, it is nonetheless scalable to large-scale data. However, it suffers from the class imbalance problem and being naive that does not consider label-correlations to split a dataset. 

To see the results using this algorithm, you may run the following command.

In [7]:
st = NaiveStratification(shuffle=True, split_size=split_size, batch_size=500,
                         num_jobs=num_jobs)
training_idx, test_idx = st.fit(y=y)

## Configuration parameters to naive based stratified multi-label dataset
   splitting:
		1. Shuffle the dataset? True
		2. Split size: 0.8
		3. Number of examples to use in each iteration: 500
		4. Number of parallel workers: 2


	>> Perform splitting...
		--> Splitting progress: 100.00%...


where *training_idx* and *test_idx* are two lists corresponding to the indices of the given dataset.

Let us explore some properties of the resulted training anda test sets.

In [8]:
model_name = "naive2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=True, display_figure=True)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,281.0,70.0
1,Number of labels,654.0,524.0,130.0
2,Label cardinality,1.863248,1.864769,1.857143
3,Label density,0.098066,0.098146,0.097744
4,Distinct labels,19.0,19.0,19.0
5,Distinct label sets,132.0,121.0,48.0
6,Frequency of distinct label sets,0.376068,0.376068,0.376068
7,Mean imbalance ratio intra-class for all labels,18.425781,18.449253,22.127525
8,Mean imbalance ratio inter-class for all labels,5.406996,5.318112,6.954365
9,Mean imbalance ratio labelsets for all labels,10.67039,11.98899,57.891645


where *Label cardinality* is defined as the mean number of labels associated for an example, *Label density* is defined as cardinality divided by the number of labels, *Distinct label sets* is defined as the number of distinct labels in the data, *Ferequency of distinct label sets* is defined as the number of appearances of distinct labels divided by the total number of examples, and *KL* (Kullback-divergence) difference between complete and data partition measures the difference between two probability distributions where low number explains that a partitioned data is closely resembling the complete data label distributions.

From the table, we observe that both training and test sets are close approximated to the complete data with regard to the KL metric. 

Let us plot the resulted data in terms of the frequency of examples for each label.

In [9]:
chart

The above chart confirms our observation from the table.

In [10]:
# train
model.fit(X[training_idx], y[training_idx])

ClassifierChain(classifier=SVC(), require_dense=[False, True])

In [11]:
# predict
y_pred = model.predict(X[test_idx])
score(y_true=y[test_idx].toarray(), y_pred=y_pred.toarray())

	>> Average sample f1-score: 0.0152
	>> Micro f1-score: 0.0299
	>> Macro f1-score: 0.0105


## Iterative approach

This is a modified algorithm from the [paper](https://link.springer.com/chapter/10.1007/978-3-642-23808-6_10), which performs iterative splitting to the dataset. The algorithm starts by calculating the desired number of examples and proportions for training and test sets. This is followed by estimating the desired number of examples of each label at each partition. Then the algorithm is iteratively examined an individual label, at a time, with the fewest examples. Then, for each example of this label, the algorithm selects appropriate partitions for distribution. Once the appropriate subset is selected, we add the example to the partition and decrement the number of desired examples for each label of this example as well as the total number of desired examples for that group.

Let us apply this algorithm.

In [12]:
st = IterativeStratification(shuffle=True, split_size=split_size)
training_idx, test_idx = st.fit(y=y)

## Configuration parameters to iteratively stratifying a multi-label
   dataset splitting:
		1. Shuffle the dataset? True
		2. Split size: 0.8


	>> Perform splitting (iterative)...
		--> Splitting progress: 100.00%...

In [13]:
model_name = "iterative2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=True, display_figure=True)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,279.0,72.0
1,Number of labels,654.0,521.0,133.0
2,Label cardinality,1.863248,1.867384,1.847222
3,Label density,0.098066,0.098283,0.097222
4,Distinct labels,19.0,19.0,19.0
5,Distinct label sets,132.0,116.0,52.0
6,Frequency of distinct label sets,0.376068,0.376068,0.376068
7,Mean imbalance ratio intra-class for all labels,18.425781,18.131342,21.420437
8,Mean imbalance ratio inter-class for all labels,5.406996,5.653859,6.247628
9,Mean imbalance ratio labelsets for all labels,10.67039,13.499738,64.161868


In [14]:
chart

As can be observed that this algorithm produces less optimum results (KL metric) in comparison to the Naive based approach.

In [15]:
# train
model.fit(X[training_idx], y[training_idx])
# predict
y_pred = model.predict(X[test_idx])
score(y_true=y[test_idx].toarray(), y_pred=y_pred.toarray())

	>> Average sample f1-score: 0.0370
	>> Micro f1-score: 0.0580
	>> Macro f1-score: 0.0162


## Stratifying XML data approach

[paper](https://arxiv.org/pdf/2103.03494.pdf)

CycleGAN uses a cycle consistency loss to enable training without the need for paired data. In other words, it can translate from one domain to another without a one-to-one mapping between the source and target domain.
This opens up the possibility to do a lot of interesting tasks like photo-enhancement, image colorization, style transfer, etc. All you need is the source and the target dataset (which is simply a directory of images).

As mentioned in the [paper](https://arxiv.org/abs/1703.10593), apply random jittering and mirroring to the training dataset. These are some of the image augmentation techniques that avoids overfitting.

In [16]:
st = ExtremeStratification(swap_probability=0.1, threshold_proportion=0.1, decay=0.1,
                           shuffle=True, split_size=split_size, num_epochs=num_epochs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a large scale multi-label
   dataset splitting:
		1. A hyper-parameter for extreme stratification: 0.1
		2. A hyper-parameter for extreme stratification: 0.1
		3. A hyper-parameter for extreme stratification: 0.1
		4. Shuffle the dataset? True
		5. Split size: 0.8
		6. Number of loops over a dataset: 50


	>> Perform splitting (extreme)...
		--> Starting score: 22
		--> Splitting progress: 100.00%; score: -9.17


In [17]:
model_name = "extreme2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=True, display_figure=True)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,273.0,78.0
1,Number of labels,654.0,514.0,140.0
2,Label cardinality,1.863248,1.882784,1.794872
3,Label density,0.098066,0.099094,0.094467
4,Distinct labels,19.0,19.0,19.0
5,Distinct label sets,132.0,117.0,47.0
6,Frequency of distinct label sets,0.376068,0.376068,0.376068
7,Mean imbalance ratio intra-class for all labels,18.425781,18.119333,24.492804
8,Mean imbalance ratio inter-class for all labels,5.406996,5.641551,5.966196
9,Mean imbalance ratio labelsets for all labels,10.67039,12.316828,39.652826


In [18]:
chart

In [19]:
# train
model.fit(X[training_idx], y[training_idx])
# predict
y_pred = model.predict(X[test_idx])
score(y_true=y[test_idx].toarray(), y_pred=y_pred.toarray())

	>> Average sample f1-score: 0.0150
	>> Micro f1-score: 0.0282
	>> Macro f1-score: 0.0100


## Clustering based strategy

Note: This example model is trained for fewer epochs (40) than the paper (200) to keep training time reasonable for this tutorial. Predictions may be less accurate.

In [20]:
st = ClusterStratification(num_clusters=5, swap_probability=0.1, threshold_proportion=0.1,
                           decay=0.1, shuffle=True, split_size=split_size, batch_size=100,
                           num_epochs=num_epochs, lr=0.0001, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a multi-label dataset splitting
   based on clustering the covariance of X and y using PLSSVD:
		1. Number of clusters to form: 5
		2. A hyper-parameter: 0.1
		3. A hyper-parameter: 0.1
		4. A hyper-parameter: 0.1
		5. Shuffle the dataset? True
		6. Split size: 0.8
		7. Number of examples to use in each iteration: 100
		8. Number of loops over training set: 50
		9. Learning rate: 0.0001
		10. Number of parallel workers: 2


	>> Computing the covariance of X and y using PLSSVD: 100.00%...
	>> Projecting examples onto the obtained low dimensional U orthonormal basis...
	>> Clustering the resulted low dimensional examples...
	>> Perform splitting (extreme)...
		--> Starting score: 21
		--> Splitting progress: 100.00%; score: 11.56


In [21]:
model_name = "plssvd2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=True, display_figure=True)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,276.0,75.0
1,Number of labels,654.0,515.0,139.0
2,Label cardinality,1.863248,1.865942,1.853333
3,Label density,0.098066,0.098207,0.097544
4,Distinct labels,19.0,19.0,19.0
5,Distinct label sets,132.0,113.0,50.0
6,Frequency of distinct label sets,0.376068,0.376068,0.376068
7,Mean imbalance ratio intra-class for all labels,18.425781,20.295392,20.276916
8,Mean imbalance ratio inter-class for all labels,5.406996,6.103324,5.677536
9,Mean imbalance ratio labelsets for all labels,10.67039,13.1184,67.632397


In [22]:
chart

In [23]:
# train
model.fit(X[training_idx], y[training_idx])
# predict
y_pred = model.predict(X[test_idx])
score(y_true=y[test_idx].toarray(), y_pred=y_pred.toarray())

	>> Average sample f1-score: 0.0467
	>> Micro f1-score: 0.0556
	>> Macro f1-score: 0.0168


## Clustering eigenvalues based strategy

Note: This example model is trained for fewer epochs (40) than the paper (200) to keep training time reasonable for this tutorial. Predictions may be less accurate.

In [24]:
st = ClusteringEigenStratification(num_subsamples=10000, num_clusters=5, sigma=2, swap_probability=0.1,
                                   threshold_proportion=0.1, decay=0.1, shuffle=True, split_size=split_size,
                                   batch_size=500, num_epochs=num_epochs, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a multi-label dataset splitting
   based on clustering eigen values of the label adjacency matrix:
		1. Subsampling input size: 10000
		2. Number of communities: 5
		3. Constant that scales the amount of laplacian norm regularization: 2
		4. A hyper-parameter: 0.1
		5. A hyper-parameter: 0.1
		6. A hyper-parameter: 0.1
		7. Shuffle the dataset? True
		8. Split size: 0.8
		9. Number of examples to use in each iteration: 500
		10. Number of loops over training set: 50
		11. Number of parallel workers: 2


	>> Extracting clusters...
	>> Perform splitting (extreme)...
		--> Starting score: 69
		--> Splitting progress: 100.00%; score: 2.91


In [25]:
model_name = "eigencluster2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=True, display_figure=True)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,282.0,69.0
1,Number of labels,654.0,514.0,140.0
2,Label cardinality,1.863248,1.822695,2.028986
3,Label density,0.098066,0.095931,0.106789
4,Distinct labels,19.0,19.0,19.0
5,Distinct label sets,132.0,112.0,51.0
6,Frequency of distinct label sets,0.376068,0.376068,0.376068
7,Mean imbalance ratio intra-class for all labels,18.425781,21.116011,16.513232
8,Mean imbalance ratio inter-class for all labels,5.406996,5.990358,5.504411
9,Mean imbalance ratio labelsets for all labels,10.67039,43.845009,18.209313


In [26]:
chart

In [27]:
# train
model.fit(X[training_idx], y[training_idx])
# predict
y_pred = model.predict(X[test_idx])
score(y_true=y[test_idx].toarray(), y_pred=y_pred.toarray())

	>> Average sample f1-score: 0.0155
	>> Micro f1-score: 0.0272
	>> Macro f1-score: 0.0124


## Community based splitting strategy

Import the generator and the discriminator used in [Pix2Pix](https://github.com/tensorflow/examples/blob/master/tensorflow_examples/models/pix2pix/pix2pix.py) via the installed [tensorflow_examples](https://github.com/tensorflow/examples) package.

The model architecture used in this tutorial is very similar to what was used in [pix2pix](https://github.com/tensorflow/examples/blob/master/tensorflow_examples/models/pix2pix/pix2pix.py). Some of the differences are:

* Cyclegan uses [instance normalization](https://arxiv.org/abs/1607.08022) instead of [batch normalization](https://arxiv.org/abs/1502.03167).
* The [CycleGAN paper](https://arxiv.org/abs/1703.10593) uses a modified `resnet` based generator. This tutorial is using a modified `unet` generator for simplicity.

There are 2 generators (G and F) and 2 discriminators (X and Y) being trained here. 

* Generator `G` learns to transform image `X` to image `Y`. $(G: X -> Y)$
* Generator `F` learns to transform image `Y` to image `X`. $(F: Y -> X)$
* Discriminator `D_X` learns to differentiate between image `X` and generated image `X` (`F(Y)`).
* Discriminator `D_Y` learns to differentiate between image `Y` and generated image `Y` (`G(X)`).

In [28]:
st = CommunityStratification(num_subsamples=20000, num_communities=5, sigma=2, swap_probability=0.1,
                             threshold_proportion=0.1, decay=0.1, shuffle=True, split_size=split_size,
                             batch_size=500, num_epochs=num_epochs, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a multi-label dataset splitting
   based on community detection approach:
		1. Subsampling input size: 20000
		2. Number of communities: 5
		3. Constant that scales the amount of laplacian norm regularization: 2
		4. A hyper-parameter: 0.1
		5. A hyper-parameter: 0.1
		6. A hyper-parameter: 0.1
		7. Shuffle the dataset? True
		8. Split size: 0.8
		9. Number of examples to use in each iteration: 500
		10. Number of loops over training set: 50
		11. Number of parallel workers: 2


	>> Building Graph...
	>> Perform splitting (extreme)...
		--> Starting score: 5
		--> Splitting progress: 100.00%; score: -7.13


In [29]:
model_name = "comm2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=True, display_figure=True)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,267.0,84.0
1,Number of labels,654.0,498.0,156.0
2,Label cardinality,1.863248,1.865169,1.857143
3,Label density,0.098066,0.098167,0.097744
4,Distinct labels,19.0,19.0,19.0
5,Distinct label sets,132.0,113.0,50.0
6,Frequency of distinct label sets,0.376068,0.376068,0.376068
7,Mean imbalance ratio intra-class for all labels,18.425781,19.241148,19.011722
8,Mean imbalance ratio inter-class for all labels,5.406996,5.909266,4.752931
9,Mean imbalance ratio labelsets for all labels,10.67039,13.113031,58.012602


In [30]:
chart

In [31]:
# train
model.fit(X[training_idx], y[training_idx])
# predict
y_pred = model.predict(X[test_idx])
score(y_true=y[test_idx].toarray(), y_pred=y_pred.toarray())

	>> Average sample f1-score: 0.0425
	>> Micro f1-score: 0.0727
	>> Macro f1-score: 0.0234


## Label enhancement based strategy

Note: This example model is trained for fewer epochs (40) than the paper (200) to keep training time reasonable for this tutorial. Predictions may be less accurate.

In [32]:
st = LabelEnhancementStratification(num_subsamples=10000, num_communities=10, sigma=2, alpha=0.2,
                                    swap_probability=0.1, threshold_proportion=0.1, decay=0.1, shuffle=True,
                                    split_size=split_size, batch_size=500, num_epochs=num_epochs,
                                    num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a multi-label dataset splitting
   based on label enhancement approach:
		1. Subsampling input size: 10000
		2. Number of communities: 10
		3. Constant that scales the amount of laplacian norm regularization: 2
		4. A hyperparameter to balancing parameterwhich controls the fraction of the information inherited from the label propagation and the label matrix.: 0.2
		5. A hyper-parameter: 0.1
		6. A hyper-parameter: 0.1
		7. A hyper-parameter: 0.1
		8. Shuffle the dataset? True
		9. Split size: 0.8
		10. Number of examples to use in each iteration: 500
		11. Number of loops over training set: 50
		12. Number of parallel workers: 2


	>> Building Graph...
	>> Perform splitting (extreme)...
		--> Starting score: 4
		--> Splitting progress: 100.00%; score: 3.63


In [33]:
model_name = "enhance2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=True, display_figure=True)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,280.0,71.0
1,Number of labels,654.0,513.0,141.0
2,Label cardinality,1.863248,1.832143,1.985915
3,Label density,0.098066,0.096429,0.104522
4,Distinct labels,19.0,19.0,17.0
5,Distinct label sets,132.0,116.0,49.0
6,Frequency of distinct label sets,0.376068,0.376068,0.376068
7,Mean imbalance ratio intra-class for all labels,18.425781,17.239248,17.768821
8,Mean imbalance ratio inter-class for all labels,5.406996,4.617656,7.007422
9,Mean imbalance ratio labelsets for all labels,10.67039,39.418394,11.802143


In [34]:
chart

In [35]:
# train
model.fit(X[training_idx], y[training_idx])
# predict
y_pred = model.predict(X[test_idx])
score(y_true=y[test_idx].toarray(), y_pred=y_pred.toarray())

	>> Average sample f1-score: 0.0338
	>> Micro f1-score: 0.0411
	>> Macro f1-score: 0.0126


## Active learning based splitting strategy

Note: This example model is trained for fewer epochs (40) than the paper (200) to keep training time reasonable for this tutorial. Predictions may be less accurate.

In [40]:
st = ActiveStratification(subsample_labels_size=10, acquisition_type="entropy", top_k=5, calc_ads=False,
                          ads_percent=0.7, use_solver=use_solver, loss_function="hinge", swap_probability=0.1,
                          threshold_proportion=0.1, decay=0.1, penalty='elasticnet', alpha_elastic=0.0001,
                          l1_ratio=0.65, alpha_l21=0.01, loss_threshold=0.05, shuffle=True,
                          split_size=split_size, batch_size=500, num_epochs=num_epochs, lr=1e-3,
                          display_interval=1, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to estimating examples predictive uncertainty
   scores to group example with high informativeness into training set
   using a modified approach to splitting an extreme large scale multi-
   label dataset:
		1. Subsampling labels: 10
		2. The acquisition function for estimating the predictive uncertainty: entropy
		3. Apply sklearn optimizers? False
		4. The loss function: hinge
		5. A hyper-parameter for extreme stratification: 0.1
		6. A hyper-parameter for extreme stratification: 0.1
		7. A hyper-parameter for extreme stratification: 0.1
		8. The penalty (aka regularization term): elasticnet
		9. Constant controlling the elastic term: 0.0001
		10. The elastic net mixing parameter: 0.65
		11. A cutoff threshold between two consecutive rounds: 0.05
		12. Shuffle the dataset? True
		13. Split size: 0.8
		14. Number of examples to use in each iteration: 500
		15. Number of loops over training set: 50
		16. Learning rate: 0.001
		17. How often to evaluate? 1


	>> Training to learn a model...
	   1)- Epoch count (1/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncertainty using entropy...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.7237; Old cost: inf
			--> Epoch 1 took 0.047 seconds...
	   2)- Epoch count (2/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncertainty using entropy...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.7549; Old cost: 0.7237
			--> Epoch 2 took 0.059 seconds...
	   3)- Epoch count (3/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncertainty usi

In [41]:
model_name = "active2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], num_tails=5, dataset_name=dsname,
                            model_name=model_name, rspath=RESULT_PATH, display_dataframe=True, display_figure=True)
df

Unnamed: 0,Properties for birds,Complete set,Training set,Test set
0,Number of examples,351.0,273.0,78.0
1,Number of labels,654.0,509.0,145.0
2,Label cardinality,1.863248,1.864469,1.858974
3,Label density,0.098066,0.09813,0.097841
4,Distinct labels,19.0,19.0,18.0
5,Distinct label sets,132.0,114.0,48.0
6,Frequency of distinct label sets,0.376068,0.376068,0.376068
7,Mean imbalance ratio intra-class for all labels,18.425781,17.316722,21.181672
8,Mean imbalance ratio inter-class for all labels,5.406996,5.074497,6.245878
9,Mean imbalance ratio labelsets for all labels,10.67039,16.803822,52.783562


In [42]:
chart

In [43]:
# train
model.fit(X[training_idx], y[training_idx])
# predict
y_pred = model.predict(X[test_idx])
score(y_true=y[test_idx].toarray(), y_pred=y_pred.toarray())

	>> Average sample f1-score: 0.0385
	>> Micro f1-score: 0.0403
	>> Macro f1-score: 0.0175


## Next steps

This tutorial has shown how to run various splitting algorithms while exploring outcomes. 

As a next step, you could try to improve the algorithms or analyze results on a large number of multi-label data or singly labeled data. Also, you may rerun the algorithms using different configurations. For instance, you could try setting the `split_type` parameter to `"iterative"` or `"naive"` or use a range of split size values (`split_size` $\in (0,1)$) and document performance results. If you choose to apply `comom2split.py` or `enhance2split.py` then apply using smal labelset data.