# Stratified splitting

This notebook provides several tutorials on how to apply any algorithm introduced in the **straSplit** package for partitioning a multi-label dataset using [stratified strategy](https://bit.ly/3s3IDA8). Please install
[anaconda](https://www.anaconda.com/) package and other modules listed in [requirement.txt](../../requirements.txt) file.

## Load modules and datasets

First, let us change the current directory to the `model` folder.

In [1]:
import os
os.chdir('../model')

Then, load the following modules.

In [2]:
from IPython import display

import pickle as pkl
import pandas as pd

## load utilities
from utils import DATASET_PATH,RESULT_PATH, data_properties
from utils import check_type, custom_shuffle, data_properties, LabelBinarizer

## load modules
from naive2split import NaiveStratification
from iterative2split import IterativeStratification
from extreme2split import ExtremeStratification
from plssvd2split import ClusterStratification
from eigencluster2split import ClusteringEigenStratification
from comm2split import CommunityStratification
from enhance2split import LabelEnhancementStratification
from active2split import ActiveStratification

## multi-label model
from sklearn.multioutput import ClassifierChain
from sklearn.svm import LinearSVC

## evaluation metrics
from sklearn.metrics import f1_score

## Set dataframe to maxwidth
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

Ensure that **DATASET_PATH** (dataset folder) and **RESULT_PATH** (results folder) are properly set in the [utils.py](utils.py) module.

Now, let us assign arbitrary values to the following arguments.

In [3]:
split_type = "extreme"
split_size = 0.80
batch_size = 500
num_epochs = 50
num_jobs = 2
num_clusters = 5
use_solver = False

where `use_solver` suggests whether to utilize the sklearn based optimization algorithm ([SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)) or the custom build optimization algorithm for the `active2split` module. `split_type` takes only: `extreme"`, `"iterative"`, and `"naive"` while `split_size` represents the proportion of the dataset to include in the training set.

### Load the "yeast" dataset

Let us use the well-known [*yeast*](https://www.openml.org/d/40597) multi-label data which consists of micro-array expression data, as well as phylogenetic profiles of yeast. The data is comprised of 2417 genes with 103 variables and each gene can be tagged using a subset of 14 distinct classes. This data is provided in the `sample` folder.

In [4]:
dsname="yeast"
X_name = dsname + "_X.pkl"
y_name = dsname + "_y.pkl"
file_path = os.path.join(DATASET_PATH, y_name)
with open(file_path, mode="rb") as f_in:
    y = pkl.load(f_in)
    idx = list(set(y.nonzero()[0]))
    y = y[idx]

file_path = os.path.join(DATASET_PATH, X_name)
with open(file_path, mode="rb") as f_in:
    X = pkl.load(f_in)
    X = X[idx]

print("Size of the data: ", X.shape)
print("Label size of the data: ", y.shape[1])

Size of the data:  (2417, 103)
Label size of the data:  14


### Define the ClassifierChain model

To evaluate the performance of training and test sets, we construct `Classifier Chains`([paper](https://link.springer.com/chapter/10.1007/978-3-642-04174-7_17)) model that arranges binary classifiers into a chain. This method is expected to presevre label correlations given a multi-label data.

For the base estimator, we apply [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) (linear support vector classification) using the default settings.

In [5]:
model = ClassifierChain(LinearSVC(), order='random', random_state=12345)

For performance evaluation, in this notebook we use F1-score performance metric as implemented in the `score` method.

In [6]:
def score(y_true, y_pred, tag:str = "training set"):
    f1_samples_average = f1_score(y_true, y_pred, average='samples')
    f1_samples_micro = f1_score(y_true, y_pred, average='micro')
    f1_samples_macro = f1_score(y_true, y_pred, average='macro')
    print('\t>> F1-score for {0}...'.format(tag))
    print('\t\t--> Average sample f1-score: {0:.4f}'.format(f1_samples_average))
    print('\t\t--> Micro f1-score: {0:.4f}'.format(f1_samples_micro))
    print('\t\t--> Macro f1-score: {0:.4f}'.format(f1_samples_macro))

Using the above data, the chain model, and configuration arguments, we will explore the performance of each splitting strategies.

## Algorithms


### Naive approach

This splitter was proposed in the [paper](https://doi.org/10.1371/journal.pcbi.1008174) and integrated into the metabolic pathway prediction software [mlLGPR](https://github.com/hallamlab/mlLGPR). This is an iterative procedure where at first it selects a label, independently of others, then finds examples associated with this label. Next, the algorithm splits data based on the `split_size` parameter and assigns examples to training and test sets accordingly. If an example was already being added to the test or training sets then continue the process by selecting another label at random. This process iterates until all examples are consumed in the splitting process. In most cases, the resulted training set may not be equal to the `split_size` parameter. This approach is scalable to large-scale data, although it suffers from the class imbalance problem and does not consider label-correlations to split a dataset.

In [7]:
st = NaiveStratification(shuffle=True, split_size=split_size, batch_size=batch_size,
                         num_jobs=num_jobs)
training_idx, test_idx = st.fit(y=y)

## Configuration parameters to naive based stratified multi-label dataset
   splitting:
		1. Shuffle the dataset? True
		2. Split size: 0.8
		3. Number of examples to use in each iteration: 500
		4. Number of parallel workers: 2


	>> Perform splitting...
		--> Splitting progress: 100.00%...


where *training_idx* and *test_idx* are two lists corresponding the indices of yeast data.

Let us explore some properties of the resulted training and test sets.

In [8]:
model_name = "naive2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], 
                            num_tails=5, dataset_name=dsname, model_name=model_name, 
                            rspath=RESULT_PATH, display_dataframe=True, 
                            display_figure=True)
df

Unnamed: 0,Properties for yeast,Complete set,Training set,Test set
0,Number of examples,2417.0,1935.0,482.0
1,Number of labels,10241.0,8181.0,2060.0
2,Label cardinality,4.237071,4.227907,4.273859
3,Label density,0.302648,0.301993,0.305276
4,Distinct labels,14.0,14.0,14.0
5,Distinct label sets,198.0,182.0,102.0
6,Frequency of distinct label sets,0.08192,0.08192,0.08192
7,Mean imbalance ratio intra-class for all labels,9.578575,10.232741,8.081865
8,Mean imbalance ratio inter-class for all labels,7.196811,7.652081,6.187154
9,Mean imbalance ratio labelsets for all labels,0.468229,0.398937,0.674251


where *Label cardinality* is defined as the mean number of labels associated with an example, *Label density* is defined as cardinality divided by the number of labels, *Distinct label sets* is defined as the number of label combinations appearing in the dataset, *Ferequency of distinct label sets* is defined as the number of appearances of distinct labels divided by the total number of examples, and *KL* (Kullback-divergence) difference between complete and partitioned data measures the difference between two probability distributions where a low KL number entails that a partitioned data is closely approximating the complete data label distributions. *Mean imbalance ratio intra-class for all labels* represents the degree of imbalance within a label, *Mean imbalance ratio inter-class for all labels* represents the degree of imbalance among labels, and *Mean imbalance ratio labelsets for all labels* represents the degree of imbalance among labelsets.

From the table, we observe that both training and test sets are close associated with the complete data in terms the KL metric. However, the three imbalance ratio metrics indicate that these datasets differ from each other. In fact, the test set contains less number of labelsets than yeast complete data. Unfortunately, this limitation was not addressed in this approach. 

Let us plot the resulted data in terms of the frequency of examples for each label.

In [9]:
chart

Lets see the performance of splitted data by using ClassifierChain model.

In [10]:
# train
model.fit(X[training_idx].toarray(), y[training_idx].toarray())

ClassifierChain(base_estimator=LinearSVC(), order='random', random_state=12345)

In [11]:
# evaluate
y_pred = model.predict(X[training_idx].toarray())
score(y_true=y[training_idx], y_pred=y_pred, tag="training set")

y_pred = model.predict(X[test_idx].toarray())
score(y_true=y[test_idx], y_pred=y_pred, tag="test set")

	>> F1-score for training set...
		--> Average sample f1-score: 0.6364
		--> Micro f1-score: 0.6586
		--> Macro f1-score: 0.4053
	>> F1-score for test set...
		--> Average sample f1-score: 0.6198
		--> Micro f1-score: 0.6444
		--> Macro f1-score: 0.3566


It seems the model performs similar to both datasets albeit these two data differ from each other with regard to the imbalance ratio.

### Iterative approach

This is a modified algorithm from the [paper](https://link.springer.com/chapter/10.1007/978-3-642-23808-6_10), which performs iterative splitting to the dataset. The algorithm starts by calculating the desired number of examples and proportions for training and test sets. This is followed by estimating the desired number of examples of each label at each partition. Then, for each example of a selected label, the algorithm chooses appropriate partitions. Once the appropriate subset is selected, the example is added to the partition while decrementing the number of desired examples for each label of this example as well as the total number of desired examples for that group.

In [12]:
st = IterativeStratification(shuffle=True, split_size=split_size)
training_idx, test_idx = st.fit(y=y)

## Configuration parameters to iteratively stratifying a multi-label
   dataset splitting:
		1. Shuffle the dataset? True
		2. Split size: 0.8


	>> Perform splitting (iterative)...
		--> Splitting progress: 100.00%...

In [13]:
model_name = "iterative2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], 
                            num_tails=5, dataset_name=dsname, model_name=model_name, 
                            rspath=RESULT_PATH, display_dataframe=True, 
                            display_figure=True)
df

Unnamed: 0,Properties for yeast,Complete set,Training set,Test set
0,Number of examples,2417.0,1940.0,477.0
1,Number of labels,10241.0,8182.0,2059.0
2,Label cardinality,4.237071,4.217526,4.316562
3,Label density,0.302648,0.301252,0.308326
4,Distinct labels,14.0,14.0,14.0
5,Distinct label sets,198.0,174.0,109.0
6,Frequency of distinct label sets,0.08192,0.08192,0.08192
7,Mean imbalance ratio intra-class for all labels,9.578575,10.097934,8.090211
8,Mean imbalance ratio inter-class for all labels,7.196811,7.56304,6.156702
9,Mean imbalance ratio labelsets for all labels,0.468229,0.602303,1.985403


In [14]:
chart

As can be observed that this algorithm produces less optimum results (KL metric) in comparison to the Naive based approach.

In [15]:
# train
model.fit(X[training_idx].toarray(), y[training_idx].toarray())
# evaluate
y_pred = model.predict(X[training_idx].toarray())
score(y_true=y[training_idx], y_pred=y_pred, tag="training set")

y_pred = model.predict(X[test_idx].toarray())
score(y_true=y[test_idx], y_pred=y_pred, tag="test set")

	>> F1-score for training set...
		--> Average sample f1-score: 0.6435
		--> Micro f1-score: 0.6637
		--> Macro f1-score: 0.3979
	>> F1-score for test set...
		--> Average sample f1-score: 0.5976
		--> Micro f1-score: 0.6209
		--> Macro f1-score: 0.3545


The results suggest this approach is less optimum than the naive approach.

### Stratifying XML data approach

Stratifying XML, proposed in the [paper](https://arxiv.org/abs/2103.03494), starts by randomly allocating each example to the training or test sets according to the `split_size` parameter. Then, it performs iterative stratification for `num_epochs` rounds. At the beginning, the algorithm computes the number of examples of each label being assigned to each partition. Then, at each iteration, it computes a score for each label describing the proximity to which a label's train size diverges from the `split_size` parameter. Next, it calculates a score associated with each example according to the scores of its labels for training and test sets. A high example score entails that many of its labels have too many examples associated with a partition, and, preferably, that example should be exchanged to the other partition. This process iterates for `num_epochs` rounds, and at the end, the algorithm terminates while producing the resulted partitions.

Let's apply the algorithm with default settings.

In [16]:
st = ExtremeStratification(swap_probability=0.1, threshold_proportion=0.1, decay=0.1,
                           shuffle=True, split_size=split_size, num_epochs=num_epochs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a large scale multi-label
   dataset splitting:
		1. A hyper-parameter for extreme stratification: 0.1
		2. A hyper-parameter for extreme stratification: 0.1
		3. A hyper-parameter for extreme stratification: 0.1
		4. Shuffle the dataset? True
		5. Split size: 0.8
		6. Number of loops over a dataset: 50


	>> Perform splitting (extreme)...
		--> Starting score: 96
		--> Splitting progress: 100.00%; score: -2.19


In [17]:
model_name = "extreme2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], 
                            num_tails=5, dataset_name=dsname, model_name=model_name, 
                            rspath=RESULT_PATH, display_dataframe=True, 
                            display_figure=True)
df

Unnamed: 0,Properties for yeast,Complete set,Training set,Test set
0,Number of examples,2417.0,1931.0,486.0
1,Number of labels,10241.0,8142.0,2099.0
2,Label cardinality,4.237071,4.216468,4.31893
3,Label density,0.302648,0.301176,0.308495
4,Distinct labels,14.0,14.0,14.0
5,Distinct label sets,198.0,177.0,110.0
6,Frequency of distinct label sets,0.08192,0.08192,0.08192
7,Mean imbalance ratio intra-class for all labels,9.578575,10.479014,7.77096
8,Mean imbalance ratio inter-class for all labels,7.196811,7.86333,5.868194
9,Mean imbalance ratio labelsets for all labels,0.468229,0.371309,0.690851


Again, we train the chain classifier using the resulted training and test sets.

In [18]:
chart

In [19]:
# train
model.fit(X[training_idx].toarray(), y[training_idx].toarray())
# evaluate
y_pred = model.predict(X[training_idx].toarray())
score(y_true=y[training_idx], y_pred=y_pred, tag="training set")

y_pred = model.predict(X[test_idx].toarray())
score(y_true=y[test_idx], y_pred=y_pred, tag="test set")

	>> F1-score for training set...
		--> Average sample f1-score: 0.6366
		--> Micro f1-score: 0.6569
		--> Macro f1-score: 0.3870
	>> F1-score for test set...
		--> Average sample f1-score: 0.6161
		--> Micro f1-score: 0.6356
		--> Macro f1-score: 0.3630


The results show better performance than the iterative approach while having more labelsets in the test data.

### Clustering based strategy

This approach is inspired by the [Partial Least Squares (PLS)](https://www.sciencedirect.com/science/article/abs/pii/S0169743901001551) to find the directions (eigenvectors) of maximum cross-covariance between the explanatory and the response variables by [Singular Value Decomposition (SVD)](https://epubs.siam.org/doi/abs/10.1137/s0895479896305696). Then, the left eigenvector is used to project explanatory variables onto it to extract a low-dimensional expression matrix, which is subsequently fed into the [K-means clustering algorithm](https://web.cse.msu.edu/~cse802/notes/ConstrainedKmeans.pdf). The clusters are used to perform the remapping of response variables to their nearest centroids. Finally, any of the following algorithms: Naive, Iterative, or Stratifying XML, can be applied to perform splitting. This approach aims to improve partitioning data while preserving correlations between response and attribute variables during the partitioning process.

In [20]:
st = ClusterStratification(num_clusters=num_clusters, swap_probability=0.1, threshold_proportion=0.1,
                           decay=0.1, shuffle=True, split_size=split_size, batch_size=batch_size,
                           num_epochs=num_epochs, lr=0.0001, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a multi-label dataset splitting
   based on clustering the covariance of X and y using PLSSVD:
		1. Number of clusters to form: 5
		2. A hyper-parameter: 0.1
		3. A hyper-parameter: 0.1
		4. A hyper-parameter: 0.1
		5. Shuffle the dataset? True
		6. Split size: 0.8
		7. Number of examples to use in each iteration: 500
		8. Number of loops over training set: 50
		9. Learning rate: 0.0001
		10. Number of parallel workers: 2


	>> Computing the covariance of X and y using PLSSVD: 100.00%...
	>> Projecting examples onto the obtained low dimensional U orthonormal basis...
	>> Clustering the resulted low dimensional examples...
	>> Perform splitting (extreme)...
		--> Starting score: -21
		--> Splitting progress: 100.00%; score: -5.23


Let's train the chain classifier using the resulted training and test sets.

In [21]:
model_name = "plssvd2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], 
                            num_tails=5, dataset_name=dsname, model_name=model_name, 
                            rspath=RESULT_PATH, display_dataframe=True, 
                            display_figure=True)
df

Unnamed: 0,Properties for yeast,Complete set,Training set,Test set
0,Number of examples,2417.0,1924.0,493.0
1,Number of labels,10241.0,8095.0,2146.0
2,Label cardinality,4.237071,4.20738,4.352941
3,Label density,0.302648,0.300527,0.310924
4,Distinct labels,14.0,14.0,14.0
5,Distinct label sets,198.0,181.0,105.0
6,Frequency of distinct label sets,0.08192,0.08192,0.08192
7,Mean imbalance ratio intra-class for all labels,9.578575,10.022388,8.368533
8,Mean imbalance ratio inter-class for all labels,7.196811,7.45949,6.51829
9,Mean imbalance ratio labelsets for all labels,0.468229,0.550249,2.050292


In [22]:
chart

In [23]:
# train
model.fit(X[training_idx].toarray(), y[training_idx].toarray())
# evaluate
y_pred = model.predict(X[training_idx].toarray())
score(y_true=y[training_idx], y_pred=y_pred, tag="training set")

y_pred = model.predict(X[test_idx].toarray())
score(y_true=y[test_idx], y_pred=y_pred, tag="test set")

	>> F1-score for training set...
		--> Average sample f1-score: 0.6330
		--> Micro f1-score: 0.6542
		--> Macro f1-score: 0.3901
	>> F1-score for test set...
		--> Average sample f1-score: 0.6307
		--> Micro f1-score: 0.6470
		--> Macro f1-score: 0.3740


The results are seen to be improved in relation to Stratifying XML. 

### Clustering eigenvalues based approach

Similar to the above method, this algorithm builds an adjacency matrix from the response variables and then, decomposes this matrix to extract eigenvectors, which is subsequently fed into the [K-means clustering algorithm](https://web.cse.msu.edu/~cse802/notes/ConstrainedKmeans.pdf). The clusters are then used to perform the remapping of response variables to their nearest centroids. As with the previous approach, any of the following algorithms: Naive, Iterative, or Stratifying XML, can be applied to perform splitting.

In [24]:
st = ClusteringEigenStratification(num_subsamples=10000, num_clusters=num_clusters, sigma=2, 
                                   swap_probability=0.1, threshold_proportion=0.1, 
                                   decay=0.1, shuffle=True, split_size=split_size,
                                   batch_size=batch_size, num_epochs=num_epochs, 
                                   num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a multi-label dataset splitting
   based on clustering eigen values of the label adjacency matrix:
		1. Subsampling input size: 10000
		2. Number of communities: 5
		3. Constant that scales the amount of laplacian norm regularization: 2
		4. A hyper-parameter: 0.1
		5. A hyper-parameter: 0.1
		6. A hyper-parameter: 0.1
		7. Shuffle the dataset? True
		8. Split size: 0.8
		9. Number of examples to use in each iteration: 500
		10. Number of loops over training set: 50
		11. Number of parallel workers: 2


	>> Extracting clusters...
	>> Perform splitting (extreme)...
		--> Starting score: 359
		--> Splitting progress: 100.00%; score: 136.40


In [25]:
model_name = "eigencluster2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], 
                            num_tails=5, dataset_name=dsname, model_name=model_name, 
                            rspath=RESULT_PATH, display_dataframe=True, 
                            display_figure=True)
df

Unnamed: 0,Properties for yeast,Complete set,Training set,Test set
0,Number of examples,2417.0,1954.0,463.0
1,Number of labels,10241.0,8288.0,1953.0
2,Label cardinality,4.237071,4.241556,4.218143
3,Label density,0.302648,0.302968,0.301296
4,Distinct labels,14.0,14.0,14.0
5,Distinct label sets,198.0,179.0,109.0
6,Frequency of distinct label sets,0.08192,0.08192,0.08192
7,Mean imbalance ratio intra-class for all labels,9.578575,9.512279,10.007766
8,Mean imbalance ratio inter-class for all labels,7.196811,7.160984,7.457191
9,Mean imbalance ratio labelsets for all labels,0.468229,0.377509,1.042197


In [26]:
chart

As usual, we train the chain classifier using the resulted training and test sets.

In [27]:
# train
model.fit(X[training_idx].toarray(), y[training_idx].toarray())
# evaluate
y_pred = model.predict(X[training_idx].toarray())
score(y_true=y[training_idx], y_pred=y_pred, tag="training set")

y_pred = model.predict(X[test_idx].toarray())
score(y_true=y[test_idx], y_pred=y_pred, tag="test set")

	>> F1-score for training set...
		--> Average sample f1-score: 0.6392
		--> Micro f1-score: 0.6600
		--> Macro f1-score: 0.4021
	>> F1-score for test set...
		--> Average sample f1-score: 0.6212
		--> Micro f1-score: 0.6398
		--> Macro f1-score: 0.3764


The results are seen to be improved in relation to Stratifying XML.

### Community based splitting strategy

Inspired by [community detection](https://www.pnas.org/content/99/12/7821.short), this approach identifies communities over a graph that is constructed using response variables.  Then communities are used to remap response variables to their communities. Then, any of the following algorithms: Naive, Iterative, or Stratifying XML, can be applied to perform splitting.

In [28]:
st = CommunityStratification(num_subsamples=20000, num_communities=num_clusters, sigma=2, 
                             swap_probability=0.1, threshold_proportion=0.1, decay=0.1, 
                             shuffle=True, split_size=split_size, batch_size=batch_size, 
                             num_epochs=num_epochs, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a multi-label dataset splitting
   based on community detection approach:
		1. Subsampling input size: 20000
		2. Number of communities: 5
		3. Constant that scales the amount of laplacian norm regularization: 2
		4. A hyper-parameter: 0.1
		5. A hyper-parameter: 0.1
		6. A hyper-parameter: 0.1
		7. Shuffle the dataset? True
		8. Split size: 0.8
		9. Number of examples to use in each iteration: 500
		10. Number of loops over training set: 50
		11. Number of parallel workers: 2


	>> Building Graph...
	>> Perform splitting (extreme)...
		--> Starting score: 72
		--> Splitting progress: 100.00%; score: 15.08


In [29]:
model_name = "comm2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], 
                            num_tails=5, dataset_name=dsname, model_name=model_name, 
                            rspath=RESULT_PATH, display_dataframe=True, 
                            display_figure=True)
df

Unnamed: 0,Properties for yeast,Complete set,Training set,Test set
0,Number of examples,2417.0,1937.0,480.0
1,Number of labels,10241.0,8250.0,1991.0
2,Label cardinality,4.237071,4.259164,4.147917
3,Label density,0.302648,0.304226,0.29628
4,Distinct labels,14.0,14.0,14.0
5,Distinct label sets,198.0,187.0,92.0
6,Frequency of distinct label sets,0.08192,0.08192,0.08192
7,Mean imbalance ratio intra-class for all labels,9.578575,9.092897,12.569147
8,Mean imbalance ratio inter-class for all labels,7.196811,6.802069,9.61016
9,Mean imbalance ratio labelsets for all labels,0.468229,0.883835,0.677507


In [30]:
chart

As we did before, let's train the chain classifier.

In [31]:
# train
model.fit(X[training_idx].toarray(), y[training_idx].toarray())
# evaluate
y_pred = model.predict(X[training_idx].toarray())
score(y_true=y[training_idx], y_pred=y_pred, tag="training set")

y_pred = model.predict(X[test_idx].toarray())
score(y_true=y[test_idx], y_pred=y_pred, tag="test set")

	>> F1-score for training set...
		--> Average sample f1-score: 0.6372
		--> Micro f1-score: 0.6596
		--> Macro f1-score: 0.4107
	>> F1-score for test set...
		--> Average sample f1-score: 0.6162
		--> Micro f1-score: 0.6350
		--> Macro f1-score: 0.3542


### Label enhancement based strategy

Label enhancement is the process to recover label distributions from logical or given labels ([paper](https://crad.ict.ac.cn/EN/Y2017/V54/I6/1171)). This idea can be utilized to perform splitting. In particular, this algorithm is based on an [iterative label propagation technique](https://ieeexplore.ieee.org/abstract/document/7373329) to recovers the label distributions. Then, we apply the community detection method, similar to the above method, to reassign response labels to their associated communities. Afterward, an algorithm from: Naive, Iterative, or Stratifying XML, is applied to perform splitting.

In [32]:
st = LabelEnhancementStratification(num_subsamples=10000, num_communities=num_clusters, 
                                    sigma=2, alpha=0.2, swap_probability=0.1, 
                                    threshold_proportion=0.1, decay=0.1, shuffle=True, 
                                    split_size=split_size, batch_size=batch_size, 
                                    num_epochs=num_epochs, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to stratifying a multi-label dataset splitting
   based on label enhancement approach:
		1. Subsampling input size: 10000
		2. Number of communities: 5
		3. Constant that scales the amount of laplacian norm regularization: 2
		4. A hyperparameter to balancing parameterwhich controls the fraction of the information inherited from the label propagation and the label matrix.: 0.2
		5. A hyper-parameter: 0.1
		6. A hyper-parameter: 0.1
		7. A hyper-parameter: 0.1
		8. Shuffle the dataset? True
		9. Split size: 0.8
		10. Number of examples to use in each iteration: 500
		11. Number of loops over training set: 50
		12. Number of parallel workers: 2


	>> Building Graph...
	>> Perform splitting (extreme)...
		--> Starting score: 28
		--> Splitting progress: 100.00%; score: -0.40


In [33]:
model_name = "enhance2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], 
                            num_tails=5, dataset_name=dsname, model_name=model_name, 
                            rspath=RESULT_PATH, display_dataframe=True, 
                            display_figure=True)
df

Unnamed: 0,Properties for yeast,Complete set,Training set,Test set
0,Number of examples,2417.0,1910.0,507.0
1,Number of labels,10241.0,8064.0,2177.0
2,Label cardinality,4.237071,4.22199,4.293886
3,Label density,0.302648,0.301571,0.306706
4,Distinct labels,14.0,14.0,14.0
5,Distinct label sets,198.0,174.0,108.0
6,Frequency of distinct label sets,0.08192,0.08192,0.08192
7,Mean imbalance ratio intra-class for all labels,9.578575,9.552217,9.717836
8,Mean imbalance ratio inter-class for all labels,7.196811,7.096647,7.60943
9,Mean imbalance ratio labelsets for all labels,0.468229,0.537607,1.70713


In [34]:
chart

Let's train the chain classifier to evaluate the performance of partitions.

In [35]:
# train
model.fit(X[training_idx].toarray(), y[training_idx].toarray())
# evaluate
y_pred = model.predict(X[training_idx].toarray())
score(y_true=y[training_idx], y_pred=y_pred, tag="training set")

y_pred = model.predict(X[test_idx].toarray())
score(y_true=y[test_idx], y_pred=y_pred, tag="test set")

	>> F1-score for training set...
		--> Average sample f1-score: 0.6332
		--> Micro f1-score: 0.6535
		--> Macro f1-score: 0.3919
	>> F1-score for test set...
		--> Average sample f1-score: 0.6251
		--> Micro f1-score: 0.6506
		--> Macro f1-score: 0.3823


The results are seen to be improved in relation to Stratifying XML.

### Active learning based splitting strategy

The subsampling approach aims to select the most informative examples using an appropriate acquisition function, such as entropy, then train on selected examples. An extended version was introduced in the [paper](https://www.biorxiv.org/content/10.1101/2020.09.14.297424v1) and integrated into the [leADS](https://github.com/hallamlab/leADS) software. Here, the aim is to measures the uncertainty of an example over labels using either entropy or [normalized propensity scored precision (nPSP)](https://www.biorxiv.org/content/10.1101/2020.09.14.297424v1) and then use the scores to calibrate the examples score in Stratified XML based splitting.

In [36]:
st = ActiveStratification(subsample_labels_size=10, acquisition_type="entropy", 
                          top_k=5, calc_ads=False, ads_percent=0.7, 
                          use_solver=use_solver, loss_function="hinge", 
                          swap_probability=0.1, threshold_proportion=0.1, decay=0.1, 
                          penalty='elasticnet', alpha_elastic=0.0001, l1_ratio=0.65, 
                          alpha_l21=0.01, loss_threshold=0.05, shuffle=True,
                          split_size=split_size, batch_size=batch_size, num_epochs=num_epochs, 
                          lr=1e-3, display_interval=1, num_jobs=num_jobs)
training_idx, test_idx = st.fit(X=X, y=y)

## Configuration parameters to estimating examples predictive uncertainty
   scores to group example with high informativeness into training set
   using a modified approach to splitting an extreme large scale multi-
   label dataset:
		1. Subsampling labels: 10
		2. The acquisition function for estimating the predictive uncertainty: entropy
		3. Apply sklearn optimizers? False
		4. The loss function: hinge
		5. A hyper-parameter for extreme stratification: 0.1
		6. A hyper-parameter for extreme stratification: 0.1
		7. A hyper-parameter for extreme stratification: 0.1
		8. The penalty (aka regularization term): elasticnet
		9. Constant controlling the elastic term: 0.0001
		10. The elastic net mixing parameter: 0.65
		11. A cutoff threshold between two consecutive rounds: 0.05
		12. Shuffle the dataset? True
		13. Split size: 0.8
		14. Number of examples to use in each iteration: 500
		15. Number of loops over training set: 50
		16. Learning rate: 0.001
		17. How often to evaluate? 1


	>> Training to learn a model...
	   1)- Epoch count (1/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncertainty using entropy...0000%...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.8492; Old cost: inf
			--> Epoch 1 took 0.232 seconds...
	   2)- Epoch count (2/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncertainty using entropy...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.8542; Old cost: 0.8492
			--> Epoch 2 took 0.272 seconds...
	   3)- Epoch count (3/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncerta

  		>> Predictive uncertainty using entropy...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.8579; Old cost: 0.8489
			--> Epoch 22 took 0.245 seconds...
	   23)- Epoch count (23/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncertainty using entropy...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.8586; Old cost: 0.8489
			--> Epoch 23 took 0.312 seconds...
	   24)- Epoch count (24/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncertainty using entropy...0000%...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.8581; Old cost: 0.8489
			--> Epoch 24 took 0.252 seconds...
	   25)- Epoch count (25/50)...
  		<<<------------<

  		>> Predictive uncertainty using entropy...0000%...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.8597; Old cost: 0.8489
			--> Epoch 44 took 0.255 seconds...
	   45)- Epoch count (45/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncertainty using entropy...0000%...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.8618; Old cost: 0.8489
			--> Epoch 45 took 0.268 seconds...
	   46)- Epoch count (46/50)...
  		<<<------------<<<------------<<<
  		>> Feed-Backward...
			--> Optimizing Theta: 100.00%...
  		>>>------------>>>------------>>>
  		>> Feed-Forward...
  		>> Predictive uncertainty using entropy...
  		>> Compute cost...
			--> Calculating cost: 100.00%...
			--> New cost: 0.8616; Old cost: 0.8489
			--> Epoch 46 took 0.321 seconds...
	   47)- Epoch count (47/50)...
  		<<<-----

In [37]:
model_name = "active2split"
df, chart = data_properties(y=y, selected_examples=[training_idx, test_idx], 
                            num_tails=5, dataset_name=dsname, model_name=model_name, 
                            rspath=RESULT_PATH, display_dataframe=True, 
                            display_figure=True)
df

Unnamed: 0,Properties for yeast,Complete set,Training set,Test set
0,Number of examples,2417.0,1925.0,492.0
1,Number of labels,10241.0,8120.0,2121.0
2,Label cardinality,4.237071,4.218182,4.310976
3,Label density,0.302648,0.301299,0.307927
4,Distinct labels,14.0,14.0,14.0
5,Distinct label sets,198.0,172.0,113.0
6,Frequency of distinct label sets,0.08192,0.08192,0.08192
7,Mean imbalance ratio intra-class for all labels,9.578575,9.397859,11.079869
8,Mean imbalance ratio inter-class for all labels,7.196811,7.108199,8.107221
9,Mean imbalance ratio labelsets for all labels,0.468229,0.525651,0.749716


In [38]:
chart

As we did before, we train the chain classifier to evaluate the performance of partitions.

In [39]:
# train
model.fit(X[training_idx].toarray(), y[training_idx].toarray())
# evaluate
y_pred = model.predict(X[training_idx].toarray())
score(y_true=y[training_idx], y_pred=y_pred, tag="training set")

y_pred = model.predict(X[test_idx].toarray())
score(y_true=y[test_idx], y_pred=y_pred, tag="test set")

	>> F1-score for training set...
		--> Average sample f1-score: 0.6465
		--> Micro f1-score: 0.6668
		--> Macro f1-score: 0.3997
	>> F1-score for test set...
		--> Average sample f1-score: 0.5778
		--> Micro f1-score: 0.6015
		--> Macro f1-score: 0.3328


Given results from the table and F-score, it seems this approach was able to perform splitting much better than Statified XML.

## Next steps

Examples provided here are constrained to specific settings to keep time complexity reasonable for this tutorial. Splitting and performance may be less optimum. Therefore, as a next step, you could try to:

- improve algorithms or analyze results on a large number of multi-label data.
- rerun algorithms using different configurations. For instance, you could try setting the `split_type` parameter to `"iterative"` or `"naive"` or use a range of split size values (`split_size` $\in (0,1)$) and document performance results. If you choose to apply `comom2split.py` or `enhance2split.py` then **use a small scale data**.
- apply different metrics in the `../model/mlmetrics.py` module to evaluate the results. As can be seen, one metric can not fully justify the performance of splitters.
- use alternative classification models. For example, try [leADS](https://github.com/hallamlab/leADS). 
- use deep learning or Bayesian approach to splitting data?