OTSeq2Set: An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classification
torch==1.9.0
torchtext==0.10.0
OTSeq2set uses the same dataset as AttentionXML, please download each dataset from the following links.
The gensim format GloVe embedding (840B,300d) is provided by AttentionXML here.
For Wiki10-31K,AmazonCat-13K, the label vocabulary is downloaded from The Extreme Classification Repository
We compress four datasets with label vocabulary and Glove embedding here.
The structure of the dataset should be:
OTSeq2Set
|-- config
|-- data
| |-- Eurlex
| |-- AmazonCat-13K
| |-- Amazon-670K
| |-- Wiki10-31K
| |-- glove.840B.300d.gensim
| |-- glove.840B.300d.gensim.vectors.npy
|-- OTSet2Set.ipynb
File config/OTSeq2Set.json contains the configuration of OTSeq2Set which the results are shown in the paper.
config/baselines.json contains the configuration of baseline models.
Description of configuration:
- dl_conv : use light weight convolution or not
- lambda_embedding: The parameter lambda of semantic optimal transport distance
- finish : whether the model is trained or not, needs set to true if you don't want to train this model
Run OTSeq2Set.ipynb