Topic Modeling with Wasserstein Autoencoders
Source code for
Nan, F., Ding, R., Nallapati, R., & Xiang, B. (2019, July). Topic Modeling with Wasserstein Autoencoders. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 6345-6381).
- Download or clone the w-lda repo. Denote the repo location as
- Create a conda environment and install necessary packages:
conda create --name w-lda python=3.6and
conda activate w-lda
mxnet-cu90depending on CUDA version),
We provide a script to process the
Wikitext-103 dataset, which can be downloaded here.
This will download the dataset and store the pre-processed data under
SOURCE_DIR/data/wikitext-103 (note the pre-processing may take a while).
Training the model:
The result is saved under
SOURCE_DIR/examples/results. In particular, the top words of the topics are saved under
eval_record.p under the keys
Top Words and
Top Words2 are top words based on ranking the decoder matrix weights;
Top Words are the top words based on the decoder output for each topic (the corresponding column of the decoder matrix plus the offset).
Note that in order to evaluate NPMI scores, a separate server process needs to run
npmi_calc.py, which would require the
dictionary and inverted index files for the Wikipedia corpus. We do not currently provide these files so the NPMIs are set to 0's.
However, readers can refer to other open source packages such as this for evaluation.
This project is licensed under the Apache-2.0 License.