Skip to content

BobXWu/TopMost

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

topmost-logo TopMost

Github Stars

Downloads

PyPi

Documentation Status

License

Contributors

arXiv

TopMost provides complete lifecycles of topic modeling, including datasets, preprocessing, models, training, and evaluations. It covers the most popular topic modeling scenarios, like basic, dynamic, hierarchical, and cross-lingual topic modeling.


If you want to use our toolkit, please cite as
@article{wu2023topmost,
    title={Towards the TopMost: A Topic Modeling System Toolkit},
    author={Wu, Xiaobao and Pan, Fengjun and Luu, Anh Tuan},
    journal={arXiv preprint arXiv:2309.06908},
    year={2023}
}

@article{wu2023survey,
    title={A Survey on Neural Topic Models: Methods, Applications, and Challenges},
    author={Wu, Xiaobao and Nguyen, Thong and Luu, Anh Tuan},
    journal={Artificial Intelligence Review},
    url={https://doi.org/10.1007/s10462-023-10661-7},
    year={2024},
    publisher={Springer}
}

Table of Contents

Overview

TopMost offers the following topic modeling scenarios with models, evaluation metrics, and datasets:

image

Scenario Model Evaluation Metric Datasets
Basic Topic Modeling
TC
TD
Clustering
Classification
| 20NG | IMDB | NeurIPS | ACL | NYT | Wikitext-103
Hierarchical
Topic Modeling
| HDP | SawETM | HyperMiner | ProGBN | TraCo | TC over levels | TD over levels | Clustering over levels | Classification over levels
20NG
IMDB
NeurIPS
ACL
NYT
Wikitext-103
| Dynamic | Topic Modeling | DTM | DETM
TC over time slices
TD over time slices
Clustering
Classification
| NeurIPS | ACL | NYT
| Cross-lingual | Topic Modeling | NMTM | InfoCTM
TC (CNPMI)
TD over languages
Classification (Intra and Cross-lingual)
ECNews
Amazon
Review Rakuten

Quick Start

Install TopMost

Install topmost with pip as

$ pip install topmost

Discover topics from your own datasets

We can get the top words of discovered topics, topic_top_words and the topic distributions of documents, doc_topic_dist. The preprocessing steps are configurable. See our documentations.

import topmost
from topmost.preprocessing import Preprocessing

# Your own documents
docs = [
    "This is a document about space, including words like space, satellite, launch, orbit.",
    "This is a document about Microsoft Windows, including words like windows, files, dos.",
    # more documents...
]

device = 'cuda' # or 'cpu'
preprocessing = Preprocessing()
dataset = topmost.data.RawDatasetHandler(docs, preprocessing, device=device, as_tensor=True)

model = topmost.models.ProdLDA(dataset.vocab_size, num_topics=2)
model = model.to(device)

trainer = topmost.trainers.BasicTrainer(model)

topic_top_words, doc_topic_dist = trainer.fit_transform(dataset, num_top_words=15, verbose=False)

Usage

Download a preprocessed dataset

import topmost
from topmost.data import download_dataset

download_dataset('20NG', cache_path='./datasets')

Train a model

device = "cuda" # or "cpu"

# load a preprocessed dataset
dataset = topmost.data.BasicDatasetHandler("./datasets/20NG", device=device, read_labels=True, as_tensor=True)
# create a model
model = topmost.models.ProdLDA(dataset.vocab_size)
model = model.to(device)

# create a trainer
trainer = topmost.trainers.BasicTrainer(model)

# train the model
trainer.train(dataset)

Evaluate

# get theta (doc-topic distributions)
train_theta, test_theta = trainer.export_theta(dataset)
# get top words of topics
topic_top_words = trainer.export_top_words(dataset.vocab)

# evaluate topic diversity
TD = topmost.evaluations.compute_topic_diversity(top_words)

# evaluate clustering
clustering_results = topmost.evaluations.evaluate_clustering(test_theta, dataset.test_labels)

# evaluate classification
classification_results = topmost.evaluations.evaluate_classification(train_theta, test_theta, dataset.train_labels, dataset.test_labels)

Test new documents

import torch
from topmost.preprocessing import Preprocessing

new_docs = [
    "This is a new document about space, including words like space, satellite, launch, orbit.",
    "This is a new document about Microsoft Windows, including words like windows, files, dos."
]

preprocessing = Preprocessing()
parsed_new_docs, new_bow = preprocessing.parse(new_docs, vocab=dataset.vocab)
new_doc_topic_dist = trainer.test(torch.as_tensor(new_bow, device=device).float())

Installation

Stable release

To install TopMost, run this command in your terminal:

$ pip install topmost

This is the preferred method to install TopMost, as it will always install the most recent stable release.

From sources

The sources for TopMost can be downloaded from the Github repository. You can clone the public repository by

$ git clone https://github.com/BobXWu/TopMost.git

Then install the TopMost by

$ python setup.py install

Tutorials

We provide tutorials for different usages:

Name Link
Quickstart Open In GitHub
How to preprocess datasets Open In GitHub
How to train and evaluate a basic topic model Open In GitHub
How to train and evaluate a hierarchical topic model Open In GitHub
How to train and evaluate a dynamic topic model Open In GitHub
How to train and evaluate a cross-lingual topic model Open In GitHub

Disclaimer

This library includes some datasets for demonstration. If you are a dataset owner who wants to exclude your dataset from this library, please contact Xiaobao Wu.

Authors

xiaobao-figure Xiaobao Wu
fengjun-figure Fengjun Pan

Contributors

Contributors

Acknowledgments

  • If you want to add any models to this package, we welcome your pull requests.
  • If you encounter any problem, please either directly contact Xiaobao Wu or leave an issue in the GitHub repo.
  • Icon by Flat-icons-com.