Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.


This repository contains a Python reference implementation of methods for ensemble topic modeling with Non-negative Matrix Factorization (NMF).

Details of these methods are described in the following paper [Link]:

Belford, M., Mac Namee, B., & Greene, D. (2018). Stability of topic modeling via matrix factorization. 
Expert Systems with Applications, 91, 159-169.

Draft pre-print:

Additional pre-processed datasets for use with this package can be downloaded here (179MB).


Tested with Python 3.5, and requiring the following packages, which are available via PIP:

Basic Usage

Step 1.

Before applying topic modeling to a corpus, the first step is to pre-process the corpus and store it in a suitable format. The script '' can be used to parse a directory of plain text documents. Here, we parse all .txt files in the directory or sub-directories of 'data/sample-text'.

python data/sample-text/ -o sample --tfidf --norm

The output will be sample.pkl, stored as a Joblib binary file. The identifiers of the documents in the dataset correspond to the original text input filenames.

Alternatively, if all of your documents are stored in a text file, with one document per line, the script '' can be used:

python data/sample.txt -o sample --tfidf --norm

Step 2.

Next, we generate a set of "base" topic models, which represent the members of the ensemble. We provide two different ways to do this.

Firstly, we can generate a specified number of base topic models using NMF and random initialization (the "Basic Ensemble" approach). For instance, we can generate 20 models, each containing k=4 topics, where each NMF run will execute for a maximum of 100 iterations. The models will be written to the directory 'models/base' as separate Joblib files.

python sample.pkl -k 4 -r 20 --maxiters 100 -o models/base

Alternatively, we can use the "K-Fold" ensemble approach. For instance, to execute 5 repetitions of 10 folds, we run:

python sample.pkl -k 4 -r 5 -f 10 --maxiters 100 -o models/base

Step 3.

The next step is to combine the base topic models using an ensemble approach, to produce a final ensemble model. Note that we specify all of the factor files from the base topic models to combine, along with the number of overall ensemble topics (here again we specify k=4). The model will be written as a number of files to the directory 'models/ensemble'.

python sample.pkl models/base/*factors*.pkl -k 4 -o models/ensemble

Browsing Results

We can display the top 10 terms in the topic descriptors for the final ensemble results in tabular format:

python models/ensemble/ranks_ensemble_k04.pkl 

Or using a line-by-line format:

python -l models/ensemble/ranks_ensemble_k04.pkl 

Similarly, we can display the identifiers of the top-ranked documents for each topic:

python models/ensemble/factors_ensemble_k04.pkl 

Evaluation Measures

To evaluate the Normalized Mutual Information (NMI) accuracy of the document partitions associated with one or more topic models, relative to a ground truth dataset, run:

python sample.pkl models/base/partition*.pkl 

To evaluate the stability of a collection of document partitions using Pairwise Normalized Mutual Information (PNMI), run:

python models/base/partition*.pkl 

To evaluate the stability of a collection of term rankings from topic models using Average Term Stability (ATS), run:

python models/base/ranks*.pkl 

To evaluate the stability of a collection of term rankings from topic models using Average Descriptor Set Difference (ADSD), run:

python models/base/ranks*.pkl


Ensemble topic modeling with matrix factorization







No releases published


No packages published