Tutorial on topic models in Python with scikit-learn
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.



This repository contains notebooks, slides, and data for the short tutorial "Topic modelling with Scikit-learn", presented at PyData Dublin in September 2017.


The summary tutorial is covered in these slides. There are three associated IPython notebooks:

  1. Text Preprocessing: Provides a basic introduction to preprocessing documents with scitkit-learn.
  2. NMF Topic Models: Covers the application and interpretation of topic models via the NMF implementation provided by scitkit-learn.
  3. Parameter Selection for NMF: More advanced material on selecting the number of topics for NMF, using topic coherence.

To demonstrate the topic modelling techniques, a sample dataset is provided here. This consists of 4,551 news articles from 2016, stored in a single text file (25MB), one article per line.


This code has been tested with Python 3.6. The core package requirements are:

  • scikit-learn (tested with v0.19.0)
  • numpy
  • matplotlib

The model selection code also relies on the gensim package to build a Word2Vec model. A pre-built Word2Vec model for the sample dataset is also provided here for download (71MB).

Links and References

  • Scikit-learn home
  • NMF documentation for scikit-learn
  • Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature. [PDF]
  • Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4). [Link]
  • O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications. [PDF]