Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Failed to load latest commit information.


This repository contains notebooks, slides, and data for the short tutorial "Topic modelling with Scikit-learn", presented at PyData Dublin in September 2017.


The summary tutorial is covered in these slides. There are three associated IPython notebooks:

  1. Text Preprocessing: Provides a basic introduction to preprocessing documents with scitkit-learn.
  2. NMF Topic Models: Covers the application and interpretation of topic models via the NMF implementation provided by scitkit-learn.
  3. Parameter Selection for NMF: More advanced material on selecting the number of topics for NMF, using topic coherence.

To demonstrate the topic modelling techniques, a sample dataset is provided here. This consists of 4,551 news articles collected from the Guardian News API in 2016, stored in a single text file (25MB), with one article per line.


This code has been tested with Python 3.6-3.8. The core package requirements are:

  • scikit-learn
  • numpy
  • matplotlib

The model selection code also relies on the gensim package to build a Word2Vec model (tested with v4.1.2). A sample pre-built Word2Vec model for the sample dataset is also provided here for download (71MB).

Links and References

  • Scikit-learn home
  • NMF documentation for scikit-learn
  • Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature. [Article]
  • Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4). [Article]
  • O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications. [PDF]


Tutorial on topic models in Python with scikit-learn






No releases published


No packages published