Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.


This repository contains notebooks, slides, and data for the short tutorial "Topic modelling with Scikit-learn", presented at PyData Dublin in September 2017.


The summary tutorial is covered in these slides. There are three associated IPython notebooks:

  1. Text Preprocessing: Provides a basic introduction to preprocessing documents with scitkit-learn.
  2. NMF Topic Models: Covers the application and interpretation of topic models via the NMF implementation provided by scitkit-learn.
  3. Parameter Selection for NMF: More advanced material on selecting the number of topics for NMF, using topic coherence.

To demonstrate the topic modelling techniques, a sample dataset is provided here. This consists of 4,551 news articles from 2016, stored in a single text file (25MB), one article per line.


This code has been tested with Python 3.6. The core package requirements are:

  • scikit-learn (tested with v0.19.0)
  • numpy
  • matplotlib

The model selection code also relies on the gensim package to build a Word2Vec model. A pre-built Word2Vec model for the sample dataset is also provided here for download (71MB).

Links and References

  • Scikit-learn home
  • NMF documentation for scikit-learn
  • Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature. [PDF]
  • Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4). [Link]
  • O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications. [PDF]


Tutorial on topic models in Python with scikit-learn



No releases published


No packages published
You can’t perform that action at this time.