This repository contains notebooks, slides, and data for the short tutorial "Topic modelling with Scikit-learn", presented at PyData Dublin in September 2017.
The summary tutorial is covered in these slides. There are three associated IPython notebooks:
- Text Preprocessing: Provides a basic introduction to preprocessing documents with scitkit-learn.
- NMF Topic Models: Covers the application and interpretation of topic models via the NMF implementation provided by scitkit-learn.
- Parameter Selection for NMF: More advanced material on selecting the number of topics for NMF, using topic coherence.
To demonstrate the topic modelling techniques, a sample dataset is provided here. This consists of 4,551 news articles from 2016, stored in a single text file (25MB), one article per line.
This code has been tested with Python 3.6. The core package requirements are:
- scikit-learn (tested with v0.19.0)
The model selection code also relies on the gensim package to build a Word2Vec model. A pre-built Word2Vec model for the sample dataset is also provided here for download (71MB).
Links and References
- Scikit-learn home
- NMF documentation for scikit-learn
- Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature. [PDF]
- Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4). [Link]
- O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications. [PDF]