In [1]:
%matplotlib inline


LDA Model
=========

Introduces Gensim's LDA model and demonstrates its use on the NIPS corpus.


In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

The purpose of this tutorial is to demonstrate training an LDA model and
obtaining good results.

This tutorial will **not**:

* Explain how Latent Dirichlet Allocation works
* Explain how the LDA model performs inference
* Teach you how to use Gensim's LDA implementation in its entirety

If you are not familiar with the LDA model or how to use it in Gensim, I
suggest you read up on that before continuing with this tutorial.

* [Introduction to LDA](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation)
* [Gensim tutorial](sphx_glr_auto_examples_core_run_topics_and_transformations.py)

Data
----

I have used a corpus of NIPS papers in this tutorial, but if you're following
this tutorial just to learn about LDA I encourage you to consider picking a
corpus on a subject that you are familiar with. Qualitatively evaluating the
output of an LDA model is challenging and can require you to understand the
subject matter of your corpus (depending on your goal with the model).

NIPS (Neural Information Processing Systems) is a machine learning conference
so the subject matter should be well suited for most of the target audience
of this tutorial.  You can download the original data from Sam Roweis'
`website <http://www.cs.nyu.edu/~roweis/data.html>`_.  The code below will
also do that for you.

The corpus contains 1740 relatively short documents. Keep in mind that this tutorial is not geared towards efficiency, and be careful before applying the code to a large dataset.

In [3]:
import io
import os.path
import re
import tarfile

import smart_open

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    fname = url.split('/')[-1]
    
    # Download the file to local storage first.
    # We can't read it on the fly because of 
    # https://github.com/RaRe-Technologies/smart_open/issues/331
    if not os.path.isfile(fname):
        with smart_open.open(url, "rb") as fin:
            with smart_open.open(fname, 'wb') as fout:
                while True:
                    buf = fin.read(io.DEFAULT_BUFFER_SIZE)
                    if not buf:
                        break
                    fout.write(buf)
                         
    with tarfile.open(fname, mode='r:gz') as tar:
        # Ignore directory entries, as well as files like README, etc.
        files = [
            m for m in tar.getmembers()
            if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
        ]
        for member in sorted(files, key=lambda x: x.name):
            member_bytes = tar.extractfile(member).read()
            yield member_bytes.decode('utf-8', errors='replace')

docs = list(extract_documents())



So we have a list of 1740 documents, where each document is a Unicode string. 
If you're thinking about using your own corpus, then you need to make sure
that it's in the same format (list of Unicode strings) before proceeding
with the rest of this tutorial.




In [4]:
print(len(docs))
print(docs[0][:500])

1740
1 
CONNECTIVITY VERSUS ENTROPY 
Yaser S. Abu-Mostafa 
California Institute of Technology 
Pasadena, CA 91125 
ABSTRACT 
How does the connectivity of a neural network (number of synapses per 
neuron) relate to the complexity of the problems it can handle (measured by 
the entropy)? Switching theory would suggest no relation at all, since all Boolean 
functions can be implemented using a circuit with very low connectivity (e.g., 
using two-input NAND gates). However, for a network that learns a pr


Pre-process and vectorize the documents
---------------------------------------

As part of preprocessing, we will:

* Tokenize (split the documents into tokens).
* Lemmatize the tokens.
* Compute bigrams.
* Compute a bag-of-words representation of the data.

First we tokenize the text using a regular expression tokenizer from NLTK. We
remove numeric tokens and tokens that are only a single character, as they
don't tend to be useful, and the dataset contains a lot of them.

You can replace NLTK with something else if you want.

In [5]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

* We use the **WordNet lemmatizer** from NLTK. A lemmatizer is preferred over a stemmer in this case because it produces more readable words. 
* Output that is easy to read is very desirable in topic modelling.

In [6]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

* We find **bigrams** in the documents. Bigrams are sets of two adjacent words.
Using bigrams we can get phrases like "machine_learning" in our output
(spaces are replaced with underscores); without bigrams we would only get
"machine" and "learning".

* Note: below, we find bigrams & add them to the original data - we to keep the words "machine" & "learning" & (the bigram) "machine_learning".

* Remember: Computing n-grams of large datasets is computationally & memory intensive.

In [8]:
import timeit

In [10]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)

   
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

2020-04-22 07:59:30,363 : INFO : collecting all words and their counts
2020-04-22 07:59:30,364 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-04-22 07:59:40,361 : INFO : collected 1311757 word types from a corpus of 4953968 words (unigram + bigrams) and 1740 sentences
2020-04-22 07:59:40,362 : INFO : using 1311757 counts as vocab in Phrases<0 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>


* Remove rare words and common words based on *document frequency*.
* (words that appear in <20 documents or >50% of documents). 
* Consider trying to remove words only based on their **frequency** too.

In [11]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

2020-04-22 08:06:05,488 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-22 08:06:09,750 : INFO : built Dictionary(79983 unique tokens: ['0a', '2h', '2h2', '2he', '2n']...) from 1740 documents (total 5617423 corpus positions)
2020-04-22 08:06:09,895 : INFO : discarding 70969 tokens: [('0a', 19), ('2h', 16), ('2h2', 1), ('2he', 3), ('__c', 2), ('_k', 6), ('a', 1740), ('about', 1058), ('abstract', 1740), ('after', 1087)]...
2020-04-22 08:06:09,896 : INFO : keeping 9014 tokens which were in no less than 20 and no more than 870 (=50.0%) documents
2020-04-22 08:06:09,935 : INFO : resulting dictionary: Dictionary(9014 unique tokens: ['2n', '_c', 'a2', 'a_follows', 'ability']...)


* Transform documents to vector. 
* We compute the frequency of each word, including the bigrams.

In [12]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

* How many?

In [13]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 9014
Number of documents: 1740


Training
--------

We are ready to train the LDA model. We will first discuss how to set some of
the training parameters.

* First: **how many topics do I need?** There is no easy answer for this, it will depend on your data & application. I have used **10 topics here** because I wanted to have a few topics that I could interpret and "label", and because that turned out to give me
reasonably good results. You might not need to interpret all your topics, so
you could use a large number of topics, for example 100.

* ``chunksize`` controls how many documents are processed at a time during training. 
* Bigger chunksizes speed up training, as long as the chunk of documents easily fit into memory.
* I've set ``chunksize = 2000``, which is more than the amount of documents, so I process all the data in one go. it can influence model quality. (see Hoffman & co-authors [2].)
* ``passes``, aka "epochs", controls how often we train the model on the entire corpus.
* ``iterations`` controls how often we repeat a particular loop over each document. It is important to set a satisfactory (high enough) #"passes" and #"iterations".

* **How to** choose iterations and passes: 
    * enable logging, ``eval_every = 1`` in ``LdaModel``. 
    * When training: review log for something like this::
    
        ```2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations```
    * If you set ``passes = 20`` you will see this line 20 times. 
    * Make sure that by the final passes, most of the documents have converged. You you want to choose both passes and iterations to be high enough for this to happen.

* We set **alpha = 'auto'** and **eta = 'auto'**. 




In [14]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus       =corpus,
    id2word      =id2word,
    chunksize    =chunksize,
    alpha        ='auto',
    eta          ='auto',
    iterations   =iterations,
    num_topics   =num_topics,
    passes       =passes,
    eval_every   =eval_every
)

2020-04-22 08:20:39,835 : INFO : using autotuned alpha, starting with [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
2020-04-22 08:20:39,838 : INFO : using serial LDA version on this node
2020-04-22 08:20:39,854 : INFO : running online (multi-pass) LDA training, 10 topics, 20 passes over the supplied corpus of 1740 documents, updating model once every 1740 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2020-04-22 08:20:39,856 : INFO : PROGRESS: pass 0, at document #1740/1740
2020-04-22 08:20:57,492 : INFO : optimized alpha [0.0693838, 0.10109078, 0.08631849, 0.059373535, 0.08150441, 0.07864842, 0.06427334, 0.08243809, 0.05018346, 0.11289825]
2020-04-22 08:20:57,513 : INFO : topic #8 (0.050): 0.004*"direction" + 0.003*"cell" + 0.003*"training_set" + 0.003*"signal" + 0.003*"image" + 0.002*"class" + 0.002*"map" + 0.002*"layer" + 0.002*"density" + 0.002*"activity"
2020-04-22 08:20:57,515 : INFO : topic #3 (0.059): 0.004*"circ

* We can compute the **topic coherence** of each topic. Below we display the
average topic coherence & print the topics in order of topic coherence.

* Note: we use the "Umass" topic coherence measure. Gensim has recently
obtained an implementation of the "AKSW" topic coherence measure (see http://rare-technologies.com/what-is-topic-coherence/).

* Share methods on blog at http://rare-technologies.com/lda-training-tips/

In [15]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum(topic coherences of all topics)/#topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

2020-04-22 08:27:06,050 : INFO : CorpusAccumulator accumulated stats from 1000 documents


Average topic coherence: -1.3123.
[([(0.015313651, 'cell'),
   (0.0149424905, 'neuron'),
   (0.0073651653, 'response'),
   (0.0068880743, 'stimulus'),
   (0.006190292, 'spike'),
   (0.0059941225, 'activity'),
   (0.0050231, 'synaptic'),
   (0.004933211, 'receptive_field'),
   (0.0046267975, 'firing'),
   (0.0045634303, 'visual'),
   (0.0043524248, 'firing_rate'),
   (0.004018963, 'cortex'),
   (0.003965662, 'signal'),
   (0.0036408643, 'connection'),
   (0.0035720216, 'frequency'),
   (0.0034900145, 'spike_train'),
   (0.0033038252, 'field'),
   (0.0031401764, 'fig'),
   (0.0031081017, 'cortical'),
   (0.0030926934, 'potential')],
  -0.977067078652941),
 ([(0.013519373, 'hidden_unit'),
   (0.009738461, 'hidden'),
   (0.007895276, 'recognition'),
   (0.00785422, 'training_set'),
   (0.0065473174, 'speech'),
   (0.0061372593, 'word'),
   (0.005670263, 'layer'),
   (0.0054374444, 'test_set'),
   (0.005188131, 'trained'),
   (0.004860516, 'speech_recognition'),
   (0.0047743507, 'hidden_la

Things to experiment with
-------------------------

* ``no_above`` and ``no_below`` parameters in ``filter_extremes`` method.
* Adding trigrams or even higher order n-grams.
* Consider whether using a hold-out set or cross-validation is the way to go for you.
* Try other datasets.

Where to go from here
---------------------

* Check out a RaRe blog post on the AKSW topic coherence measure (http://rare-technologies.com/what-is-topic-coherence/).
* pyLDAvis (https://pyldavis.readthedocs.io/en/latest/index.html).
* Read some more Gensim tutorials (https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials).
* If you haven't already, read [1] and [2] (see references).

References
----------

1. "Latent Dirichlet Allocation", Blei et al. 2003.
2. "Online Learning for Latent Dirichlet Allocation", Hoffman et al. 2010.


