In [1]:
%matplotlib inline


LDA Model
=========

Introduces Gensim's LDA model and demonstrates its use on the NIPS corpus.


In [2]:
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

* More info on LDA theory:
* [Introduction to LDA](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation)
* [Gensim tutorial](sphx_glr_auto_examples_core_run_topics_and_transformations.py)

Data
----
* Consider using a corpus on a familiar topic. LDA model evaluation is challenging and can require you to understand the
subject matter.
* NIPS (Neural Information Processing Systems) should be well suited for this tutorial. You can download [Sam Roweis dataset](http://www.cs.nyu.edu/~roweis/data.html). It contains 1740 short documents. This tutorial is not geared towards efficiency - be careful before applying it to a large dataset.

In [1]:
import io, os.path, re, tarfile, smart_open, pprint

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    fname = url.split('/')[-1]
    
    # Download the file to local storage first.
    # We can't read it on the fly because of 
    # https://github.com/RaRe-Technologies/smart_open/issues/331

    if not os.path.isfile(fname):
        with smart_open.open(url, "rb") as fin:
            with smart_open.open(fname, 'wb') as fout:
                while True:
                    buf = fin.read(io.DEFAULT_BUFFER_SIZE)
                    if not buf:
                        break
                    fout.write(buf)
                         
    with tarfile.open(fname, mode='r:gz') as tar:
        # Ignore directory entries, as well as files like README, etc.
        files = [
            m for m in tar.getmembers()
            if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
        ]
        for member in sorted(files, key=lambda x: x.name):
            member_bytes = tar.extractfile(member).read()
            yield member_bytes.decode('utf-8', errors='replace')

docs = list(extract_documents())

unable to import 'smart_open.gcs', disabling that module


* If you're thinking about using your own corpus, ensure that it's in the same format (list of Unicode strings) before proceeding with the rest of this tutorial.

In [3]:
print(len(docs))
print(docs[0][:250])

1740
1 
CONNECTIVITY VERSUS ENTROPY 
Yaser S. Abu-Mostafa 
California Institute of Technology 
Pasadena, CA 91125 
ABSTRACT 
How does the connectivity of a neural network (number of synapses per 
neuron) relate to the complexity of the problems it can han


Pre-process and vectorize the documents
---------------------------------------
* Tokenize (split the documents into tokens, using the NLTK tokenizer.)
* Lemmatize the tokens.
* Compute bigrams.
* Compute a bag-of-words representation of the data.

In [4]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

* We use the **WordNet lemmatizer** from NLTK. A lemmatizer is preferred over a stemmer in this case because it produces more readable words. 
* Output that is easy to read is very desirable in topic modelling.

In [5]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

* We find **bigrams** in the documents. Bigrams are sets of two adjacent words.
Using bigrams we can get phrases like "machine_learning" in our output
(spaces are replaced with underscores); without bigrams we would only get
"machine" and "learning".

* Note: below, we find bigrams & add them to the original data - we to keep the words "machine" & "learning" & (the bigram) "machine_learning".

* Remember: Computing n-grams of large datasets is computationally & memory intensive.

In [6]:
import timeit

In [7]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)

   
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

* Remove rare words and common words based on *document frequency*.
* (words that appear in <20 documents or >50% of documents). 
* Consider trying to remove words only based on their **frequency** too.

In [8]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

* Transform documents to vector. 
* Compute the frequency of each word, including the bigrams.

In [9]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

* How many?

In [10]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 8644
Number of documents: 1740


Training
--------

We are ready to train the LDA model. We will first discuss how to set some of
the training parameters.

* First: **how many topics do I need?** There is no easy answer for this, it will depend on your data & application. I have used **10 topics here** because I wanted to have a few topics that I could interpret and "label", and because that turned out to give me
reasonably good results. You might not need to interpret all your topics, so
you could use a large number of topics, for example 100.

* ``chunksize`` controls how many documents are processed at a time during training. 
* Bigger chunksizes speed up training, as long as the chunk of documents easily fit into memory.
* I've set ``chunksize = 2000``, which is more than the amount of documents, so I process all the data in one go. it can influence model quality. (see Hoffman & co-authors [2].)
* ``passes``, aka "epochs", controls how often we train the model on the entire corpus.
* ``iterations`` controls how often we repeat a particular loop over each document. It is important to set a satisfactory (high enough) #"passes" and #"iterations".

* **How to** choose iterations and passes: 
    * enable logging, ``eval_every = 1`` in ``LdaModel``. 
    * When training: review log for something like this::
    
        ```2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations```
    * If you set ``passes = 20`` you will see this line 20 times. 
    * Make sure that by the final passes, most of the documents have converged. You you want to choose both passes and iterations to be high enough for this to happen.

* We set **alpha = 'auto'** and **eta = 'auto'**. 




In [11]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus       =corpus,
    id2word      =id2word,
    chunksize    =chunksize,
    alpha        ='auto',
    eta          ='auto',
    iterations   =iterations,
    num_topics   =num_topics,
    passes       =passes,
    eval_every   =eval_every
)

* We can compute the **topic coherence** of each topic. Below we display the
average topic coherence & print the topics in order of topic coherence.

* Note: we use the "Umass" topic coherence measure. Gensim has recently
obtained an implementation of the "AKSW" topic coherence measure (see http://rare-technologies.com/what-is-topic-coherence/).

* Share methods on blog at http://rare-technologies.com/lda-training-tips/

In [14]:
# Average topic coherence is the sum(topic coherences of all topics)/#topics.

top_topics          = model.top_topics(corpus) #, num_words=20)
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics

print('Average topic coherence: %.4f.' % avg_topic_coherence)

pprint.pprint(top_topics)

Average topic coherence: -1.1545.
[([(0.019677075, 'neuron'),
   (0.01687375, 'cell'),
   (0.007679801, 'stimulus'),
   (0.0075856154, 'response'),
   (0.0075718933, 'activity'),
   (0.0072249984, 'spike'),
   (0.006526222, 'synaptic'),
   (0.0054767067, 'firing'),
   (0.0049440907, 'connection'),
   (0.004776207, 'cortex'),
   (0.0045617186, 'signal'),
   (0.0044404753, 'frequency'),
   (0.00400868, 'visual'),
   (0.003995763, 'fig'),
   (0.003983469, 'cortical'),
   (0.0038858573, 'orientation'),
   (0.0038649454, 'noise'),
   (0.0036327932, 'potential'),
   (0.0035797937, 'field'),
   (0.0034529855, 'layer')],
  -0.8491954138363681),
 ([(0.0076544937, 'gaussian'),
   (0.0062915096, 'mixture'),
   (0.0060493224, 'density'),
   (0.005303372, 'likelihood'),
   (0.005155422, 'matrix'),
   (0.005063279, 'prior'),
   (0.005018555, 'component'),
   (0.0049988, 'estimate'),
   (0.004524429, 'bayesian'),
   (0.0041858866, 'class'),
   (0.004135563, 'sample'),
   (0.0041156034, 'log'),
   (0.

Resources
---------
* [AKSW topic coherence measure (RaRe)](http://rare-technologies.com/what-is-topic-coherence/).
* [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/index.html).