Some code to understand how LDA works and see if we can apply basic sketching techniques.  The sample documents are given below (from https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html)

In [1]:
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

The cleaning steps are to (1) _tokenize_: converting a document to atomic elements, (2) _stopping_: removing meaningless words, (3) _Stemming_: Merging words of equivalent meaning.

In [3]:
# 1. Tokenization
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

The `nltk.tokenize` module will match word characters until it reaches non-word characters e.g a space.  This is simple but can cause issues for words with apostrophes etc.  This needs to be improved for a more accurate tokenization.  This is for a single document but for all documents a loop will be necessary.

In [5]:
raw = doc_a.lower()
tokens = tokenizer.tokenize(raw)
print(tokens)

['brocolli', 'is', 'good', 'to', 'eat', 'my', 'brother', 'likes', 'to', 'eat', 'good', 'brocolli', 'but', 'not', 'my', 'mother']


In [6]:
# 2. Stopping - removing meaningless words
from stop_words import get_stop_words

# english stop words
en_stop = get_stop_words('en')

In [8]:
# removing stop words:
stopped_tokens = [word for word in tokens
                      if not word in en_stop]
print(stopped_tokens)

['brocolli', 'good', 'eat', 'brother', 'likes', 'eat', 'good', 'brocolli', 'mother']


In [9]:
# 3. Stemming - combining similar words

from nltk.stem.porter import PorterStemmer

# create p_stemmer of class porterStemmer
p_stemmer = PorterStemmer()

In [23]:
# p_stemmer requires tokens input as str type
texts = [p_stemmer.stem(word) for word in stopped_tokens]

print(texts)

['brocolli', 'good', 'eat', 'brother', 'like', 'eat', 'good', 'brocolli', 'mother']


In [25]:
# do full process in one function

def clean_document(doc):
    """
    Tokenizes, stops, and stems the doc.
    Returns the tokens."""
    raw = doc.lower()
    tokens = tokenizer.tokenize(raw)
    
    en_stop = get_stop_words('en')
    stopped_tokens = [word for word in tokens
                      if not word in en_stop]
    
    p_stemmer = PorterStemmer()
    texts = [p_stemmer.stem(word) for word in stopped_tokens]

    return texts
    

In [29]:
clean_texts = [[] for i in range(len(doc_set))]

for doc_number in range(len(doc_set)):
    clean_texts[doc_number] = clean_document(
                        doc_set[doc_number])
    
clean_texts

[['brocolli',
  'good',
  'eat',
  'brother',
  'like',
  'eat',
  'good',
  'brocolli',
  'mother'],
 ['mother',
  'spend',
  'lot',
  'time',
  'drive',
  'brother',
  'around',
  'basebal',
  'practic'],
 ['health',
  'expert',
  'suggest',
  'drive',
  'may',
  'caus',
  'increas',
  'tension',
  'blood',
  'pressur'],
 ['often',
  'feel',
  'pressur',
  'perform',
  'well',
  'school',
  'mother',
  'never',
  'seem',
  'drive',
  'brother',
  'better'],
 ['health', 'profession', 'say', 'brocolli', 'good', 'health']]

## Document Term Matrix
The result of the cleaning stage is `texts`, a tokenized, stopped, and stemmed list of words from a single document.  If we had done this for every document then `texts` would be a list of lists.

To generate the LDA model we need to count how frequently each term occurs in eah document by creating a document-term matrix using `gensim`.

In [41]:
import gensim
from gensim import corpora, models

dictionary = corpora.Dictionary(clean_texts)

In [33]:
print(dictionary)

Dictionary(32 unique tokens: ['brocolli', 'brother', 'eat', 'good', 'like']...)


The `Dictionary` function looks over all of `clean_texts` and assigns a uniqu integer id to eqch unique token (word) whilst also collecting word counts and statistics.  We can view a token's unique id as:

In [34]:
print(dictionary.token2id)

{'brocolli': 0, 'brother': 1, 'eat': 2, 'good': 3, 'like': 4, 'mother': 5, 'around': 6, 'basebal': 7, 'drive': 8, 'lot': 9, 'practic': 10, 'spend': 11, 'time': 12, 'blood': 13, 'caus': 14, 'expert': 15, 'health': 16, 'increas': 17, 'may': 18, 'pressur': 19, 'suggest': 20, 'tension': 21, 'better': 22, 'feel': 23, 'never': 24, 'often': 25, 'perform': 26, 'school': 27, 'seem': 28, 'well': 29, 'profession': 30, 'say': 31}


For later work, this might be the point at which we can apply hashing for CountMin sketch.  However for now, carry on and convert to bag of words model.

In [36]:
corpus = [dictionary.doc2bow(text) for text in clean_texts]

In [37]:
print(corpus[0])

[(0, 2), (1, 1), (2, 2), (3, 2), (4, 1), (5, 1)]


In [38]:
print(corpus)

[[(0, 2), (1, 1), (2, 2), (3, 2), (4, 1), (5, 1)], [(1, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(8, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1)], [(1, 1), (5, 1), (8, 1), (19, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)], [(0, 1), (3, 1), (16, 2), (30, 1), (31, 1)]]


`corpus` is now a list of vectors equal to the number of documents.  For every document the vector is a series of tuples.  The list of tuples representing the first document is `corpus[0]`.  Each tuple is `(term_id, term_frequency)` pair - if `dictionary.token2id` is 0 then the first tuple indicates that `brocolli` appeared twice in `doc_a`.

`doc2bow()` only includes terms that occur and does not show those that do not appear.

In [39]:
print(corpus[0])

[(0, 2), (1, 1), (2, 2), (3, 2), (4, 1), (5, 1)]


### Applying LDA
Now we have generate `corpus` which is a document term matrix we can start the LDA model.

In [43]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3,
                                          id2word=dictionary, passes=20)

Parameters:
1. `num_topics` - _required_. An LDA model needs the user to determine how many topics should be generated (as part of the generative model procedure).  This document set is small so only ask for three topics.
2. `id2word` - _required_. The `LdaModel` class requires the previous dictionary to map ids to strings.
3. `passes` - _optional_. Determines how many passes over the `corpus` that will be taken.  For later streaming application we will want this to be 1.

In [44]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(0, '0.068*"drive" + 0.068*"lot" + 0.068*"spend"'), (1, '0.125*"health" + 0.050*"pressur" + 0.050*"suggest"'), (2, '0.074*"good" + 0.074*"brocolli" + 0.074*"brother"')]


Each generated topic is separated by a comma and each topic shows the three most probable words to occur.

In [45]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

In [46]:
print(ldamodel.print_topics(num_topics=2, num_words=4))

[(0, '0.082*"health" + 0.048*"mother" + 0.048*"brother" + 0.048*"drive"'), (1, '0.054*"good" + 0.054*"brocolli" + 0.053*"drive" + 0.053*"brother"')]


LDA comes from a generative model which assumes a document is built as follows:
1. Choose the number of words in a document.
2. Choose the number of topics from which words can be chosen in the document.
3. Using the distribution of each topic, choose a word from the distribution to fill a slot.

LDA then backtracks and finds which topics would have created the words.