In [1]:
%reload_kedro

2019-12-14 19:56:06,551 - root - INFO - ** Kedro project Dynamic Topic Modeling
2019-12-14 19:56:06,551 - root - INFO - Defined global variable `context` and `catalog`


## Data

In [2]:
docs = catalog.load("nips12raw")
print(len(docs))
print(docs[0][:500])

2019-12-14 19:56:07,177 - kedro.io.data_catalog - INFO - Loading data from `nips12raw` (TgzLocalDataSet)...
1740
1 
CONNECTIVITY VERSUS ENTROPY 
Yaser S. Abu-Mostafa 
California Institute of Technology 
Pasadena, CA 91125 
ABSTRACT 
How does the connectivity of a neural network (number of synapses per 
neuron) relate to the complexity of the problems it can handle (measured by 
the entropy)? Switching theory would suggest no relation at all, since all Boolean 
functions can be implemented using a circuit with very low connectivity (e.g., 
using two-input NAND gates). However, for a network that learns a pr


## Pre-process

- tokenize
- remove numbers, stop words and custom words, 1-character words
- lemmatize
- compute n-grams (bigrams, trigrams)
- generate dictionary
- filter extremes
- convert corpsu (bag-of-words)

In [13]:
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Phrases
from gensim.corpora import Dictionary, MmCorpus

In [None]:
stop_words = set(stopwords.words('english'))
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = word_tokenize(docs[idx])  # Split into words.
    docs[idx] = [w for w in docs[idx] if not w in stop_words] 

In [14]:
stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.
    docs[idx] = [w for w in docs[idx] if not w in stop_words]

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

In [15]:
# Lemmatize the documents
lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

In [16]:
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

2019-12-14 19:17:56,668 - gensim.models.phrases - INFO - collecting all words and their counts
2019-12-14 19:17:56,669 - gensim.models.phrases - INFO - PROGRESS: at sentence #0, processed 0 words and 0 word types
2019-12-14 19:18:00,728 - gensim.models.phrases - INFO - collected 1398637 word types from a corpus of 2871355 words (unigram + bigrams) and 1740 sentences
2019-12-14 19:18:00,729 - gensim.models.phrases - INFO - using 1398637 counts as vocab in Phrases<0 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>


In [17]:
# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

2019-12-14 19:18:09,196 - gensim.corpora.dictionary - INFO - adding document #0 to Dictionary(0 unique tokens: [])
2019-12-14 19:18:11,445 - gensim.corpora.dictionary - INFO - built Dictionary(79585 unique tokens: ['0a', '2h', '2h2', '2he', '2n']...) from 1740 documents (total 3129442 corpus positions)
2019-12-14 19:18:11,593 - gensim.corpora.dictionary - INFO - discarding 70959 tokens: [('0a', 19), ('2h', 16), ('2h2', 1), ('2he', 3), ('__c', 2), ('_k', 6), ('abstract', 1740), ('alently', 2), ('also', 1630), ('arned', 2)]...
2019-12-14 19:18:11,594 - gensim.corpora.dictionary - INFO - keeping 8626 tokens which were in no less than 20 and no more than 870 (=50.0%) documents
2019-12-14 19:18:11,619 - gensim.corpora.dictionary - INFO - resulting dictionary: Dictionary(8626 unique tokens: ['2n', '_c', 'a2', 'ability', 'abu']...)


In [18]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [19]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 8626
Number of documents: 1740


In [20]:
MmCorpus.serialize('./data/05_model_input/nips12.mm', corpus)
dictionary.save('./data/05_model_input/nips12.dict')

2019-12-14 19:18:24,820 - gensim.corpora.mmcorpus - INFO - storing corpus in Matrix Market format to ./data/05_model_input/nips12.mm
2019-12-14 19:18:24,823 - gensim.matutils - INFO - saving sparse matrix to ./data/05_model_input/nips12.mm
2019-12-14 19:18:24,824 - gensim.matutils - INFO - PROGRESS: saving document #0
2019-12-14 19:18:25,319 - gensim.matutils - INFO - PROGRESS: saving document #1000
2019-12-14 19:18:25,722 - gensim.matutils - INFO - saved 1740x8626 matrix, density=6.068% (910831/15009240)
2019-12-14 19:18:25,724 - gensim.corpora.indexedcorpus - INFO - saving MmCorpus index to ./data/05_model_input/nips12.mm.index
2019-12-14 19:18:25,725 - gensim.utils - INFO - saving Dictionary object under ./data/05_model_input/nips12.dict, separately None
2019-12-14 19:18:25,729 - gensim.utils - INFO - saved ./data/05_model_input/nips12.dict


## Training

In [1]:
from gensim.corpora import Dictionary, MmCorpus

corpus = MmCorpus('./data/05_model_input/nips12.mm')
dictionary = Dictionary.load('./data/05_model_input/nips12.dict')

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

2019-12-14 19:18:31,651 - gensim.summarization.textcleaner - INFO - 'pattern' package not found; tag filters are not available for English


In [2]:
from gensim.models import LdaModel

2019-12-14 19:18:31,666 - gensim.corpora.indexedcorpus - INFO - loaded corpus index from ./data/05_model_input/nips12.mm.index
2019-12-14 19:18:31,667 - gensim.corpora._mmreader - INFO - initializing cython corpus reader from ./data/05_model_input/nips12.mm
2019-12-14 19:18:31,668 - gensim.corpora._mmreader - INFO - accepted corpus with 1740 documents, 8626 features, 910831 non-zero entries
2019-12-14 19:18:31,669 - gensim.utils - INFO - loading Dictionary object from ./data/05_model_input/nips12.dict
2019-12-14 19:18:31,675 - gensim.utils - INFO - loaded ./data/05_model_input/nips12.dict


In [4]:
# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

2019-12-14 19:18:35,756 - gensim.models.ldamodel - INFO - using autotuned alpha, starting with [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
2019-12-14 19:18:35,760 - gensim.models.ldamodel - INFO - using serial LDA version on this node
2019-12-14 19:18:35,796 - gensim.models.ldamodel - INFO - running online (multi-pass) LDA training, 10 topics, 20 passes over the supplied corpus of 1740 documents, updating model once every 1740 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2019-12-14 19:18:36,171 - gensim.models.ldamodel - INFO - PROGRESS: pass 0, at document #1740/1740
2019-12-14 19:18:50,715 - gensim.models.ldamodel - INFO - optimized alpha [0.062012203, 0.065208, 0.08343, 0.07865155, 0.06288165, 0.07505426, 0.10154667, 0.063270524, 0.09782959, 0.06569147]
2019-12-14 19:18:50,727 - gensim.models.ldamodel - INFO - topic #0 (0.062): 0.004*"circuit" + 0.004*"field" + 0.003*"sample" + 0.003*"approximation" + 0.003*"thres

2019-12-14 19:19:19,826 - gensim.models.ldamodel - INFO - topic #2 (0.055): 0.006*"image" + 0.006*"class" + 0.005*"layer" + 0.005*"hidden" + 0.004*"net" + 0.003*"classifier" + 0.003*"solution" + 0.003*"kernel" + 0.003*"generalization" + 0.003*"machine"
2019-12-14 19:19:19,827 - gensim.models.ldamodel - INFO - topic diff=0.211397, rho=0.408248
2019-12-14 19:19:20,221 - gensim.models.ldamodel - INFO - PROGRESS: pass 5, at document #1740/1740
2019-12-14 19:19:25,138 - gensim.models.ldamodel - INFO - optimized alpha [0.050862856, 0.05159408, 0.052995816, 0.04742664, 0.03467562, 0.043895956, 0.04950838, 0.042020608, 0.044184774, 0.04448872]
2019-12-14 19:19:25,146 - gensim.models.ldamodel - INFO - topic #4 (0.035): 0.011*"rule" + 0.008*"cell" + 0.008*"object" + 0.008*"image" + 0.008*"source" + 0.007*"component" + 0.007*"signal" + 0.005*"matrix" + 0.005*"independent" + 0.005*"ica"
2019-12-14 19:19:25,147 - gensim.models.ldamodel - INFO - topic #7 (0.042): 0.007*"classifier" + 0.006*"classifi

2019-12-14 19:19:45,713 - gensim.models.ldamodel - INFO - topic #0 (0.056): 0.006*"bound" + 0.006*"approximation" + 0.005*"let" + 0.004*"theorem" + 0.004*"gaussian" + 0.004*"sample" + 0.004*"generalization" + 0.004*"matrix" + 0.004*"noise" + 0.004*"estimate"
2019-12-14 19:19:45,714 - gensim.models.ldamodel - INFO - topic diff=0.171031, rho=0.301511
2019-12-14 19:19:46,086 - gensim.models.ldamodel - INFO - PROGRESS: pass 10, at document #1740/1740
2019-12-14 19:19:50,274 - gensim.models.ldamodel - INFO - optimized alpha [0.05736358, 0.052264597, 0.047579166, 0.041526094, 0.02933066, 0.039997395, 0.04562452, 0.042081386, 0.039410762, 0.04161426]
2019-12-14 19:19:50,282 - gensim.models.ldamodel - INFO - topic #4 (0.029): 0.011*"rule" + 0.011*"component" + 0.010*"source" + 0.009*"image" + 0.009*"signal" + 0.008*"object" + 0.008*"matrix" + 0.007*"independent" + 0.006*"ica" + 0.006*"cell"
2019-12-14 19:19:50,283 - gensim.models.ldamodel - INFO - topic #8 (0.039): 0.013*"memory" + 0.008*"dist

2019-12-14 19:20:08,265 - gensim.models.ldamodel - INFO - topic #0 (0.063): 0.006*"bound" + 0.006*"approximation" + 0.005*"let" + 0.004*"theorem" + 0.004*"noise" + 0.004*"matrix" + 0.004*"generalization" + 0.004*"sample" + 0.004*"gaussian" + 0.004*"estimate"
2019-12-14 19:20:08,266 - gensim.models.ldamodel - INFO - topic diff=0.129984, rho=0.250000
2019-12-14 19:20:08,641 - gensim.models.ldamodel - INFO - PROGRESS: pass 15, at document #1740/1740
2019-12-14 19:20:12,504 - gensim.models.ldamodel - INFO - optimized alpha [0.06459158, 0.054765593, 0.046663716, 0.039999172, 0.027550472, 0.039295692, 0.046163477, 0.045692574, 0.04004167, 0.041262783]
2019-12-14 19:20:12,512 - gensim.models.ldamodel - INFO - topic #4 (0.028): 0.013*"component" + 0.011*"source" + 0.010*"image" + 0.010*"signal" + 0.009*"matrix" + 0.009*"rule" + 0.007*"independent" + 0.007*"face" + 0.007*"object" + 0.006*"ica"
2019-12-14 19:20:12,513 - gensim.models.ldamodel - INFO - topic #5 (0.039): 0.009*"node" + 0.008*"hidd

2019-12-14 19:20:30,550 - gensim.models.ldamodel - INFO - topic #0 (0.070): 0.006*"bound" + 0.006*"approximation" + 0.005*"let" + 0.005*"theorem" + 0.004*"noise" + 0.004*"estimate" + 0.004*"optimal" + 0.004*"matrix" + 0.004*"generalization" + 0.004*"sample"
2019-12-14 19:20:30,550 - gensim.models.ldamodel - INFO - topic diff=0.096100, rho=0.218218


In [5]:
model.save('./data/06_models/nips12.model')

2019-12-14 19:20:30,612 - gensim.utils - INFO - saving LdaState object under ./data/06_models/nips12.model.state, separately None
2019-12-14 19:20:30,617 - gensim.utils - INFO - saved ./data/06_models/nips12.model.state
2019-12-14 19:20:30,622 - gensim.utils - INFO - saving LdaModel object under ./data/06_models/nips12.model, separately ['expElogbeta', 'sstats']
2019-12-14 19:20:30,623 - gensim.utils - INFO - storing np array 'expElogbeta' to ./data/06_models/nips12.model.expElogbeta.npy
2019-12-14 19:20:30,627 - gensim.utils - INFO - not storing attribute state
2019-12-14 19:20:30,627 - gensim.utils - INFO - not storing attribute dispatcher
2019-12-14 19:20:30,628 - gensim.utils - INFO - not storing attribute id2word
2019-12-14 19:20:30,629 - gensim.utils - INFO - saved ./data/06_models/nips12.model


## Predict

In [1]:
from gensim.corpora import Dictionary, MmCorpus

corpus = MmCorpus('./data/05_model_input/nips12.mm')
dictionary = Dictionary.load('./data/05_model_input/nips12.dict')

2019-12-14 19:05:55,963 - gensim.summarization.textcleaner - INFO - 'pattern' package not found; tag filters are not available for English


In [7]:
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

model = LdaModel.load('./data/06_models/nips12.model')

2019-12-14 19:06:48,584 - gensim.corpora.indexedcorpus - INFO - loaded corpus index from ./data/05_model_input/nips12.mm.index
2019-12-14 19:06:48,585 - gensim.corpora._mmreader - INFO - initializing cython corpus reader from ./data/05_model_input/nips12.mm
2019-12-14 19:06:48,586 - gensim.corpora._mmreader - INFO - accepted corpus with 1740 documents, 8644 features, 951904 non-zero entries
2019-12-14 19:06:48,587 - gensim.utils - INFO - loading Dictionary object from ./data/05_model_input/nips12.dict
2019-12-14 19:06:48,592 - gensim.utils - INFO - loaded ./data/05_model_input/nips12.dict
2019-12-14 19:06:48,593 - gensim.utils - INFO - loading LdaModel object from ./data/06_models/nips12.model
2019-12-14 19:06:48,595 - gensim.utils - INFO - loading expElogbeta from ./data/06_models/nips12.model.expElogbeta.npy with mmap=None
2019-12-14 19:06:48,596 - gensim.utils - INFO - setting ignored attribute dispatcher to None
2019-12-14 19:06:48,597 - gensim.utils - INFO - setting ignored attrib

In [8]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

2019-12-14 19:06:49,163 - gensim.topic_coherence.text_analysis - INFO - CorpusAccumulator accumulated stats from 1000 documents
Average topic coherence: -1.1306.
[([(0.011300996, 'hidden'),
   (0.010701707, 'layer'),
   (0.009174217, 'recognition'),
   (0.008248592, 'word'),
   (0.00815106, 'speech'),
   (0.0074451244, 'net'),
   (0.0059543457, 'hidden_unit'),
   (0.0052499757, 'trained'),
   (0.0047664186, 'architecture'),
   (0.004440651, 'node'),
   (0.0042951847, 'back'),
   (0.004026782, 'propagation'),
   (0.0037917835, 'signal'),
   (0.0033209438, 'classification'),
   (0.0032666498, 'rule'),
   (0.003175322, 'speaker'),
   (0.003158244, 'back_propagation'),
   (0.0030343765, 'table'),
   (0.0029238581, 'activation'),
   (0.0029038575, 'class')],
  -0.9429485558755631),
 ([(0.008398214, 'action'),
   (0.007676156, 'policy'),
   (0.0051605855, 'optimal'),
   (0.004469503, 'reinforcement'),
   (0.0039273654, 'rule'),
   (0.0038814063, 'markov'),
   (0.0037810968, 'bayesian'),
   (

## Dataviz

In [1]:
from gensim.corpora import Dictionary, MmCorpus

corpus = MmCorpus('./data/05_model_input/nips12.mm')
dictionary = Dictionary.load('./data/05_model_input/nips12.dict')

2019-12-14 19:20:41,442 - gensim.summarization.textcleaner - INFO - 'pattern' package not found; tag filters are not available for English
2019-12-14 19:20:41,446 - gensim.corpora.indexedcorpus - INFO - loaded corpus index from ./data/05_model_input/nips12.mm.index
2019-12-14 19:20:41,446 - gensim.corpora._mmreader - INFO - initializing cython corpus reader from ./data/05_model_input/nips12.mm
2019-12-14 19:20:41,448 - gensim.corpora._mmreader - INFO - accepted corpus with 1740 documents, 8626 features, 910831 non-zero entries
2019-12-14 19:20:41,448 - gensim.utils - INFO - loading Dictionary object from ./data/05_model_input/nips12.dict
2019-12-14 19:20:41,453 - gensim.utils - INFO - loaded ./data/05_model_input/nips12.dict
2019-12-14 19:20:41,454 - gensim.utils - INFO - loading LdaModel object from ./data/06_models/nips12.model
2019-12-14 19:20:41,455 - gensim.utils - INFO - loading expElogbeta from ./data/06_models/nips12.model.expElogbeta.npy with mmap=None
2019-12-14 19:20:41,462 

In [None]:
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

model = LdaModel.load('./data/06_models/nips12.model')

In [2]:
import pyLDAvis
import pyLDAvis.gensim as gensimvis
pyLDAvis.enable_notebook()

  @attr.s(cmp=False, hash=False)


In [3]:
vis_data = gensimvis.prepare(model, corpus, dictionary, mds='pcoa') # MDS available : pcoa, tsne, mmds
pyLDAvis.display(vis_data)

2019-12-14 19:20:48,309 - numexpr.utils - INFO - NumExpr defaulting to 8 threads.


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
