<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Example-of-GibbsLDA-and-vbLDA" data-toc-modified-id="Example-of-GibbsLDA-and-vbLDA-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Example of GibbsLDA and vbLDA</a></span><ul class="toc-item"><li><span><a href="#Loading-Reuter-corpus-from-NLTK" data-toc-modified-id="Loading-Reuter-corpus-from-NLTK-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Loading Reuter corpus from NLTK</a></span></li><li><span><a href="#Inferencen-through-the-Gibbs-sampling" data-toc-modified-id="Inferencen-through-the-Gibbs-sampling-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Inferencen through the Gibbs sampling</a></span><ul class="toc-item"><li><span><a href="#Print-top-10-probability-words-for-each-topic" data-toc-modified-id="Print-top-10-probability-words-for-each-topic-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Print top 10 probability words for each topic</a></span></li></ul></li><li><span><a href="#Inferencen-through-the-Variational-Bayes" data-toc-modified-id="Inferencen-through-the-Variational-Bayes-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Inferencen through the Variational Bayes</a></span><ul class="toc-item"><li><span><a href="#Print-top-10-probability-words-for-each-topic" data-toc-modified-id="Print-top-10-probability-words-for-each-topic-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Print top 10 probability words for each topic</a></span></li></ul></li></ul></li></ul></div>

# Example of GibbsLDA and vbLDA

This example requires to install three nltk corpora:nltk.corpus.reuters, nltk.corpus.words, nltk.corpus.stopwords.

You can download the corpora via `nltk.download()`

In [1]:
import os
import sys
PACKAGE = os.path.abspath('../python-topic-model')
# print(PACKAGE)
sys.path.append(PACKAGE)

In [2]:
import logging

import numpy as np
from ptm import GibbsLDA
from ptm import vbLDA
from ptm.nltk_corpus import get_reuters_ids_cnt
from ptm.utils import convert_cnt_to_list, get_top_words

## Loading Reuter corpus from NLTK

Load reuter corpus including 1000 documents with maximum vocabulary size of 10000 from NLTK corpus

In [3]:
# Download data by nltk
import nltk
for resource in ['reuters','words','stopwords']:
    nltk.download(resource)

[nltk_data] Downloading package reuters to /Users/YuLong/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package words to /Users/YuLong/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/YuLong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
n_doc = 1000
voca, doc_ids, doc_cnt = get_reuters_ids_cnt(num_doc=n_doc, max_voca=10000)
docs = convert_cnt_to_list(doc_ids, doc_cnt)
n_voca = len(voca)
print('Vocabulary size:%d' % n_voca)

Vocabulary size:4629


In [25]:
# Vo
# vocabulary set :  4629
# n_doc : 1000
voca, n_doc

(array(['would', 'nil', 'last', ..., 'receivables', 'recoverable', 'elect'],
       dtype='<U16'),
 1000)

## Inferencen through the Gibbs sampling

In [5]:
max_iter=100
n_topic=10

logger = logging.getLogger('GibbsLDA')
logger.propagate = False

model = GibbsLDA(n_doc, len(voca), n_topic)
model.fit(docs, max_iter=max_iter)

2021-07-14 00:11:19 INFO:GibbsLDA:[ITER] 0,	elapsed time:0.54,	log_likelihood:-447583.70
2021-07-14 00:11:20 INFO:GibbsLDA:[ITER] 1,	elapsed time:0.54,	log_likelihood:-421115.19
2021-07-14 00:11:20 INFO:GibbsLDA:[ITER] 2,	elapsed time:0.54,	log_likelihood:-404051.47
2021-07-14 00:11:21 INFO:GibbsLDA:[ITER] 3,	elapsed time:0.56,	log_likelihood:-392140.18
2021-07-14 00:11:21 INFO:GibbsLDA:[ITER] 4,	elapsed time:0.58,	log_likelihood:-383707.46
2021-07-14 00:11:22 INFO:GibbsLDA:[ITER] 5,	elapsed time:0.54,	log_likelihood:-377761.32
2021-07-14 00:11:22 INFO:GibbsLDA:[ITER] 6,	elapsed time:0.54,	log_likelihood:-372990.73
2021-07-14 00:11:23 INFO:GibbsLDA:[ITER] 7,	elapsed time:0.54,	log_likelihood:-369235.16
2021-07-14 00:11:24 INFO:GibbsLDA:[ITER] 8,	elapsed time:0.54,	log_likelihood:-366877.99
2021-07-14 00:11:24 INFO:GibbsLDA:[ITER] 9,	elapsed time:0.54,	log_likelihood:-363952.43
2021-07-14 00:11:25 INFO:GibbsLDA:[ITER] 10,	elapsed time:0.54,	log_likelihood:-362131.33
2021-07-14 00:11:25 

2021-07-14 00:12:12 INFO:GibbsLDA:[ITER] 92,	elapsed time:0.54,	log_likelihood:-343175.38
2021-07-14 00:12:12 INFO:GibbsLDA:[ITER] 93,	elapsed time:0.54,	log_likelihood:-343129.88
2021-07-14 00:12:13 INFO:GibbsLDA:[ITER] 94,	elapsed time:0.55,	log_likelihood:-343006.38
2021-07-14 00:12:13 INFO:GibbsLDA:[ITER] 95,	elapsed time:0.54,	log_likelihood:-343191.66
2021-07-14 00:12:14 INFO:GibbsLDA:[ITER] 96,	elapsed time:0.54,	log_likelihood:-343006.92
2021-07-14 00:12:14 INFO:GibbsLDA:[ITER] 97,	elapsed time:0.54,	log_likelihood:-342921.05
2021-07-14 00:12:15 INFO:GibbsLDA:[ITER] 98,	elapsed time:0.55,	log_likelihood:-342788.71
2021-07-14 00:12:16 INFO:GibbsLDA:[ITER] 99,	elapsed time:0.54,	log_likelihood:-342968.25


### Print top 10 probability words for each topic

In [6]:
# Might be exlainanle in the topics, might not
# Topic 1 likes to be sotck news
# Topic 0 likes to be oil and gas
# Topic 6 likes to be Japan, economic, ...

for ti in range(n_topic):
    top_words = get_top_words(model.TW, voca, ti, n_words=15)
    print('Topic', ti ,': ', ','.join(top_words))

Topic 0 :  corp,gas,unit,also,acquisition,oil,group,sale,purchase,gold,statement,subsidiary,acquired,new,international
Topic 1 :  loss,quarter,first,profit,corp,share,note,earnings,tax,income,gain,bank,ago,six,march
Topic 2 :  japan,trade,dollar,yen,deficit,japanese,west,currency,economic,meeting,market,surplus,paris,around,economy
Topic 3 :  would,also,oil,government,growth,debt,added,exchange,long,five,time,trading,economic,term,demand
Topic 4 :  week,last,february,rise,march,rose,fell,foreign,lower,january,recent,increase,coffee,trade,new
Topic 5 :  would,business,new,industry,chairman,record,management,market,told,american,agreement,last,production,plant,union
Topic 6 :  bank,rate,market,money,fed,days,policy,federal,two,monetary,today,interest,reserve,cut,day
Topic 7 :  stock,share,april,offer,dividend,may,one,corp,dome,record,common,pay,prior,div,split
Topic 8 :  nil,stocks,wheat,last,production,month,department,grain,crop,total,agriculture,end,corn,start,use
Topic 9 :  price,sug

## Inferencen through the Variational Bayes

In [7]:
logger = logging.getLogger('vbLDA')
logger.propagate = False

vbmodel = vbLDA(n_doc, n_voca, n_topic)
vbmodel.fit(doc_ids, doc_cnt, max_iter=max_iter)

2021-07-14 00:12:16 INFO:vbLDA:[ITER] 0,	elapsed time:0.53,	ELBO:-478165.30
2021-07-14 00:12:17 INFO:vbLDA:[ITER] 1,	elapsed time:0.51,	ELBO:-423168.55
2021-07-14 00:12:17 INFO:vbLDA:[ITER] 2,	elapsed time:0.50,	ELBO:-379311.76
2021-07-14 00:12:18 INFO:vbLDA:[ITER] 3,	elapsed time:0.47,	ELBO:-363504.22
2021-07-14 00:12:18 INFO:vbLDA:[ITER] 4,	elapsed time:0.46,	ELBO:-357206.33
2021-07-14 00:12:19 INFO:vbLDA:[ITER] 5,	elapsed time:0.45,	ELBO:-354247.30
2021-07-14 00:12:19 INFO:vbLDA:[ITER] 6,	elapsed time:0.44,	ELBO:-352609.74
2021-07-14 00:12:19 INFO:vbLDA:[ITER] 7,	elapsed time:0.44,	ELBO:-351564.89
2021-07-14 00:12:20 INFO:vbLDA:[ITER] 8,	elapsed time:0.44,	ELBO:-350812.02
2021-07-14 00:12:20 INFO:vbLDA:[ITER] 9,	elapsed time:0.44,	ELBO:-350295.12
2021-07-14 00:12:21 INFO:vbLDA:[ITER] 10,	elapsed time:0.44,	ELBO:-349965.13
2021-07-14 00:12:21 INFO:vbLDA:[ITER] 11,	elapsed time:0.43,	ELBO:-349698.35
2021-07-14 00:12:22 INFO:vbLDA:[ITER] 12,	elapsed time:0.48,	ELBO:-349498.88
2021-07-1

### Print top 10 probability words for each topic

In [8]:
for ti in range(n_topic):
    top_words = get_top_words(vbmodel._lambda, voca, ti, n_words=10)
    print('Topic', ti ,': ', ','.join(top_words))

Topic 0 :  loss,profit,quarter,first,note,loan,interest,bank,tax,corp
Topic 1 :  loss,quarter,bank,first,week,last,market,one,two,earnings
Topic 2 :  share,may,growth,dividend,stock,would,one,tax,first,five
Topic 3 :  corp,share,price,would,stock,oil,new,per,acquisition,borg
Topic 4 :  nil,stocks,end,total,use,production,start,supply,soybean,demand
Topic 5 :  offer,april,oil,gas,stock,would,one,share,may,common
Topic 6 :  dollar,japan,dome,would,trade,deficit,yen,economic,market,debt
Topic 7 :  february,march,days,trade,surplus,january,april,last,department,rose
Topic 8 :  corp,price,would,trade,national,told,cut,april,wheat,china
Topic 9 :  last,japan,month,trade,sugar,japanese,yen,bank,crop,around
