# Table of Contents
 <p><div class="lev1"><a href="#Example-of-GibbsLDA-and-vbLDA"><span class="toc-item-num">1 - </span>Example of GibbsLDA and vbLDA</a></div><div class="lev2"><a href="#Loading-Reuter-corpus-from-NLTK"><span class="toc-item-num">1.1 - </span>Loading Reuter corpus from NLTK</a></div><div class="lev2"><a href="#Inferencen-through-the-Gibbs-sampling"><span class="toc-item-num">1.2 - </span>Inferencen through the Gibbs sampling</a></div><div class="lev3"><a href="#Print-top-10-probability-words-for-each-topic"><span class="toc-item-num">1.2.1 - </span>Print top 10 probability words for each topic</a></div><div class="lev2"><a href="#Inferencen-through-the-Variational-Bayes"><span class="toc-item-num">1.3 - </span>Inferencen through the Variational Bayes</a></div><div class="lev3"><a href="#Print-top-10-probability-words-for-each-topic"><span class="toc-item-num">1.3.1 - </span>Print top 10 probability words for each topic</a></div>

# Example of GibbsLDA and vbLDA

This example requires to install three nltk corpora:nltk.corpus.reuters, nltk.corpus.words, nltk.corpus.stopwords.

You can download the corpora via `nltk.download()`

In [1]:
import logging

import numpy as np
from ptm import GibbsLDA
from ptm import vbLDA
from ptm.nltk_corpus import get_reuters_ids_cnt
from ptm.utils import convert_cnt_to_list, get_top_words

## Loading Reuter corpus from NLTK

Load reuter corpus including 1000 documents with maximum vocabulary size of 10000 from NLTK corpus

In [2]:
n_doc = 1000
voca, doc_ids, doc_cnt = get_reuters_ids_cnt(num_doc=n_doc, max_voca=10000)
docs = convert_cnt_to_list(doc_ids, doc_cnt)
n_voca = len(voca)
print('Vocabulary size:%d' % n_voca)

Vocabulary size:4632


## Inferencen through the Gibbs sampling

In [3]:
max_iter=100
n_topic=10

logger = logging.getLogger('GibbsLDA')
logger.propagate = False

model = GibbsLDA(n_doc, len(voca), n_topic)
model.fit(docs, max_iter=max_iter)

2016-02-10 19:42:01 INFO:GibbsLDA:[ITER] 0,	elapsed time:0.86,	log_likelihood:-447909.18
2016-02-10 19:42:02 INFO:GibbsLDA:[ITER] 1,	elapsed time:0.89,	log_likelihood:-421738.22
2016-02-10 19:42:03 INFO:GibbsLDA:[ITER] 2,	elapsed time:0.94,	log_likelihood:-405181.71
2016-02-10 19:42:04 INFO:GibbsLDA:[ITER] 3,	elapsed time:0.87,	log_likelihood:-393867.42
2016-02-10 19:42:05 INFO:GibbsLDA:[ITER] 4,	elapsed time:0.90,	log_likelihood:-385570.47
2016-02-10 19:42:06 INFO:GibbsLDA:[ITER] 5,	elapsed time:0.90,	log_likelihood:-379114.11
2016-02-10 19:42:07 INFO:GibbsLDA:[ITER] 6,	elapsed time:0.92,	log_likelihood:-374416.99
2016-02-10 19:42:08 INFO:GibbsLDA:[ITER] 7,	elapsed time:0.90,	log_likelihood:-371338.53
2016-02-10 19:42:09 INFO:GibbsLDA:[ITER] 8,	elapsed time:0.88,	log_likelihood:-368035.03
2016-02-10 19:42:10 INFO:GibbsLDA:[ITER] 9,	elapsed time:0.93,	log_likelihood:-365556.67
2016-02-10 19:42:11 INFO:GibbsLDA:[ITER] 10,	elapsed time:0.87,	log_likelihood:-363627.94
2016-02-10 19:42:11 

### Print top 10 probability words for each topic

In [4]:
for ti in range(n_topic):
    top_words = get_top_words(model.TW, voca, ti, n_words=10)
    print('Topic', ti ,': ', ','.join(top_words))

Topic 0 :  market,bank,week,rate,rose,money,two,rise,three,fed
Topic 1 :  quarter,first,april,record,earnings,dividend,share,prior,may,one
Topic 2 :  oil,dome,one,debt,gas,price,plan,new,would,energy
Topic 3 :  nil,stocks,production,total,end,use,start,soybean,supply,demand
Topic 4 :  last,month,wheat,crop,grain,department,sugar,april,week,export
Topic 5 :  loss,profit,corp,note,tax,chemical,gain,quarter,nine,operating
Topic 6 :  trade,government,last,also,deficit,would,surplus,foreign,canada,industry
Topic 7 :  japan,would,could,economic,japanese,market,west,growth,meeting,policy
Topic 8 :  dollar,bank,yen,interest,exchange,term,days,currency,rate,current
Topic 9 :  share,offer,stock,corp,acquisition,would,group,common,also,cash


## Inferencen through the Variational Bayes

In [5]:
logger = logging.getLogger('vbLDA')
logger.propagate = False

vbmodel = vbLDA(n_doc, n_voca, n_topic)
vbmodel.fit(doc_ids, doc_cnt, max_iter=max_iter)

2016-02-10 19:43:32 INFO:vbLDA:[ITER] 0,	elapsed time:0.79,	ELBO:-478629.24
2016-02-10 19:43:33 INFO:vbLDA:[ITER] 1,	elapsed time:0.78,	ELBO:-424352.68
2016-02-10 19:43:34 INFO:vbLDA:[ITER] 2,	elapsed time:0.79,	ELBO:-380711.73
2016-02-10 19:43:34 INFO:vbLDA:[ITER] 3,	elapsed time:0.76,	ELBO:-364218.72
2016-02-10 19:43:35 INFO:vbLDA:[ITER] 4,	elapsed time:0.72,	ELBO:-357506.75
2016-02-10 19:43:36 INFO:vbLDA:[ITER] 5,	elapsed time:0.69,	ELBO:-354117.34
2016-02-10 19:43:37 INFO:vbLDA:[ITER] 6,	elapsed time:0.69,	ELBO:-352265.21
2016-02-10 19:43:37 INFO:vbLDA:[ITER] 7,	elapsed time:0.69,	ELBO:-351168.75
2016-02-10 19:43:38 INFO:vbLDA:[ITER] 8,	elapsed time:0.65,	ELBO:-350393.52
2016-02-10 19:43:39 INFO:vbLDA:[ITER] 9,	elapsed time:0.65,	ELBO:-349864.68
2016-02-10 19:43:39 INFO:vbLDA:[ITER] 10,	elapsed time:0.64,	ELBO:-349479.59
2016-02-10 19:43:40 INFO:vbLDA:[ITER] 11,	elapsed time:0.66,	ELBO:-349231.45
2016-02-10 19:43:40 INFO:vbLDA:[ITER] 12,	elapsed time:0.64,	ELBO:-349048.99
2016-02-1

### Print top 10 probability words for each topic

In [6]:
for ti in range(n_topic):
    top_words = get_top_words(vbmodel._lambda, voca, ti, n_words=10)
    print('Topic', ti ,': ', ','.join(top_words))

Topic 0 :  share,stock,profit,would,offer,corp,earnings,per,dividend,first
Topic 1 :  fed,price,trade,may,two,april,market,reserve,would,japan
Topic 2 :  dollar,would,one,foreign,growth,last,trade,economic,week,rise
Topic 3 :  loss,profit,corp,note,quarter,national,share,gain,one,first
Topic 4 :  bank,market,week,days,rate,money,new,april,today,day
Topic 5 :  quarter,first,tax,share,income,april,bank,dividend,record,may
Topic 6 :  oil,quarter,first,gas,march,gold,february,price,earnings,last
Topic 7 :  japan,dollar,trade,would,yen,dome,japanese,market,also,agreement
Topic 8 :  nil,last,stocks,month,production,total,grain,crop,wheat,end
Topic 9 :  share,corp,april,wheat,price,new,group,would,exchange,department
