First, let's import the Python bindings, as usual.

In [1]:
import metapy

In [2]:
metapy.__version__ # you will want your version to be >= to this

'0.2.13'

If you would like to, you can inform MeTA to output log data to stderr like so:

In [3]:
metapy.log_to_stderr()

Now, let's download a list of stopwords and a sample dataset to begin exploring MeTA's topic models.

In [4]:
#!wget -N https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt

In [5]:
fidx = metapy.index.make_forward_index('review.toml')

1558167117: [info]     Creating forward index: reviews-idx/fwd (C:/Users/appveyor/AppData/Local/Temp/1/pip-req-build-7ct0ssv5/deps/meta/src/index/forward_index.cpp:239)
 > Tokenizing Docs: [>                                       ]   0% ETA 00:00:00 
 
 > Tokenizing Docs: [====>                                   ]  10% ETA 00:00:04 
 
 
 
 
 
 
 
 
 
 
 
 
1558167124: [info]     Done creating index: reviews-idx/fwd (C:/Users/appveyor/AppData/Local/Temp/1/pip-req-build-7ct0ssv5/deps/meta/src/index/forward_index.cpp:278)


Just like in classification, the feature set used for the topic modeling will be the feature set used at the time of indexing, so if you want to play with a different set of features (like bigram words), you will need to re-index your data.

For now, we've just stuck with the default filter chain for unigram words, so we're operating in the traditional bag-of-words space.

Let's load our documents into memory to run the topic model inference now.

In [6]:
dset = metapy.learn.Dataset(fidx)



Now, let's try to find some topics for this dataset. To do so, we're going to use a generative model called a topic model.

There are many different topic models in the literature, but the most commonly used topic model is Latent Dirichlet Allocation. Here, we propose that there are K topics (represented with a categorical distribution over words) $\phi_k$ from which all of our documents are genereated. These K topics are modeled as being sampled from a Dirichlet distribution with parameter $\vec{\alpha}$. Then, to generate a document $d$, we first sample a distribution over the K topics $\theta_d$ from another Dirichlet distribution with parameter $\vec{\beta}$. Then, for each word in this document, we first sample a topic identifier $z \sim \theta_d$ and then the word by drawing from the topic we selected ($w \sim \phi_z$). Refer to the [Wikipedia article on LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) for more information.

The goal of running inference for an LDA model is to infer the latent variables $\phi_k$ and $\theta_d$ for all of the $K$ topics and $D$ documents, respectively. MeTA provides a number of different inference algorithms for LDA, as each one entails a different set of trade-offs (inference in LDA is intractable, so all inference algorithms are approximations; different algorithms entail different approximation guarantees, running times, and required memroy consumption). For now, let's run a Variational Infernce algorithm called CVB0 to find two topics. (In practice you will likely be finding many more topics than just two, but this is a very small toy dataset.)

In [7]:
lda_inf = metapy.topics.LDACollapsedVB(dset, num_topics=10, alpha=1.0, beta=0.01)
lda_inf.run(num_iters=1000)

Iteration 1 maximum change in gamma: 1.35668                                                                 
Iteration 2 maximum change in gamma: 0.347496                                                                
Iteration 3 maximum change in gamma: 0.575276                                                                 
Iteration 4 maximum change in gamma: 0.979212                                                                 
Iteration 5 maximum change in gamma: 0.978383                                                                
Iteration 6 maximum change in gamma: 1.08539                                                                 
Iteration 7 maximum change in gamma: 1.15643                                                                 
Iteration 8 maximum change in gamma: 1.07446                                                                  
Iteration 9 maximum change in gamma: 0.926256                                                                
Iterati

Iteration 74 maximum change in gamma: 0.291093                                                              
Iteration 75 maximum change in gamma: 0.290534                                                               
Iteration 76 maximum change in gamma: 0.292218                                                               
Iteration 77 maximum change in gamma: 0.332982                                                               
Iteration 78 maximum change in gamma: 0.3211                                                                 
Iteration 79 maximum change in gamma: 0.372007                                                               
Iteration 80 maximum change in gamma: 0.343436                                                                
Iteration 81 maximum change in gamma: 0.295127                                                                
Iteration 82 maximum change in gamma: 0.25854                                                                
Iteration

Iteration 148 maximum change in gamma: 0.271247                                                             
Iteration 149 maximum change in gamma: 0.344348                                                              
Iteration 150 maximum change in gamma: 0.569639                                                              
Iteration 151 maximum change in gamma: 0.576529                                                              
Iteration 152 maximum change in gamma: 0.470848                                                              
Iteration 153 maximum change in gamma: 0.582473                                                               
Iteration 154 maximum change in gamma: 0.412794                                                              
Iteration 155 maximum change in gamma: 0.230025                                                              
Iteration 156 maximum change in gamma: 0.24163                                                               
Iteration 

Iteration 222 maximum change in gamma: 0.320492                                                             
Iteration 223 maximum change in gamma: 0.346648                                                              
Iteration 224 maximum change in gamma: 0.298322                                                              
Iteration 225 maximum change in gamma: 0.227953                                                               
Iteration 226 maximum change in gamma: 0.19051                                                               
Iteration 227 maximum change in gamma: 0.209888                                                              
Iteration 228 maximum change in gamma: 0.237263                                                              
Iteration 229 maximum change in gamma: 0.266067                                                              
Iteration 230 maximum change in gamma: 0.295107                                                              
Iteration 

Iteration 296 maximum change in gamma: 0.283456                                                             
Iteration 297 maximum change in gamma: 0.2844                                                                
Iteration 298 maximum change in gamma: 0.385749                                                               
Iteration 299 maximum change in gamma: 0.407851                                                              
Iteration 300 maximum change in gamma: 0.326266                                                              
Iteration 301 maximum change in gamma: 0.287801                                                               
Iteration 302 maximum change in gamma: 0.306749                                                              
Iteration 303 maximum change in gamma: 0.287271                                                              
Iteration 304 maximum change in gamma: 0.320497                                                               
Iteratio

Iteration 370 maximum change in gamma: 0.458454                                                             
Iteration 371 maximum change in gamma: 0.464898                                                              
Iteration 372 maximum change in gamma: 0.275584                                                              
Iteration 373 maximum change in gamma: 0.162683                                                              
Iteration 374 maximum change in gamma: 0.16803                                                               
Iteration 375 maximum change in gamma: 0.163565                                                              
Iteration 376 maximum change in gamma: 0.251778                                                              
Iteration 377 maximum change in gamma: 0.381335                                                              
Iteration 378 maximum change in gamma: 0.446002                                                              
Iteration 3

Iteration 444 maximum change in gamma: 0.319343                                                             
Iteration 445 maximum change in gamma: 0.403198                                                              
Iteration 446 maximum change in gamma: 0.400576                                                              
Iteration 447 maximum change in gamma: 0.322766                                                              
Iteration 448 maximum change in gamma: 0.18545                                                               
Iteration 449 maximum change in gamma: 0.203271                                                              
Iteration 450 maximum change in gamma: 0.228098                                                              
Iteration 451 maximum change in gamma: 0.234156                                                              
Iteration 452 maximum change in gamma: 0.212596                                                              
Iteration 4

Iteration 518 maximum change in gamma: 0.27763                                                               
Iteration 519 maximum change in gamma: 0.283561                                                               
Iteration 520 maximum change in gamma: 0.275364                                                              
Iteration 521 maximum change in gamma: 0.211964                                                               
Iteration 522 maximum change in gamma: 0.14513                                                                
Iteration 523 maximum change in gamma: 0.157137                                                               
Iteration 524 maximum change in gamma: 0.158622                                                               
Iteration 525 maximum change in gamma: 0.15285                                                                
Iteration 526 maximum change in gamma: 0.131685                                                               
Ite

Iteration 591 maximum change in gamma: 0.0881543                                                             
Iteration 592 maximum change in gamma: 0.0808485                                                              
Iteration 593 maximum change in gamma: 0.0931631                                                              
Iteration 594 maximum change in gamma: 0.117338                                                                 
Iteration 595 maximum change in gamma: 0.146796                                                                 
Iteration 596 maximum change in gamma: 0.177498                                                               
Iteration 597 maximum change in gamma: 0.199494                                                               
Iteration 598 maximum change in gamma: 0.202358                                                                
Iteration 599 maximum change in gamma: 0.189162                                                             

Iteration 664 maximum change in gamma: 0.276146                                                              
Iteration 665 maximum change in gamma: 0.238963                                                               
Iteration 666 maximum change in gamma: 0.153026                                                               
Iteration 667 maximum change in gamma: 0.168869                                                              
Iteration 668 maximum change in gamma: 0.194212                                                              
Iteration 669 maximum change in gamma: 0.210181                                                               
Iteration 670 maximum change in gamma: 0.200154                                                              
Iteration 671 maximum change in gamma: 0.165376                                                               
Iteration 672 maximum change in gamma: 0.118073                                                               
Itera

Iteration 737 maximum change in gamma: 0.0753582                                                             
Iteration 738 maximum change in gamma: 0.0834769                                                              
Iteration 739 maximum change in gamma: 0.0911141                                                              
Iteration 740 maximum change in gamma: 0.105829                                                               
Iteration 741 maximum change in gamma: 0.1317                                                                 
Iteration 742 maximum change in gamma: 0.159468                                                               
Iteration 743 maximum change in gamma: 0.181572                                                               
Iteration 744 maximum change in gamma: 0.187101                                                               
Iteration 745 maximum change in gamma: 0.180805                                                               
It

Iteration 810 maximum change in gamma: 0.103148                                                              
Iteration 811 maximum change in gamma: 0.108748                                                               
Iteration 812 maximum change in gamma: 0.111246                                                                
Iteration 813 maximum change in gamma: 0.110111                                                               
Iteration 814 maximum change in gamma: 0.105286                                                               
Iteration 815 maximum change in gamma: 0.0972335                                                              
Iteration 816 maximum change in gamma: 0.0868428                                                              
Iteration 817 maximum change in gamma: 0.075159                                                               
Iteration 818 maximum change in gamma: 0.0740115                                                              
I

Iteration 883 maximum change in gamma: 0.0823785                                                             
Iteration 884 maximum change in gamma: 0.0767509                                                              
Iteration 885 maximum change in gamma: 0.079638                                                               
Iteration 886 maximum change in gamma: 0.0813111                                                               
Iteration 887 maximum change in gamma: 0.0815311                                                              
Iteration 888 maximum change in gamma: 0.0801492                                                              
Iteration 889 maximum change in gamma: 0.0771457                                                               
Iteration 890 maximum change in gamma: 0.072648                                                               
Iteration 891 maximum change in gamma: 0.0669204                                                              


Iteration 956 maximum change in gamma: 0.0906904                                                               
Iteration 957 maximum change in gamma: 0.103656                                                                 
Iteration 958 maximum change in gamma: 0.114969                                                                 
Iteration 959 maximum change in gamma: 0.121917                                                               
Iteration 960 maximum change in gamma: 0.121722                                                                
Iteration 961 maximum change in gamma: 0.113011                                                               
Iteration 962 maximum change in gamma: 0.0968587                                                              
Iteration 963 maximum change in gamma: 0.0996825                                                              
Iteration 964 maximum change in gamma: 0.113062                                                           

The above ran the CVB0 algorithm for 1000 iterations, or until an algorithm-specific convergence criterion was met. Now let's save the current estimate for our topics and topic proportions.

In [8]:
lda_inf.save('lda-cvb0')

We can interrogate the topic inference results by using the `TopicModel` query class. Let's load our inference results back in.

In [29]:
model = metapy.topics.TopicModel('lda-cvb0')



Now, let's have a look at our topics. A typical way of doing this is to print the top $k$ words in each topic, so let's do that.

In [30]:
model.top_k(tid=0)

[(35529, 0.0356366064502679),
 (6488, 0.032412340260260596),
 (17679, 0.02883476015125708),
 (40362, 0.02530187122707259),
 (19334, 0.019688848655901633),
 (5773, 0.01809963306987989),
 (8278, 0.016885006086425822),
 (33135, 0.01582886091573045),
 (14358, 0.012588553428843087),
 (48046, 0.01016252301972237)]

The models operate on term ids instead of raw text strings, so let's convert this to a human readable format by using the vocabulary contained in our `ForwardIndex` to map the term ids to strings.

In [33]:
topic = [[(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=id)] for id in range(0,10)]

We can pretty clearly see that this particular dataset was about two major issues: smoking in public and part time jobs for students. This dataset is actually a collection of essays written by students, and there just so happen to be two different topics they can choose from!

The topics are pretty clear in this case, but in some cases it is also useful to score the terms in a topic using some function of the probability of the word in the topic and the probability of the word in the other topics. Intuitively, we might want to select words from each topic that best reflect that topic's content by picking words that both have high probability in that topic **and** have low probability in the other topics. In other words, we want to balance between high probability terms and highly specific terms (this is kind of like a tf-idf weighting). One such scoring function is provided by the toolkit in `BLTermScorer`, which implements a scoring function proposed by Blei and Lafferty.

In [39]:
'ddd'+'a'

'ddda'

In [42]:
scorer = metapy.topics.BLTermScorer(model)
topic_words = {'Topic ' + str(id) : [(fidx.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=id, scorer=scorer)] for id in range(0,10)}
topic_words

{'Topic 0': [('pizza', 0.4387231206616917),
  ('burger', 0.39975651446206595),
  ('sandwich', 0.2871798664431046),
  ('fri', 0.2675637997698189),
  ('breakfast', 0.2003754370876098),
  ('chees', 0.13471966896754017),
  ('egg', 0.12638271310132687),
  ('bacon', 0.09053566645499546),
  ('order', 0.07843862408003499),
  ('wing', 0.06759458349635811)],
 'Topic 1': [('chicken', 0.25024883853595986),
  ('taco', 0.21346828288501873),
  ('chip', 0.1336681517468307),
  ('mexican', 0.1265819692685006),
  ('bean', 0.11060991324233882),
  ('salsa', 0.1097639834388628),
  ('sauc', 0.09448052767035807),
  ('bbq', 0.09224234220000961),
  ('burrito', 0.08119836625267962),
  ('good', 0.08054806897655738)],
 'Topic 2': [('it', 0.0656415651632207),
  ("i'm", 0.060605704458270855),
  ('know', 0.056872950590452824),
  ("don't", 0.05372320272211005),
  ('locat', 0.03627744217672341),
  ('review', 0.0358918839961071),
  ('old', 0.026949627807821445),
  ('year', 0.025937348447821012),
  ("you'r", 0.0256555588

Here we can see that the uninformative word stem "think" was downweighted from the word list from each topic, since it had relatively high probability in either topic.

We can also see the inferred topic distribution for each document.

In [44]:
import json

with open ( 'review_topic.json', 'w') as f:
    json.dump(topic_words, f)

Chinese Resturant Review Positive 

In [18]:
fidx_pos = metapy.index.make_forward_index('chinese_review_pos.toml')
dset_pos = metapy.learn.Dataset(fidx_pos)

1558186830: [info]     Creating forward index: chinese_reviews_pos-idx/fwd (C:/Users/appveyor/AppData/Local/Temp/1/pip-req-build-7ct0ssv5/deps/meta/src/index/forward_index.cpp:239)
 > Tokenizing Docs: [>                                       ]   0% ETA 00:00:00 
1558186832: [info]     Done creating index: chinese_reviews_pos-idx/fwd (C:/Users/appveyor/AppData/Local/Temp/1/pip-req-build-7ct0ssv5/deps/meta/src/index/forward_index.cpp:278)


In [19]:
lda_inf_pos = metapy.topics.LDACollapsedVB(dset_pos, num_topics=10, alpha=1.0, beta=0.01)
lda_inf_pos.run(num_iters=1000)
lda_inf_pos.save('lda-cvb0-pos')

Iteration 1 maximum change in gamma: 1.31323                                            
Iteration 2 maximum change in gamma: 0.47757                                            
Iteration 3 maximum change in gamma: 0.986129                                           
Iteration 4 maximum change in gamma: 1.13021                                            
Iteration 5 maximum change in gamma: 1.12014                                            
Iteration 6 maximum change in gamma: 0.872339                                            
Iteration 7 maximum change in gamma: 0.625976                                            
Iteration 8 maximum change in gamma: 0.633009                                           
Iteration 9 maximum change in gamma: 0.716217                                           
Iteration 10 maximum change in gamma: 0.847852                                          
Iteration 11 maximum change in gamma: 0.928589                                          
Iteration 12 maximu

Iteration 91 maximum change in gamma: 0.681102                                         
Iteration 92 maximum change in gamma: 0.520491                                          
Iteration 93 maximum change in gamma: 0.446401                                           
Iteration 94 maximum change in gamma: 0.347761                                          
Iteration 95 maximum change in gamma: 0.336727                                          
Iteration 96 maximum change in gamma: 0.311449                                          
Iteration 97 maximum change in gamma: 0.434311                                          
Iteration 98 maximum change in gamma: 0.506866                                          
Iteration 99 maximum change in gamma: 0.531022                                          
Iteration 100 maximum change in gamma: 0.50095                                          
Iteration 101 maximum change in gamma: 0.539083                                         
Iteration 102 maximum

Iteration 182 maximum change in gamma: 0.346314                                        
Iteration 183 maximum change in gamma: 0.209953                                         
Iteration 184 maximum change in gamma: 0.243434                                         
Iteration 185 maximum change in gamma: 0.246501                                         
Iteration 186 maximum change in gamma: 0.22474                                          
Iteration 187 maximum change in gamma: 0.25624                                          
Iteration 188 maximum change in gamma: 0.30945                                          
Iteration 189 maximum change in gamma: 0.34544                                          
Iteration 190 maximum change in gamma: 0.35822                                          
Iteration 191 maximum change in gamma: 0.387449                                         
Iteration 192 maximum change in gamma: 0.345872                                         
Iteration 193 maximum 

Iteration 274 maximum change in gamma: 0.146431                                        
Iteration 275 maximum change in gamma: 0.190371                                         
Iteration 276 maximum change in gamma: 0.233795                                         
Iteration 277 maximum change in gamma: 0.28462                                          
Iteration 278 maximum change in gamma: 0.289174                                         
Iteration 279 maximum change in gamma: 0.214966                                         
Iteration 280 maximum change in gamma: 0.122307                                         
Iteration 281 maximum change in gamma: 0.102295                                          
Iteration 282 maximum change in gamma: 0.108357                                          
Iteration 283 maximum change in gamma: 0.111627                                          
Iteration 284 maximum change in gamma: 0.119329                                          
Iteration 285 maxi

Iteration 364 maximum change in gamma: 0.216069                                        
Iteration 365 maximum change in gamma: 0.237034                                         
Iteration 366 maximum change in gamma: 0.255587                                         
Iteration 367 maximum change in gamma: 0.265165                                         
Iteration 368 maximum change in gamma: 0.222197                                         
Iteration 369 maximum change in gamma: 0.199976                                         
Iteration 370 maximum change in gamma: 0.251098                                         
Iteration 371 maximum change in gamma: 0.30053                                          
Iteration 372 maximum change in gamma: 0.31284                                          
Iteration 373 maximum change in gamma: 0.321361                                         
Iteration 374 maximum change in gamma: 0.382176                                         
Iteration 375 maximum 

Iteration 455 maximum change in gamma: 0.154017                                        
Iteration 456 maximum change in gamma: 0.154526                                         
Iteration 457 maximum change in gamma: 0.144979                                         
Iteration 458 maximum change in gamma: 0.122229                                          
Iteration 459 maximum change in gamma: 0.0922504                                        
Iteration 460 maximum change in gamma: 0.0862487                                        
Iteration 461 maximum change in gamma: 0.0899013                                         
Iteration 462 maximum change in gamma: 0.0914311                                         
Iteration 463 maximum change in gamma: 0.0936017                                         
Iteration 464 maximum change in gamma: 0.0997501                                         
Iteration 465 maximum change in gamma: 0.119812                                          
Iteration 466 ma

Iteration 546 maximum change in gamma: 0.036005                                        
Iteration 547 maximum change in gamma: 0.0375692                                        
Iteration 548 maximum change in gamma: 0.0391524                                        
Iteration 549 maximum change in gamma: 0.0406492                                         
Iteration 550 maximum change in gamma: 0.0420315                                         
Iteration 551 maximum change in gamma: 0.0448213                                         
Iteration 552 maximum change in gamma: 0.0479567                                         
Iteration 553 maximum change in gamma: 0.0510952                                        
Iteration 554 maximum change in gamma: 0.0543511                                        
Iteration 555 maximum change in gamma: 0.0584592                                        
Iteration 556 maximum change in gamma: 0.0624182                                        
Iteration 557 maxi

Iteration 637 maximum change in gamma: 0.068922                                        
Iteration 638 maximum change in gamma: 0.0660538                                        
Iteration 639 maximum change in gamma: 0.0624873                                        
Iteration 640 maximum change in gamma: 0.0583617                                         
Iteration 641 maximum change in gamma: 0.0538316                                        
Iteration 642 maximum change in gamma: 0.0490568                                        
Iteration 643 maximum change in gamma: 0.0501606                                         
Iteration 644 maximum change in gamma: 0.0544656                                        
Iteration 645 maximum change in gamma: 0.0587669                                        
Iteration 646 maximum change in gamma: 0.062864                                          
Iteration 647 maximum change in gamma: 0.0664931                                         
Iteration 648 maxi

Iteration 728 maximum change in gamma: 0.058191                                        
Iteration 729 maximum change in gamma: 0.0573395                                        
Iteration 730 maximum change in gamma: 0.0555922                                        
Iteration 731 maximum change in gamma: 0.0529661                                        
Iteration 732 maximum change in gamma: 0.0514368                                         
Iteration 733 maximum change in gamma: 0.0551516                                        
Iteration 734 maximum change in gamma: 0.0584957                                         
Iteration 735 maximum change in gamma: 0.0612063                                        
Iteration 736 maximum change in gamma: 0.0629999                                        
Iteration 737 maximum change in gamma: 0.0649513                                        
Iteration 738 maximum change in gamma: 0.0658974                                        
Iteration 739 maximu

Iteration 819 maximum change in gamma: 0.0366661                                        
Iteration 820 maximum change in gamma: 0.0385459                                        
Iteration 821 maximum change in gamma: 0.040389                                         
Iteration 822 maximum change in gamma: 0.0421752                                        
Iteration 823 maximum change in gamma: 0.0438124                                        
Iteration 824 maximum change in gamma: 0.045237                                         
Iteration 825 maximum change in gamma: 0.0463808                                         
Iteration 826 maximum change in gamma: 0.0471744                                        
Iteration 827 maximum change in gamma: 0.0485975                                        
Iteration 828 maximum change in gamma: 0.0499166                                        
Iteration 829 maximum change in gamma: 0.0507555                                        
Iteration 830 maximu

Iteration 910 maximum change in gamma: 0.0737457                                        
Iteration 911 maximum change in gamma: 0.0826635                                        
Iteration 912 maximum change in gamma: 0.0918228                                         
Iteration 913 maximum change in gamma: 0.100331                                          
Iteration 914 maximum change in gamma: 0.105084                                          
Iteration 915 maximum change in gamma: 0.104196                                          
Iteration 916 maximum change in gamma: 0.0966585                                        
Iteration 917 maximum change in gamma: 0.083429                                         
Iteration 918 maximum change in gamma: 0.0669973                                         
Iteration 919 maximum change in gamma: 0.0504463                                        
Iteration 920 maximum change in gamma: 0.0373749                                        
Iteration 921 ma

In [47]:
model = metapy.topics.TopicModel('lda-cvb0-pos')

scorer = metapy.topics.BLTermScorer(model)
topic_words = {'Topic '+str(id): [(fidx_pos.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=id, scorer=scorer)] for id in range(0,10)}



In [49]:
with open ( 'chinese_pos_topic.json', 'w') as f:
    json.dump(topic_words, f)

topic_words

{'Topic 0': [('order', 0.18609484884683222),
  ('wait', 0.09935275349068518),
  ('ask', 0.09458999489467562),
  ('tabl', 0.08401145355687928),
  ('came', 0.08009414550334812),
  ('minut', 0.07936646921246783),
  ('time', 0.06948061058714489),
  ('got', 0.0670061241540397),
  ('took', 0.06372612248202003),
  ('did', 0.0601870056513798)],
 'Topic 1': [('chines', 0.7898718898940076),
  ('restaur', 0.26132790279789264),
  ('food', 0.20043935474934305),
  ('authent', 0.12002137212802559),
  ('place', 0.10069767040218026),
  ('best', 0.09465091893364452),
  ('area', 0.07357137579629303),
  ('china', 0.06960600883597047),
  ('phoenix', 0.06640020763331163),
  ('new', 0.06591198015147486)],
 'Topic 2': [('chicken', 0.8104478051526193),
  ('rice', 0.3476494035684613),
  ('fri', 0.2840430328863542),
  ('egg', 0.2201903727378909),
  ('soup', 0.20029155903646073),
  ('sour', 0.15949638845653774),
  ('order', 0.14075188971830943),
  ('lunch', 0.13717533423254472),
  ('orang', 0.13260244627736728),


Chinese Restaurant Negative Reviews

In [None]:
fidx_neg = metapy.index.make_forward_index('chinese_review_neg.toml')
dset_neg = metapy.learn.Dataset(fidx_pos)
lda_inf_neg = metapy.topics.LDACollapsedVB(dset_pos, num_topics=10, alpha=1.0, beta=0.01)
lda_inf_neg.run(num_iters=1000)
lda_inf_neg.save('lda-cvb0-neg')

1558195032: [info]     Creating forward index: chinese_reviews_neg-idx/fwd (C:/Users/appveyor/AppData/Local/Temp/1/pip-req-build-7ct0ssv5/deps/meta/src/index/forward_index.cpp:239)
 > Tokenizing Docs: [>                                       ]   0% ETA 00:00:00 
1558195032: [info]     Done creating index: chinese_reviews_neg-idx/fwd (C:/Users/appveyor/AppData/Local/Temp/1/pip-req-build-7ct0ssv5/deps/meta/src/index/forward_index.cpp:278)
Iteration 1 maximum change in gamma: 1.31323                                             
Iteration 2 maximum change in gamma: 0.47757                                            
Iteration 3 maximum change in gamma: 0.986129                                           
Iteration 4 maximum change in gamma: 1.13021                                            
Iteration 5 maximum change in gamma: 1.12014                                            
Iteration 6 maximum change in gamma: 0.872339                                           
Iteration 7 maximum chan

In [None]:
model = metapy.topics.TopicModel('lda-cvb0-neg')

scorer = metapy.topics.BLTermScorer(model)
topic_words = {'Topic '+str(id): [(fidx_neg.term_text(pr[0]), pr[1]) for pr in model.top_k(tid=id, scorer=scorer)] for id in range(0,10)}

with open ( 'chinese_neg_topic.json', 'w') as f:
    json.dump(topic_words, f)

topic_words

In [23]:
model.topic_distribution(0)

<metapy.stats.Multinomial {0: 0.978659, 1: 0.021341}>

It looks like our first document was written by a student who chose the part-time job essay topic...

In [24]:
model.topic_distribution(900)

<metapy.stats.Multinomial {0: 0.021203, 1: 0.978797}>

...whereas this document looks like it was written by a student who chose the public smoking essay topic.

We can also infer topics for a brand new document. First, let's create the document and use the forward index we loaded before to convert it to a feature vector:

In [25]:
doc = metapy.index.Document()
doc.content("I think smoking in public is bad for others' health.")
fvec = fidx.tokenize(doc)

Now, let's load a topic model inferencer that uses the same CVB inference method we used earlier:

In [26]:
inferencer = metapy.topics.CVBInferencer('lda-cvb0.phi.bin', alpha=1.0)

 


Now, let's infer the topic proportions for the new document:

In [27]:
proportions = inferencer.infer(fvec, max_iters=20, convergence=1e-4)
print(proportions)

<metapy.stats.Multinomial {0: 0.185608, 1: 0.814392}>
