## Extraction of LDA model parameters

In [1]:
import pickle

from pandas import read_csv
from pandas import DataFrame, Series 
from collections import defaultdict

from gensim import corpora
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel as lda



## I.  Gensim corpus format 

### A standard workflow to define corpus

Let us start with simple sentences and convert them into token vectors.

In [2]:
# Initial set of documents
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

display(texts)

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

Now we can build a token dictionary, which we use in final integer-encoded tokenisation. 

In [3]:
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]
display(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

Lets build a integer-encoding for the words with gensim library and encode a single text as an example. A corpus data-type is list or sequence of such encodings.

In [4]:
# Dictionary
dictionary = Dictionary(texts)
print("Dictionary of encodings")
print(dictionary.token2id)
print("")

# Encoding with the dictionary
new_doc = "Human human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print("Encoding of '{}' as word-count pairs".format(new_doc))
print(new_vec)

Dictionary of encodings
{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

Encoding of 'Human human computer interaction' as word-count pairs
[(0, 1), (1, 2)]


### What is a document  for collocations?

* Recall that for adjective-noun collocations each document corresponds to a noun.
* A document contains adjectives that form collocation pairs with the noun.
* This information is stored as dataframe.

Hence we can do our own manual converter from dataframe to collection


In [6]:
data = read_csv("../data/df.csv", index_col=0)

In [7]:
def dataframe2corpus(x: DataFrame):
    return [list(enumerate(row)) for _, row in x.iterrows()]    

In [8]:
def dataframe2idmapping(x: DataFrame):
    return {key: value for key, value in enumerate(x.columns)}

In [9]:
tmp = data.iloc[:3, :3]
display(tmp)
display(dataframe2corpus(tmp))
display(dataframe2idmapping(tmp))

Unnamed: 0,eelmine,järgmine,viimane
aasta,53250,38107,24410
aeg,75,65,27915
määrus,16,110,17


[[(0, 53250), (1, 38107), (2, 24410)],
 [(0, 75), (1, 65), (2, 27915)],
 [(0, 16), (1, 110), (2, 17)]]

{0: 'eelmine', 1: 'järgmine', 2: 'viimane'}

## II. Gensim call 

By setting the parameter `alpha` to `auto` the implementations finds the best prior distribution for topics as well. There are other settings but these keep the initial topic distribution, you can see it from the gensim code.

In [10]:
tmp = data.iloc[:100, :100]
corpus = dataframe2corpus(tmp)
idmapping = dataframe2idmapping(tmp)

ldamodel = lda(corpus, num_topics=5, id2word=idmapping, alpha='auto')
assert ldamodel.optimize_alpha, "The training does look for optimal alpha"
ldamodel.print_topics()

[(0,
  '0.206*"käesolev" + 0.111*"kogu" + 0.071*"viimane" + 0.062*"uus" + 0.041*"lugupeetud" + 0.031*"pikk" + 0.027*"muudetud" + 0.027*"vaba" + 0.022*"hea" + 0.021*"õige"'),
 (1,
  '0.178*"eelmine" + 0.143*"järgmine" + 0.093*"viimane" + 0.073*"hea" + 0.042*"uus" + 0.041*"käesolev" + 0.035*"tulev" + 0.031*"suur" + 0.029*"juriidiline" + 0.022*"läinud"'),
 (2,
  '0.182*"suur" + 0.082*"kohalik" + 0.060*"viimane" + 0.049*"uus" + 0.045*"noor" + 0.034*"lugupeetud" + 0.029*"hea" + 0.028*"praegune" + 0.024*"väike" + 0.022*"järgmine"'),
 (3,
  '0.128*"uus" + 0.081*"avalik" + 0.077*"hea" + 0.076*"viimane" + 0.054*"kogu" + 0.032*"isiklik" + 0.028*"pikk" + 0.027*"poliitiline" + 0.026*"praegune" + 0.025*"õige"'),
 (4,
  '0.105*"viimane" + 0.100*"järgmine" + 0.057*"ettenähtud" + 0.049*"terve" + 0.048*"vajalik" + 0.048*"eelmine" + 0.046*"kohalik" + 0.036*"uus" + 0.034*"külm" + 0.032*"üksikasjalik"')]

## III. Proper extraction of model parameters

The matrix of word probabilities for each topic $\boldsymbol{\beta}$ and prior probability distribution for topics $\boldsymbol{\alpha}$.

In [11]:
beta = DataFrame(ldamodel.get_topics()).rename(columns=idmapping).transpose()
display(beta)

alpha = DataFrame(ldamodel.alpha, columns=['prior'])
display(alpha)

with open('model.pkl', 'wb') as f:
    pickle.dump([alpha, beta], f)

Unnamed: 0,0,1,2,3,4
eelmine,0.018361,0.177570,0.014753,0.019628,0.047648
järgmine,0.020323,0.143423,0.022139,0.016380,0.100096
viimane,0.071203,0.093035,0.059599,0.075985,0.104720
käesolev,0.206296,0.040864,0.007287,0.010630,0.026075
kogu,0.110683,0.012068,0.022004,0.054339,0.023873
...,...,...,...,...,...
õigusjärgne,0.004404,0.000118,0.000238,0.000534,0.000008
kõva,0.000423,0.000722,0.006294,0.001153,0.000275
eraõiguslik,0.000205,0.003066,0.000507,0.000259,0.000397
võrdne,0.001321,0.002863,0.003406,0.004784,0.000585


Unnamed: 0,prior
0,0.183308
1,0.207601
2,0.250066
3,0.235056
4,0.085477


## IV. Notes on training

LDA implementation of gensim is quite similar to sklearn implementation. However there is a difference in the perplecity function. The functions are different but the monotonpusly related

* https://stackoverflow.com/questions/40524768/perplexity-comparision-issue-in-sklearn-lda-vs-gensim-lda
*  https://github.com/RaRe-Technologies/gensim/issues/457