This notebook is designed to implement Biterm Topic Model as proposed <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf">in this paper</a>. It uses a stored pickle of a page in English from data folder. It is a deprecated version of the Topic Modelling notebook (C++ is a more efficient one).</br>

<a href="https://github.com/markoarnauto/biterm">The Github link</a> to the biterm library utilized here.

In [None]:
import pandas as pd
from utils.lngselection import abbreviation
from wikiwho_wrapper import WikiWho
from external.wikipedia import WikipediaDV, WikipediaAPI
from metrics.conflict import ConflictManager
import numpy as np
import pyLDAvis
import random
from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms, topic_summuary
from biterm.btm import oBTM

Loading the page pickle:

In [None]:
%%capture
## Some Extensions ##
%load_ext autoreload
%autoreload 2
%store -r the_page

if 'the_page' not in locals():
    import pickle
    print("Loading default data...")
    the_page = pickle.load(open("data/the_page.p",'rb'))

lng = abbreviation('English')

Initiating a wikiwho instance. Retrieving `all_content` for each token that has been changed and `revisions` for data about each revision. 

In [None]:
wikiwho = WikiWho(lng=lng)
all_content = wikiwho.dv.all_content(the_page['page_id'])
revisions = wikiwho.dv.rev_ids_of_article(the_page['page_id'])

`ConflictManager` is used to retrieve conflicts and conflicting tokens. Dataframe `token` contains all tokens that have been changed on the page with information on revisions, original insertions and editors. `Tokens_processed` is a dataframe where revisions are grouped together and lists of token_ids correspond to each of them. In other words, the format of each row is a rev_id and list of token_ids.

In [None]:
con_manager = ConflictManager(all_content.copy(), 
                                           revisions.copy(), 
                                           lng=lng, 
                                           include_stopwords=False)

In [None]:
con_manager.calculate()
token = con_manager.all_actions.copy()
tokens_processed = token[['rev_id', 'rev_time', 'editor', 'token_id', 'token']].groupby("rev_id")['token_id'].apply(lambda group_series: group_series.to_numpy()).reset_index()
tokens_processed['token_id']

If a revision contains only one token, that has been changed, this token_id is appended to either previous or next revision. `token_ids` contains a list of all unique token_ids. `X_new` is a a numpy array where each row is an array of length by the number of unique token_ids. Each array contains 1 if corresponding token_id is presented in this revision and 0 otherwise. In other words, from list of token_ids a matrix is created of term frequencies.

In [None]:

#this is used for vectorizing tokenized lists
def dummy(doc):
    return doc
vec = CountVectorizer(
        tokenizer=dummy,
        preprocessor=dummy,
    )  

#this loop randomly appends lists with 1 token_id to the previous or next revision
for i, row in tokens_processed[np.array(list(map(len,tokens_processed.token_id.values)))==1].iterrows():
    k = random.choice([-1, 1])
    np.append(tokens_processed.loc[i+k, 'token_id'], row['token_id'][0])
    
tokens_processed = tokens_processed[np.array(list(map(len,tokens_processed.token_id.values)))>1] #now dropping revisions with 1 token_id
token_ids = token[['token', 'token_id']].drop_duplicates()['token_id'].to_numpy() 
#tokens_processed.drop(443, inplace=True)
#tokens_processed.reset_index(inplace=True)
#X_old = vec.fit_transform(tokens_processed['token_id'].tolist()).toarray()
X = tokens_processed['token_id'].to_numpy()
X_new = np.empty([1, len(token_ids)], dtype=int)
for x in X:
    X_new = np.append(X_new, [[1 if i in x else 0 for i in token_ids]], axis=0)
    
X_new = np.delete(X_new, 0, 0)

In [None]:
X_new

Then vocabulary and biterms are extracted to match the btm requirements. <a href="https://github.com/markoarnauto/biterm/blob/master/biterm/utility.py#L7">Function code</a>.

In [None]:
#vocab = np.array(vec.get_feature_names())
vocab = token[['token', 'token_id']].drop_duplicates()['token'].to_numpy()
biterms = vec_to_biterms(X_new)

Creating a BTM and passing the biterms to train it. Please keep in mind this process takes some time (approximately 5 hours for The Camp of the Saints article). Gibbs sampling is used. <a href="https://github.com/markoarnauto/biterm/blob/master/biterm/btm.py">Biterm Topic Model code</a>. Alpha and beta are hyperparameters and l is used for tuning alpha and beta after fitting.

In [None]:
btm = oBTM(num_topics=30, V=vocab, alpha = 0.1, beta = 0.01) #default alpha=1., beta=0.01, l=0.5
topics = btm.fit_transform(biterms, iterations=100)

Using LDA visualization to display the topics and most frequent words. The plot is also saved 

In [None]:
vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(X_new, axis=1), vocab, np.sum(X_new, axis=0))

pyLDAvis.save_html(vis, 'topic_modelling_03_07.html')  # path to output

#print("\n\n Topic coherence ..")
#topic_summuary(btm.phi_wz.T, X, vocab, 10)


In [None]:
pyLDAvis.display(vis)