Basic Text Pre-processing and Topic Modelling
======

In [1]:
import glob
import math
import re
import threading
import time

import gensim
import nltk
import pyLDAvis
import tomotopy as tp
import tqdm

Data ingestion
-----

In [2]:
data_files = glob.glob("./data/bbc/*/*.txt")

In [3]:
raw_docs = []
for file in tqdm.tqdm(data_files):
    with open(file) as f:
        doc = f.read()
        raw_docs.append(doc)

print(raw_docs[0])

100%|██████████| 2225/2225 [00:00<00:00, 10695.57it/s]

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AO




Text pre-processing and tokenisation
------

- Naive tokenisation (by whitespace)
- Strip leading/trailing non-informative punctuation from tokens

In [4]:
remove_punctuation = "'\"()?!,.:;<>/|_"


def naive_tokenise(doc):
    tokens = doc.split()
    tokens = [x.strip(remove_punctuation) for x in tokens]
    return tokens


docs = [naive_tokenise(doc) for doc in raw_docs]

print(" ".join(docs[0]))

Ad sales boost Time Warner profit Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn Â£600m for the three months to December from $639m year-earlier The firm which is now one of the biggest investors in Google benefited from sales of high-speed internet connections and higher advert sales TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros and less users for AOL Time Warner said on Friday that it now owns 8% of search-engine Google But its own internet business AOL had has mixed fortunes It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters However the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing custome

Chunk into significant bigrams/trigrams based on collocation frequency

- Min count: Must appear in at least 0.1% of the documents

- Scoring: "default" or "npmi"

- Threshold: Intuitively, higher threshold means fewer phrases. With the default scorer, this is greater than or equal to 0; with the NPMI scorer, this is in the range -1 to 1.

- Common terms: These terms will be ignored if they come between normal words. E.g., if `common_terms` includes the word "of", then when the phraser sees "Wheel of Fortune" it actually evaluates _"Wheel Fortune"_ as an n-gram, putting "of" back in only at the output level.

In [5]:
min_count = math.ceil(len(docs) / 1000)
scoring = "npmi"
# We want a relatively high threshold so that we don't start littering spurious n-grams all over our corpus, diluting our results.
# E.g., we want "Lord_of_the_Rings", but not "slightly_better_than_analysts"
threshold = 0.75
common_terms = ["a", "an", "the", "of", "on", "in", "at"]

This could take a while, so set up a threaded function with a basic progress indicator in the main thread

In [6]:
def find_ngrams(docs, results):
    bigram = gensim.models.Phrases(
        docs,
        min_count=min_count,
        threshold=threshold,
        scoring=scoring,
        common_terms=common_terms,
    )
    trigram = gensim.models.Phrases(
        bigram[docs],
        min_count=min_count,
        threshold=threshold,
        scoring=scoring,
        common_terms=common_terms,
    )

    # Finalise the bigram/trigram generators for efficiency
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)

    results[0] = bigram_mod
    results[1] = trigram_mod

In [7]:
print("Generating n-grams", flush=True, end="")

results = [None, None]
t = threading.Thread(target=find_ngrams, args=(docs, results))
t.start()

progress_countdown = 1.0

while t.isAlive():
    time.sleep(0.1)
    progress_countdown -= 0.1
    if progress_countdown <= 0:
        print(" .", flush=True, end="")
        progress_countdown = 1

print(" Done.")

bigram_mod = results[0]
trigram_mod = results[1]

docs = [trigram_mod[bigram_mod[doc]] for doc in docs]

print(" ".join(docs[0]))

Generating n-grams . . . . . . . . . . Done.
Ad sales boost Time Warner profit Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn Â£600m for the three months to December from $639m year-earlier The firm which is now one of the biggest investors in Google benefited from sales of high-speed internet connections and higher advert sales TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn Its profits were buoyed by one-off gains which offset a profit dip at Warner_Bros and less users for AOL Time Warner said on Friday that it now owns 8% of search-engine Google But its own internet business AOL had has mixed fortunes It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters However the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues It hopes to increase subscribers by offering the online service free to TimeWarner internet customers a

Second-pass tokenisation
- Case folding
- Remove single apostrophes
- Remove stop words, remove purely numeric/non-alphabetic tokens

In [8]:
# stopset = set(nltk.corpus.stopwords.words("english"))
# Testing term weighting
stopset = []


def second_tokenise(tokens):
    new_tokens = []
    for token in tokens:
        token = token.casefold()
        token = token.replace("'", "")
        if token in stopset or re.match("^[^a-z]+$", token):
            continue
        new_tokens.append(token)

    return new_tokens


docs = [second_tokenise(doc) for doc in docs]

print(" ".join(docs[0]))

ad sales boost time warner profit quarterly profits at us media giant timewarner jumped to $1.13bn â£600m for the three months to december from $639m year-earlier the firm which is now one of the biggest investors in google benefited from sales of high-speed internet connections and higher advert sales timewarner said fourth quarter sales rose to $11.1bn from $10.9bn its profits were buoyed by one-off gains which offset a profit dip at warner_bros and less users for aol time warner said on friday that it now owns of search-engine google but its own internet business aol had has mixed fortunes it lost subscribers in the fourth quarter profits were lower than in the preceding three quarters however the company said aols underlying profit before exceptional items rose on the back of stronger internet advertising revenues it hopes to increase subscribers by offering the online service free to timewarner internet customers and will try to sign up aols existing customers for high-speed broad

Model training (LDA)
----

Add processed docs to the LDA model and train it.

The random seed and parallelisation can affect results, so setting both the seed and number of workers is necessary for reproducibility.

In [9]:
# Persistence
model_seed = 11399
num_workers = 12

# Model options
model_file = "model.bin"
num_topics = 20

# Training iterations
load_saved_model = False
train_batch = 100
train_total = 1000

# Extended training
train_until_min_ll = True
max_iterations = 10000

In [10]:
if load_saved_model:
    model = tp.LDAModel.load(model_file)
    print(f"Loaded from '{model_file}'.")
else:
    model = tp.LDAModel(tw=tp.TermWeight.IDF, seed=model_seed, k=num_topics)

    for doc in tqdm.tqdm(docs):
        model.add_doc(doc)

    model.train(0, workers=num_workers, parallel=tp.ParallelScheme.DEFAULT)
    print(
        f"Num docs: {len(model.docs)}, Vocab size: {model.num_vocabs}, "
        f"Num words: {model.num_words}"
    )
    print(f"Removed top words: {model.removed_top_words}")

    print("Training model...", flush=True)

    try:
        for i in range(0, train_total, train_batch):
            start_time = time.perf_counter()
            model.train(
                train_batch, workers=num_workers, parallel=tp.ParallelScheme.DEFAULT
            )
            elapsed = time.perf_counter() - start_time
            print(
                f"Iteration: {i + train_batch}\tLog-likelihood: {model.ll_per_word}\t"
                f"Time: {elapsed:.3f}s",
                flush=True,
            )
    except KeyboardInterrupt:
        print("Stopping train sequence.")
    model.save(model_file)
    print(f"Saved to '{model_file}'.")

100%|██████████| 2225/2225 [00:00<00:00, 14260.28it/s]

Num docs: 2225, Vocab size: 34999, Num words: 823446
Removed top words: []
Training model...





Iteration: 100	Log-likelihood: -20.668145037626903	Time: 1.537s
Iteration: 200	Log-likelihood: -20.34755185784935	Time: 1.336s
Iteration: 300	Log-likelihood: -20.18207772854758	Time: 1.325s
Iteration: 400	Log-likelihood: -20.084874617274668	Time: 1.352s
Iteration: 500	Log-likelihood: -20.022642685940397	Time: 1.324s
Iteration: 600	Log-likelihood: -19.969369306127888	Time: 1.339s
Iteration: 700	Log-likelihood: -19.935580342669827	Time: 1.332s
Iteration: 800	Log-likelihood: -19.910520841113637	Time: 1.322s
Iteration: 900	Log-likelihood: -19.893262059659186	Time: 1.319s
Iteration: 1000	Log-likelihood: -19.875714284677485	Time: 1.318s
Saved to 'model.bin'.


In [11]:
if train_until_min_ll:
    print("Continuing to train until minimum log-likelihood...")
    print("(N.B.: This may not correlate with increased human interpretability)")
    last_ll = model.ll_per_word
    i = 0
    consecutive_loss = 0

    while True:
        try:
            start_time = time.perf_counter()
            model.train(
                train_batch, workers=num_workers, parallel=tp.ParallelScheme.DEFAULT
            )
            i += train_batch
            elapsed = time.perf_counter() - start_time
            print(
                f"Iteration: {i}\tLog-likelihood: {model.ll_per_word}\t"
                f"Time: {elapsed:.3f}s",
                flush=True,
            )

            if model.ll_per_word < last_ll:
                consecutive_loss += 1
            else:
                consecutive_loss = 0
                model.save(model_file)
            last_ll = model.ll_per_word

            if consecutive_loss == 2 or i >= max_iterations:
                break

        except KeyboardInterrupt:
            print("Stopping extended train sequence.")
            break

    model = tp.LDAModel.load(model_file)
    print(f"Best recent model saved at '{model_file}' (LL: {model.ll_per_word}).")

Continuing to train until minimum log-likelihood...
(N.B.: This may not correlate with increased human interpretability)
Iteration: 100	Log-likelihood: -19.862121603155117	Time: 1.404s
Iteration: 200	Log-likelihood: -19.847816531532246	Time: 1.336s
Iteration: 300	Log-likelihood: -19.845141649939766	Time: 1.325s
Iteration: 400	Log-likelihood: -19.835607049491735	Time: 1.331s
Iteration: 500	Log-likelihood: -19.830071912621634	Time: 1.323s
Iteration: 600	Log-likelihood: -19.82426470629624	Time: 1.320s
Iteration: 700	Log-likelihood: -19.822704955326003	Time: 1.483s
Iteration: 800	Log-likelihood: -19.815492980037998	Time: 1.558s
Iteration: 900	Log-likelihood: -19.81584760424373	Time: 1.734s
Iteration: 1000	Log-likelihood: -19.803949302088988	Time: 1.854s
Iteration: 1100	Log-likelihood: -19.803707856724454	Time: 2.011s
Iteration: 1200	Log-likelihood: -19.796494038931105	Time: 1.910s
Iteration: 1300	Log-likelihood: -19.796152273916416	Time: 1.691s
Iteration: 1400	Log-likelihood: -19.791519038

Topic labelling

In [13]:
print("Extracting suggested topic labels...", flush=True)
# extractor = tp.label.PMIExtractor(min_cf=10, min_df=5, max_len=5, max_cand=10000)
extractor = tp.label.PMIExtractor(min_cf=5, min_df=3, max_len=5, max_cand=20000)
candidates = extractor.extract(model)
# labeler = tp.label.FoRelevance(model, candidates, min_df=5, smoothing=1e-2,
# mu=0.25)
labeler = tp.label.FoRelevance(
    model, candidates, min_df=3, smoothing=1e-2, mu=0.25, workers=num_workers
)
print("Done.")

Extracting suggested topic labels...
Done.


Print results
------

In [14]:
def print_topic(topic_id):
    # Labels
    labels = ", ".join(
        label for label, score in labeler.get_topic_labels(topic_id, top_n=10)
    )
    print(f"Suggested labels: {labels}")

    # Print this topic
    words_probs = model.get_topic_words(topic_id, top_n=10)
    words = [x[0] for x in words_probs]

    words = ", ".join(words)
    print(words)

In [15]:
for k in range(model.k):
    print(f"[Topic {k+1}]")
    print_topic(k)
    print()

[Topic 1]
Suggested labels: world number, mauresmo, australian_open, andy roddick, roddick, mario_ancic, carlos_moya, french open, the australian_open, world number one
roddick, seed, her, nadal, match, tennis, australian_open, henman, davis_cup, she

[Topic 2]
Suggested labels: gardener, 1500m, heptathlon, 60m, indoor, 800m, sotherton, the norwich_union, jason_gardener, kelly sotherton
her, she, olympic, race, champion, holmes, indoor, athens, marathon, mens

[Topic 3]
Suggested labels: film, actor, actress, films, comedy, oscar, starring, hollywood, awards, best director
film, best, award, actor, films, awards, actress, comedy, oscar, aviator

[Topic 4]
Suggested labels: the club, club, we all want, premier_league, steve morgan, if gerrard, knows what, wolves, entertain, man_utd
club, v, parry, liverpool, rangers, glazer, gerrard, football, manager, manchester_united

[Topic 5]
Suggested labels: header, free-kick, striker, keeper, subs_not_used, yards, goal, header from, minutes late

Visualise
--------
- Present data in the format expected by pyLDAvis

In [16]:
model_data = {
    "topic_term_dists": [model.get_topic_word_dist(k) for k in range(model.k)],
    "doc_topic_dists": [model.docs[n].get_topic_dist() for n in range(len(model.docs))],
    "doc_lengths": [len(model.docs[n].words) for n in range(len(model.docs))],
    "vocab": model.vocabs,
    "term_frequency": model.vocab_freq,
}

Again, this could take a while

In [17]:
def prepare_vis(model_data, results):
    vis_data = pyLDAvis.prepare(**model_data)
    results[0] = vis_data

In [18]:
print("Preparing LDA visualisation", flush=True, end="")

results = [None]
t = threading.Thread(target=prepare_vis, args=(model_data, results))
t.start()

progress_countdown = 1.0

while t.isAlive():
    time.sleep(0.1)
    progress_countdown -= 0.1
    if progress_countdown <= 0:
        print(" .", flush=True, end="")
        progress_countdown = 1

print(" Done.")

vis_data = results[0]

Preparing LDA visualisation . . . . . . . . . . . . . . . . . . . Done.


In [19]:
pyLDAvis.display(vis_data)

Iterate
--------
- See what the main topics might be, slice initial corpus and re-run LDA to get sub-topics