Basic Text Pre-processing and Topic Modelling
======

In [1]:
import glob
import math
import re
import threading
import time

import gensim
import nltk
import pyLDAvis
import tomotopy as tp
import tqdm

Data ingestion
-----

In [2]:
data_files = glob.glob("./data/bbc/*/*.txt")

In [3]:
raw_docs = []
for file in tqdm.tqdm(data_files):
    with open(file) as f:
        doc = f.read()
        raw_docs.append(doc)

print(raw_docs[0])

100%|██████████| 2225/2225 [00:00<00:00, 9757.48it/s]

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AO




Text pre-processing and tokenisation
------

- Naive tokenisation (by whitespace)
- Strip leading/trailing non-informative punctuation from tokens

In [4]:
remove_punctuation = "'\"()?!,."


def naive_tokenise(doc):
    tokens = doc.split()
    tokens = [x.strip(remove_punctuation) for x in tokens]
    return tokens


docs = [naive_tokenise(doc) for doc in raw_docs]

print(" ".join(docs[0]))

Ad sales boost Time Warner profit Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn Â£600m for the three months to December from $639m year-earlier The firm which is now one of the biggest investors in Google benefited from sales of high-speed internet connections and higher advert sales TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros and less users for AOL Time Warner said on Friday that it now owns 8% of search-engine Google But its own internet business AOL had has mixed fortunes It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters However the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing custome

Chunk into significant bigrams/trigrams based on collocation frequency
- Min count: Must appear in at least 0.1% of the documents
- Threshold: Intuitively, higher threshold means fewer phrases
- Common terms: These terms will be ignored if they come between normal words. E.g., if `common_terms` includes the word "of", then when the phraser sees "Wheel of Fortune" it actually evaluates _"Wheel Fortune"_ as an n-gram, putting "of" back in only at the output level.

In [5]:
min_count = math.ceil(len(docs) / 1000)
threshold = 25
common_terms = ["a", "an", "the", "of", "on", "in", "at"]

This could take a while, so set up a threaded function with a basic progress indicator in the main thread

In [6]:
def find_ngrams(docs, results):
    bigram = gensim.models.Phrases(
        docs, min_count=min_count, threshold=threshold, common_terms=common_terms
    )
    trigram = gensim.models.Phrases(
        bigram[docs],
        min_count=min_count,
        threshold=threshold,
        common_terms=common_terms,
    )

    # Finalise the bigram/trigram generators for efficiency
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)

    results[0] = bigram_mod
    results[1] = trigram_mod

In [7]:
print("Generating n-grams", flush=True, end="")

results = [None, None]
t = threading.Thread(target=find_ngrams, args=(docs, results))
t.start()

progress_countdown = 1.0

while t.isAlive():
    time.sleep(0.1)
    progress_countdown -= 0.1
    if progress_countdown <= 0:
        print(" .", flush=True, end="")
        progress_countdown = 1

print(" Done.")

bigram_mod = results[0]
trigram_mod = results[1]

docs = [trigram_mod[bigram_mod[doc]] for doc in docs]

print(" ".join(docs[0]))

Generating n-grams . . . . . . . . . . Done.
Ad sales boost Time Warner profit Quarterly profits at US media_giant TimeWarner jumped 76% to $1.13bn Â£600m for the three_months to December from $639m year-earlier The firm which is now one_of_the_biggest investors in Google benefited from sales of high-speed_internet connections and higher advert sales TimeWarner said fourth_quarter sales_rose 2% to $11.1bn from $10.9bn Its profits were buoyed_by one-off gains which offset a profit dip at Warner_Bros and less users for AOL Time Warner said on Friday that it now_owns 8% of search-engine Google But its_own internet business AOL had has mixed fortunes It lost 464,000 subscribers in the fourth_quarter_profits were lower_than in the preceding three quarters However the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising_revenues It hopes to increase subscribers by offering the online service free to TimeWarner internet customers a

Second-pass tokenisation
- Case folding
- Remove single apostrophes
- Remove stop words, remove purely numeric/non-alphabetic tokens

In [8]:
stopset = set(nltk.corpus.stopwords.words("english"))


def second_tokenise(tokens):
    new_tokens = []
    for token in tokens:
        token = token.casefold()
        token = token.replace("'", "")
        if token in stopset or re.match("^[^a-z]+$", token):
            continue
        new_tokens.append(token)

    return new_tokens


docs = [second_tokenise(doc) for doc in docs]

print(" ".join(docs[0]))

ad sales boost time warner profit quarterly profits us media_giant timewarner jumped $1.13bn â£600m three_months december $639m year-earlier firm one_of_the_biggest investors google benefited sales high-speed_internet connections higher advert sales timewarner said fourth_quarter sales_rose $11.1bn $10.9bn profits buoyed_by one-off gains offset profit dip warner_bros less users aol time warner said friday now_owns search-engine google its_own internet business aol mixed fortunes lost subscribers fourth_quarter_profits lower_than preceding three quarters however company said aols underlying profit exceptional items rose back stronger internet advertising_revenues hopes increase subscribers offering online service free timewarner internet customers try sign_up aols existing customers high-speed broadband timewarner also restate results following probe us securities exchange_commission_sec close concluding time warners fourth_quarter_profits slightly_better_than analysts expectations film

Model training (LDA)
----

Add processed docs to the LDA model and train it

In [9]:
model_seed = 11399
num_topics = 20
train_and_save = True
model_file = "model.bin"
# Training iterations
train_batch = 50
train_total = 500

In [10]:
if train_and_save:
    model = tp.LDAModel(seed=model_seed, k=num_topics)

    for doc in tqdm.tqdm(docs):
        model.add_doc(doc)

    model.train(0, workers=0)
    print(
        f"Num docs: {len(model.docs)}, Vocab size: {model.num_vocabs}, "
        f"Num words: {model.num_words}"
    )
    print(f"Removed top words: {model.removed_top_words}")

    print("Training model...", flush=True)

    try:
        for i in range(0, train_total, train_batch):
            start_time = time.perf_counter()
            model.train(train_batch, workers=0)
            elapsed = time.perf_counter() - start_time
            print(
                f"Iteration: {i + train_batch}\tLog-likelihood: {model.ll_per_word}\t"
                f"Time: {elapsed:.3f}s",
                flush=True,
            )
    except KeyboardInterrupt:
        print("Stopping train sequence.")
    model.save(model_file)
    print(f"Saved to '{model_file}'.")
else:
    model = tp.LDAModel.load(model_file)
    print(f"Loaded from '{model_file}'.")

100%|██████████| 2225/2225 [00:00<00:00, 20040.82it/s]

Num docs: 2225, Vocab size: 40709, Num words: 442125
Removed top words: []
Training model...





Iteration: 50	Log-likelihood: -9.562028606800228	Time: 0.623s
Iteration: 100	Log-likelihood: -9.436368643777802	Time: 0.504s
Iteration: 150	Log-likelihood: -9.367648934418513	Time: 0.519s
Iteration: 200	Log-likelihood: -9.3241451223967	Time: 0.492s
Iteration: 250	Log-likelihood: -9.296855778229718	Time: 0.490s
Iteration: 300	Log-likelihood: -9.271015426605544	Time: 0.486s
Iteration: 350	Log-likelihood: -9.248314670445591	Time: 0.489s
Iteration: 400	Log-likelihood: -9.229455485350943	Time: 0.491s
Iteration: 450	Log-likelihood: -9.224303670255114	Time: 0.498s
Iteration: 500	Log-likelihood: -9.216025007060669	Time: 0.495s
Saved to 'model.bin'.


Topic labelling

In [11]:
print("Extracting suggested topic labels...", flush=True)
# extractor = tp.label.PMIExtractor(min_cf=10, min_df=5, max_len=5, max_cand=10000)
extractor = tp.label.PMIExtractor(min_cf=5, min_df=3, max_len=5, max_cand=20000)
candidates = extractor.extract(model)
# labeler = tp.label.FoRelevance(model, candidates, min_df=5, smoothing=1e-2,
# mu=0.25)
labeler = tp.label.FoRelevance(model, candidates, min_df=3, smoothing=1e-2, mu=0.25)
print("Done.")

Extracting suggested topic labels...
Done.


Print results
------

In [12]:
def print_topic(topic_id):
    # Labels
    labels = ", ".join(
        label for label, score in labeler.get_topic_labels(topic_id, top_n=10)
    )
    print(f"Suggested labels: {labels}")

    # Print this topic
    words_probs = model.get_topic_words(topic_id, top_n=10)
    words = [x[0] for x in words_probs]

    words = ", ".join(words)
    print(words)

In [13]:
for k in range(model.k):
    print(f"[Topic {k+1}]")
    print_topic(k)
    print()

[Topic 1]
Suggested labels: six_nations, flanker, lock, fly-half, wing, rugby, ireland, capt, line-out, england
england, wales, ireland, game, france, players, six_nations, side, squad, try

[Topic 2]
Suggested labels: modern_day, detective, school, died, without_fear, laughable, outstanding_contribution, civilians, coped, too_much time
also, life, said, london, made, school, family, later, children, home

[Topic 3]
Suggested labels: gamers, gaming, nintendo, sony, portable, gameboy, consoles, console, nintendo_ds, games
games, game, music, gaming, play, said, also, online, market, players

[Topic 4]
Suggested labels: album, hip-hop, eminem, band, sound_of_2004, music, best_pop, the_streets, natasha_bedingfield, joss_stone
music, album, band, song, us, singer, show, uk, best, artists

[Topic 5]
Suggested labels: de_niro, oceans_twelve, meet_the_fockers, new_years_day, north_american_box_office, according studio_estimates, studio_estimates, comedy_meet, top film, barbra_streisand
took, 

Visualise
--------
- Present data in the format expected by pyLDAvis

In [14]:
model_data = {
    "topic_term_dists": [model.get_topic_word_dist(k) for k in range(model.k)],
    "doc_topic_dists": [model.docs[n].get_topic_dist() for n in range(len(model.docs))],
    "doc_lengths": [len(model.docs[n].words) for n in range(len(model.docs))],
    "vocab": model.vocabs,
    "term_frequency": model.vocab_freq,
}

Again, this could take a while

In [15]:
def prepare_vis(model_data, results):
    vis_data = pyLDAvis.prepare(**model_data)
    results[0] = vis_data

In [16]:
print("Preparing LDA visualisation", flush=True, end="")

results = [None]
t = threading.Thread(target=prepare_vis, args=(model_data, results))
t.start()

progress_countdown = 1.0

while t.isAlive():
    time.sleep(0.1)
    progress_countdown -= 0.1
    if progress_countdown <= 0:
        print(" .", flush=True, end="")
        progress_countdown = 1

print(" Done.")

vis_data = results[0]

Preparing LDA visualisation . . . . . . . . . . . . . . . . . . . . . . Done.


In [17]:
pyLDAvis.display(vis_data)

Iterate
--------
- See what the main topics might be, slice initial corpus and re-run LDA to get sub-topics