Ignis: Corpus Preparation
==========================

## Configuration

### Input/Output

This notebook will process the specified `input_files` into the consolidated `output_file` in the main codebase folder.

In [1]:
import glob

input_files = glob.glob("data/bbc/*/*.txt")

To add files from other paths to the same corpus, uncomment and modify the code in the next cell to extend the `input_files` list.

In [2]:
# input_files += glob.glob("some/other/path/*.txt")
# input_files += glob.glob("yet/another/path/*.txt")

In [3]:
output_file = "data/bbc.corpus"

### Pre-tokenisation cleaning (Regex)

Set these options to keep or remove email adddresses and/or URLs that are found in the documents before we split their textual content into individual tokens.

In [4]:
remove_emails_regex = True
remove_urls_regex = True

### Automated n-gram detection

The script can try to automatically chunk document tokens into multi-word phrases ("n-grams") based on collocation frequency.

This process is controlled by the following parameters:

- `ngram_min_count`: How many documents each n-gram minimally needs to appear in before being considered valid.

- `ngram_scoring`: `"default"` or `"npmi"`.

- `ngram_threshold`: Intuitively, a higher threshold means fewer phrases.
  - With the default scorer, this is greater than or equal to 0; with the NPMI scorer, this is in the range `-1` to `1`.
  - We want a relatively high threshold (e.g., `0.8` for NPMI scoring) so that we don't start littering spurious n-grams all over our corpus, diluting our results. E.g., we want _Lord of the Rings_, but not _slightly better than analysts_.

- `ngram_connector_words`: These terms will be ignored if they come between normal words.
  - E.g., if `ngram_connector_words` includes the word "of", then when the phraser sees "Wheel of Fortune" it actually evaluates _"Wheel Fortune"_ as an n-gram, putting "of" back in only at the output level to give _wheel_of_fortune_.

Set the value of `ngram_iterations` to control how many iterations of the phrasing algorithm to run:
1 iteration detects bigrams, 2 iterations detects trigrams, etc.

To disable the automated phraser completely, set `ngram_iterations` to `0`.

In [5]:
ngram_min_count = 5
ngram_scoring = "npmi"
ngram_threshold = 0.6
ngram_connector_words = ["a", "an", "the", "of", "on", "in", "at"]
ngram_iterations = 2

### File-based stop word lists

Each of the files referenced below should be a list of stop words/phrases (one per line).

Blank lines and lines starting with a hash symbol (`#`) are ignored.

Stop words can also be added or removed in real-time at the modelling stage, so the lists defined here do not need to be comprehensive.

In [6]:
# Add or remove entries from the list below as necessary.
stop_word_files = [
    "data/nltk-stopwords-en.txt"
]

### Automated stop word detection

The script employs an automated stop word detection system that can pick out potential stop words even if they are not within the file-based stop lists.

The automated stop word detection system calculates the TF-IDF scores for each term within the corpus, then picks out the lowest `n` terms from each document as candidate stop words.

For each candidate stop word, if it is within the lowest `n` terms for more than the threshold proportion of the documents, it is put on the final stop word list.

- `n_lowest`: Number of terms with lowest TF-IDF scores to consider per document
- `stop_list_proportion`: The proportion of all per-document stop word lists that each candidate has to appear in before it is considered a final stop word. (The higher the threshold, the fewer the number of stop words produced. If set to `1`, no stop words will be produced.)

In [7]:
n_lowest = 50
stop_list_proportion = 0.35

### Hapax legomena

If `True`, will add hapax legomena (i.e., words that occur in only a single document in the entire corpus) to the stop word list.

In [8]:
remove_hapax = True

### Supplementary stop word list

Any ad hoc stop words specific to this corpus can be appended directly to this list.

As before, the stop words defined here need not be comprehensive.

In [9]:
supplementary_stop_words = ["mr", "could"]

### Whitelist

Terms in this list will be unconditionally removed from the stop list generated by the options in the previous cells.

In [10]:
whitelist = []

### Corpus constraints

The LDA algorithm does not work well with documents that are very short; `min_doc_tokens` sets the minimum length (in final tokens) that a document needs to be in order to be added to the corpus.

Separately, if `remove_duplicates` is `True`, documents that have the same tokens **after processing** are treated as duplicates, and only one copy is kept.  N.B.: These documents might not necessarily have started as exact duplicates, as long as their final token lists end up being the same.

In [11]:
min_doc_tokens = 5
remove_duplicates = True

### Lemmatisation

The script can optionally attempt to lemmatise all document tokens using spaCy's English dictionary.

Lemmatisation attempts to reduce all words to their basic form: E.g., `"fly"`, `"flying"`, `"flew"` and `"flown"` should all be lemmatised to the single root word `"fly"`.

Because lemmatisation is a highly language-specific process, it only makes sense to use if the documents are actually in English.

In [12]:
lemmatise_as_english = True

## All set!

When all the configuration options above have been set properly, select "Cell" -> "Run All" from the Jupyter Notebook menu to run the full corpus preparation script.

A summary of the script's output will be generated in the last cell of this notebook upon successful completion.

----------

In [13]:
# General library imports
import collections
import json
import pathlib
import pprint
import re

from tqdm.auto import tqdm

import ignis

Data ingestion
--------------------

We will track the contents and filename of each document, then tokenise them all and feed them into an `ignis.Corpus` that will be saved.

We should, by all accounts, actually be preparing a separate text cleaning function and running the raw text through it immediately, but this way we can see the effects of each step of the data cleaning.

In [14]:
RawDocument = collections.namedtuple("RawDocument", "metadata, tokens, display_str")

In [15]:
raw_docs = []
for file in tqdm(input_files, miniters=1):
    filename = pathlib.Path(file).as_posix()

    metadata = {"filename": filename}

    # Basic HTML conversion (we could just use the plain text as well, BeautifulSoup can parse it)
    with open(file) as f:
        tokens = f.read()
        lines = [line for line in tokens.split("\n") if line != ""]

        # Assume first line is the title
        title = f"<strong>{lines[0]}</strong>"
        paras = [f"<p>{line}</p>" for line in lines[1:]]
        body = "\n".join(paras)
        display_str = f"<html><body>{title}{body}</body></html>"

    raw_doc = RawDocument(metadata, tokens, display_str)

    raw_docs.append(raw_doc)

  0%|          | 0/2225 [00:00<?, ?it/s]

In [16]:
# Simple deduplication, round 1
if remove_duplicates:
    seen_docs = set()
    dupe_count = 0
    deduped_docs = []
    for doc in raw_docs:
        if len(doc.tokens) == 0:
            continue

        # Cast the document tokens as a tuple so that we can use it as a deduplicating hash
        doc_hash = tuple(doc.tokens)
        if doc_hash in seen_docs:
            # print(f"Duplicate document: {doc[0]['filename']}")
            dupe_count += 1
        else:
            seen_docs.add(doc_hash)
            deduped_docs.append(doc)

    raw_docs = deduped_docs
    print(f"{dupe_count} duplicate documents discarded before tokenisation.")

98 duplicate documents discarded before tokenisation.


Text pre-processing and tokenisation
------

### Pre-tokenisation cleaning
- URLs/Emails

In [17]:
URL_REGEX = re.compile(
    # protocol identifier
    r"(?:(?:(?:https?|ftp):)?//)"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?" r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host & domain names, may end with dot
    # can be replaced by a shortest alternative
    # r"(?![-_])(?:[-\w\u00a1-\uffff]{0,63}[^-_]\.)+"
    # r"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    # # domain name
    # r"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
    r"(?:"
    r"(?:"
    r"[a-z0-9\u00a1-\uffff]"
    r"[a-z0-9\u00a1-\uffff_-]{0,62}"
    r")?"
    r"[a-z0-9\u00a1-\uffff]\."
    r")+"
    # TLD identifier name, may end with dot
    r"(?:[a-z\u00a1-\uffff]{2,}\.?)" r")"
    # port number (optional)
    r"(?::\d{2,5})?"
    # resource path (optional)
    r"(?:[/?#]\S*)?",
    re.IGNORECASE,
)

In [18]:
pre_regexes = []

if remove_emails_regex:
    # Rough email regex
    pre_regexes.append(
        re.compile(r"\b[a-z0-9._%+-]+@[a-z0-9.-]+[.][a-z]{2,}\b", re.IGNORECASE),
    )
    
if remove_urls_regex:
    # URL regex
    pre_regexes.append(URL_REGEX)
    # Rougher URL regex
    pre_regexes.append(re.compile(r"\bwww[.][a-z0-9.-]+[.][a-z]{2,}\b", re.IGNORECASE))

In [19]:
def pre_regex(doc):
    for regex in pre_regexes:
        doc = regex.sub("", doc)
    return doc


no_url_docs = []
for doc in raw_docs:
    no_url_docs.append(doc._replace(tokens=pre_regex(doc.tokens)))

### Naive tokenisation (by whitespace)
- Split by whitespace and selected punctuation
- Case folding
- Strip leading/trailing non-informative punctuation from tokens
- Remove single apostrophes
- Remove single brackets within words
  - For dealing with cases like "the recipient(s)" -- Which will get tokenised to "the recipient(s" otherwise

In [20]:
split_punctuation = "_/."
strip_punctuation = "'\"()[]<>?!,.:;/|_-+*"
bracket_pairs = [
    ["(", ")"],
    ["[", "]"],
]


def naive_tokenise(doc):
    """
    Naively tokenises a document.
    
    Returns
    -------
    iterable of str
        The document as a string of space-separated tokens
    """
    new_tokens = []
    
    for punctuation in split_punctuation:
        doc = doc.replace(punctuation, f" {punctuation} ")

    tokens = doc.split()
    for token in tokens:
        token = token.casefold()
        token = token.strip(strip_punctuation)
        token = token.replace("'", "")

        for bracket_pair in bracket_pairs:
            if bracket_pair[0] in token and bracket_pair[1] not in token:
                token = token.replace(bracket_pair[0], "")
            if bracket_pair[1] in token and bracket_pair[0] not in token:
                token = token.replace(bracket_pair[1], "")

        if token != "":
            new_tokens.append(token)

    return new_tokens

In [21]:
naive_docs = []
for doc in no_url_docs:
    naive_docs.append(doc._replace(tokens=naive_tokenise(doc.tokens)))

In [22]:
# Lemmatisation
if lemmatise_as_english:
    import copy
    import spacy.lemmatizer
    
    sp = spacy.load("en_core_web_sm")
    lookups = sp.vocab.lookups
    lemmas = spacy.lemmatizer.Lemmatizer(lookups)
    
    def lemmatise_doc(doc):
        new_tokens = [lemmas.lookup(token) for token in doc.tokens]
        
        metadata = copy.deepcopy(doc.metadata)
        metadata["lemmatised"] = True
        
        return doc._replace(metadata=metadata, tokens=new_tokens)
    
    lemma_docs = []
    for doc in naive_docs:
        lemma_docs.append(lemmatise_doc(doc))
else:
    lemma_docs = naive_docs[:]

### Automated n-gram detection

The script can try to automatically chunk document tokens into multi-word phrases ("n-grams") based on collocation frequency.

This process is controlled by the following parameters:

- `ngram_min_count`: How many documents each n-gram minimally needs to appear in before being considered valid.

- `ngram_scoring`: `"default"` or `"npmi"`.

- `ngram_threshold`: Intuitively, a higher threshold means fewer phrases.
  - With the default scorer, this is greater than or equal to 0; with the NPMI scorer, this is in the range `-1` to `1`.
  - We want a relatively high threshold (e.g., `0.8` for NPMI scoring) so that we don't start littering spurious n-grams all over our corpus, diluting our results. E.g., we want _Lord of the Rings_, but not _slightly better than analysts_.

- `ngram_connector_words`: These terms will be ignored if they come between normal words.
  - E.g., if `ngram_connector_words` includes the word "of", then when the phraser sees "Wheel of Fortune" it actually evaluates _"Wheel Fortune"_ as an n-gram, putting "of" back in only at the output level to give _wheel_of_fortune_.

In [23]:
for_phrasing = [doc.tokens for doc in lemma_docs]

for i in range(ngram_iterations):
    print(f"Iteration {i + 1}...")
    
    # We use the built-in Ignis phraser, which has a few algorithmic improvements
    # over the basic Gensim one
    phraser = ignis.util.ImprovedPhraser(
        for_phrasing,
        min_count=ngram_min_count,
        threshold=ngram_threshold,
        scoring=ngram_scoring,
        connector_words=ngram_connector_words,
        drop_non_alpha=True,
        verbose=True,
    )
    for_phrasing = phraser.find_ngrams(for_phrasing, verbose=True)

# `for_phrasing` contains the post-phrasing tokens for each document;
# We need to recombine them with each document's metadata etc.
phrased_docs = []
for index, doc in enumerate(for_phrasing):
    phrased_docs.append(lemma_docs[index]._replace(tokens=doc))

Iteration 1...




Gensim Phraser initialised. 4.67s
Improved Phraser initialised. 0.09s


  0%|          | 0/2127 [00:00<?, ?it/s]

Iteration 2...
Gensim Phraser initialised. 2.26s
Improved Phraser initialised. 0.01s


  0%|          | 0/2127 [00:00<?, ?it/s]

In [24]:
# # Optionally, display all the ngrams found (uncomment to show)
# seen_tokens = set()
# found_ngrams = []
# for document in phrased_docs:
#     for token in document.tokens:
#         if token.count(" ") >= 1:
#             if token not in seen_tokens:
#                 found_ngrams.append(token)
#                 seen_tokens.add(token)
# print("\n".join(sorted(found_ngrams)))

### Post-phrasing cleaning

- Remove stop words (optional)
- Remove purely numeric/non-alphabetic/single-character tokens
  - Under the assumption that significant tokens, like the "19" in "Covid 19" or the "11" in "Chapter 11 (bankruptcy)" would have been picked up by the phraser

In [25]:
# Stoplist based on TF-IDF
# Not theoretically needed if using term weighting during the training process, but stopword
# removal could still help with runtimes and interpretability


import collections
import math

# Calculate IDF (For calculations of this size, the overhead of Pandas/Numpy is probably not worth it)
df_dict = {}
for doc in phrased_docs:
    for token in set(doc.tokens):
        if token in df_dict:
            df_dict[token] += 1
        else:
            df_dict[token] = 1

# While we're at it, we could remove document-level hapax legomena
hapax = [token for token, count in df_dict.items() if count == 1]

token_idf = {}
for token in df_dict:
    token_idf[token] = math.log(len(phrased_docs) / df_dict[token])


def tf_idf(tokens, token_idf):
    """
    Calculates the TF-IDF for each unique term in the given list of document tokens
    
    Parameters
    ----------
    tokens: iterable of str
        Tokens that make up a single document
    token_idf: dict
        Mapping of terms to their global IDF values
    """
    token_tf_idf = {}
    counts = collections.Counter(tokens)
    for token in set(tokens):
        token_tf = counts[token] / len(tokens)
        # In particular, accessing a Pandas Series by string index is much slower than accessing a Dictionary by key
        token_tf_idf[token] = token_tf * token_idf[token]
    return token_tf_idf

In [26]:
# Thresholding

# (using `n_lowest` and `stop_list_proportion` from the Configuration section)

per_doc_stopwords = []
for doc in phrased_docs:
    token_tf_idf = tf_idf(doc.tokens, token_idf)
    # token_tf_idf is a dict of token -> score
    scores = sorted(list(token_tf_idf.items()), key=lambda x: x[1])
    lowest_n = [score[0] for score in scores[:n_lowest]]
    per_doc_stopwords.append(set(lowest_n))

# Check how many *stopword lists* each token appears in; this is more discriminative than
# checking how many actual *documents* each token appears in instead
stopword_df = {}
for per_doc in per_doc_stopwords:
    for stopword in per_doc:
        if stopword in stopword_df:
            stopword_df[stopword] += 1
        else:
            stopword_df[stopword] = 1

total_docs = len(phrased_docs)
final_stopwords = []
for stopword, count in stopword_df.items():
    if count / total_docs > stop_list_proportion:
        final_stopwords.append((stopword, count))
final_stopwords.sort(key=lambda x: x[1], reverse=True)

# List of removed words sorted by TF-IDF (so in roughly decreasing order of commonness)
tf_idf_removed_words = [stopword for stopword, count in final_stopwords]
stop_set = set(tf_idf_removed_words)

print(f"Stopwords, ordered by TF-IDF (threshold: {stop_list_proportion}):")
print(", ".join(tf_idf_removed_words))
print(f"({len(tf_idf_removed_words)})")

Stopwords, ordered by TF-IDF (threshold: 0.35):
the, be, a, to, of, in, and, have, for, on, say, with, it, at, by, that, but, from, this, will, which, also, not, its, make, can, up, much, year, who, he, one, take, after, their, would, out, do
(38)


In [27]:
# Hapax legomena
if remove_hapax:
    stop_set = stop_set.union(set(hapax))
    print(f"Also removing document-level hapaxes ({len(hapax)}).")

Also removing document-level hapaxes (13238).


In [28]:
# Stop word files
stop_word_file_count = 0
for file in stop_word_files:
    with open(file, "r", encoding="utf8") as fp:
        for stop_word in fp:
            stop_word = stop_word.strip()
            if stop_word != "" and not stop_word.startswith("#"):
                stop_set.add(stop_word)
                stop_word_file_count += 1
print(f"Removing stop words using file-based lists ({stop_word_file_count}).")

Removing stop words using file-based lists (179).


In [29]:
# Supplementary stop words
stop_set = stop_set.union(set(supplementary_stop_words))
print(f"Removing supplementary stop words ({stop_word_file_count}).")

Removing supplementary stop words (179).


In [30]:
# Whitelisting
stop_set -= set(whitelist)
print(f"Not removing whitelisted words ({len(whitelist)}).")

Not removing whitelisted words (0).


Stop words will not be removed from each document's list of raw tokens directly, but will instead be saved as part of the corpus metadata and only processed at run-time when documents are retrieved for modelling.

Put simply, what this means is that all the documents still retain all their raw tokens, so we can _modify_ the stop word list later on without needing to re-run this corpus preparation script.

In [31]:
def second_tokenise(tokens):
    new_tokens = []
    for token in tokens:
        # We still do a permanent removal of all non-alphabetic tokens as well as all
        # tokens consisting of a single character.
        if re.search("^[^a-z]+$", token) or len(token) <= 1:
            continue
        new_tokens.append(token)

    return new_tokens

In [32]:
final_docs = []
for phrased_doc in phrased_docs:
    final_docs.append(
        phrased_doc._replace(tokens=second_tokenise(phrased_doc.tokens))
    )

In [33]:
# See the top remaining high-frequency words (for further cleaning if necessary)
corpus_tf = {}
for doc in final_docs:
    doc_counts = collections.Counter(doc.tokens)

    for token in set(doc.tokens):
        if token in corpus_tf:
            corpus_tf[token] += doc_counts[token]
        else:
            corpus_tf[token] = doc_counts[token]

In [34]:
# Simple deduplication, round 2
if remove_duplicates:
    seen_docs = set()
    post_dupe_count = 0
    deduped_docs = []
    for doc in final_docs:
        if len(doc.tokens) == 0:
            continue

        # Cast the document tokens as a tuple so that we can use it as a deduplicating hash
        doc_hash = tuple(doc.tokens)
        if doc_hash in seen_docs:
            # print(f"Duplicate document: {doc[0]['filename']}")
            dupe_count += 1
            post_dupe_count += 1
        else:
            seen_docs.add(doc_hash)
            deduped_docs.append(doc)

    final_docs = deduped_docs
    print(
        f"{post_dupe_count} duplicate documents discarded after tokenisation "
        f"({dupe_count} total)."
    )

7 duplicate documents discarded after tokenisation (105 total).


Save to Ignis Corpus
----

In [35]:
corpus = ignis.Corpus(stop_words=stop_set)
short_doc_count = 0

for doc in final_docs:
    if len(doc.tokens) < min_doc_tokens:
        short_doc_count += 1
        continue
    corpus.add_doc(**doc._asdict())

In [36]:
# And make sure it saves/loads without errors.
corpus.save(output_file)
corpus = ignis.load_corpus(output_file)

In [37]:
# Jupyter notebook setup for final document view
import ipywidgets as widgets
from IPython.core.display import display, HTML

# - Prevent vertical scrollbars in output subareas
jupyter_styles = """
<style>
   div.cell > div.output_wrapper > div.output.output_scroll {
     height: auto;
   }
   .jupyter-widgets-output-area .output_scroll {
        height: unset;
        border-radius: unset;
        -webkit-box-shadow: unset;
        box-shadow: unset;
    }
    .jupyter-widgets-output-area, div.output_stdout, div.output_result  {
        height: auto;
        max-height: 50em;
        overflow-y: auto;
    }
</style>
"""
display(HTML(jupyter_styles))


corpus_doc_ids = corpus.document_ids


def show_corpus_doc(index=0):
    doc = corpus.get_document(corpus_doc_ids[index])
    print(doc.metadata)
    print()
    print("-" * 10)
    print()
    print("|".join(doc.tokens))
    print()
    print("-" * 10)
    print()

    # Jupyter notebooks will interpret anything between $ signs as LaTeX formulae when rendering HTML output,
    # so we need to replace them with escaped $ signs (only in Jupyter environments)
    display_str = doc.display_str.replace("$", r"\$")
    display(HTML(display_str))

if len(corpus_doc_ids) > 0:
    widgets.interact(show_corpus_doc, index=(0, len(corpus_doc_ids) - 1))
else:
    print(
        "Corpus is empty -- No significant documents were found at the input path(s)."
    )

interactive(children=(IntSlider(value=0, description='index', max=2119), Output()), _dom_classes=('widget-inte…

In [38]:
print("Summary:")
print(f"{len(corpus_doc_ids)} total documents.")
print()
print(f"{stop_word_file_count} stop word(s) from stop word file(s):")
pprint.pprint(stop_word_files)
print()
print(
    f"{len(tf_idf_removed_words)} stop word(s) automatically removed by TF-IDF "
    f"(threshold: {stop_list_proportion}):"
)
print(" ".join(tf_idf_removed_words))
print()
if remove_hapax:
    print(f"{len(hapax)} hapax legomena removed.")
    print()
print(f"{len(supplementary_stop_words)} supplementary stop word(s):")
print(" ".join(supplementary_stop_words))
print()
print(f"{len(whitelist)} whitelisted word(s):")
print(" ".join(whitelist))
print()
print("Top 20 remaining terms in corpus by frequency:")
# Remove stop words from top corpus terms
term_counts = [
    term_count for term_count in corpus_tf.items() if term_count[0] not in stop_set
]
pprint.pprint(sorted(term_counts, key=lambda x: x[1], reverse=True)[:20])
print()
print(f"{short_doc_count} short documents (< {min_doc_tokens} tokens) discarded.")
print()
if remove_duplicates:
    print(f"{dupe_count} duplicate documents discarded.")
    print()
print(f"Corpus saved to:\n{output_file}")

Summary:
2120 total documents.

179 stop word(s) from stop word file(s):
['data/nltk-stopwords-en.txt']

38 stop word(s) automatically removed by TF-IDF (threshold: 0.35):
the be a to of in and have for on say with it at by that but from this will which also not its make can up much year who he one take after their would out do

13238 hapax legomena removed.

2 supplementary stop word(s):
mr could

0 whitelisted word(s):


Top 20 remaining terms in corpus by frequency:
[('people', 1895),
 ('us', 1822),
 ('new', 1646),
 ('go', 1642),
 ('get', 1500),
 ('game', 1447),
 ('time', 1425),
 ('win', 1402),
 ('use', 1399),
 ('good', 1384),
 ('well', 1251),
 ('come', 1171),
 ('government', 1164),
 ('play', 1162),
 ('two', 1138),
 ('see', 1124),
 ('world', 1083),
 ('show', 1082),
 ('film', 1072),
 ('work', 1024)]

0 short documents (< 5 tokens) discarded.

105 duplicate documents discarded.

Corpus saved to:
data/bbc.corpus
