In [16]:
%load_ext autoreload

%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [17]:
from ngram_tools.download_ngrams import download_ngram_files
from ngram_tools.convert_to_jsonl import convert_to_jsonl_files
from ngram_tools.lowercase_ngrams import lowercase_ngrams
from ngram_tools.lemmatize_ngrams import lemmatize_ngrams
from ngram_tools.filter_ngrams import filter_ngrams
from ngram_tools.sort_ngrams import sort_ngrams
from ngram_tools.consolidate_ngrams import consolidate_duplicate_ngrams
from ngram_tools.make_yearly_files import make_yearly_files
from ngram_tools.helpers.verify_sort import check_file_sorted
from ngram_tools.helpers.print_jsonl_lines import print_jsonl_lines
from ngram_tools.helpers.save_stripped_ngrams import extract_all_ngrams

# **Process Multigrams for Training Word-Embedding Models**

## **Goal**: Download and preprocess mulitgrams for use in training `word2vec` models. 

This workflow is resource-intensive and is probably only practical when run on a computing cluster. On my university's High Performance Computing (HPC) cluster, I request the maximum 14 cores (48 logical processors) and 128G of memory and use a 2T fast-I/O NVMe SSD filespace—and I still run up against time and resource limits. I've designed the code to be efficient, although further optimization is surely possible.

The code affords options to conserve resources. Throughout the workflow you can specify `compress=True`, which tells a script to compress its output files. In my experience, there is little downside to using LZ4 compression, since it's very fast and cuts file sizes by about half. Downstream modules will see the `.lz4` extensions and handle the files accordingly. If you know your workflow runs correctly and wish to further conserve space, you can specify `delete_input=True` for many of the scripts; this will delete the source files for a given step once it is complete. The scripts are fairly memory-efficient—with the exception of `sort_ngrams` and `index_and_create_vocab_files`, which sort multiple files in memory at once. When processing multigrams, I've found that allocating more than ~10 workers in these scripts leads to memory exhaustion (with 128G!) and slow processing.

**NOTE:** You'll probably want to have run `workflow_unigrams.ipynb` before processing multigrams. That workflos allows you create a vocabulary file for filtering out uncommon tokens from the multigrams. Although you can run the `filter_ngrams` module without a vocab file, most use cases will call for one.

### Download multigrams
Here, I'm using `download_ngrams` module to fetch 5grams appended with part-of-speech (POS) tags (e.g., `_VERB`). Although you can specify `ngram_type='untagged'`, POS tags are necessary to lemmatize the tokens. Specify the number of parallel processes you wish to use by setting `workers` (the default is all available processors). You may wish to specify `compress=True` becausae 5gram files are _big_.

In [11]:
download_ngram_files(
    ngram_size=5,
    ngram_type='tagged',
    repo_release_id='20200217',
    repo_corpus_id='eng-fiction',
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction',
    compress=True,
    overwrite=True
)

[31mStart Time:                2025-04-07 16:35:43.665449
[0m
[4mDownload Info[0m
Ngram repository:          https://storage.googleapis.com/books/ngrams/books/20200217/eng-fiction/eng-fiction-5-ngrams_exports.html
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/1download
File index range:          0 to 1448
File URLs available:       1449
File URLs to use:          1449
First file to get:         https://storage.googleapis.com/books/ngrams/books/20200217/eng-fiction/5-00000-of-01449.gz
Last file to get:          https://storage.googleapis.com/books/ngrams/books/20200217/eng-fiction/5-01448-of-01449.gz
Ngram size:                5
Ngram type:                tagged
Number of workers:         48
Compress saved files:      True
Overwrite existing files:  True



Downloading:   0%|          | 0/1449 [00:00<?, ?files/s]

[31m
End Time:                  2025-04-07 16:58:50.365110[0m
[31mTotal runtime:             0:23:06.699661
[0m


### Convert files from TXT to JSONL
This module converts the original multigram files' text data to a more flexible JSON Lines (JSONL) format. Although this increases storage demands, it makes downstream processing more efficient.

In [12]:
convert_to_jsonl_files(
    ngram_size=5,
    ngram_type='tagged',
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction',
    compress=True,
    overwrite=True,
    delete_input=True
)

[31mStart Time:                2025-04-07 17:03:12.690524
[0m
[4mConversion Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/1download
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/2convert
File index range:          0 to 1448
Files available:           1449
Files to use:              1449
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/1download/5-00000-of-01449.txt.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/1download/5-01448-of-01449.txt.lz4
Ngram size:                5
Ngram type:                tagged
Number of workers:         48
Compress output files:     True
Overwrite existing files:  True
Delete input directory:    True



Converting:   0%|          | 0/1449 [00:00<?, ?files/s]

[31m
End Time:                  2025-04-07 17:10:51.923774[0m
[31mTotal runtime:             0:07:39.233250
[0m


### Make multigrams all lowercase
This module lowercases all characters in the multigrams. Most use cases benefit from this.

In [13]:
lowercase_ngrams(
    ngram_size=5,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction',
    compress=True,
    overwrite=True,
    delete_input=True
)

[31mStart Time:                2025-04-07 17:11:49.117078
[0m
[4mLowercasing Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/2convert
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/3lowercase
File index range:          0 to 596
Files available:           597
Files to use:              597
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/2convert/5-00000-of-01449.jsonl.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/2convert/5-01448-of-01449.jsonl.lz4
Ngram size:                5
Number of workers:         48
Compress output files:     True
Overwrite existing files:  True
Delete input directory:    True



Lowercasing:   0%|          | 0/597 [00:00<?, ?files/s]

[31m
End Time:                  2025-04-07 17:15:57.430066[0m
[31mTotal runtime:             0:04:08.312988
[0m


### Lemmatize the multigrams
Likewise, most use cases will benefit from multigrams that are lemmatized—that is, reduced to their base form. This requires POS-tagged multigrams. Example: `people_NOUN` ("the people of this land") will be converted to `person` in the output; `people_VERB` ("to people this land") will not. The POS tag will then be discarded as it is no longer useful.

In [14]:
lemmatize_ngrams(
    ngram_size=5,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction',
    compress=True,
    overwrite=True,
    delete_input=True
)

[31mStart Time:                2025-04-07 17:17:24.519488
[0m
[4mLemmatizing Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/3lowercase
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/4lemmatize
File index range:          0 to 596
Files available:           597
Files to use:              597
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/3lowercase/5-00000-of-01449.jsonl.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/3lowercase/5-01448-of-01449.jsonl.lz4
Ngram size:                5
Number of workers:         48
Compress output files:     True
Overwrite existing files:  True
Delete input directory:    True



Lemmatizing:   0%|          | 0/597 [00:00<?, ?files/s]

[31m
End Time:                  2025-04-07 17:25:47.710195[0m
[31mTotal runtime:             0:08:23.190707
[0m


### Filter the multigrams
This module removes tokens that provide little information about words' semantic context—specifically, those that contain numerals (`numerals=True`), nonalphabetic characters (`nonalpha=True`), stopwords (high-frequency, low information tokens like "the" and "into"; `stops=True`), or short words (those below a certain user-specified character count; here, `min_token_length=3`). You can also specify a **vocabulary file** like the one produced in the unigram workflow. A vocabulary file is simply a list of the _N_ most common words in the unigram corpus; the multigram tokens are checked against this list and those that don't appear in it are dropped.

The `replace` option controls what happens to dropped tokens. If `replace=False` (the default), then these tokens are simply erased from the ngrams. If `replace=True`, then ineligible tokens will be replaced with `UNK` (the "unknown" symbol). The filtering process will inevitably reduce the amount of useful information contained in some ngrams: If `replace=False`, some longer ngrams (e.g., 5grams) will become shorter (e.g., 3grams) after unwanted tokens are dropped; if `replace=True`, filtering will reduce the number of real tokens in certain ngrams by replacing one or more of tokens with `UNK`.

The training of word-embedding models requires _linguistic context_, which in turn requires ngrams containing more than one token. (A unigram isn't useful for helping a model learn what "company" a word keeps.) The `min_tokens` option allows you to drop ngrams that fall below a specified length (or number of real tokens) during filtering. If filtering results in an ngram with fewer than the minimum number of real tokens, all information for that ngram is dropped entirely. I usually set `min_tokens=2`, since two tokens (and higher) provide at least some contextual information.

In [15]:
filter_ngrams(
    ngram_size=5,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction',
    numerals=True,
    nonalpha=True,
    stops=True,
    min_token_length=3,
    min_tokens=2,
    replace_unk=True,
    vocab_file='1gram-corpus-vocab_list_match.txt',
    compress=True,
    overwrite=True,
    delete_input=True
)

[31mStart Time:                   2025-04-07 17:25:51.159025
[0m
[4mFiltering Info[0m
Input directory:              /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/4lemmatize
Output directory:             /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5filter
File index range:             0 to 596
Files available:              597
Files to use:                 597
First file to get:            /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/4lemmatize/5-00000-of-01449.jsonl.lz4
Last file to get:             /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/4lemmatize/5-01448-of-01449.jsonl.lz4
Ngram size:                   5
Number of workers:            48
Compress output files:        True
Overwrite existing files:     True
Delete input directory:       True

[4mFiltering Options[0m
Drop stopwords:               True
Drop tokens under:            3 chars
Drop tokens with numerals:    True

Filtering:   0%|          | 0/597 [00:00<?, ?files/s]


[4mFiltering Results (Dropped)[0m
Stopword tokens:              0 
Short-word tokens:            0 
Tokens with numerals:         0 
Tokens with non-alpha chars:  0
Out-of-vocab tokens:          623721673
Entire ngrams:                5476482 
[31m
End Time:                  2025-04-07 17:30:32.498445[0m
[31mTotal runtime:             0:04:41.339420
[0m


### Sort and combine the multigram files
This modules creates a single, fully-sorted multigram file out of the filtered files. This is crucial for the next step (ngram consolidation; see below).   

Sorting a giant file is a resource-hungry process and I've tried to implement an efficient approach that leverages parallelism: We first sort the filtered files in parallel using Python's standard sorting algorithm [Timsort](https://en.wikipedia.org/wiki/Timsort); then, we incrementally [heapsort](https://en.wikipedia.org/wiki/Heapsort) the files in parallel until we get down to 2 files. Finally, we heapsort the final 2 files (necessarily using one processor) to arrive at a single combined and sorted unigram file.

Because this step can take a _very_ long time for larger multigrams (e.g., 5grams), we can run it in sessions using the `start_iteration` and `end_iteration` options. Iteration 1 comes after the initial file sort. If you only have time to complete, say, iterations 1–3, you can set `end_iteration=3`. During a later session, you can specify `start_iteration=4` to pick up where you left off.

In [18]:
sort_ngrams(
    ngram_size=5,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction',
    workers=10,
    sort_key='ngram',
    start_iteration=4,
    end_iteration=6,
    compress=True,
    overwrite=True,
    sort_order='ascending',
    delete_input=True
)

[31mStart Time:                2025-04-07 18:06:52.939003
[0m
[4mSort Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5filter
Sorted directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/temp
Temp directory:            /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/tmp
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/5gram-merged.jsonl.lz4
Files available:           596
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5filter/5-00009-of-01449.jsonl.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5filter/5-01448-of-01449.jsonl.lz4
Files to use:              596
Ngram size:                5
Number of workers:         10
Compress output files:     True
Overwrite existing files:  True
Sort key:                  

Sorting:   0%|          | 0/596 [00:00<?, ?files/s]


Iteration 4: merging 12 files into 6 chunks using 6 workers.
  6 chunk(s) with 2 file(s)

Iteration 5: merging 6 files into 3 chunks using 3 workers.
  3 chunk(s) with 2 file(s)

Iteration 6: merging 3 files into 1 chunks using 1 workers.
  1 chunk(s) with 3 file(s)
Merging complete. Final file: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/5gram-merged.jsonl.lz4
[31m
End Time:                  2025-04-07 19:07:10.419140[0m
[31mTotal runtime:             1:00:17.480137
[0m


### Verify sort [OPTIONAL]
If we want, we can verify that the output file is correctly sorted. Bear in mind that you need to specify the file path manually here; be sure to use the right file extension based on whether sort_ngrams was run with `compress=True`.

In [20]:
check_file_sorted(
    input_file=(
        '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/'
        '5gram_files/6corpus/5gram-merged.jsonl.lz4'
    ),
    field="ngram",
    sort_order="ascending"
)

Lines: 198074082line [21:59, 150096.54line/s]

The file is sorted.


### Consolidate duplicate multigrams
This module consolidates the sorted multigram file. Lowercasing, lemmatizing, and filtering produce duplicate unigrams. Now that the file is sorted, we can scan through it and consolidate consecutive idential duplicates. This involves summing their overall and yearly frequencies and document counts. It also leads to a much smaller file.

In [21]:
consolidate_duplicate_ngrams(
    ngram_size=5,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction',
    lines_per_chunk=500000,
    compress=True,
    overwrite=True
)

[31mStart Time:                2025-04-07 19:33:06.741519
[0m
[4mConsolidation Info[0m
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/5gram-merged.jsonl.lz4
Corpus file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/5gram-corpus.jsonl.lz4
Temporary directory:       /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/temp_chunks
Ngram size:                5
Number of workers:         48
Compress output files:     True
Overwrite existing files:  True

Created and Sorted: 395 chunks
Merged: 395 chunks

[31m
End Time:                  2025-04-07 20:00:59.676167[0m
[31mTotal runtime:             0:27:52.934648
[0m


### View line [OPTIONAL]
If we want, we can inspect a line in the file.

In [22]:
print_jsonl_lines(
    file_path=(
        '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/'
        '5gram_files/6corpus/5gram-corpus.jsonl.lz4'
    ),
    start_line=1650262,
    end_line=1650263,
    parse_json=True
)

Line 1650262: {'ngram': 'UNK receive UNK UNK surprising', 'freq_tot': 91, 'doc_tot': 90, 'freq': {'1809': 1, '1810': 1, '1839': 1, '1840': 2, '1841': 1, '1846': 1, '1855': 4, '1869': 5, '1870': 2, '1872': 1, '1882': 2, '1888': 2, '1889': 1, '1891': 1, '1895': 9, '1896': 1, '1903': 3, '1904': 8, '1907': 1, '1908': 1, '1909': 1, '1911': 1, '1920': 1, '1927': 2, '1938': 1, '1942': 1, '1948': 3, '1957': 1, '1960': 1, '1961': 1, '1963': 1, '1966': 2, '1972': 1, '1974': 1, '2001': 2, '2004': 1, '2009': 1, '2012': 1, '2013': 4, '2015': 4, '2016': 2, '2017': 2, '2018': 3, '2019': 5}, 'doc': {'1809': 1, '1810': 1, '1839': 1, '1840': 2, '1841': 1, '1846': 1, '1855': 4, '1869': 5, '1870': 2, '1872': 1, '1882': 2, '1888': 2, '1889': 1, '1891': 1, '1895': 9, '1896': 1, '1903': 3, '1904': 8, '1907': 1, '1908': 1, '1909': 1, '1911': 1, '1920': 1, '1927': 2, '1938': 1, '1942': 1, '1948': 3, '1957': 1, '1960': 1, '1961': 1, '1963': 1, '1966': 2, '1972': 1, '1974': 1, '2001': 2, '2004': 1, '2009': 1, '2

### Make yearly files
This module converts the overall corpus file into yearly corpora. For every year in which an ngram appeared, a `<year>.jsonl` file (or `<year>.jsonl.lz4` if `compress=True`) will be created. Each line in a yearly file contains an ngram, a `freq` value (the number of times it appeared that year), and a `doc` value (the number of unique documents it appeared in that year).

I found it difficult to prevent memory exhaustion when processing 5grams with 128GB of RAM. Users may have to reduce the number of processors and/or the `chunk_size` to stay within their limits. Also note that the final clean-up step, in which many temporary files get deleted, can take several minutes to complete. 

After creating yearly corpora, we can proceed to train `word2vec` models as shown in the `workflow_train_models.ipynb` notebook.

In [23]:
make_yearly_files(
    ngram_size=5,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction',
    overwrite=True,
)
    compress=True,
    workers=14,
    chunk_size=500000

[31mStart Time:                2025-04-07 20:02:24.443765
[0m
[4mProcessing Info[0m
Corpus file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/5gram-corpus.jsonl.lz4
Yearly file directory:     /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data
Compress output files:     True
Number of workers:         14
Overwrite existing files:  True

Created and processed 112 chunks
Merged temp files for 390 years
[31m
End Time:                  2025-04-07 20:15:57.133325[0m
[31mTotal runtime:             0:13:32.689560
[0m


In [18]:
print_jsonl_lines(
    file_path=(
        '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/'
        '5gram_files/6corpus/yearly_files/data/2019.jsonl.lz4'
    ),
    start_line=1650262,
    end_line=1650263,
    parse_json=True
)

Line 1650262: {'ngram': 'UNK fertile breeding UNK ground', 'freq': 1, 'doc': 1}
Line 1650263: {'ngram': 'UNK fertile breeding ground UNK', 'freq': 8, 'doc': 7}


In [19]:
extract_all_ngrams(
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction',
    workers=48,
    overwrite=True
)

Extracting ngrams:   0%|          | 0/390 [00:00<?, ?it/s]

### Next Steps
Now that you've created yearly corpora of multigrams, it's time to train word embeddings using `word2vec`. See the `workflow_train_models.ipynb` notebook for a guide to training and optimizing yearly word embeddings.