In [9]:
%load_ext autoreload

%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
from ngram_tools.download_ngrams import download_ngram_files
from ngram_tools.convert_to_jsonl import convert_to_jsonl_files
from ngram_tools.lowercase_ngrams import lowercase_ngrams
from ngram_tools.lemmatize_ngrams import lemmatize_ngrams
from ngram_tools.filter_ngrams import filter_ngrams
from ngram_tools.sort_ngrams import sort_ngrams
from ngram_tools.consolidate_ngrams import consolidate_duplicate_ngrams
from ngram_tools.index_and_create_vocab import index_and_create_vocab_files
from ngram_tools.helpers.verify_sort import check_file_sorted
from ngram_tools.helpers.print_jsonl_lines import print_jsonl_lines

# **Process Unigrams to Generate a Vocabulary File**

## **Goal**: Make a list of the _n_ most common unigrams for later use filtering multigrams

This workflow is resource-intensive and is probably only practical when run on a computing cluster. On my university's High Performance Computing (HPC) cluster, I request the maximum 14 cores (48 logical processors) and 128G of memory and use a 2T fast-I/O NVMe SSD filespace—and I still run up against time and resource limits. I've designed the code to be efficient, although further optimization is surely possible.

The code affords options to conserve resources. Throughout the workflow you can specify `compress=True`, which tells a script to compress its output files. In my experience, there is little downside to using LZ4 compression, since it's very fast and cuts file sizes by about half. Downstream modules will see the `.lz4` extensions and handle the files accordingly. If you know your workflow runs correctly and wish to further conserve space, you can specify `delete_input=True` for many of the scripts; this will delete the source files for a given step once it is complete. The scripts are fairly memory-efficient—with the exception of `sort_ngrams` and `index_and_create_vocab_files`, which sort multiple files in memory at once. When processing multigrams, I've found that allocating more than ~10 workers in these scripts leads to memory exhaustion (with 128G!) and slow processing.

### Download unigrams
Here, I'm using the `download_ngrams` module to fetch unigrams appended with part-of-speech (POS) tags (e.g., `_VERB`). Although you can specify `ngram_type='untagged'`, POS tags are necessary to lemmatize the tokens. Specify the number of parallel processes you wish to use by setting `workers` (the default is all available processors). Note the `repo_release_id` and `repo_corpus_id` parameters; these tell the module which ngram corpus and release to download.

In [3]:
download_ngram_files(
    ngram_size=1,
    ngram_type='tagged',
    repo_release_id='20200217',
    repo_corpus_id='eng-us',
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us',
    compress=False,
    overwrite=True
)

[31mStart Time:                2025-04-05 20:25:57.254973
[0m
[4mDownload Info[0m
Ngram repository:          https://storage.googleapis.com/books/ngrams/books/20200217/eng-us/eng-us-1-ngrams_exports.html
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/1download
File index range:          0 to 13
File URLs available:       14
File URLs to use:          14
First file to get:         https://storage.googleapis.com/books/ngrams/books/20200217/eng-us/1-00000-of-00014.gz
Last file to get:          https://storage.googleapis.com/books/ngrams/books/20200217/eng-us/1-00013-of-00014.gz
Ngram size:                1
Ngram type:                tagged
Number of workers:         48
Compress saved files:      False
Overwrite existing files:  True



Downloading:   0%|          | 0/14 [00:00<?, ?files/s]

[31m
End Time:                  2025-04-05 20:26:21.430271[0m
[31mTotal runtime:             0:00:24.175298
[0m


### Convert files from TXT to JSONL
This module converts the original unigram files' text data to a more flexible JSON Lines (JSONL) format. Although this increases storage demands, it makes downstream processing more efficient.

In [4]:
convert_to_jsonl_files(
    ngram_size=1,
    ngram_type='tagged',
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us',
    compress=False,
    overwrite=True,
    delete_input=False
)

[31mStart Time:                2025-04-05 20:26:32.159192
[0m
[4mConversion Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/1download
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/2convert
File index range:          0 to 13
Files available:           14
Files to use:              14
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/1download/1-00000-of-00014.txt
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/1download/1-00013-of-00014.txt
Ngram size:                1
Ngram type:                tagged
Number of workers:         48
Compress output files:     False
Overwrite existing files:  True
Delete input directory:    False



Converting:   0%|          | 0/14 [00:00<?, ?files/s]

[31m
End Time:                  2025-04-05 20:28:14.470697[0m
[31mTotal runtime:             0:01:42.311505
[0m


### Make unigrams all lowercase
This module lowercases all characters in the unigrams. Most use cases benefit from this.

In [5]:
lowercase_ngrams(
    ngram_size=1,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us',
    compress=False,
    overwrite=True,
    delete_input=False
)

[31mStart Time:                2025-04-05 20:28:33.554586
[0m
[4mLowercasing Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/2convert
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/3lowercase
File index range:          0 to 13
Files available:           14
Files to use:              14
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/2convert/1-00000-of-00014.jsonl
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/2convert/1-00013-of-00014.jsonl
Ngram size:                1
Number of workers:         48
Compress output files:     False
Overwrite existing files:  True
Delete input directory:    False



Lowercasing:   0%|          | 0/14 [00:00<?, ?files/s]

[31m
End Time:                  2025-04-05 20:29:22.536318[0m
[31mTotal runtime:             0:00:48.981732
[0m


### Lemmatize the unigrams
This module lemmatizes the unigrams—that is, reduces them to their base forms. This is desirable for most use cases. Example: `people_NOUN` ("the people of this land") will be converted to `person` in the output; `people_VERB` ("to people this land") will not. My code uses the [NLTK Lemmatizer](https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet), which requires requires POS-tagged unigrams. The tags are discarded after lemmatization as they're no longer useful, saving space.

In [6]:
lemmatize_ngrams(
    ngram_size=1,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us',
    compress=False,
    overwrite=True,
    delete_input=False
)

[31mStart Time:                2025-04-05 20:29:37.159819
[0m
[4mLemmatizing Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/3lowercase
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/4lemmatize
File index range:          0 to 13
Files available:           14
Files to use:              14
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/3lowercase/1-00000-of-00014.jsonl
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/3lowercase/1-00013-of-00014.jsonl
Ngram size:                1
Number of workers:         48
Compress output files:     False
Overwrite existing files:  True
Delete input directory:    False



Lemmatizing:   0%|          | 0/14 [00:00<?, ?files/s]

[31m
End Time:                  2025-04-05 20:30:30.555406[0m
[31mTotal runtime:             0:00:53.395587
[0m


### Filter the unigrams
This module removes tokens that provide little information about words' semantic context: those that contain numerals (`numerals=True`), nonalphabetic characters (`nonalpha=True`), stopwords (high-frequency, low information tokens like "the" and "into"; `stops=True`), or short words (those below a certain user-specified character count; here, `min_token_length=3`).

You can also specify `min_tokens`—the minumum length of a retained ngram after filtering its tokens. This is mainly intended for use when processing multigrams. However, it's still good to specify `min_tokens=1` for unigrams, as it completely discards the data for any unigram violating our criteria. Empty output files will be deleted.

In [11]:
filter_ngrams(
    ngram_size=1,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us',
    numerals=True,
    nonalpha=True,
    stops=True,
    min_token_length=3,
    min_tokens=1,
    compress=False,
    overwrite=True,
    replace_unk=True,
    delete_input=False
)

[31mStart Time:                   2025-04-05 20:50:24.917836
[0m
[4mFiltering Info[0m
Input directory:              /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/4lemmatize
Output directory:             /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/5filter
File index range:             0 to 13
Files available:              14
Files to use:                 14
First file to get:            /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/4lemmatize/1-00000-of-00014.jsonl
Last file to get:             /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/4lemmatize/1-00013-of-00014.jsonl
Ngram size:                   1
Number of workers:            48
Compress output files:        False
Overwrite existing files:     True
Delete input directory:       False

[4mFiltering Options[0m
Drop stopwords:               True
Drop tokens under:            3 chars
Drop tokens with numerals:    True
Drop non-alphabetic:        

Filtering:   0%|          | 0/14 [00:00<?, ?files/s]


[4mFiltering Results (Dropped)[0m
Stopword tokens:              4993 
Short-word tokens:            20751 
Tokens with numerals:         9437345 
Tokens with non-alpha chars:  1388761
Out-of-vocab tokens:          0
Entire ngrams:                10851850 
[31m
End Time:                  2025-04-05 20:51:08.081778[0m
[31mTotal runtime:             0:00:43.163942
[0m


### Sort and combine the unigram files
This modules creates a single, fully-sorted unigram file out of the filtered files. This is crucial for the next step (ngram consolidation; see below).   

Sorting a giant file is a resource-hungry process and I've tried to implement an efficient approach that leverages parallelism: We first sort the filtered files in parallel using Python's standard sorting algorithm [Timsort](https://en.wikipedia.org/wiki/Timsort); then, we incrementally [heapsort](https://en.wikipedia.org/wiki/Heapsort) the files in parallel until we get down to 2 files. Finally, we heapsort the final 2 files (necessarily using one processor) to arrive at a single combined and sorted unigram file.

In [13]:
sort_ngrams(
    ngram_size=1,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us',
    workers=12,
    sort_key='ngram',
    compress=False,
    overwrite=True,
    sort_order='ascending',
    delete_input=False
)

[31mStart Time:                2025-04-05 20:54:51.374121
[0m
[4mSort Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/5filter
Sorted directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/temp
Temp directory:            /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/tmp
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/6corpus/1gram-merged.jsonl
Files available:           10
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/5filter/1-00004-of-00014.jsonl
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/5filter/1-00013-of-00014.jsonl
Files to use:              10
Ngram size:                1
Number of workers:         12
Compress output files:     False
Overwrite existing files:  True
Sort key:                  ngram
Sort order:                ascending


Sorting:   0%|          | 0/10 [00:00<?, ?files/s]


Iteration 1: merging 10 files into 5 chunks using 5 workers.
  5 chunk(s) with 2 file(s)

Iteration 2: merging 5 files into 2 chunks using 2 workers.
  1 chunk(s) with 2 file(s)
  1 chunk(s) with 3 file(s)

Iteration 3: final merge of 2 files.


Merging:   0%|          | 0/15441065 [00:00<?, ?lines/s]


Merging complete. Final merged file:
/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/6corpus/1gram-merged.jsonl
[31m
End Time:                  2025-04-05 21:02:39.314724[0m
[31mTotal runtime:             0:07:47.940603
[0m


### Verify sort [OPTIONAL]
If we want, we can verify that the output file is correctly sorted. If the script outputs `True`, then the file is sorted. Bear in mind that you need to specify the file path manually here; be sure to use the right file extension based on whether `sort_ngrams` was run with `compress=True`.

In [17]:
check_file_sorted(
    input_file=(
        '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/'
        '1gram_files/6corpus/1gram-merged.jsonl'
    ),
    field="ngram",
    sort_order="ascending"
)

Lines: 15441065line [02:35, 99058.29line/s] 

The file is sorted.


### Consolidate duplicate unigrams
This module consolidates the sorted unigram file. Lowercasing and lemmatizing produce duplicate unigrams. Now that the file is sorted, we can scan through it and consolidate consecutive idential duplicates. This involves summing their overall and yearly frequencies and document counts. It also leads to a much smaller file.

`[Runtime with compression:  0:07:33.662163`

In [18]:
consolidate_duplicate_ngrams(
    ngram_size=1,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us',
    lines_per_chunk=500000,
    compress=False,
    overwrite=True
)

[31mStart Time:                2025-04-05 21:19:40.172839
[0m
[4mConsolidation Info[0m
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/6corpus/1gram-merged.jsonl
Corpus file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/6corpus/1gram-corpus.jsonl
Temporary directory:       /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/temp_chunks
Ngram size:                1
Number of workers:         48
Compress output files:     False
Overwrite existing files:  True

Created and Sorted: 31 chunks
Merged: 31 chunks

[31m
End Time:                  2025-04-05 21:23:20.433356[0m
[31mTotal runtime:             0:03:40.260517
[0m


### Index unigrams and create vocabulary file
Most use cases will require an indexed list of "valid" (i.e., reasonably common) vocabulary words. This indexing script serves the dual functions of (1) mapping each unigram to an index number (saved in `/6corpus/1gram-corpus-indexed.jsonl`) and (2) culling this file into a vocabulary list consisting of the _n_ most frequent unigrams (saved in `6corpus/1gram-corpus-vocab_list_match.txt`). The vocabulary file provides a critical means of filtering excessively rare words out of the corpus. Unlike files upstream in the workflow, the vocabulary files are not large and don't need to be compressed.

In [19]:
index_and_create_vocab_files(
    ngram_size=1,
    proj_dir='/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us',
    compress=False,
    overwrite=True,
    vocab_n=80000
)

[31mStart Time:                2025-04-05 21:31:34.037914
[0m
[4mIndexing Info[0m
Corpus file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/6corpus/1gram-corpus.jsonl
Indexed file:              /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/6corpus/1gram-indexed.jsonl
Ngram size:                1
Number of workers:         48
Compress output files:     False
Overwrite existing files:  True

[4mVocabulary Info[0m
Vocab size (top N):        80000
Match File:                /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/6corpus/1gram-corpus-vocab_list_match.txt
Lookup File:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/6corpus/1gram-corpus-vocab_list_lookup.jsonl



Chunking:   0%|          | 0/8221439 [00:00<?, ?lines/s]

Sorting:   0%|          | 0/83 [00:00<?, ?chunks/s]

Merging:   0%|          | 0/8221439 [00:00<?, ?lines/s]

Indexing:   0%|          | 0/8221439 [00:00<?, ?lines/s]


Indexed 8221439 lines.
Final indexed file: 1gram-corpus-indexed.jsonl
Created vocab_list_match and vocab_list_lookup files for top 80000 ngrams.
[31m
End Time:                  2025-04-05 21:37:42.289504[0m
[31mTotal runtime:             0:06:08.251590
[0m


### Verify indexing [OPTIONAL]
We can verify that the final indexed file looks right.

In [23]:
print_jsonl_lines(
    file_path=(
        '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-us/'
        '1gram_files/6corpus/1gram-corpus-indexed.jsonl'
    ),
    start_line=70000,
    end_line=70002,
    parse_json=True
)

Line 70000: {'ngram': 'enue', 'freq_tot': 140480, 'doc_tot': 92524, 'freq': {'1501': 2, '1681': 2, '1701': 2, '1717': 1, '1751': 4, '1753': 1, '1759': 2, '1776': 1, '1778': 2, '1789': 13, '1795': 1, '1796': 1, '1798': 3, '1800': 3, '1802': 2, '1804': 3, '1805': 3, '1807': 13, '1808': 53, '1809': 15, '1810': 6, '1811': 7, '1812': 11, '1813': 15, '1814': 5, '1815': 7, '1816': 8, '1817': 9, '1818': 8, '1819': 18, '1820': 3, '1821': 5, '1822': 12, '1823': 8, '1824': 15, '1825': 66, '1826': 12, '1827': 9, '1828': 39, '1829': 42, '1830': 23, '1831': 37, '1832': 245, '1833': 56, '1834': 119, '1835': 109, '1836': 167, '1837': 225, '1838': 49, '1839': 51, '1840': 41, '1841': 97, '1842': 100, '1843': 134, '1844': 635, '1845': 113, '1846': 135, '1847': 105, '1848': 98, '1849': 113, '1850': 122, '1851': 200, '1852': 154, '1853': 217, '1854': 231, '1855': 166, '1856': 154, '1857': 160, '1858': 181, '1859': 225, '1860': 189, '1861': 206, '1862': 113, '1863': 149, '1864': 125, '1865': 118, '1866': 32

## Next Steps
If you've gotten this far, you're ready to start pre-processing multigrams using the `workflow_multigrams.ipynb` notebook!