# **Process Ngram Files**

## Process multigram files
Download multigrams (_n_ = 2–5). You can specify a **vocabulary file** for filtering. Vocabulary filtering dicards ngrams containing tokens absent from the vocabulary file.

#### Set the appropriate base directory

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

#### Train a basic model

In [None]:
!python train_word2vec.py \
    --corpus_file "{base_dir}/5gram_files/year_files/text/2019.txt" \
    --model_file "{base_dir}/5gram_files/word2vec_model.model" \
    --vector_file "{base_dir}/5gram_files/word_vectors.txt" \
    --vector_size 300 \
    --window 5 \
    --sg 1 \
    --negative 10 \
    --min_count 5 \
    --sample 1e-5 \
    --workers 48 \
    --epochs 10 \
    --alpha 0.025 \
    --min_alpha 0.0001

In [None]:
from gensim.models import Word2Vec

# Load the model
model = Word2Vec.load(f'{base_dir}/5gram_files/word2vec_model.model')

similar_words = model.wv.most_similar('water_ADJ', topn=10)
print(similar_words)

## Generate Vocabulary File

Make a list of the _n_ most common unigrams (1grams). This file is used to filter multi-token ngrams.

### Set base directory
The scripts need to know where your project is stored, and will add subdirectories there as they go.

In [2]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

### Download unigrams

Downloads unigrams (`--ngram_size 1`) appended with part-of-speech (POS) tags (e.g., `_VERB`). Although you can specify `--ngram_type untagged`, POS tags are necessary to lemmatize the tokens. If storage space is limited, specify `--compress` here and throughout the worlflow; this tells the code to use LZ4 compression when storing output files. Downstream scripts will see the `.lz4` extensions and handle the files accordingly. Specify as many parallel processes as you have available or wish to use by setting `--workers`.

In [3]:
!python 1_download.py \
    --ngram_size 1 \
    --ngram_type tagged \
    --proj_dir {base_dir} \
    --workers 48 \
    --overwrite \
    --compress

[31mStart Time:                2025-01-01 18:16:37.088160
[0m
[4mDownload Info[0m
Ngram repository:          https://storage.googleapis.com/books/ngrams/books/20200217/eng/eng-1-ngrams_exports.html
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1download
File index range:          0 to 23
File URLs available:       24
File URLs to use:          24
First file to get:         https://storage.googleapis.com/books/ngrams/books/20200217/eng/1-00000-of-00024.gz
Last file to get:          https://storage.googleapis.com/books/ngrams/books/20200217/eng/1-00023-of-00024.gz
Ngram size:                1
Ngram type:                tagged
Number of workers:         48
Compress saved files:      True
Overwrite existing files:  True

Downloading     |[32m██████████████████████████████████████████████████[0m| 100.0% 24          /24         [0m
[31m
End Time:                  2025-01-01 18:17:16.648932[0m
[31mTotal runtime:             0:00:39.560772


### Convert unigrams to JSONL files

Convert the original unigram files' text data to a more flexible JSON Lines (JSONL) format. Although this increases storage demands, it makes downstream processing more efficient.

In [4]:
!python 2_convert.py \
    --ngram_size 1 \
    --ngram_type tagged \
    --proj_dir {base_dir} \
    --workers 48 \
    --overwrite \
    --compress

[31mStart Time:                2025-01-01 18:18:13.673572
[0m
[4mLowercasing Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1download
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/2convert
File index range:          0 to 23
Files available:           24
Files to use:              24
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1download/1-00000-of-00024.txt.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1download/1-00023-of-00024.txt.lz4
Ngram size:                1
Ngram type:                tagged
Number of workers:         48
Compress output files:     True
Overwrite existing files:  True
Delete input directory:    False

[4mConversion Progress[0m
Converting      |[32m██████████████████████████████████████████████████[0m| 100.0% 24          /24         [0m
[31m
End Time:                  2

### Make unigrams all lowercase
Most use cases will benefit from unigrams that are all lowercase.

In [5]:
!python 3_lowercase.py \
    --ngram_size 1 \
    --proj_dir {base_dir} \
    --workers 48 \
    --overwrite \
    --compress

[31mStart Time:                2025-01-01 18:20:46.699502
[0m
[4mLowercasing Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/2convert
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/3lowercase
File index range:          0 to 23
Files available:           24
Files to use:              24
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/2convert/1-00000-of-00024.jsonl.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/2convert/1-00023-of-00024.jsonl.lz4
Ngram size:                1
Number of workers:         48
Compress output files:     True
Overwrite existing files:  True
Delete input directory:    False

[4mLowercasing Progress[0m
Lowercasing     |[32m██████████████████████████████████████████████████[0m| 100.0% 24          /24         [0m
[31m
End Time:                  2025-01-01 18:21:51.655646[0m


### Lemmatize the uningrams
Likewise, most use cases will benefit from unigrams that are lemmatized—that is, reduced to their base form. This requires POS-tagged unigrams. Example: `aggregating_VERB` will be converted to `aggregate` in the output. The POS tag will then be discarded as it is no longer useful.

In [6]:
!python 4_lemmatize.py \
    --ngram_size 1 \
    --proj_dir {base_dir} \
    --workers 48 \
    --overwrite \
    --compress

[31mStart Time:                2025-01-01 18:22:15.453722
[0m
[4mLemmatizing Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/3lowercase
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/4lemmatize
File index range:          0 to 23
Files available:           24
Files to use:              24
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/3lowercase/1-00000-of-00024.jsonl.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/3lowercase/1-00023-of-00024.jsonl.lz4
Ngram size:                1
Number of workers:         48
Compress output files:     True
Overwrite existing files:  True
Delete input directory:    False

[4mLemmatization Progress[0m
Lemmatizing     |[32m██████████████████████████████████████████████████[0m| 100.0% 24          /24         [0m
[31m
End Time:                  2025-01-01 18:23:36.749

### Filter the unigrams
Removes tokens that contain numerals (`--numerals`), nonalphabetic characters (`--nonalpha`), stopwords (`--stopwords`), or short words (`--min_token_length 3`). Since we're processing unigrams, `--min_tokens 1` means than any unigram in the data will be completely discarded. Any empty output files will be deleted.

In [7]:
!python 5_filter.py \
    --ngram_size 1 \
    --proj_dir {base_dir} \
    --workers 48 \
    --numerals \
    --nonalpha \
    --stopwords \
    --min_token_length 3 \
    --min_tokens 1 \
    --overwrite \
    --compress

[31mStart Time:                   2025-01-01 18:24:38.960769
[0m
[4mFiltering Info[0m
Input directory:              /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/4lemmatize
Output directory:             /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/5filter
File index range:             0 to 23
Files available:              24
Files to use:                 24
First file to get:            /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/4lemmatize/1-00000-of-00024.jsonl.lz4
Last file to get:             /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/4lemmatize/1-00023-of-00024.jsonl.lz4
Ngram size:                   1
Number of workers:            48
Compress output files:        True
Overwrite existing files:     True
Delete input directory:       False

[4mFiltering Options[0m
Drop stopwords:               True
Drop tokens under:            3 chars
Drop tokens with numerals:    True
Drop non-alphabetic:          Tru

### Sort and combine the unigram files

Create a single, fully-sorted unigram file out of the filtered files. To create a vocabulary file for filtering multigrams, we need to sort the input files `--descending` by `freq_tot`, giving us a rank-ordered list of unigrams by frequency. 

This is an expensive process in terms of computations, memory, and storage and is designed to be as economical and efficient as possible. We start by sorting each individual filtered file using Python's standard sorting algorithm (Timsort). Then, we incrementally merge the sorted files in parallel until we get down to 2 files using a heap-merge algorithm. Finally, we heap-merge the final 2 files (necessarily using one process) to arrive at a single combined and sorted unigram file.

In [17]:
!python 6_sort2.py \
    --ngram_size 1 \
    --proj_dir {base_dir} \
    --workers 10 \
    --sort_key ngram \
    --compress \
    --overwrite \
    --sort_order ascending

[31mStart Time:                2025-01-01 19:10:57.817487
[0m
[4mSort Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/5filter
Sorted directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/temp
Temp directory:            /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/tmp
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/6corpus/1gram-merged.jsonl.lz4
Files available:           18
Files to use:              18
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/5filter/1-00006-of-00024.jsonl.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/5filter/1-00023-of-00024.jsonl.lz4
Ngram size:                1
Number of workers:         10
Compress output files:     True
Overwrite existing files:  True
Sort key:                  ngram
Sort order:                ascending
Heap-me

### Verify sort [OPTIONAL]
If we want, we can verify that the output file is correctly sorted.

In [18]:
!python verify_sort.py \
    --input_file "{base_dir}/1gram_files/6corpus/1gram-merged.jsonl.lz4" \
    --field ngram \
    --sort_order ascending

Lines: 25127726line [05:01, 83447.21line/s] 

The file is sorted.

Processing complete.


### Consolidate duplicate unigrams
Lowercasing and lemmatizing produces duplicate unigrams. Now that the file is sorted, we can scan through it and consolidate consecutive idential duplicates. This involves summing their overall and yearly frequencies and document counts.

In [19]:
!python 7_consolidate.py \
    --ngram_size 1 \
    --proj_dir {base_dir} \
    --overwrite \
    --compress

[31mStart Time:                2025-01-01 19:42:40.646053
[0m
[4mConsolidation Info[0m
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/6corpus/1gram-merged.jsonl.lz4
Consolidated directory:    /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/6corpus/1gram-consolidated.jsonl.lz4
Ngram size:                1
Compress output files:     True
Overwrite existing files:  True

Consolidating   |[32m██████████████████████████████████████████████████[0m| 100.0% 25127726    /25127726   [0m

Lines before consolidation:  25127726
Lines after consolidation:   13499384
[31m
End Time:                  2025-01-01 19:54:29.839860[0m
[31mTotal runtime:             0:11:49.193807
[0m


### View line [OPTIONAL]
If we want, we can inspect a line in the file.

In [27]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/1gram_files/6corpus/1gram-consolidated.jsonl.lz4" \
    --start 1000000 \
    --end 1000001 \
    --parse

Line 1000000: {'ngram': 'bandageing', 'freq_tot': 71, 'doc_tot': 51, 'freq': {'1804': 1, '1817': 1, '1845': 2, '1870': 1, '1872': 1, '1873': 1, '1874': 1, '1878': 1, '1885': 5, '1891': 1, '1907': 1, '1911': 2, '1914': 2, '1916': 1, '1917': 24, '1921': 1, '1964': 1, '1966': 1, '1973': 3, '1974': 1, '1988': 1, '1991': 1, '2000': 4, '2004': 2, '2005': 3, '2007': 2, '2012': 2, '2014': 4}, 'doc': {'1804': 1, '1817': 1, '1845': 2, '1870': 1, '1872': 1, '1873': 1, '1874': 1, '1878': 1, '1885': 5, '1891': 1, '1907': 1, '1911': 2, '1914': 2, '1916': 1, '1917': 12, '1921': 1, '1964': 1, '1966': 1, '1973': 2, '1974': 1, '1988': 1, '1991': 1, '2000': 2, '2004': 1, '2005': 2, '2007': 1, '2012': 2, '2014': 2}}
Line 1000001: {'ngram': 'bandageless', 'freq_tot': 79, 'doc_tot': 66, 'freq': {'1907': 2, '1909': 10, '1910': 1, '1921': 20, '1926': 1, '1927': 3, '1934': 2, '1935': 1, '1945': 2, '1958': 4, '1969': 5, '1982': 1, '1985': 1, '1987': 1, '1998': 1, '2001': 1, '2003': 1, '2004': 1, '2005': 2, '200

### Index unigrams and create vocabulary file
Most use cases will require an indexed list of "valid" (i.e., reasonably common) vocabulary words. This indexing script served dual functions of (1) mapping each unigram to an index number (saved in `/6corpus/1gram-consolidated-indexed.jsonl`) and (2) culling this file into a vocabulary list consisting of the _n_ most frequent unigrams (saved in `6corpus/1gram-consolidated-vocab_list_match.txt`). Unlike files upstream in the workflow, the vocabulary files are not very large and are therefore not compressed.

In [25]:
!python 7_index.py \
    --ngram_size 1 \
    --proj_dir {base_dir} \
    --input_file {base_dir}/1gram_files/6corpus/1gram-consolidated.jsonl.lz4 \
    --overwrite \
    --vocab_file 80000 \
    --workers 48

[31mStart Time:                2025-01-01 22:32:01.094067
[0m
[4mIndexing Info[0m
Project directory:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng
Output directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/6corpus
Input file:                /vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/6corpus/1gram-consolidated.jsonl.lz4
Ngram size:                1
Overwrite existing files:  True
Workers:                   48
Vocab size (top N):        80000


[4mIndexing Info[0m
Chunking        |[32m██████████████████████████████████████████████████[0m| 100.0% 13499384    /13499384   [0m
Sorting         |[32m██████████████████████████████████████████████████[0m| 100.0% 135         /135        [0m
Merging         |[32m██████████████████████████████████████████████████[0m| 100.0% 13499384    /13499384   [0m
Indexing        |[32m██████████████████████████████████████████████████[0m| 100.0% 13499384    /13499384   [0m

Indexe

## Process Multigrams

In [1]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

In [None]:
!python 1_download.py \
    --ngram_size 5 \
    --ngram_type tagged \
    --proj_dir {base_dir} \
    --overwrite \
    --compress

In [None]:
!python 2_convert.py \
    --ngram_size 5 \
    --ngram_type tagged \
    --proj_dir {base_dir} \
    --compress

In [None]:
!python 3_lowercase.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --compress

In [None]:
!python 4_lemmatize.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --compress

In [None]:
!python 5_filter.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --numerals \
    --nonalpha \
    --stopwords \
    --min_token_length 3 \
    --min_tokens 2 \
    --compress \
    --vocab_file {base_dir}/1gram_files/6corpus/1-00000-to-00017-vocab_list_match.txt

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

!python 6_sort2.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --workers 10 \
    --sort_key ngram \
    --compress \
    --sort_order ascending \
    --end_iteration 2

[31mStart Time:                2025-01-01 15:54:18.573790
[0m
[4mSort Info[0m
Input directory:           /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5filter
Sorted directory:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/temp
Temp directory:            /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/tmp
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-merged.jsonl.lz4
Files available:           6520
Files to use:              6520
First file to get:         /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5filter/5-00069-of-19423.jsonl.lz4
Last file to get:          /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5filter/5-19422-of-19423.jsonl.lz4
Ngram size:                5
Number of workers:         10
Compress output files:     True
Overwrite existing files:  False
Sort key:                  ngram
Sort order:                ascending
He

In [2]:
!python 7_consolidate.py \
    --ngram_size 5 \
    --proj_dir {base_dir} \
    --overwrite \
    --compress

[31mStart Time:                2024-12-31 17:52:08.091974
[0m
[4mConsolidation Info[0m
Merged file:               /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-merged.jsonl.lz4
Consolidated directory:    /vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-consolidated.jsonl.lz4
Ngram size:                5
Compress output files:     True
Overwrite existing files:  True

Consolidating   |[32m██████████████████████████████████████████████████[0m| 100.0% 276470316   /276470316  [0m

Lines before consolidation:  276470316
Lines after consolidation:   75107076
[31m
End Time:                  2024-12-31 19:10:05.741234
[0m


In [3]:
import lz4.frame

input_path = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-consolidated.jsonl.lz4"
output_path = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/6corpus/5gram-consolidated.jsonl"

def decompress_lz4_file(input_path, output_path):
    with lz4.frame.open(input_path, "rb") as compressed_file:
        with open(output_path, "wb") as decompressed_file:
            decompressed_file.write(compressed_file.read())

decompress_lz4_file(input_path, output_path)

In [4]:
!python print_jsonl_lines.py \
    --file_path "{base_dir}/1gram_files/6corpus/1gram-consolidated.jsonl" \
    --start 50000 \
    --end 50100

Error reading the file '/vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/6corpus/1gram-consolidated.jsonl': [Errno 2] No such file or directory: '/vast/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/6corpus/1gram-consolidated.jsonl'


In [4]:
!python verify_sort.py \
    --input_file "{base_dir}/5gram_files/6corpus/5gram-consolidated.jsonl" \
    --field ngram \
    --sort_order ascending

Lines: 75107076line [08:47, 142445.80line/s]

The file is sorted.

Processing complete.


In [None]:
!python simulate_merge.py \
    --file_dir "{base_dir}/5gram_files/temp" \
    --workers 48

In [None]:
base_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng"

!python simulate_merge2.py \
    --ngram_size 5 \
    --file_dir "{base_dir}/5gram_files/temp" \
    --tmp_dir "{base_dir}/5gram_files/tmp" \
    --compress \
    --sort_key ngram \
    --sort_order ascending \
    --workers 48

In [22]:
!lz4 -t {base_dir}/1gram_files/6corpus/1gram-consolidated.jsonl.lz4

/vast/edk202/NLP_corpora/Googl : decoded 18959398668 bytes                     
