# **Multigrams: Full Workflow**
The multigram workflow mirrors the unigram workflow, but with two differences. First, instead of _creating_ a whitelist, you filter the multigram corpus _using_ a whitelist containing the top-_N_ unigrams. Second, the multigram workflow adds a **pivoting** step. Pivoting reorganizes the database so that it's easy to query year-ngram combinations. For instance, you can learn how many times the word "nuclear" appeared in 2011 by querying the key `[2011] nuclear`. This is useful for analyzing changes in word meanings over time.

## **Setup**
### Imports

In [3]:
%load_ext autoreload
%autoreload 2

from ngramprep.ngram_acquire import download_and_ingest_to_rocksdb
from ngramprep.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramprep.ngram_pivot.pipeline import build_pivoted_db
from ngramprep.utilities.peek import db_head, db_peek, db_peek_prefix

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Configure
Here we set basic parameters: the corpus to download, the size of the ngrams to download, and the size of the year bins.

In [4]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = None
release = '20200217'
language = 'eng-us'
ngram_size = 5
bin_size = 1

## **Step 1: Download and Ingest**

Specifying `combined_bigrams_download` will convert compound terms into single, hyphenated tokens. This process is case-sensitive, so we specify all common capitalization patterns. 

In [4]:
combined_bigrams_download = {
    "working class", "Working class", "Working Class", "working classes", "Working classes", "Working Classes"
    "middle class", "Middle class", "Middle Class", "middle classes", "Middle classes", "Middle Classes"
    "lower class", "Lower class", "Lower Class", "lower classes", "Lower classes", "Lower Classes"
    "upper class", "Upper class", "Upper Class", "upper classes", "Upper classes", "Upper Classes"
    "human being", "Human being", "Human Being", "human beings", "Human beings", "Human Beings"
}

download_and_ingest_to_rocksdb(
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=None,
    ngram_type="tagged",
    random_seed=289,
    overwrite_db=True,
    workers=128,
    write_batch_size=5_000,
    open_type="write:packed24",
    compact_after_ingest=False,
    combined_bigrams=combined_bigrams_download
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-05 09:25:22

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           https://books.storage.googleapis.com/?prefix=ngrams/books/20200217/eng-us/5-
DB path:              /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams.db
File range:           0 to 11144
Total files:          11145
Files to get:         11145
Skipping:             0
Download workers:     128
Batch size:           5,000
Ngram size:           5
Ngram type:           tagged
Overwrite DB:         True
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|█████████████████████████████████████████████████| 11145/11145 [1:38:40<00:00]



Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Fully processed files:       11145
Failed files:                0
Total entries written:       1,391,714,327
Write batches flushed:       3963
Uncompressed data processed: 17.02 TB
Processing throughput:       3003.07 MB/sec

End Time: 2026-01-05 11:04:25.300122
Total Runtime: 1:39:02.892968
Time per file: 0:00:00.533234
Files per hour: 6751.3


## **Step 2: Filter and Normalize**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. As implemented here, we use the whitelist from the unigram workflow to filter the multigram corpus. If we weren't using a whitelist, we could normalize, filter, and lemmatize each token just as we did for the unigrams.

`always_include_tokens` is applied after case-normalization, so we use all lowercase.

In [6]:
filter_options = {
    'bin_size': bin_size
}

whitelist_options = {
    'whitelist_path': f'{db_path_stub}/{release}/{language}/1gram_files/1grams_processed.db/whitelist.txt',
    'output_whitelist_path': None
}

always_include_tokens = {
    'he', 'she',
    'him', 'her',
    'his', 'hers',
    'himself', 'herself',
    'man', 'woman',
    'men', 'women',
    'male', 'female',
    'boy', 'girl',
    'boys', 'girls',
    'father', 'mother',
    'fathers', 'mothers',
    'son', 'daughter',
    'sons', 'daughters',
    'brother', 'sister',
    'brothers', 'sisters'
}

build_processed_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=36,
    num_initial_work_units=600,
    work_unit_claim_order="random",
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=30.0,
    always_include=always_include_tokens,
    compact_after_ingest=False,
    **filter_options,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-06 19:41:08
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams.db
Target DB:            ...02/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_processed.db
Temp directory:       .../edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              36
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
─────────────────────

Loading whitelist...
Loaded 30,000 tokens

Phase 1: Creating work units...
════════════════════════════════════════════════════════════════════════════════════════════════════
Clean restart - creating new work units
Loading cached partitions (600 work units)...
Loaded 600 work units from cache

Phase 2: Processing 600 work units with 36 workers...
════════════════════════════════════════════════════════════════════════════════════════════════════

    items         kept%        workers        units          rate        elapsed    
────────────────────────────────────────────────────────────────────────────────────
    2.85M         74.7%          4/36        591·4·5       70.5k/s         40s      
    14.54M        84.3%          7/36        572·7·21      206.6k/s       1m10s     
    38.45M        86.2%         10/36       554·10·36      383.0k/s       1m40s     
    73.11M        86.0%         14/36       540·14·46      560.8k/s       2m10s     
   109.67M        85.7%         17/36 

Shards Ingested: 100%|███████████████████████████████████████████████████████| 600/600 [44:36<00:00]



Ingestion complete: 600 shards, 325,177,784 items in 2676.3s (121,505 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 282,846,041 (estimated)                                                                   │
│ Size: 297.18 GB                                                                                  │
│ Database: ...h/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_processed.db   │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘


## **Step 3: Pivot to Yearly Indices**

In [7]:
build_pivoted_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=30,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=True,
    compact_after_ingest=False,
    progress_every_s=10.0,
);


PARALLEL N-GRAM DATABASE PIVOT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-06 20:37:14
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            ...02/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_processed.db
Target DB:            ...k202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_pivoted.db
Temp directory:       ...ch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/pivoting_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              30
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24
Ingest profile:       write:packed24



SST Files Ingested: 100%|████████████████████████████████████████████████████| 600/600 [00:36<00:00]



Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 8,614,673,680 (estimated)                                                                 │
│ Size: 125.08 GB                                                                                  │
│ Database: ...tch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_pivoted.db   │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘


# Inspect Final Database
Here are three functions you can use to inspect the final database.

## `db_head`: First N records

In [8]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

db_head(pivoted_db, n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1472] <UNK> <UNK> <UNK> <UNK> absence
     Value: 5 occurrences in 3 documents

[ 2] Key:   [1472] <UNK> <UNK> <UNK> <UNK> accordance
     Value: 1 occurrences in 1 documents

[ 3] Key:   [1472] <UNK> <UNK> <UNK> <UNK> accordingly
     Value: 1 occurrences in 1 documents

[ 4] Key:   [1472] <UNK> <UNK> <UNK> <UNK> accrue
     Value: 1 occurrences in 1 documents

[ 5] Key:   [1472] <UNK> <UNK> <UNK> <UNK> acre
     Value: 2 occurrences in 2 documents



## `db_peek`: Records starting from a key

In [13]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

db_peek(pivoted_db, start_key="[2019] herself <UNK> <UNK> <UNK> <UNK>", n=5)

5 key-value pairs starting from 000007e368657273656c66203c554e4b3e203c554e4b3e203c554e4b3e203c554e4b3e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2019] herself <UNK> <UNK> <UNK> <UNK>
     Value: 15,021 occurrences in 14,857 documents

[ 2] Key:   [2019] herself <UNK> <UNK> <UNK> able
     Value: 53 occurrences in 51 documents

[ 3] Key:   [2019] herself <UNK> <UNK> <UNK> absence
     Value: 1 occurrences in 1 documents

[ 4] Key:   [2019] herself <UNK> <UNK> <UNK> absurd
     Value: 1 occurrences in 1 documents

[ 5] Key:   [2019] herself <UNK> <UNK> <UNK> abyss
     Value: 2 occurrences in 2 documents



## `db_peek_prefix`: Records matching a prefix

In [14]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

db_peek_prefix(pivoted_db, prefix="[1910] <UNK> himself", n=5)

5 key-value pairs with prefix 000007763c554e4b3e2068696d73656c66:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1910] <UNK> himself <UNK> <UNK> <UNK>
     Value: 94,539 occurrences in 90,099 documents

[ 2] Key:   [1910] <UNK> himself <UNK> <UNK> ability
     Value: 35 occurrences in 29 documents

[ 3] Key:   [1910] <UNK> himself <UNK> <UNK> abject
     Value: 2 occurrences in 2 documents

[ 4] Key:   [1910] <UNK> himself <UNK> <UNK> able
     Value: 99 occurrences in 99 documents

[ 5] Key:   [1910] <UNK> himself <UNK> <UNK> aboriginal
     Value: 5 occurrences in 5 documents

