# **Multigrams: Full Workflow**
The multigram workflow mirrors the unigram workflow, but with two differences. First, instead of _creating_ a whitelist, you filter the multigram corpus _using_ a whitelist containing the top-_N_ unigrams. Second, the multigram workflow adds a **pivoting** step. Pivoting reorganizes the database so that it's easy to query year-ngram combinations. For instance, you can learn how many times the word "nuclear" appeared in 2011 by querying the key `[2011] nuclear`. This is useful for analyzing changes in word meanings over time.

## **Setup**
### Imports

In [3]:
%load_ext autoreload
%autoreload 2

from ngramprep.ngram_acquire import download_and_ingest_to_rocksdb
from ngramprep.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramprep.ngram_pivot.pipeline import build_pivoted_db
from ngramprep.utilities.peek import db_head, db_peek, db_peek_prefix
from ngramprep.utilities.count_items import count_db_items

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Configure
Here we set basic parameters: the corpus to download, the size of the ngrams to download, and the size of the year bins.

In [4]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = None
release = '20200217'
language = 'eng'
ngram_size = 5
bin_size = 1

## **Step 1: Download and Ingest**

Specifying `combined_bigrams_download` will convert compound terms into single, hyphenated tokens. This process is case-sensitive, so we specify all common capitalization patterns.

If you're resuming downloads after an interruption, there may be a lag before you see any output. The RocksDB is being repaired.

In [None]:
combined_bigrams_download = {
    "human being", "Human being", "Human Being", "human beings", "Human beings", "Human Beings"
}

download_and_ingest_to_rocksdb(
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=None,
    ngram_type="tagged",
    random_seed=923,
    overwrite_db=False,
    workers=128,
    write_batch_size=5_000,
    open_type="write:packed24",
    compact_after_ingest=True,
    combined_bigrams=combined_bigrams_download
)

## **Step 2: Filter and Normalize**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. As implemented here, we use the whitelist from the unigram workflow to filter the multigram corpus. If we weren't using a whitelist, we could normalize, filter, and lemmatize each token just as we did for the unigrams.

`always_include_tokens` is applied after case-normalization, so we use all lowercase.

In [None]:
filter_options = {
    'bin_size': bin_size
}

whitelist_options = {
    'whitelist_path': f'{db_path_stub}/{release}/{language}/1gram_files/1grams_processed.db/whitelist.txt',
    'output_whitelist_path': None
}

always_include_tokens = {
    'he', 'she',
    'him', 'her',
    'his', 'hers',
    'himself', 'herself',
    'man', 'woman',
    'men', 'women',
    'male', 'female',
    'boy', 'girl',
    'boys', 'girls',
    'father', 'mother',
    'fathers', 'mothers',
    'son', 'daughter',
    'sons', 'daughters',
    'brother', 'sister',
    'brothers', 'sisters'
}

build_processed_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=36,
    num_initial_work_units=600,
    work_unit_claim_order="random",
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=30.0,
    always_include=always_include_tokens,
    compact_after_ingest=True,
    **filter_options,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-13 02:50:57
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams.db
Target DB:            ...dk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams_processed.db
Temp directory:       ...tch/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              36
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
────────────────────────

Shards Ingested:   6%|███▌                                                    | 38/600 [01:04<16:53]

## **Step 3: Pivot to Yearly Indices**
This function rearranges the filtered data such that each record has a year (or year bin) prefix. Ideal for time-series announcement. 

In [None]:
build_pivoted_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=30,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=False,
    compact_after_ingest=True,
    progress_every_s=30.0,
);


PARALLEL N-GRAM DATABASE PIVOT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-12 21:50:30
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            ...dk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams_processed.db
Target DB:            .../edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams_pivoted.db
Temp directory:       /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/pivoting_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              30
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24
Ingest profile:       write:packed24



SST Files Ingested: 100%|████████████████████████████████████████████████████| 600/600 [00:44<00:00]



Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         211.29 GB
Compaction completed in 0:44:59
Size before:             211.29 GB
Size after:              517.97 GB
Space saved:             -306.68 GB (-145.1%)

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 14,661,317,082 (estimated)                                                                │
│ Size: 517.97 GB                                                                                  │
│ Database: /scratch/edk202/NLP_corpora/Google_Bo

## **Optional: Inspect Database Files**

### `db_head`: Show first N records

In [7]:
db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_processed.db'

db_head(db, n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   <UNK> <UNK> <UNK> abaca banana
     Value: Total: 59 occurrences in 59 volumes (1992-2013, 17 bins)

[ 2] Key:   <UNK> <UNK> <UNK> abaca industry
     Value: Total: 267 occurrences in 223 volumes (1901-2019, 65 bins)

[ 3] Key:   <UNK> <UNK> <UNK> abaca plant
     Value: Total: 487 occurrences in 477 volumes (1887-2019, 111 bins)

[ 4] Key:   <UNK> <UNK> <UNK> abaca plantation
     Value: Total: 69 occurrences in 51 volumes (1898-1987, 19 bins)

[ 5] Key:   <UNK> <UNK> <UNK> abaca production
     Value: Total: 630 occurrences in 550 volumes (1949-2003, 39 bins)



### `db_peek`: Show Records starting from a key

In [8]:
db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_processed.db'

db_peek(db, start_key="à <UNK> <UNK> <UNK> <UNK>", n=5)

5 key-value pairs starting from c3a0203c554e4b3e203c554e4b3e203c554e4b3e203c554e4b3e:
────────────────────────────────────────────────────────────────────────────────────────────────────


### `db_peek_prefix`: Show records matching a prefix

In [10]:
db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_processed.db'

db_peek_prefix(db, prefix="b", n=5)

5 key-value pairs with prefix 62:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   baa <UNK> <UNK> <UNK> also
     Value: Total: 59 occurrences in 57 volumes (1921-1988, 22 bins)

[ 2] Key:   baa <UNK> <UNK> <UNK> authority
     Value: Total: 299 occurrences in 260 volumes (1967-2019, 38 bins)

[ 3] Key:   baa <UNK> <UNK> <UNK> average
     Value: Total: 284 occurrences in 45 volumes (1953-2003, 23 bins)

[ 4] Key:   baa <UNK> <UNK> <UNK> baa
     Value: Total: 127 occurrences in 121 volumes (1926-2017, 40 bins)

[ 5] Key:   baa <UNK> <UNK> <UNK> better
     Value: Total: 147 occurrences in 121 volumes (1966-2016, 35 bins)



## **Optional: Count Database Items**

In [None]:
raw_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams.db'

raw_count = count_db_items(raw_db)

DATABASE ITEM COUNTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
/scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams.db
Progress interval: every 10,000,000 items

COUNTING
────────────────────────────────────────────────────────────────────────────────────────────────────
[     10,000,000] | elapsed     25.2s | rate 397,507 items/sec
[     20,000,000] | elapsed     47.3s | rate 422,881 items/sec
[     30,000,000] | elapsed     74.4s | rate 402,994 items/sec
[     40,000,000] | elapsed     97.4s | rate 410,517 items/sec
[     50,000,000] | elapsed    118.2s | rate 422,954 items/sec
[     60,000,000] | elapsed    137.2s | rate 437,405 items/sec
[     70,000,000] | elapsed    158.3s | rate 442,239 items/sec
[     80,000,000] | elapsed    185.7s | rate 430,781 items/sec
[     90,000,000] | elapsed    205.5s | rate 437,971 items/sec
[    100,000,000] | elapsed    224.9s | rate 444,696 items/sec
[    110,0

In [None]:
filtered_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_processed.db'

filtered_count = count_db_items(filtered_db)


DATABASE ITEM COUNTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams_processed.db
Progress interval: every 10,000,000 items

COUNTING
────────────────────────────────────────────────────────────────────────────────────────────────────


In [None]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

pivoted_count = count_db_items(pivoted_db, progress_interval=50_000_000)

DATABASE ITEM COUNTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
...atch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams_pivoted.db
Progress interval: every 50,000,000 items

COUNTING
────────────────────────────────────────────────────────────────────────────────────────────────────
[     50,000,000] | elapsed     22.7s | rate 2,199,962 items/sec
[    100,000,000] | elapsed     45.6s | rate 2,194,238 items/sec
[    150,000,000] | elapsed     68.6s | rate 2,185,296 items/sec
[    200,000,000] | elapsed     92.2s | rate 2,169,326 items/sec
[    250,000,000] | elapsed    115.2s | rate 2,170,430 items/sec
[    300,000,000] | elapsed    138.9s | rate 2,159,459 items/sec
[    350,000,000] | elapsed    161.4s | rate 2,169,175 items/sec
[    400,000,000] | elapsed    184.3s | rate 2,170,344 items/sec
[    450,000,000] | elapsed    207.3s | rate 2,170,534 items/sec
[    500,000,000] | elapsed    230.3s | rate 2,1

In [None]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

pivoted_count_per_bin = count_db_items(pivoted_db, progress_interval=50_000_000, grouping='year_bin')

DATABASE ITEM COUNTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams_pivoted.db
Progress interval: every 50,000,000 items
Grouping by: year_bin

COUNTING
────────────────────────────────────────────────────────────────────────────────────────────────────
[     50,000,000] | elapsed     36.8s | rate 1,360,272 items/sec
[    100,000,000] | elapsed     75.9s | rate 1,316,799 items/sec
