# **Multigrams: Full Workflow**
The multigram workflow mirrors the unigram workflow, but with two differences. First, instead of _creating_ a whitelist, you filter the multigram corpus _using_ a whitelist containing the top-_N_ unigrams. Second, the multigram workflow adds a **pivoting** step. Pivoting reorganizes the database so that it's easy to query year-ngram combinations. For instance, you can learn how many times the word "nuclear" appeared in 2011 by querying the key `[2011] nuclear`. This is useful for analyzing changes in word meanings over time.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from ngramkit.ngram_acquire import download_and_ingest_to_rocksdb
from ngramkit.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramkit.ngram_pivot.pipeline import build_pivoted_db
from ngramkit.utilities.peek import db_head, db_peek, db_peek_prefix

### Configure

In [2]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = '/scratch/edk202/NLP_archive/Google_Books/'
release = '20200217'
language = 'eng'
size = 5

## **Step 1: Download and Ingest**

In [3]:
combined_bigrams_download = {"lower class", "working class", "middle class", "upper class", "blue collar", "white collar", "african american", "african americans", "european american", "european americans", "white people", "white person", "white americans", "black american", "black americans", "black person", "black people", "human being"}

download_and_ingest_to_rocksdb(
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=None,
    ngram_type="tagged",
    random_seed=98,
    overwrite_db=False,
    workers=80,
    write_batch_size=5_000,
    open_type="write:packed24",
    compact_after_ingest=True,
    combined_bigrams=combined_bigrams_download
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-23 20:40:01

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           https://books.storage.googleapis.com/?prefix=ngrams/books/20200217/eng/5-
DB path:              /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams.db
File range:           0 to 19422
Total files:          19423
Files to get:         0
Skipping:             19423
Download workers:     80
Batch size:           5,000
Ngram size:           5
Ngram type:           tagged
Overwrite DB:         False
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed:   0%|                                                               | 0/0 [00:00<?]


Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         2.21 TB





Compaction completed in 0:42:07
Size before:             2.21 TB
Size after:              2.21 TB
Space saved:             -143.08 KB (-0.0%)

Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Fully processed files:       0
Failed files:                0
Total entries written:       0
Write batches flushed:       0
Uncompressed data processed: 0.00 B
Processing throughput:       0.00 MB/sec

End Time: 2025-11-23 21:22:26.459660
Total Runtime: 0:42:25.150204
Time per file: 0:00:00
Files per hour: 0.0


## **Step 2: Filter and Normalize**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. As implemented here, we use the whitelist from the unigram workflow to filter the multigram corpus. If we weren't using a whitelist, we could normalize, filter, and lemmatize each token just as we did for the unigrams.

In [3]:
whitelist_options = {
    'whitelist_path': f'{db_path_stub}/{release}/{language}/1gram_files/1grams_processed.db/whitelist.txt',
    'output_whitelist_path': None
}

always_include_tokens = {"lower-class", "working-class", "middle-class", "upper-class", "blue-collar", "white-collar", "african-american", "african-americans", "european-american", "european-americans", "white-people", "white-person", "white-americans", "black-american", "black-americans", "black-person", "black-people", "human-being"}

build_processed_db(
    mode="restart",
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=36,
    num_initial_work_units=600,
    work_unit_claim_order="random",
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=30.0,
    always_include=always_include_tokens,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-23 22:15:22
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams.db
Target DB:            ...dk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams_processed.db
Temp directory:       ...tch/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              36
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
────────────────────────

Shards Ingested: 100%|███████████████████████████████████████████████████████| 600/600 [42:53<00:00]



Ingestion complete: 600 shards, 454,493,730 items in 2573.3s (176,622 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         443.99 GB
Compaction completed in 0:10:27
Size before:             443.99 GB
Size after:              277.36 GB
Space saved:             166.62 GB (37.5%)

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 405,318,801 (estimated)                                                                   │
│ Size: 277.36 GB                                                        

## **Step 3: Pivot to Yearly Indices**

In [None]:
build_pivoted_db(
    mode="restart",
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=40,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=30.0,
);


PARALLEL N-GRAM DATABASE PIVOT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-23 23:38:11
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            ...dk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams_processed.db
Target DB:            .../edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/5grams_pivoted.db
Temp directory:       /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/5gram_files/pivoting_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              40
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24
Ingest profile:       write:packed24



# Inspect Final Database
Here are three functions you can use to inspect the final database.

## `db_head`: First N records

In [4]:
pivoted_db = f'{db_path_stub}{release}/{language}/{size}gram_files/{size}grams_pivoted.db'

db_head(pivoted_db, n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1470] <UNK> <UNK> <UNK> <UNK> convenient
     Value: 1 occurrences in 1 documents

[ 2] Key:   [1470] <UNK> <UNK> <UNK> <UNK> eng
     Value: 1 occurrences in 1 documents

[ 3] Key:   [1470] <UNK> <UNK> <UNK> atomic energy
     Value: 1 occurrences in 1 documents

[ 4] Key:   [1470] <UNK> <UNK> <UNK> convenient form
     Value: 1 occurrences in 1 documents

[ 5] Key:   [1470] <UNK> <UNK> <UNK> convenient one
     Value: 1 occurrences in 1 documents



## `db_peek`: Records starting from a key

In [5]:
pivoted_db = f'{db_path_stub}{release}/{language}/{size}gram_files/{size}grams_pivoted.db'

db_peek(pivoted_db, start_key="[2019] working-class <UNK> <UNK> <UNK> <UNK>", n=5)

5 key-value pairs starting from 000007e3776f726b696e672d636c617373203c554e4b3e203c554e4b3e203c554e4b3e203c554e4b3e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2019] working-class <UNK> <UNK> ability
     Value: 2 occurrences in 2 documents

[ 2] Key:   [2019] working-class <UNK> <UNK> able
     Value: 28 occurrences in 27 documents

[ 3] Key:   [2019] working-class <UNK> <UNK> abolition
     Value: 5 occurrences in 5 documents

[ 4] Key:   [2019] working-class <UNK> <UNK> absence
     Value: 3 occurrences in 3 documents

[ 5] Key:   [2019] working-class <UNK> <UNK> achievement
     Value: 2 occurrences in 2 documents



## `db_peek_prefix`: Records matching a prefix

In [11]:
pivoted_db = f'{db_path_stub}{release}/{language}/{size}gram_files/{size}grams_pivoted.db'

db_peek_prefix(pivoted_db, prefix="[1900] <UNK> lower-class", n=5)

5 key-value pairs with prefix 0000076c3c554e4b3e206c6f7765722d636c617373:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1900] <UNK> lower-class <UNK> <UNK>
     Value: 1,323 occurrences in 1,286 documents

[ 2] Key:   [1900] <UNK> lower-class <UNK> able
     Value: 7 occurrences in 7 documents

[ 3] Key:   [1900] <UNK> lower-class <UNK> although
     Value: 1 occurrences in 1 documents

[ 4] Key:   [1900] <UNK> lower-class <UNK> case
     Value: 14 occurrences in 14 documents

[ 5] Key:   [1900] <UNK> lower-class <UNK> caste
     Value: 2 occurrences in 2 documents

