# **Multigrams: Full Workflow**
The multigram workflow mirrors the unigram workflow, but with two differences. First, instead of _creating_ a whitelist, you filter the multigram corpus _using_ a whitelist containing the top-_N_ unigrams. Second, the multigram workflow adds a **pivoting** step. Pivoting reorganizes the database so that it's easy to query year-ngram combinations. For instance, you can learn how many times the word "nuclear" appeared in 2011 by querying the key `[2011] nuclear`. This is useful for analyzing changes in word meanings over time.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from ngramkit.ngram_acquire import download_and_ingest_to_rocksdb
from ngramkit.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramkit.ngram_pivot.pipeline import build_pivoted_db
from ngramkit.utilities.peek import db_head, db_peek, db_peek_prefix

### Configure

In [2]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = '/scratch/edk202/NLP_archive/Google_Books/'
release = '20200217'
language = 'eng-fiction'
size = 5

## **Step 1: Download and Ingest**

In [3]:
download_and_ingest_to_rocksdb(
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=None,
    ngram_type="tagged",
    random_seed=67,
    overwrite_db=False,
    workers=90,
    write_batch_size=5_000,
    open_type="write:packed24",
    compact_after_ingest=False
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-16 12:59:55

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           ...//books.storage.googleapis.com/?prefix=ngrams/books/20200217/eng-fiction/5-
DB path:              .../edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams.db
File range:           0 to 1448
Total files:          1449
Files to get:         1449
Skipping:             0
Download workers:     90
Batch size:           5,000
Ngram size:           5
Ngram type:           tagged
Overwrite DB:         False
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|█████████████████████████████████████████████████████| 1449/1449 [16:21<00:00]



Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Fully processed files:       1449
Failed files:                0
Total entries written:       203,550,564
Write batches flushed:       589
Uncompressed data processed: 2.42 TB
Processing throughput:       2549.47 MB/sec

End Time: 2025-11-16 13:16:28.929272
Total Runtime: 0:16:33.886320
Time per file: 0:00:00.685912
Files per hour: 5248.5


## **Step 2: Filter and Normalize**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. As implemented here, we use the whitelist from the unigram workflow to filter the multigram corpus. If we weren't using a whitelist, we could normalize, filter, and lemmatize each token just as we did for the unigrams.

In [4]:
whitelist_options = {
    'whitelist_path': f'{db_path_stub}/{release}/{language}/1gram_files/1grams_processed.db/whitelist.txt',
    'output_whitelist_path': None
}

build_processed_db(
    mode="restart",
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=40,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=10.0,
    **whitelist_options
)


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-16 13:52:03
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            .../edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams.db
Target DB:            ...P_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams_processed.db
Temp directory:       ...02/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              40
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
─────────────────────

Shards Ingested: 100%|███████████████████████████████████████████████████████| 600/600 [05:27<00:00]



Ingestion complete: 600 shards, 48,052,657 items in 328.0s (146,522 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         34.48 GB
Compaction completed in 0:00:57
Size before:             34.48 GB
Size after:              27.71 GB
Space saved:             6.77 GB (19.6%)

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 33,699,321 (estimated)                                                                    │
│ Size: 27.71 GB                                                                

'/scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams_processed.db'

## **Step 3: Pivot to Yearly Indices**

In [5]:
build_pivoted_db(
    mode="restart",
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=40,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=10.0,
);

2025-11-16 14:04:32 INFO     root: Logging initialized: /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/pivoting_tmp/logs/pivot_pipeline_20251116_140432.log
2025-11-16 14:04:32 INFO     ngramkit.ngram_pivot.pipeline.orchestrator: Pivot Pipeline
2025-11-16 14:04:32 INFO     ngramkit.ngram_pivot.pipeline.orchestrator: Started: 2025-11-16 14:04:32
2025-11-16 14:04:33 INFO     ngramkit.ngram_pivot.pipeline.orchestrator: Source DB: /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams_processed.db
2025-11-16 14:04:33 INFO     ngramkit.ngram_pivot.pipeline.orchestrator: Destination DB: /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams_pivoted.db
2025-11-16 14:04:33 INFO     ngramkit.ngram_pivot.pipeline.orchestrator: Log file: /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/pivoting_tmp/logs/pivot_pipeline_20251116_140432.log



PARALLEL N-GRAM DATABASE PIVOT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-16 14:04:33
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            ...P_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams_processed.db
Target DB:            ...NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams_pivoted.db
Temp directory:       ...k202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/pivoting_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              40
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24
Ingest profile:       write:packed24



SST Files Ingested: 100%|████████████████████████████████████████████████████| 600/600 [00:28<00:00]



Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════


2025-11-16 14:08:51 INFO     ngramkit.ngram_pivot.pipeline.orchestrator: Starting post-ingestion compaction



Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         16.96 GB
Compaction completed in 0:02:52
Size before:             16.96 GB
Size after:              43.42 GB
Space saved:             -26.46 GB (-156.0%)

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 1,212,469,973 (estimated)                                                                 │
│ Size: 43.42 GB                                                                                   │
│ Database: ...dk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/5grams_pivoted.db   │
└─────────────────────────────────────────────────────────────────────────────────────

# Inspect Final Database
Here are three functions you can use to inspect the final database.

## `db_head`: First N records

In [10]:
pivoted_db = f'{db_path_stub}{release}/{language}/{size}gram_files/{size}grams_pivoted.db'

db_head(pivoted_db, n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1578] <UNK> <UNK> <UNK> <UNK> air
     Value: 1 occurrences in 1 documents

[ 2] Key:   [1578] <UNK> <UNK> <UNK> <UNK> answer
     Value: 1 occurrences in 1 documents

[ 3] Key:   [1578] <UNK> <UNK> <UNK> <UNK> fleet
     Value: 2 occurrences in 1 documents

[ 4] Key:   [1578] <UNK> <UNK> <UNK> <UNK> man
     Value: 1 occurrences in 1 documents

[ 5] Key:   [1578] <UNK> <UNK> <UNK> <UNK> sake
     Value: 1 occurrences in 1 documents



## `db_peek`: Records starting from a key

In [12]:
pivoted_db = f'{db_path_stub}{release}/{language}/{size}gram_files/{size}grams_pivoted.db'

db_peek(pivoted_db, start_key="[2002] attack <UNK> <UNK> world trade", n=5)

5 key-value pairs starting from 000007d261747461636b203c554e4b3e203c554e4b3e20776f726c64207472616465:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2002] attack <UNK> <UNK> world trade
     Value: 152 occurrences in 114 documents

[ 2] Key:   [2002] attack <UNK> <UNK> year ago
     Value: 1 occurrences in 1 documents

[ 3] Key:   [2002] attack <UNK> <UNK> yesterday evening
     Value: 3 occurrences in 3 documents

[ 4] Key:   [2002] attack <UNK> <UNK> yet <UNK>
     Value: 4 occurrences in 4 documents

[ 5] Key:   [2002] attack <UNK> <UNK> young age
     Value: 2 occurrences in 2 documents



## `db_peek_prefix`: Records matching a prefix

In [13]:
pivoted_db = f'{db_path_stub}{release}/{language}/{size}gram_files/{size}grams_pivoted.db'

db_peek_prefix(pivoted_db, prefix="[2011] poor <UNK> happy", n=5)

5 key-value pairs with prefix 000007db706f6f72203c554e4b3e206861707079:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2011] poor <UNK> happy <UNK> <UNK>
     Value: 4 occurrences in 4 documents

[ 2] Key:   [2011] poor <UNK> happy <UNK> rich
     Value: 4 occurrences in 4 documents

[ 3] Key:   [2011] poor <UNK> happy <UNK> sad
     Value: 2 occurrences in 2 documents

[ 4] Key:   [2011] poor <UNK> happy <UNK> unhappy
     Value: 1 occurrences in 1 documents

