# **Multigrams: Full Workflow**
The multigram workflow mirrors the unigram workflow, but with two differences. First, instead of _creating_ a whitelist, you filter the multigram corpus _using_ a whitelist containing the top-_N_ unigrams. Second, the multigram workflow adds a **pivoting** step. Pivoting reorganizes the database so that it's easy to query year-ngram combinations. For instance, you can learn how many times the word "nuclear" appeared in 2011 by querying the key `[2011] nuclear`. This is useful for analyzing changes in word meanings over time.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from ngramprep.ngram_acquire import download_and_ingest_to_rocksdb
from ngramprep.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramprep.ngram_pivot.pipeline import build_pivoted_db
from ngramprep.utilities.peek import db_head, db_peek, db_peek_prefix

Downloading en_core_web_sm...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m112.1 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Downloading zh_core_web_sm...
Collecting zh-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.8.0/zh_core_web_sm-3.8.0-py3-none-any.whl (48.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 MB[0m [31m135.8 MB/s[0m  [33m0:00:00[0meta [36m0:00:01[0m
[?25hCollecting spacy-pkuseg<2.0.0,>=1.0.0 (from zh-core-web-sm==3.8.0)
  Using cached spacy_pkuseg-1.0.1-cp311-cp311-m

### Configure
Here we set basic parameters: the corpus to download, the size of the ngrams to download, and the size of the year bins.

In [2]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = None
release = '20200217'
language = 'eng-us'
ngram_size = 5
bin_size = 1

## **Step 1: Download and Ingest**

In [4]:
combined_bigrams_download = {
    "working class", "working classes",
    "middle class", "middle classes",
    "lower class", "lower classes",
    "upper class", "upper classes",
    "human being", "human beings"
}

download_and_ingest_to_rocksdb(
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=None,
    ngram_type="tagged",
    random_seed=89,
    overwrite_db=False,
    workers=100,
    write_batch_size=5_000,
    open_type="write:packed24",
    compact_after_ingest=False,
    combined_bigrams=combined_bigrams_download
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-03 19:17:22

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           https://books.storage.googleapis.com/?prefix=ngrams/books/20200217/eng-us/5-
DB path:              /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams.db
File range:           0 to 11144
Total files:          11145
Files to get:         0
Skipping:             11145
Download workers:     100
Batch size:           5,000
Ngram size:           5
Ngram type:           tagged
Overwrite DB:         False
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed:   0%|                                                               | 0/0 [00:00<?]



Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Fully processed files:       0
Failed files:                0
Total entries written:       0
Write batches flushed:       0
Uncompressed data processed: 0.00 B
Processing throughput:       0.00 MB/sec

End Time: 2026-01-03 19:18:03.544998
Total Runtime: 0:00:40.683806
Time per file: 0:00:00
Files per hour: 0.0


## **Step 2: Filter and Normalize**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. As implemented here, we use the whitelist from the unigram workflow to filter the multigram corpus. If we weren't using a whitelist, we could normalize, filter, and lemmatize each token just as we did for the unigrams.

In [3]:
filter_options = {
    'bin_size': bin_size
}

whitelist_options = {
    'whitelist_path': f'{db_path_stub}/{release}/{language}/1gram_files/1grams_processed.db/whitelist.txt',
    'output_whitelist_path': None
}

always_include_tokens = {
    "working-class", "working-classes",
    "middle-class", "middle-classes",
    "lower-class", "lower-classes",
    "upper-class", "upper-classes",
    "human-being", "human-beings"
}

build_processed_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=36,
    num_initial_work_units=600,
    work_unit_claim_order="random",
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=30.0,
    always_include=always_include_tokens,
    compact_after_ingest=False,
    **filter_options,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-03 23:13:41
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams.db
Target DB:            ...02/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_processed.db
Temp directory:       .../edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              36
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
─────────────────────

Shards Ingested: 100%|███████████████████████████████████████████████████████| 600/600 [43:25<00:00]



Ingestion complete: 600 shards, 309,909,747 items in 2605.8s (118,932 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 267,449,339 (estimated)                                                                   │
│ Size: 278.94 GB                                                                                  │
│ Database: ...h/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_processed.db   │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘


## **Step 3: Pivot to Yearly Indices**

In [4]:
build_pivoted_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=30,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=True,
    compact_after_ingest=False,
    progress_every_s=10.0,
);


PARALLEL N-GRAM DATABASE PIVOT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-04 00:08:42
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            ...02/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_processed.db
Target DB:            ...k202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_pivoted.db
Temp directory:       ...ch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/pivoting_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              30
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24
Ingest profile:       write:packed24



SST Files Ingested: 100%|████████████████████████████████████████████████████| 600/600 [00:37<00:00]



Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 8,029,663,534 (estimated)                                                                 │
│ Size: 117.21 GB                                                                                  │
│ Database: ...tch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_pivoted.db   │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘


# Inspect Final Database
Here are three functions you can use to inspect the final database.

## `db_head`: First N records

In [5]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

db_head(pivoted_db, n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1472] <UNK> <UNK> <UNK> <UNK> absence
     Value: 5 occurrences in 3 documents

[ 2] Key:   [1472] <UNK> <UNK> <UNK> <UNK> accordance
     Value: 1 occurrences in 1 documents

[ 3] Key:   [1472] <UNK> <UNK> <UNK> <UNK> accordingly
     Value: 1 occurrences in 1 documents

[ 4] Key:   [1472] <UNK> <UNK> <UNK> <UNK> accrue
     Value: 1 occurrences in 1 documents

[ 5] Key:   [1472] <UNK> <UNK> <UNK> <UNK> acre
     Value: 2 occurrences in 2 documents



## `db_peek`: Records starting from a key

In [6]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

db_peek(pivoted_db, start_key="[2019] working-class <UNK> <UNK> <UNK> <UNK>", n=5)

5 key-value pairs starting from 000007e3776f726b696e672d636c617373203c554e4b3e203c554e4b3e203c554e4b3e203c554e4b3e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2019] working-class <UNK> <UNK> ability
     Value: 1 occurrences in 1 documents

[ 2] Key:   [2019] working-class <UNK> <UNK> able
     Value: 4 occurrences in 4 documents

[ 3] Key:   [2019] working-class <UNK> <UNK> aim
     Value: 1 occurrences in 1 documents

[ 4] Key:   [2019] working-class <UNK> <UNK> also
     Value: 16 occurrences in 16 documents

[ 5] Key:   [2019] working-class <UNK> <UNK> although
     Value: 2 occurrences in 2 documents



## `db_peek_prefix`: Records matching a prefix

In [7]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

db_peek_prefix(pivoted_db, prefix="[2006] <UNK> working-class", n=5)

5 key-value pairs with prefix 000007d63c554e4b3e20776f726b696e672d636c617373:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2006] <UNK> working-class <UNK> <UNK>
     Value: 11,174 occurrences in 10,238 documents

[ 2] Key:   [2006] <UNK> working-class <UNK> able
     Value: 4 occurrences in 4 documents

[ 3] Key:   [2006] <UNK> working-class <UNK> achieve
     Value: 3 occurrences in 2 documents

[ 4] Key:   [2006] <UNK> working-class <UNK> action
     Value: 2 occurrences in 2 documents

[ 5] Key:   [2006] <UNK> working-class <UNK> actually
     Value: 14 occurrences in 14 documents

