# **Multigrams: Full Workflow**
The multigram workflow mirrors the unigram workflow, but with two differences. First, instead of _creating_ a whitelist, you filter the multigram corpus _using_ a whitelist containing the top-_N_ unigrams. Second, the multigram workflow adds a **pivoting** step. Pivoting reorganizes the database so that it's easy to query year-ngram combinations. For instance, you can learn how many times the word "nuclear" appeared in 2011 by querying the key `[2011] nuclear`. This is useful for analyzing changes in word meanings over time.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from ngramprep.ngram_acquire import download_and_ingest_to_rocksdb
from ngramprep.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramprep.ngram_pivot.pipeline import build_pivoted_db
from ngramprep.utilities.peek import db_head, db_peek, db_peek_prefix

### Configure
Here we set basic parameters: the corpus to download, the size of the ngrams to download, and the size of the year bins.

In [2]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = None
release = '20200217'
language = 'eng-us'
ngram_size = 5
bin_size = 1

## **Step 1: Download and Ingest**

Specifying `combined_bigrams_download` will convert compound terms to single, hyphenated tokens. This process is case-sensitive, so we specify all common capitalization patterns. 

In [None]:
combined_bigrams_download = {
    "working class", "Working class", "Working Class", "working classes", "Working classes", "Working Classes"
    "middle class", "Middle class", "Middle Class", "middle classes", "Middle classes", "Middle Classes"
    "lower class", "Lower class", "Lower Class", "lower classes", "Lower classes", "Lower Classes"
    "upper class", "Upper class", "Upper Class", "upper classes", "Upper classes", "Upper Classes"
    "human being", "Human being", "Human Being", "human beings", "Human beings", "Human Beings"
}

download_and_ingest_to_rocksdb(
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=None,
    ngram_type="tagged",
    random_seed=289,
    overwrite_db=True,
    workers=128,
    write_batch_size=5_000,
    open_type="write:packed24",
    compact_after_ingest=False,
    combined_bigrams=combined_bigrams_download
)

NameError: name 'Try' is not defined

## **Step 2: Filter and Normalize**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. As implemented here, we use the whitelist from the unigram workflow to filter the multigram corpus. If we weren't using a whitelist, we could normalize, filter, and lemmatize each token just as we did for the unigrams.

In [3]:
filter_options = {
    'bin_size': bin_size
}

whitelist_options = {
    'whitelist_path': f'{db_path_stub}/{release}/{language}/1gram_files/1grams_processed.db/whitelist.txt',
    'output_whitelist_path': None
}

always_include_tokens = {
    "brad", "brendan", "geoffrey", "greg", "brett", "jay", "matthew", "neil", "todd",
    "allison", "anne", "carrie", "emily", "jill", "laurie", "kristen", "meredith", "sarah",
    "darnell", "hakim", "jermaine", "kareem", "jamal", "leroy", "rasheed", "tremayne", "tyrone",
    "aisha", "ebony", "keisha", "kenya", "latonya", "lakisha", "latoya", "tamika", "tanisha",
    "joy", "love", "peace", "wonderful", "pleasure", "friend", "laughter", "happy",
    "agony", "terrible", "horrible", "nasty", "evil", "war", "awful", "failure"
}

build_processed_db(
    mode="resume",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=36,
    num_initial_work_units=600,
    work_unit_claim_order="random",
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=30.0,
    always_include=always_include_tokens,
    compact_after_ingest=False,
    **filter_options,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-05 08:58:43
Mode:       RESUME

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams.db
Target DB:            ...02/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_processed.db
Temp directory:       .../edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              36
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
──────────────────────

Loading whitelist...
Loaded 30,000 tokens

Phase 1: Creating work units...
════════════════════════════════════════════════════════════════════════════════════════════════════
Resuming existing work units
Resuming: 0 completed, 0 processing, 0 pending

Phase 2: Processing 0 work units with 36 workers...
════════════════════════════════════════════════════════════════════════════════════════════════════
No pending work units - skipping processing phase

Phase 3: Ingesting 0 shards with 4 parallel readers...
════════════════════════════════════════════════════════════════════════════════════════════════════
All shards already ingested - skipping ingestion phase

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                           

## **Step 3: Pivot to Yearly Indices**

In [4]:
build_pivoted_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=30,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=True,
    compact_after_ingest=False,
    progress_every_s=10.0,
);


PARALLEL N-GRAM DATABASE PIVOT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-05 08:59:38
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            ...02/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_processed.db
Target DB:            ...k202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_pivoted.db
Temp directory:       ...ch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/pivoting_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              30
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24
Ingest profile:       write:packed24



SST Files Ingested: 100%|████████████████████████████████████████████████████| 600/600 [00:35<00:00]



Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 8,028,458,847 (estimated)                                                                 │
│ Size: 117.19 GB                                                                                  │
│ Database: ...tch/edk202/NLP_corpora/Google_Books/20200217/eng-us/5gram_files/5grams_pivoted.db   │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘


# Inspect Final Database
Here are three functions you can use to inspect the final database.

## `db_head`: First N records

In [5]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

db_head(pivoted_db, n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1472] <UNK> <UNK> <UNK> <UNK> absence
     Value: 5 occurrences in 3 documents

[ 2] Key:   [1472] <UNK> <UNK> <UNK> <UNK> accordance
     Value: 1 occurrences in 1 documents

[ 3] Key:   [1472] <UNK> <UNK> <UNK> <UNK> accordingly
     Value: 1 occurrences in 1 documents

[ 4] Key:   [1472] <UNK> <UNK> <UNK> <UNK> accrue
     Value: 1 occurrences in 1 documents

[ 5] Key:   [1472] <UNK> <UNK> <UNK> <UNK> acre
     Value: 2 occurrences in 2 documents



## `db_peek`: Records starting from a key

In [6]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

db_peek(pivoted_db, start_key="[2019] Geoffrey <UNK> <UNK> <UNK> <UNK>", n=5)

5 key-value pairs starting from 000007e347656f6666726579203c554e4b3e203c554e4b3e203c554e4b3e203c554e4b3e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2019] ab <UNK> <UNK> <UNK> <UNK>
     Value: 1,029 occurrences in 793 documents

[ 2] Key:   [2019] ab <UNK> <UNK> <UNK> ab
     Value: 2 occurrences in 2 documents

[ 3] Key:   [2019] ab <UNK> <UNK> <UNK> ad
     Value: 37 occurrences in 27 documents

[ 4] Key:   [2019] ab <UNK> <UNK> <UNK> adequate
     Value: 1 occurrences in 1 documents

[ 5] Key:   [2019] ab <UNK> <UNK> <UNK> also
     Value: 2 occurrences in 2 documents



## `db_peek_prefix`: Records matching a prefix

In [7]:
pivoted_db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_pivoted.db'

db_peek_prefix(pivoted_db, prefix="[2006] <UNK> Geoffrey", n=5)

5 key-value pairs with prefix 000007d63c554e4b3e2047656f6666726579:
────────────────────────────────────────────────────────────────────────────────────────────────────
No keys found with prefix 000007d63c554e4b3e2047656f6666726579
