# **Multigrams: Full Workflow**
The multigram workflow mirrors the unigram workflow, but with two differences. First, instead of _creating_ a whitelist, you filter the multigram corpus _using_ a whitelist containing the top-_N_ unigrams. Second, the multigram workflow adds a **pivoting** step. Pivoting reorganizes the database so that it's easy to query year-ngram combinations. For instance, you can learn how many times the word "nuclear" appeared in 2011 by querying the key `[2011] nuclear`. This is useful for analyzing changes in word meanings over time.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from stop_words import get_stop_words
from ngramkit.ngram_filter.lemmatizer import SpacyLemmatizer
from ngramkit.ngram_acquire import download_and_ingest_to_rocksdb
from ngramkit.ngram_filter.config import PipelineConfig as FilterPipelineConfig
from ngramkit.ngram_filter.config import FilterConfig
from ngramkit.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramkit.ngram_pivot.config import PipelineConfig as PivotPipelineConfig
from ngramkit.ngram_pivot.pipeline import run_pivot_pipeline
from ngramkit.utilities.peek import db_head, db_peek, db_peek_prefix
from ngramkit.utilities.notebook_logging import setup_notebook_logging

### Configure

In [2]:
release = '20200217'
language = 'rus'
size = 5
num_files = 633

In [3]:
base_path = Path(f"/scratch/edk202/NLP_corpora/Google_Books/{release}/{language}/{size}gram_files")
raw_db = base_path / f"{size}grams.db"
filtered_db = base_path / f"{size}grams_processed.db"
pivoted_db = base_path / f"{size}grams_pivoted.db"
filter_tmp_dir = base_path / "processing_tmp"
pivot_tmp_dir = base_path / "pivot_tmp"
whitelist_path = f"/scratch/edk202/NLP_corpora/Google_Books/{release}/{language}/1gram_files/1grams_processed.db/whitelist.txt"

## **Step 1: Download and Ingest**

In [5]:
setup_notebook_logging(
    workflow_name="multigrams_acquire",
    console=False
)

download_and_ingest_to_rocksdb(
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub="/scratch/edk202/NLP_corpora/Google_Books/",
    file_range=(0, num_files-1),
    workers=99,
    ngram_type="tagged",
    overwrite_db=True,
    write_batch_size=1_000_000,
    open_type="write:packed24",
    compact_after_ingest=True
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-13 00:09:23

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           https://books.storage.googleapis.com/?prefix=ngrams/books/20200217/rus/5-
DB path:              /scratch/edk202/NLP_corpora/Google_Books/20200217/rus/5gram_files/5grams.db
File range:           0 to 632
Total files:          633
Files to get:         633
Skipping:             0
Download workers:     99
Batch size:           1,000,000
Ngram size:           5
Ngram type:           tagged
Overwrite DB:         True
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|███████████████████████████████████████████████████████| 633/633 [05:38<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         36.11 GB
Compaction completed in 0:01:20
Size before:             36.11 GB
Size after:              37.76 GB
Space saved:             -1.65 GB (-4.6%)

Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Fully processed files:       633
Failed files:                0
Total entries written:       50,094,861
Write batches flushed:       234
Uncompressed data processed: 1.36 TB
Processing throughput:       3383.32 MB/sec

End Time: 2025-11-13 00:16:24.343211
Total Runtime: 0:07:00.795373
Time per file: 0:00:00.664764
Files per hour: 5415.5


## **Step 2: Filter and Normalize**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing `FilterConfig` and `FilterPipelineConfig` objects to the `build_processed_db` function. As implemented here, we use the whitelist from the unigram workflow to filter the multigram corpus. If we weren't using a whitelist, we could normalize, filter, and lemmatize each token just as we did in the unigrams.

In [4]:
setup_notebook_logging(
    workflow_name="multigrams_filter",
    data_path=str(base_path),
    console=False
)

stop_set = set(get_stop_words("russian"))
lemmatizer = SpacyLemmatizer(language="ru")

filter_config = FilterConfig(
    stop_set=stop_set,
    lemma_gen=lemmatizer,
    whitelist_path=whitelist_path
)

pipeline_config = FilterPipelineConfig(
    src_db=raw_db,
    dst_db=filtered_db,
    tmp_dir=filter_tmp_dir,
    num_workers=50,
    use_smart_partitioning=True,
    num_initial_work_units=500,
    cache_partitions=True,
    use_cached_partitions=True,
    samples_per_worker=500_000,
    work_unit_claim_order="random",
    flush_interval_s=5.0,
    mode="restart",
    progress_every_s=15.0,
    ingest_num_readers=10,
    ingest_batch_items=1_000_000,
    ingest_queue_size=3,
    compact_after_ingest=True
)

build_processed_db(pipeline_config, filter_config)


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-13 00:46:56
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/rus/5gram_files/5grams.db
Target DB:            ...dk202/NLP_corpora/Google_Books/20200217/rus/5gram_files/5grams_processed.db
Temp directory:       ...tch/edk202/NLP_corpora/Google_Books/20200217/rus/5gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              50
Initial work units:   500

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
────────────────────────

Shards Ingested: 100%|███████████████████████████████████████████████████████| 500/500 [02:17<00:00]



Ingestion complete: 500 shards, 18,567,640 items in 137.7s (134,881 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         11.61 GB
Compaction completed in 0:00:27
Size before:             11.61 GB
Size after:              10.39 GB
Space saved:             1.22 GB (10.5%)

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 14,296,958 (estimated)                                                                    │
│ Size: 10.39 GB                                                                

## **Step 3: Pivot to Yearly Indices**

In [4]:
setup_notebook_logging(
    workflow_name="multigrams_pivot",
    data_path=str(base_path),
    console=False
)

pivot_config = PivotPipelineConfig(
    src_db=filtered_db,
    dst_db=pivoted_db,
    tmp_dir=pivot_tmp_dir,
    num_workers=24,
    num_initial_work_units=500,
    cache_partitions=True,
    use_cached_partitions=True,
    samples_per_worker=500_000,
    work_unit_claim_order="random",
    flush_interval_s=15.0,
    progress_every_s=10.0,
    mode="restart",
    num_ingest_readers=1,
    ingest_buffer_shards=1,
    use_smart_partitioning=True,
    ingest_mode="direct_sst"
)

run_pivot_pipeline(pivot_config)


PARALLEL N-GRAM DATABASE PIVOT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-13 01:05:23
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            ...dk202/NLP_corpora/Google_Books/20200217/rus/5gram_files/5grams_processed.db
Target DB:            .../edk202/NLP_corpora/Google_Books/20200217/rus/5gram_files/5grams_pivoted.db
Temp directory:       /scratch/edk202/NLP_corpora/Google_Books/20200217/rus/5gram_files/pivot_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              24
Initial work units:   500

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24
Ingest profile:       write:packed24

Ing

SST Files Ingested: 100%|████████████████████████████████████████████████████| 500/500 [00:26<00:00]



Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         9.07 GB
Compaction completed in 0:01:36
Size before:             9.07 GB
Size after:              21.17 GB
Space saved:             -12.10 GB (-133.4%)

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ PROCESSING COMPLETE                                                                              │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 444,807,860 (estimated)                                                                   │
│ Size: 21.17 GB                                                                                   │
│ Database: /scratch/edk202/NLP_corpora/Google_Books/20

# Inspect Final Database
Here are three functions you can use to inspect the final database.

## `db_head`: First N records

In [5]:
db_head(str(pivoted_db), n=10)

First 10 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1486] <UNK> <UNK> <UNK> <UNK> Италия
     Value: 1 occurrences in 1 documents

[ 2] Key:   [1486] <UNK> <UNK> <UNK> <UNK> Надо
     Value: 1 occurrences in 1 documents

[ 3] Key:   [1486] <UNK> <UNK> <UNK> <UNK> При
     Value: 1 occurrences in 1 documents

[ 4] Key:   [1486] <UNK> <UNK> <UNK> <UNK> автор
     Value: 1 occurrences in 1 documents

[ 5] Key:   [1486] <UNK> <UNK> <UNK> <UNK> акт
     Value: 1 occurrences in 1 documents

[ 6] Key:   [1486] <UNK> <UNK> <UNK> <UNK> ангел
     Value: 1 occurrences in 1 documents

[ 7] Key:   [1486] <UNK> <UNK> <UNK> <UNK> апостол
     Value: 1 occurrences in 1 documents

[ 8] Key:   [1486] <UNK> <UNK> <UNK> <UNK> благодаря
     Value: 2 occurrences in 2 documents

[ 9] Key:   [1486] <UNK> <UNK> <UNK> <UNK> бог
     Value: 6 occurrences in 6 documents

[10] Key:   [1486] <UNK> <UNK> <UNK> <UNK> божественны

## `db_peek`: Records starting from a key

In [14]:
db_peek(str(pivoted_db), start_key="[1964] <UNK> ядерный бомба <UNK> <UNK> <UNK>", n=5)

5 key-value pairs starting from 000007ac3c554e4b3e20d18fd0b4d0b5d180d0bdd18bd0b920d0b1d0bed0bcd0b1d0b0203c554e4b3e203c554e4b3e203c554e4b3e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1964] <UNK> ядерный бомба <UNK> борт
     Value: 5 occurrences in 5 documents

[ 2] Key:   [1964] <UNK> ядерный веко <UNK> <UNK>
     Value: 6 occurrences in 6 documents

[ 3] Key:   [1964] <UNK> ядерный взрыв <UNK> <UNK>
     Value: 8 occurrences in 7 documents

[ 4] Key:   [1964] <UNK> ядерный взрыв <UNK> атмосфера
     Value: 2 occurrences in 2 documents

[ 5] Key:   [1964] <UNK> ядерный взрыв <UNK> включая
     Value: 10 occurrences in 10 documents



## `db_peek_prefix`: Records matching a prefix

In [27]:
db_peek_prefix(str(pivoted_db), prefix="[2019] <UNK> <UNK> нищий <UNK> <UNK>", n=5)

5 key-value pairs with prefix 000007e33c554e4b3e203c554e4b3e20d0bdd0b8d189d0b8d0b9203c554e4b3e203c554e4b3e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2019] <UNK> <UNK> нищий <UNK> <UNK>
     Value: 1,289 occurrences in 1,175 documents

