# **Unigrams: Full Workflow**
The primary purpose of the unigram workflow is to generate a vocabulary **whitelist** that can be used to filter out unwanted words from a multigram corpus. The workflow consists of two steps: (1) downloading the unigram corpus into a database and (2) filtering and normalizing the corpus and generating the whitelist.

## **Setup**
### Imports

In [4]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from stop_words import get_stop_words
from ngramkit.ngram_filter.lemmatizer import SpacyLemmatizer
from ngramkit.ngram_acquire import download_and_ingest_to_rocksdb
from ngramkit.ngram_filter.config import PipelineConfig as FilterPipelineConfig
from ngramkit.ngram_filter.config import FilterConfig
from ngramkit.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramkit.utilities.peek import db_head, db_peek, db_peek_prefix
from ngramkit.utilities.notebook_logging import setup_notebook_logging

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Configure

In [5]:
release = '20200217'
language = 'rus'
size = 1
num_files = 2

In [3]:
base_path = Path(f"/scratch/edk202/NLP_corpora/Google_Books/{release}/{language}/{size}gram_files")
raw_db = base_path / f"{size}grams.db"
filtered_db = base_path / f"{size}grams_processed.db"
pivoted_db = base_path / f"{size}grams_pivoted.db"
filter_tmp_dir = base_path / "processing_tmp"
pivot_tmp_dir = base_path / "pivot_tmp"
whitelist_path = filtered_db / "whitelist.txt"

## **Step 1: Download and Ingest**

In [9]:
setup_notebook_logging(
    workflow_name="unigrams_acquire",
    data_path=str(base_path),
    console=False
)

download_and_ingest_to_rocksdb(
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub="/scratch/edk202/NLP_corpora/Google_Books/",
    file_range=(0, num_files-1),
    workers=2,
    use_threads=False,
    ngram_type="tagged",
    overwrite_db=True,
    write_batch_size=100_000,
    open_type="write:packed24",
    compact_after_ingest=True
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-12 21:41:07

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           https://books.storage.googleapis.com/?prefix=ngrams/books/20200217/rus/1-
DB path:              /scratch/edk202/NLP_corpora/Google_Books/20200217/rus/1gram_files/1grams.db
File range:           0 to 1
Total files:          2
Files to get:         2
Skipping:             0
Download workers:     2
Batch size:           100,000
Ngram size:           1
Ngram type:           tagged
Overwrite DB:         True
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|███████████████████████████████████████████████████████████| 2/2 [03:23<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Compaction completed in 0:00:34

Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Fully processed files:       2
Failed files:                0
Total entries written:       6,862,543
Write batches flushed:       2
Uncompressed data processed: 7.05 GB
Processing throughput:       30.32 MB/sec

End Time: 2025-11-12 21:45:05.261661
Total Runtime: 0:03:58.208891
Time per file: 0:01:59.104446
Files per hour: 30.2


## **Step 2: Filter, Normalize, and Generate Whitelist**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing `FilterConfig` and `FilterPipelineConfig` objects to the `build_processed_db` function, as seen below. By default, we:
1. case-normalize the tokens
2. remove tokens containing non-alphanumeric text
3. remove stopwords using the `stop-words` package
4. remove tokens fewer than 3 characters
5. lemmatize the tokens using the `spaCy` package—first using a lookup table and falling back to rules when lookups fail
6. Create a whitelist of the top 6,000 most frequent words that pass a `pyenchant` spell-check and appear in all corpora from 1900–2019 (inclusive).

In [7]:
setup_notebook_logging(
    workflow_name="unigrams_filter",
    data_path=str(base_path),
    console=False
)

stop_set = set(get_stop_words("russian"))
lemmatizer = SpacyLemmatizer(language="ru")

filter_config = FilterConfig(
    stop_set=stop_set,
    lemma_gen=lemmatizer,
)

pipeline_config = FilterPipelineConfig(
    src_db=raw_db,
    dst_db=filtered_db,
    tmp_dir=filter_tmp_dir,
    num_workers=24,
    use_smart_partitioning=True,
    num_initial_work_units=300,
    cache_partitions=True,
    use_cached_partitions=True,
    samples_per_worker=500_000,
    work_unit_claim_order="random",
    flush_interval_s=15.0,
    mode="restart",
    progress_every_s=10.0,
    ingest_num_readers=10,
    ingest_batch_items=2_000_000,
    ingest_queue_size=2,
    output_whitelist_path=whitelist_path,
    output_whitelist_top_n=6_000,
    output_whitelist_year_range=(1900, 2019),
    output_whitelist_spell_check=True,
    output_whitelist_spell_check_language="ru"
)

build_processed_db(pipeline_config, filter_config)


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-12 23:13:25
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/rus/1gram_files/1grams.db
Target DB:            ...dk202/NLP_corpora/Google_Books/20200217/rus/1gram_files/1grams_processed.db
Temp directory:       ...tch/edk202/NLP_corpora/Google_Books/20200217/rus/1gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              24
Initial work units:   300

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
────────────────────────

Shards Ingested: 100%|███████████████████████████████████████████████████████| 300/300 [01:04<00:00]



Ingestion complete: 300 shards, 4,049,498 items in 64.5s (62,818 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         4.30 GB
Compaction completed in 0:00:10
Size before:             4.30 GB
Size after:              4.30 GB
Space saved:             50.19 KB (0.0%)

Phase 5: Generating output whitelist...
════════════════════════════════════════════════════════════════════════════════════════════════════
  Output path: ...LP_corpora/Google_Books/20200217/rus/1gram_files/1grams_processed.db/whitelist.txt
  Extracting top 6,000 tokens
  Spell checking enabled (ru)
  Year range filter: 1900-2019 (inclusive)


ValueError: Spell checking language 'ru' not found. Install it with: enchant.broker.list_dicts() to see available languages.

## **Optional: Inspect Database**

### `db_head`: Show first N records

In [4]:
db_head(str(filtered_db), n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   FALSE
     Value: Total: 124,503 occurrences in 110,465 volumes (1538-2019, 380 years)

[ 2] Key:   TRUE
     Value: Total: 4,241,065 occurrences in 2,916,658 volumes (1501-2019, 440 years)

[ 3] Key:   aaa
     Value: Total: 5,093,481 occurrences in 1,181,441 volumes (1477-2019, 404 years)

[ 4] Key:   aaaa
     Value: Total: 474,351 occurrences in 92,702 volumes (1477-2019, 338 years)

[ 5] Key:   aaaaa
     Value: Total: 54,556 occurrences in 23,113 volumes (1581-2019, 274 years)



### `db_peek`: Show records starting from a key

In [5]:
db_peek(str(filtered_db), start_key=b"time", n=5)

5 key-value pairs starting from 74696d65:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   time
     Value: Total: 2,893,115,714 occurrences in 69,504,388 volumes (1470-2019, 517 years)

[ 2] Key:   timea
     Value: Total: 55,191 occurrences in 38,897 volumes (1591-2019, 343 years)

[ 3] Key:   timeable
     Value: Total: 2,076 occurrences in 1,823 volumes (1792-2019, 143 years)

[ 4] Key:   timeabout
     Value: Total: 499 occurrences in 489 volumes (1614-2019, 152 years)

[ 5] Key:   timeabove
     Value: Total: 61 occurrences in 53 volumes (1724-2009, 44 years)



### `db_peek_prefix`: Show records matching a prefix

In [6]:
db_peek_prefix(str(filtered_db), prefix=b"tarnation", n=5)

5 key-value pairs with prefix 7461726e6174696f6e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   tarnation
     Value: Total: 69,016 occurrences in 49,433 volumes (1606-2019, 265 years)

[ 2] Key:   tarnational
     Value: Total: 114 occurrences in 92 volumes (1913-2010, 53 years)

[ 3] Key:   tarnationest
     Value: Total: 53 occurrences in 53 volumes (1840-2019, 30 years)

[ 4] Key:   tarnationly
     Value: Total: 94 occurrences in 94 volumes (1833-2019, 44 years)

