# **Unigrams: Full Workflow**
The primary purpose of the unigram workflow is to generate a vocabulary **whitelist** that can be used to filter out unwanted words from a multigram corpus. The workflow consists of two steps: (1) downloading the unigram corpus into a database and (2) filtering and normalizing the corpus and generating the whitelist.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from stop_words import get_stop_words
from ngramkit.ngram_filter.lemmatizer import SpacyLemmatizer
from ngramkit.ngram_acquire import download_and_ingest_to_rocksdb
from ngramkit.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramkit.utilities.peek import db_head, db_peek, db_peek_prefix

### Configure

In [2]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = '/scratch/edk202/NLP_archive/Google_Books/'
release = '20200217'
language = 'eng'
size = 1

## **Step 1: Download and Ingest**

In [3]:
download_and_ingest_to_rocksdb(
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=archive_path_stub,
    ngram_type="tagged",
    overwrite_db=True,
    open_type="write:packed24",
    compact_after_ingest=True
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-17 18:58:09

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           https://books.storage.googleapis.com/?prefix=ngrams/books/20200217/eng/1-
DB path:              /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams.db
File range:           0 to 23
Total files:          24
Files to get:         24
Skipping:             0
Download workers:     24
Batch size:           50,000
Ngram size:           1
Ngram type:           tagged
Overwrite DB:         True
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|█████████████████████████████████████████████████████████| 24/24 [05:38<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         46.46 GB
Compaction completed in 0:02:24
Size before:             46.46 GB
Size after:              57.76 GB
Space saved:             -11.30 GB (-24.3%)

Database Archival
════════════════════════════════════════════════════════════════════════════════════════════════════
Compressing database to temporary location: ...1/job-468790/ngram_archive_7b_r6t0m/1grams.db.tar.zst
Compressing /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams.db to /state/partition1/job-468790/ngram_archive_7b_r6t0m/1grams.db.tar.zst
Using 128 threads at compression level 3
Compression complete: /state/partition1/job-468790/ngram_archive_7b_r6t0m/1grams.db.tar.zst
Transferring archive to ...dk202/NLP_archive/Google_Books/20200217/eng/1gram_files/1grams.db.tar.zst
Archive created: /scratch/edk202/NLP_archive/Google_Books/20200217/eng/1gram_f

 ## **Step 2: Filter, Normalize, and Generate Whitelist**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. By default, we:
1. case-normalize the tokens
2. remove tokens containing non-alphanumeric text
3. remove stopwords using the `stop-words` package
4. remove tokens fewer than 3 characters
5. lemmatize the tokens using the `spaCy` package—first using a lookup table and falling back to rules when lookups fail
6. Create a whitelist of the top 20,000 most frequent words that pass a `pyenchant` spell-check and appear in all corpora from 1900–2019 (inclusive).

In [4]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': SpacyLemmatizer(language="en")
}

whitelist_options = {
    'output_whitelist_top_n': 15_000,
    'output_whitelist_year_range': (1900, 2019),
    'output_whitelist_spell_check': True,
    'output_whitelist_spell_check_language': "en"
}

build_processed_db(
    mode="restart",
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=100,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=10.0,
    **filter_options,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-17 19:08:39
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams.db
Target DB:            ...dk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db
Temp directory:       ...tch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              100
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
───────────────────────

Shards Ingested: 100%|███████████████████████████████████████████████████████| 600/600 [03:26<00:00]



Ingestion complete: 600 shards, 17,731,248 items in 206.8s (85,725 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         25.08 GB
Compaction completed in 0:00:58
Size before:             25.08 GB
Size after:              21.13 GB
Space saved:             3.95 GB (15.7%)

Phase 5: Generating output whitelist...
════════════════════════════════════════════════════════════════════════════════════════════════════
  Output path: ...LP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db/whitelist.txt
  Extracting top 15,000 tokens
  Spell checking enabled (en)
  Year range filter: 1900-2019 (inclusive)
  Generated whitelist with 15,000 tokens in 257.5s

┌──────────────────────────────────────────────────────────────────────────────────────

## **Optional: Inspect Database**

### `db_head`: Show first N records

In [8]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db'

db_head(str(dp_path), n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   FALSE
     Value: Total: 124,503 occurrences in 110,465 volumes (1538-2019, 380 years)

[ 2] Key:   TRUE
     Value: Total: 4,241,065 occurrences in 2,916,658 volumes (1501-2019, 440 years)

[ 3] Key:   aaa
     Value: Total: 5,093,481 occurrences in 1,181,441 volumes (1477-2019, 404 years)

[ 4] Key:   aaaa
     Value: Total: 474,351 occurrences in 92,702 volumes (1477-2019, 338 years)

[ 5] Key:   aaaaa
     Value: Total: 54,556 occurrences in 23,113 volumes (1581-2019, 274 years)



### `db_peek`: Show records starting from a key

In [9]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db'

db_peek(str(dp_path), start_key=b"much", n=5)

5 key-value pairs starting from 6d756368:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   much
     Value: Total: 5,422,182,873 occurrences in 147,811,157 volumes (1470-2019, 519 years)

[ 2] Key:   mucha
     Value: Total: 253,936 occurrences in 119,832 volumes (1554-2019, 411 years)

[ 3] Key:   muchaa
     Value: Total: 141 occurrences in 108 volumes (1807-2017, 75 years)

[ 4] Key:   muchaal
     Value: Total: 232 occurrences in 118 volumes (1993-2019, 21 years)

[ 5] Key:   muchaba
     Value: Total: 371 occurrences in 65 volumes (1848-2019, 33 years)



### `db_peek_prefix`: Show records matching a prefix

In [12]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db'

db_peek_prefix(str(dp_path), prefix=b"tarnation", n=5)

5 key-value pairs with prefix 7461726e6174696f6e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   tarnation
     Value: Total: 69,016 occurrences in 49,433 volumes (1606-2019, 265 years)

[ 2] Key:   tarnational
     Value: Total: 114 occurrences in 92 volumes (1913-2010, 53 years)

[ 3] Key:   tarnationest
     Value: Total: 53 occurrences in 53 volumes (1840-2019, 30 years)

[ 4] Key:   tarnationly
     Value: Total: 94 occurrences in 94 volumes (1833-2019, 44 years)

