# **Unigrams: Full Workflow**
The primary purpose of the unigram workflow is to generate a vocabulary **whitelist** that can be used to filter out unwanted words from a multigram corpus. The workflow consists of two steps: (1) downloading the unigram corpus into a database and (2) filtering and normalizing the corpus and generating the whitelist.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from stop_words import get_stop_words
from ngramprep.ngram_filter.lemmatizer import CachedSpacyLemmatizer
from ngramprep.ngram_acquire import download_and_ingest_to_rocksdb
from ngramprep.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramprep.utilities.peek import db_head, db_peek, db_peek_prefix

### Configure
Here we set basic parameters: the corpus to download, the size of the ngrams to download, and the size of the year bins.

In [2]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = None
release = '20200217'
language = 'eng'
ngram_size = 1
bin_size = 1

## **Step 1: Download and Ingest**

In [3]:
download_and_ingest_to_rocksdb(
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=archive_path_stub,
    ngram_type="tagged",
    overwrite_db=True,
    open_type="write:packed24",
    compact_after_ingest=True
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-09 18:59:04

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           ...//books.storage.googleapis.com/?prefix=ngrams/books/20200217/eng-fiction/1-
DB path:              .../edk202/NLP_corpora/Google_Books/20200217/eng-fiction/1gram_files/1grams.db
File range:           0 to 0
Total files:          1
Files to get:         1
Skipping:             0
Download workers:     1
Batch size:           50,000
Ngram size:           1
Ngram type:           tagged
Overwrite DB:         True
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|███████████████████████████████████████████████████████████| 1/1 [02:13<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Compaction completed in 0:00:20

Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Fully processed files:       1
Failed files:                0
Total entries written:       2,489,457
Write batches flushed:       1
Uncompressed data processed: 2.78 GB
Processing throughput:       18.37 MB/sec

End Time: 2026-01-09 19:01:39.661829
Total Runtime: 0:02:34.814534
Time per file: 0:02:34.814534
Files per hour: 23.3


 ## **Step 2: Filter, Normalize, and Generate Whitelist**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. By default, we:
1. case-normalize the tokens
2. remove tokens containing non-alphanumeric text
3. remove stopwords using the `stop-words` package
4. lemmatize the tokens using the `spaCy` package—first using a lookup table and falling back to rules when lookups fail
5. Create a whitelist of the top 20,000 most frequent words that pass a `pyenchant` spell-check and appear in all corpora from 1900–2019 (inclusive).

In [3]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': CachedSpacyLemmatizer(language="en"),
    'bin_size': bin_size
}

whitelist_options = {
    'output_whitelist_top_n': 30_000,
    'output_whitelist_year_range': (1900, 2019),
    'output_whitelist_spell_check': True,
    'output_whitelist_spell_check_language': "en"
}

build_processed_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=20,
    num_initial_work_units=300,
    cache_partitions=True,
    use_cached_partitions=False,
    progress_every_s=5,
    compact_after_ingest=True,
    **filter_options,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-10 22:45:43
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams.db
Target DB:            ...dk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db
Temp directory:       ...tch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              20
Initial work units:   300

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
────────────────────────


Phase 1: Creating work units...
════════════════════════════════════════════════════════════════════════════════════════════════════
Clean restart - creating new work units
Sampling database to create 300 density-based work units...
Created 300 balanced work units based on data density
Cached partition results for future use

Phase 2: Processing 300 work units with 20 workers...
════════════════════════════════════════════════════════════════════════════════════════════════════

    items         kept%        workers        units          rate        elapsed    
────────────────────────────────────────────────────────────────────────────────────
    2.70M         46.0%         20/20       259·20·21      185.5k/s        15s      
    8.75M         58.3%         18/20       222·18·60      446.6k/s        20s      
    14.25M        60.4%         20/20       190·20·90      579.7k/s        25s      
    19.08M        61.6%         20/20       152·20·128     645.2k/s        30s      
    2

Shards Ingested: 100%|███████████████████████████████████████████████████████| 300/300 [06:56<00:00]



Ingestion complete: 300 shards, 18,967,924 items in 416.1s (45,588 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         26.93 GB
Compaction completed in 0:00:52
Size before:             26.93 GB
Size after:              22.53 GB
Space saved:             4.40 GB (16.3%)

Phase 5: Generating output whitelist...
════════════════════════════════════════════════════════════════════════════════════════════════════
  Output path: ...LP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db/whitelist.txt
  Extracting top 30,000 tokens
  Spell checking enabled (en)
  Year range filter: 1900-2019 (inclusive)
  Generated whitelist with 30,000 tokens in 257.4s

┌──────────────────────────────────────────────────────────────────────────────────────

## **Optional: Inspect Database Files**

### `db_head`: Show first N records

In [4]:
db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_processed.db'

db_head(db, n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   FALSE
     Value: Total: 124,503 occurrences in 110,465 volumes (1538-2019, 380 bins)

[ 2] Key:   TRUE
     Value: Total: 4,241,065 occurrences in 2,916,658 volumes (1501-2019, 440 bins)

[ 3] Key:   aaa
     Value: Total: 4,449,225 occurrences in 1,052,377 volumes (1477-2019, 402 bins)

[ 4] Key:   aaaa
     Value: Total: 472,912 occurrences in 91,764 volumes (1477-2019, 337 bins)

[ 5] Key:   aaaaa
     Value: Total: 54,371 occurrences in 22,966 volumes (1581-2019, 274 bins)



### `db_peek`: Show records starting from a key

In [5]:
db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_processed.db'

db_peek(db, start_key=b"much", n=5)

5 key-value pairs starting from 6d756368:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   much
     Value: Total: 5,422,178,221 occurrences in 147,808,024 volumes (1470-2019, 519 bins)

[ 2] Key:   mucha
     Value: Total: 196,322 occurrences in 86,933 volumes (1554-2019, 293 bins)

[ 3] Key:   muchaa
     Value: Total: 141 occurrences in 108 volumes (1807-2017, 75 bins)

[ 4] Key:   muchaal
     Value: Total: 232 occurrences in 118 volumes (1993-2019, 21 bins)

[ 5] Key:   muchaba
     Value: Total: 371 occurrences in 65 volumes (1848-2019, 33 bins)



### `db_peek_prefix`: Show records matching a prefix

In [6]:
db = f'{db_path_stub}{release}/{language}/{ngram_size}gram_files/{ngram_size}grams_processed.db'

db_peek_prefix(db, prefix=b"tarnation", n=5)

5 key-value pairs with prefix 7461726e6174696f6e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   tarnation
     Value: Total: 67,827 occurrences in 48,385 volumes (1633-2019, 260 bins)

[ 2] Key:   tarnational
     Value: Total: 114 occurrences in 92 volumes (1913-2010, 53 bins)

[ 3] Key:   tarnationest
     Value: Total: 53 occurrences in 53 volumes (1840-2019, 30 bins)

[ 4] Key:   tarnationly
     Value: Total: 94 occurrences in 94 volumes (1833-2019, 44 bins)

[ 5] Key:   tarnations
     Value: Total: 1,189 occurrences in 1,048 volumes (1606-2019, 198 bins)

