# **Unigrams: Full Workflow**
The primary purpose of the unigram workflow is to generate a vocabulary **whitelist** that can be used to filter out unwanted words from a multigram corpus. The workflow consists of two steps: (1) downloading the unigram corpus into a database and (2) filtering and normalizing the corpus and generating the whitelist.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from stop_words import get_stop_words
from ngramprep.ngram_filter.lemmatizer import CachedSpacyLemmatizer
from ngramprep.ngram_acquire import download_and_ingest_to_rocksdb
from ngramprep.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramprep.utilities.peek import db_head, db_peek, db_peek_prefix

### Configure
Here we set basic parameters: the corpus to download, the size of the ngrams to download, and the size of the year bins.

In [2]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = None
release = '20200217'
language = 'eng-us'
ngram_size = 1
bin_size = 1

## **Step 1: Download and Ingest**

In [3]:
download_and_ingest_to_rocksdb(
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=archive_path_stub,
    ngram_type="tagged",
    overwrite_db=True,
    open_type="write:packed24",
    compact_after_ingest=True
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-06 22:44:33

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           https://books.storage.googleapis.com/?prefix=ngrams/books/20200217/eng-us/1-
DB path:              /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/1grams.db
File range:           0 to 13
Total files:          14
Files to get:         14
Skipping:             0
Download workers:     14
Batch size:           50,000
Ngram size:           1
Ngram type:           tagged
Overwrite DB:         True
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|█████████████████████████████████████████████████████████| 14/14 [03:41<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         17.44 GB
Compaction completed in 0:01:29
Size before:             17.44 GB
Size after:              35.28 GB
Space saved:             -17.84 GB (-102.2%)

Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Fully processed files:       14
Failed files:                0
Total entries written:       26,292,918
Write batches flushed:       14
Uncompressed data processed: 26.54 GB
Processing throughput:       87.19 MB/sec

End Time: 2026-01-06 22:49:44.894990
Total Runtime: 0:05:11.727399
Time per file: 0:00:22.266243
Files per hour: 161.7


 ## **Step 2: Filter, Normalize, and Generate Whitelist**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. By default, we:
1. case-normalize the tokens
2. remove tokens containing non-alphanumeric text
3. remove stopwords using the `stop-words` package
4. lemmatize the tokens using the `spaCy` package—first using a lookup table and falling back to rules when lookups fail
5. Create a whitelist of the top 20,000 most frequent words that pass a `pyenchant` spell-check and appear in all corpora from 1900–2019 (inclusive).

In [3]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': CachedSpacyLemmatizer(language="en"),
    'bin_size': bin_size
}

whitelist_options = {
    'output_whitelist_top_n': 20_000,
    'output_whitelist_year_range': (1900, 2019),
    'output_whitelist_spell_check': True,
    'output_whitelist_spell_check_language': "en"
}

build_processed_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=20,
    num_initial_work_units=300,
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=5,
    compact_after_ingest=True,
    **filter_options,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-06 23:08:39
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/1grams.db
Target DB:            ...02/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/1grams_processed.db
Temp directory:       .../edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              20
Initial work units:   300

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
─────────────────────

Shards Ingested: 100%|███████████████████████████████████████████████████████| 300/300 [06:27<00:00]



Ingestion complete: 300 shards, 11,509,809 items in 387.5s (29,704 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         14.97 GB
Compaction completed in 0:00:37
Size before:             14.97 GB
Size after:              12.99 GB
Space saved:             1.98 GB (13.2%)

Phase 5: Generating output whitelist...
════════════════════════════════════════════════════════════════════════════════════════════════════
  Output path: ...corpora/Google_Books/20200217/eng-us/1gram_files/1grams_processed.db/whitelist.txt
  Extracting top 20,000 tokens
  Spell checking enabled (en)
  Year range filter: 1900-2019 (inclusive)
  Generated whitelist with 20,000 tokens in 144.5s

┌──────────────────────────────────────────────────────────────────────────────────────

## **Optional: Inspect Database**

### `db_head`: Show first N records

In [4]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng-us/1gram_files/1grams_processed.db'

db_head(str(dp_path), n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   FALSE
     Value: Total: 55,745 occurrences in 50,370 volumes (1575-2019, 240 bins)

[ 2] Key:   TRUE
     Value: Total: 2,290,338 occurrences in 1,573,605 volumes (1501-2019, 330 bins)

[ 3] Key:   aaa
     Value: Total: 3,420,745 occurrences in 734,111 volumes (1478-2019, 246 bins)

[ 4] Key:   aaaa
     Value: Total: 379,478 occurrences in 58,543 volumes (1508-2019, 220 bins)

[ 5] Key:   aaaaa
     Value: Total: 39,678 occurrences in 14,970 volumes (1616-2019, 195 bins)



### `db_peek`: Show records starting from a key

In [5]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db'

db_peek(str(dp_path), start_key=b"much", n=5)

5 key-value pairs starting from 6d756368:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   much
     Value: Total: 5,422,178,221 occurrences in 147,808,024 volumes (1470-2019, 519 bins)

[ 2] Key:   mucha
     Value: Total: 196,322 occurrences in 86,933 volumes (1554-2019, 293 bins)

[ 3] Key:   muchaa
     Value: Total: 141 occurrences in 108 volumes (1807-2017, 75 bins)

[ 4] Key:   muchaal
     Value: Total: 232 occurrences in 118 volumes (1993-2019, 21 bins)

[ 5] Key:   muchaba
     Value: Total: 371 occurrences in 65 volumes (1848-2019, 33 bins)



### `db_peek_prefix`: Show records matching a prefix

In [6]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db'

db_peek_prefix(str(dp_path), prefix=b"tarnation", n=5)

5 key-value pairs with prefix 7461726e6174696f6e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   tarnation
     Value: Total: 67,827 occurrences in 48,385 volumes (1633-2019, 260 bins)

[ 2] Key:   tarnational
     Value: Total: 114 occurrences in 92 volumes (1913-2010, 53 bins)

[ 3] Key:   tarnationest
     Value: Total: 53 occurrences in 53 volumes (1840-2019, 30 bins)

[ 4] Key:   tarnationly
     Value: Total: 94 occurrences in 94 volumes (1833-2019, 44 bins)

[ 5] Key:   tarnations
     Value: Total: 1,189 occurrences in 1,048 volumes (1606-2019, 198 bins)

