# **Unigrams: Full Workflow**
The primary purpose of the unigram workflow is to generate a vocabulary **whitelist** that can be used to filter out unwanted words from a multigram corpus. The workflow consists of two steps: (1) downloading the unigram corpus into a database and (2) filtering and normalizing the corpus and generating the whitelist.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from stop_words import get_stop_words
from ngramkit.ngram_filter.lemmatizer import SpacyLemmatizer
from ngramkit.ngram_acquire import download_and_ingest_to_rocksdb
from ngramkit.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramkit.utilities.peek import db_head, db_peek, db_peek_prefix

### Configure

In [2]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = '/scratch/edk202/NLP_archive/Google_Books/'
release = '20200217'
language = 'eng-fiction'
size = 1

## **Step 1: Download and Ingest**

In [3]:
download_and_ingest_to_rocksdb(
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=archive_path_stub,
    ngram_type="tagged",
    overwrite_db=True,
    open_type="write:packed24",
    compact_after_ingest=True
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-16 12:50:32

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           ...//books.storage.googleapis.com/?prefix=ngrams/books/20200217/eng-fiction/1-
DB path:              .../edk202/NLP_corpora/Google_Books/20200217/eng-fiction/1gram_files/1grams.db
File range:           0 to 0
Total files:          1
Files to get:         1
Skipping:             0
Download workers:     1
Batch size:           50,000
Ngram size:           1
Ngram type:           tagged
Overwrite DB:         True
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|███████████████████████████████████████████████████████████| 1/1 [02:06<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Compaction completed in 0:00:17

Database Archival
════════════════════════════════════════════════════════════════════════════════════════════════════
Compressing database to temporary location: ...1/job-454076/ngram_archive_qxqup1sl/1grams.db.tar.zst
Compressing /scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/1gram_files/1grams.db to /state/partition1/job-454076/ngram_archive_qxqup1sl/1grams.db.tar.zst
Using 128 threads at compression level 3
Compression complete: /state/partition1/job-454076/ngram_archive_qxqup1sl/1grams.db.tar.zst
Transferring archive to ...P_archive/Google_Books/20200217/eng-fiction/1gram_files/1grams.db.tar.zst
Archive created: ...k202/NLP_archive/Google_Books/20200217/eng-fiction/1gram_files/1grams.db.tar.zst

Processing complete!

Final Summary
══════════════════════════════════════════════════════════════════════════════

 ## **Step 2: Filter, Normalize, and Generate Whitelist**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. By default, we:
1. case-normalize the tokens
2. remove tokens containing non-alphanumeric text
3. remove stopwords using the `stop-words` package
4. remove tokens fewer than 3 characters
5. lemmatize the tokens using the `spaCy` package—first using a lookup table and falling back to rules when lookups fail
6. Create a whitelist of the top 15,000 most frequent words that pass a `pyenchant` spell-check and appear in all corpora from 1900–2019 (inclusive).

In [4]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': SpacyLemmatizer(language="en")
}

whitelist_options = {
    'output_whitelist_top_n': 15_000,
    'output_whitelist_year_range': (1900, 2019),
    'output_whitelist_spell_check': True,
    'output_whitelist_spell_check_language': "en"
}

build_processed_db(
    mode="restart",
    ngram_size=size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=100,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=10.0,
    **filter_options,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-16 12:53:19
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            .../edk202/NLP_corpora/Google_Books/20200217/eng-fiction/1gram_files/1grams.db
Target DB:            ...P_corpora/Google_Books/20200217/eng-fiction/1gram_files/1grams_processed.db
Temp directory:       ...02/NLP_corpora/Google_Books/20200217/eng-fiction/1gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              100
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
────────────────────

Shards Ingested: 100%|███████████████████████████████████████████████████████| 600/600 [01:19<00:00]



Ingestion complete: 600 shards, 1,394,337 items in 79.8s (17,482 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         1.63 GB
Compaction completed in 0:00:07
Size before:             1.63 GB
Size after:              1.63 GB
Space saved:             17.06 KB (0.0%)

Phase 5: Generating output whitelist...
════════════════════════════════════════════════════════════════════════════════════════════════════
  Output path: ...ra/Google_Books/20200217/eng-fiction/1gram_files/1grams_processed.db/whitelist.txt
  Extracting top 15,000 tokens
  Spell checking enabled (en)
  Year range filter: 1900-2019 (inclusive)
  Generated whitelist with 15,000 tokens in 19.1s

┌────────────────────────────────────────────────────────────────────────────────────────────

## **Optional: Inspect Database**

### `db_head`: Show first N records

In [6]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/1gram_files/1grams_processed.db'

db_head(str(dp_path), n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   FALSE
     Value: Total: 16,185 occurrences in 14,116 volumes (1685-2019, 240 years)

[ 2] Key:   TRUE
     Value: Total: 429,584 occurrences in 319,392 volumes (1587-2019, 299 years)

[ 3] Key:   aaa
     Value: Total: 54,159 occurrences in 30,144 volumes (1684-2019, 206 years)

[ 4] Key:   aaaa
     Value: Total: 5,845 occurrences in 4,093 volumes (1803-2019, 134 years)

[ 5] Key:   aaaaa
     Value: Total: 4,466 occurrences in 2,432 volumes (1818-2019, 102 years)



### `db_peek`: Show records starting from a key

In [7]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/1gram_files/1grams_processed.db'

db_peek(str(dp_path), start_key=b"much", n=5)

5 key-value pairs starting from 6d756368:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   much
     Value: Total: 458,589,086 occurrences in 15,123,921 volumes (1578-2019, 375 years)

[ 2] Key:   mucha
     Value: Total: 17,451 occurrences in 9,880 volumes (1652-2019, 202 years)

[ 3] Key:   muchabashed
     Value: Total: 42 occurrences in 42 volumes (1851-2019, 17 years)

[ 4] Key:   muchabout
     Value: Total: 186 occurrences in 182 volumes (1790-2018, 34 years)

[ 5] Key:   muchabused
     Value: Total: 1,018 occurrences in 1,011 volumes (1784-2019, 157 years)



### `db_peek_prefix`: Show records matching a prefix

In [8]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/1gram_files/1grams_processed.db'

db_peek_prefix(str(dp_path), prefix=b"tarnation", n=5)

5 key-value pairs with prefix 7461726e6174696f6e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   tarnation
     Value: Total: 33,289 occurrences in 22,081 volumes (1797-2019, 201 years)

