# **Unigrams: Full Workflow**
The primary purpose of the unigram workflow is to generate a vocabulary **whitelist** that can be used to filter out unwanted words from a multigram corpus. The workflow consists of two steps: (1) downloading the unigram corpus into a database and (2) filtering and normalizing the corpus and generating the whitelist.

## **Setup**
### Imports

In [2]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from stop_words import get_stop_words
from ngramkit.ngram_filter.lemmatizer import SpacyLemmatizer
from ngramkit.ngram_acquire import download_and_ingest_to_rocksdb
from ngramkit.ngram_filter.pipeline.orchestrator import build_processed_db
from ngramkit.utilities.peek import db_head, db_peek, db_peek_prefix

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Configure
Here we set basic parameters: the corpus to download, the size of the ngrams to download, and the size of the year bins.

In [3]:
db_path_stub = '/scratch/edk202/NLP_corpora/Google_Books/'
archive_path_stub = '/scratch/edk202/NLP_archive/Google_Books/'
release = '20200217'
language = 'eng'
ngram_size = 1
bin_size = 5

## **Step 1: Download and Ingest**

In [4]:
download_and_ingest_to_rocksdb(
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    archive_path_stub=archive_path_stub,
    ngram_type="tagged",
    overwrite_db=True,
    open_type="write:packed24",
    compact_after_ingest=True
)

N-GRAM ACQUISITION PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-29 18:25:23

Download Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Ngram repo:           https://books.storage.googleapis.com/?prefix=ngrams/books/20200217/eng/1-
DB path:              /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams.db
File range:           0 to 23
Total files:          24
Files to get:         24
Skipping:             0
Download workers:     24
Batch size:           50,000
Ngram size:           1
Ngram type:           tagged
Overwrite DB:         True
DB Profile:           write:packed24

Download Progress
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|█████████████████████████████████████████████████████████| 24/24 [05:09<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         45.90 GB
Compaction completed in 0:02:26
Size before:             45.90 GB
Size after:              57.76 GB
Space saved:             -11.86 GB (-25.8%)

Database Archival
════════════════════════════════════════════════════════════════════════════════════════════════════
Compressing database to temporary location: ...n1/job-57884/ngram_archive_ghmakj4v/1grams.db.tar.zst
Compressing /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams.db to /state/partition1/job-57884/ngram_archive_ghmakj4v/1grams.db.tar.zst
Using 128 threads at compression level 3
Compression complete: /state/partition1/job-57884/ngram_archive_ghmakj4v/1grams.db.tar.zst
Transferring archive to ...dk202/NLP_archive/Google_Books/20200217/eng/1gram_files/1grams.db.tar.zst
Archive created: /scratch/edk202/NLP_archive/Google_Books/20200217/eng/1gram_fil

 ## **Step 2: Filter, Normalize, and Generate Whitelist**
`config.py` contains generic defaults for the filtering pipeline. You can override these defaults by passing option dictionaries to the `build_processed_db` function, as seen below. By default, we:
1. case-normalize the tokens
2. remove tokens containing non-alphanumeric text
3. remove stopwords using the `stop-words` package
4. remove tokens fewer than 3 characters
5. lemmatize the tokens using the `spaCy` package—first using a lookup table and falling back to rules when lookups fail
6. Create a whitelist of the top 30,000 most frequent words that pass a `pyenchant` spell-check and appear in all corpora from 1900–2019 (inclusive).

In [5]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': SpacyLemmatizer(language="en"),
    'bin_size': bin_size
}

whitelist_options = {
    'output_whitelist_top_n': 30_000,
    'output_whitelist_year_range': (1900, 2019),
    'output_whitelist_spell_check': True,
    'output_whitelist_spell_check_language': "en"
}

build_processed_db(
    mode="restart",
    ngram_size=ngram_size,
    repo_release_id=release,
    repo_corpus_id=language,
    db_path_stub=db_path_stub,
    num_workers=80,
    num_initial_work_units=600,
    cache_partitions=True,
    use_cached_partitions=True,
    progress_every_s=10.0,
    **filter_options,
    **whitelist_options
);


N-GRAM FILTER PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-11-29 18:36:29
Mode:       RESTART

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams.db
Target DB:            ...dk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db
Temp directory:       ...tch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/processing_tmp

Parallelism
────────────────────────────────────────────────────────────────────────────────────────────────────
Workers:              80
Initial work units:   600

Database Profiles
────────────────────────────────────────────────────────────────────────────────────────────────────
Reader profile:       read:packed24
Writer profile:       write:packed24

Ingestion Configuration
────────────────────────

Shards Ingested: 100%|███████████████████████████████████████████████████████| 600/600 [02:10<00:00]



Ingestion complete: 600 shards, 17,731,248 items in 130.4s (135,960 items/s)

Phase 4: Finalizing database...
════════════════════════════════════════════════════════════════════════════════════════════════════

Post-Ingestion Compaction
────────────────────────────────────────────────────────────────────────────────────────────────────
Initial DB size:         8.92 GB
Compaction completed in 0:00:23
Size before:             8.92 GB
Size after:              7.41 GB
Space saved:             1.51 GB (16.9%)

Phase 5: Generating output whitelist...
════════════════════════════════════════════════════════════════════════════════════════════════════
  Output path: ...LP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db/whitelist.txt
  Extracting top 30,000 tokens
  Spell checking enabled (en)
  Year range filter: 1900-2019 (inclusive)
  Generated whitelist with 30,000 tokens in 102.8s

┌────────────────────────────────────────────────────────────────────────────────────────

## **Optional: Inspect Database**

### `db_head`: Show first N records

In [8]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db'

db_head(str(dp_path), n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   FALSE
     Value: Total: 124,503 occurrences in 110,465 volumes (1535-2015, 90 bins)

[ 2] Key:   TRUE
     Value: Total: 4,241,065 occurrences in 2,916,658 volumes (1500-2015, 95 bins)

[ 3] Key:   aaa
     Value: Total: 5,093,481 occurrences in 1,181,441 volumes (1475-2015, 93 bins)

[ 4] Key:   aaaa
     Value: Total: 474,351 occurrences in 92,702 volumes (1475-2015, 82 bins)

[ 5] Key:   aaaaa
     Value: Total: 54,556 occurrences in 23,113 volumes (1580-2015, 79 bins)



### `db_peek`: Show records starting from a key

In [7]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db'

db_peek(str(dp_path), start_key=b"much", n=5)

5 key-value pairs starting from 6d756368:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   much
     Value: Total: 5,422,182,873 occurrences in 147,811,157 volumes (1470-2015, 109 years)

[ 2] Key:   mucha
     Value: Total: 253,936 occurrences in 119,832 volumes (1550-2015, 93 years)

[ 3] Key:   muchaa
     Value: Total: 141 occurrences in 108 volumes (1805-2015, 35 years)

[ 4] Key:   muchaal
     Value: Total: 232 occurrences in 118 volumes (1990-2015, 6 bins)

[ 5] Key:   muchaba
     Value: Total: 371 occurrences in 65 volumes (1845-2015, 18 years)



### `db_peek_prefix`: Show records matching a prefix

In [9]:
dp_path='/scratch/edk202/NLP_corpora/Google_Books/20200217/eng/1gram_files/1grams_processed.db'

db_peek_prefix(str(dp_path), prefix=b"tarnation", n=5)

5 key-value pairs with prefix 7461726e6174696f6e:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   tarnation
     Value: Total: 69,016 occurrences in 49,433 volumes (1605-2015, 66 bins)

[ 2] Key:   tarnational
     Value: Total: 114 occurrences in 92 volumes (1910-2010, 18 bins)

[ 3] Key:   tarnationest
     Value: Total: 53 occurrences in 53 volumes (1840-2015, 23 bins)

[ 4] Key:   tarnationly
     Value: Total: 94 occurrences in 94 volumes (1830-2015, 25 bins)

