# **Davies Corpus: Acquisition Workflow**
This workflow ingests a locally downloaded Davies corpus (e.g., COHA, COCA) into a RocksDB database. Unlike Google Books ngrams which are downloaded on-the-fly, Davies corpora must be obtained separately and stored locally before running this notebook.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from stop_words import get_stop_words
from ngramkit.ngram_filter.lemmatizer import CachedSpacyLemmatizer
from davieskit.davies_acquire import ingest_davies_corpus
from davieskit.davies_filter import filter_davies_corpus, write_whitelist
from ngramkit.utilities.peek import db_head, db_peek

### Configure
Here we set basic parameters: the corpus name, local path to the downloaded corpus files, and database storage paths.

In [2]:
corpus_name = 'COHA'
db_path_stub = f'/scratch/edk202/NLP_corpora/{corpus_name}/'

## **Ingest Corpus into Database**

In [3]:
combined_bigrams = {
    "working class", "working classes",
    "middle class", "middle classes",
    "lower class", "lower classes",
    "upper class", "upper classes",
    "human being", "human beings"
}

ingest_davies_corpus(
    db_path_stub = db_path_stub,
    workers=24,
    write_batch_size=500_000,
    track_genre=False,
    compact_after=True,
    combined_bigrams=combined_bigrams
)

COHA CORPUS ACQUISITION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-12-16 21:30:24

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Corpus path:          /scratch/edk202/NLP_corpora/COHA
Text directory:       /scratch/edk202/NLP_corpora/COHA/text
DB path:              /scratch/edk202/NLP_corpora/COHA/COHA
Text files found:     20
Workers:              24
Batch size:           500,000
Genre tracking:       Disabled

Processing Files
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|█████████████████████████████████████████████████████████| 20/20 [01:53<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         1.71 GB
Compaction completed in 0:00:21
Size before:             1.71 GB
Size after:              2.44 GB
Space saved:             -741.19 MB (-42.3%)

Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Files processed:          20/20
Failed files:             0
Total sentences written:  23,128,954
Database path:            /scratch/edk202/NLP_corpora/COHA/COHA

End Time: 2025-12-16 21:32:40
Total Runtime: 0:02:16.109666



## **Filter Database**

In [4]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': CachedSpacyLemmatizer()
}

always_include_tokens = {
    "working-class", "working-classes",
    "middle-class", "middle-classes",
    "lower-class", "lower-classes",
    "upper-class", "upper-classes",
    "human-being", "human-beings"
}

filter_davies_corpus(
    src_db_path=f'{db_path_stub}/{corpus_name}',
    dst_db_path=f'{db_path_stub}/{corpus_name}_filtered',
    workers=100,
    batch_size=250_000,
    track_genre=False,
    create_whitelist=True,
    whitelist_path=f'{db_path_stub}/{corpus_name}_whitelist.txt',
    apply_whitelist=True,
    whitelist_size=30_000,
    whitelist_spell_check=True,
    whitelist_year_range=(1900, 2020),
    compact_after=True,
    always_include=always_include_tokens,
    **filter_options
)

COHA CORPUS FILTERING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-12-16 21:32:46

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/COHA/COHA
Destination DB:       /scratch/edk202/NLP_corpora/COHA/COHA_filtered
Lowercase:            True
Alpha only:           True
Filter short:         True (min_len=3)
Filter stops:         True
Apply lemmas:         True
Workers:              100
Batch size:           250,000

Processing Sentences
════════════════════════════════════════════════════════════════════════════════════════════════════


Batches Processed: 100%|███████████████████████████████████████████████████████| 89/89 [00:54<00:00]



Building Whitelist
════════════════════════════════════════════════════════════════════════════════════════════════════
Whitelist path:          /scratch/edk202/NLP_corpora/COHA/COHA_whitelist.txt
Top N tokens:            30,000
Year range:              1900-2020
Spell check:             True

Scanning database...
Found 20,645,805 sentences in 413 batches
Detecting years present in corpus...


Building token frequencies: 100%|████████████████████████████████████████████| 413/413 [03:33<00:00]



Years present in corpus within range: 11 years
  Range: 1900 to 2000
  Years: [1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000]

Filtering tokens by year coverage (must appear in all 11 years)...
Tokens before year filter: 64,639
Tokens after year filter:  23,845
Tokens removed:            40,794

Ranking 23,845 unique tokens...
Selected top 30,000 tokens (+ 0 always_include)
Writing whitelist to /scratch/edk202/NLP_corpora/COHA/COHA_whitelist.txt...
Whitelist written successfully: /scratch/edk202/NLP_corpora/COHA/COHA_whitelist.txt

Applying Whitelist
════════════════════════════════════════════════════════════════════════════════════════════════════
Loading whitelist into memory...
Loaded 23,845 tokens from whitelist

Replacing non-whitelist tokens with <UNK>...



Batches Processed: 100%|███████████████████████████████████████████████████████| 83/83 [00:55<00:00]



Whitelist application complete!
Sentences processed:      20,645,805
Sentences modified:       719,178

Post-Filter Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         2.39 GB
Compaction completed in 0:00:18
Size before:             2.39 GB
Size after:              2.39 GB
Space saved:             244.70 KB (0.0%)


Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Sentences read:           22,172,210
Writes accumulated:       20,802,803
Sentences rejected:       1,369,407
Retention rate:           93.8%
Destination DB:           /scratch/edk202/NLP_corpora/COHA/COHA_filtered

End Time: 2025-12-16 21:40:18
Total Runtime: 0:07:32.701842



## **Optional: Inspect Database**

### `db_head`: Show first N records

In [5]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}'

db_head(str(db_path), n=15)

First 15 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1810] &; Leon
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 2] Key:   [1810] &; Mad
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 3] Key:   [1810] &; Moth
     Value: Total: 2 occurrences in 1 volumes (1810-1810, 1 bins)

[ 4] Key:   [1810] &; Stur
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 5] Key:   [1810] &; urg 'd him to the grave where sleeps His greatness Borodino 's field
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 6] Key:   [1810] 'The standard of good behavior for the continuance in office of the-judieial magistracy is certainly one of the most valuable of the modern improvements in the practice of government
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 7] Key:   [1810] 'm thine
     Value: Total: 1 occurrences in 1 volum

### `db_peek`: Show records starting from a key

In [5]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}'

db_peek(db_path, start_key="[1980] I am a human-being", n=5)


5 key-value pairs starting from 000007bc4920616d20612068756d616e2d6265696e67:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1980] I am a human-being too he pleaded on a note of despair
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 2] Key:   [1980] I am a hundred percent disbeliever in war
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 3] Key:   [1980] I am a layman a physician and a Catholic
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 4] Key:   [1980] I am a literature and history buff and therefore my examples are heavily weighted in those directions.
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 5] Key:   [1980] I am a little nervous too with everybody on and off the ship watching but this really
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)



### `db_peek_prefix`: Records matching a prefix

In [6]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_filtered'

db_peek(db_path, start_key="[1980] <UNK> <UNK> <UNK> human-being", n=15)

15 key-value pairs starting from 000007bc3c554e4b3e203c554e4b3e203c554e4b3e2068756d616e2d6265696e67:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1980] <UNK> <UNK> <UNK> human-being <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> movie <UNK> still available
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 2] Key:   [1980] <UNK> <UNK> <UNK> human-being <UNK> <UNK> <UNK> right <UNK> <UNK> <UNK> life <UNK> future
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 3] Key:   [1980] <UNK> <UNK> <UNK> human-being <UNK> <UNK> deserve well <UNK> <UNK>
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 4] Key:   [1980] <UNK> <UNK> <UNK> human-being <UNK> <UNK> plead <UNK> <UNK> note <UNK> despair
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 5] Key:   [1980] <UNK> <UNK> <UNK> human-being <UNK> put <UNK> <UNK> slight danger <UNK> <UNK> <UNK> <UNK>
     Valu