# **Davies Corpus: Acquisition Workflow**
This workflow ingests a locally downloaded Davies corpus (e.g., COHA, COCA) into a RocksDB database. Unlike Google Books ngrams which are downloaded on-the-fly, Davies corpora must be obtained separately and stored locally before running this notebook.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from stop_words import get_stop_words
from ngramprep.ngram_filter.lemmatizer import CachedSpacyLemmatizer
from daviesprep.davies_acquire import ingest_davies_corpus
from daviesprep.davies_filter import filter_davies_corpus, write_whitelist
from ngramprep.utilities.peek import db_head, db_peek

### Configure
Here we set basic parameters: the corpus name, local path to the downloaded corpus files, and database storage paths.

In [2]:
corpus_name = 'COHA'
genre_focus = ['mag']
db_path_stub = f'/scratch/edk202/NLP_corpora/{corpus_name}/'

## **Ingest Corpus into Database**

In [3]:
combined_bigrams = {
    "working class", "working classes",
    "middle class", "middle classes",
    "lower class", "lower classes",
    "upper class", "upper classes",
    "human being", "human beings"
}

ingest_davies_corpus(
    db_path_stub = db_path_stub,
    genre_focus=genre_focus,
    workers=24,
    write_batch_size=500_000,
    compact_after=True,
    combined_bigrams=combined_bigrams
)

COHA CORPUS ACQUISITION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-12-19 19:28:02

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Corpus path:          /scratch/edk202/NLP_corpora/COHA
Text directory:       /scratch/edk202/NLP_corpora/COHA/text
DB path:              /scratch/edk202/NLP_corpora/COHA/COHA_mag
Text files found:     20
Genre focus:          mag
Key format:           Year-only (training-ready)
Workers:              24
Batch size:           500,000

Processing Files
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|█████████████████████████████████████████████████████████| 20/20 [00:25<00:00]


Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════





Compaction completed in 0:00:12

Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Files processed:          20/20
Failed files:             0
Total sentences written:  4,580,686
Database path:            /scratch/edk202/NLP_corpora/COHA/COHA_mag

Genre breakdown:
  mag        4,580,686 sentences

End Time: 2025-12-19 19:28:41
Total Runtime: 0:00:38.587874



## **Filter Database**

In [4]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': CachedSpacyLemmatizer()
}

always_include_tokens = {
    "working-class", "working-classes",
    "middle-class", "middle-classes",
    "lower-class", "lower-classes",
    "upper-class", "upper-classes",
    "human-being", "human-beings"
}

filter_davies_corpus(
    db_path_stub=db_path_stub,
    genre_focus=genre_focus,
    workers=100,
    batch_size=250_000,
    create_whitelist=True,
    apply_whitelist=True,
    whitelist_size=30_000,
    whitelist_spell_check=True,
    whitelist_year_range=(1900, 2000),
    compact_after=True,
    always_include=always_include_tokens,
    **filter_options
)

COHA CORPUS FILTERING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-12-19 19:28:53

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/COHA/COHA_mag
Destination DB:       /scratch/edk202/NLP_corpora/COHA/COHA_mag_filtered
Lowercase:            True
Alpha only:           True
Filter short:         True (min_len=3)
Filter stops:         True
Apply lemmas:         True
Workers:              100
Batch size:           250,000

Processing Sentences
════════════════════════════════════════════════════════════════════════════════════════════════════


Batches Processed: 100%|███████████████████████████████████████████████████████| 19/19 [00:15<00:00]



Building Whitelist
════════════════════════════════════════════════════════════════════════════════════════════════════
Whitelist path:          /scratch/edk202/NLP_corpora/COHA/COHA_mag_whitelist.txt
Top N tokens:            30,000
Year range:              1900-2000
Spell check:             True

Scanning database...
Found 4,312,837 sentences in 87 batches
Detecting years present in corpus...


Building token frequencies: 100%|██████████████████████████████████████████████| 87/87 [00:44<00:00]



Years present in corpus within range: 11 years
  Range: 1900 to 2000
  Years: [1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000]

Filtering tokens by year coverage (must appear in all 11 years)...
Tokens before year filter: 54,263
Tokens after year filter:  16,330
Tokens removed:            37,933

Ranking 16,330 unique tokens...
Added 1 always_include tokens that were not found in corpus
Selected top 30,000 tokens (+ 0 always_include)
Writing whitelist to /scratch/edk202/NLP_corpora/COHA/COHA_mag_whitelist.txt...
Whitelist written successfully: /scratch/edk202/NLP_corpora/COHA/COHA_mag_whitelist.txt

Applying Whitelist
════════════════════════════════════════════════════════════════════════════════════════════════════
Loading whitelist into memory...
Loaded 16,330 tokens from whitelist

Replacing non-whitelist tokens with <UNK>...



Batches Processed: 100%|███████████████████████████████████████████████████████| 18/18 [00:13<00:00]



Whitelist application complete!
Sentences processed:      4,312,837
Sentences modified:       17,589

Post-Filter Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         624.40 MB
Compaction completed in 0:00:07
Size before:             624.40 MB
Size after:              624.40 MB
Space saved:             1.35 KB (0.0%)


Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Sentences read:           4,510,852
Writes accumulated:       4,319,570
Sentences rejected:       191,282
Retention rate:           95.8%
Destination DB:           /scratch/edk202/NLP_corpora/COHA/COHA_mag_filtered

End Time: 2025-12-19 19:30:42
Total Runtime: 0:01:49.645953



## **Optional: Inspect Database**

### `db_head`: Show first N records

In [6]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}'

db_head(str(db_path), n=150)

First 150 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1810] &; Leon
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 2] Key:   [1810] &; Mad
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 3] Key:   [1810] &; Moth
     Value: Total: 2 occurrences in 1 volumes (1810-1810, 1 bins)

[ 4] Key:   [1810] &; Stur
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 5] Key:   [1810] &; urg 'd him to the grave where sleeps His greatness Borodino 's field
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 6] Key:   [1810] 'm thine
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 7] Key:   [1810] 's defeated The whole rank 's slaughter 'd scarce a man escap 'd To tell their fate
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 8] Key:   [1810] 's my mother
     Value: Total: 1 occurrences in 1 volu

### `db_peek`: Show records starting from a key

In [7]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_mag_filtered'

db_peek(db_path, start_key="[1980] <UNK> <UNK> <UNK> human-being", n=5)


5 key-value pairs starting from 000007bc3c554e4b3e203c554e4b3e203c554e4b3e2068756d616e2d6265696e67:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1980] <UNK> <UNK> <UNK> human-being <UNK> <UNK> deserve well <UNK> <UNK>
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 2] Key:   [1980] <UNK> <UNK> <UNK> humane man <UNK> contemporary account <UNK> <UNK> behavior <UNK> <UNK> analyst indicate <UNK> <UNK> sometimes show quite human emotion <UNK> <UNK> patient occasionally reveal personal matter like <UNK> accident <UNK> <UNK> son <UNK> <UNK> even report <UNK> thump <UNK> table <UNK> hammer home <UNK> point
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 3] Key:   [1980] <UNK> <UNK> <UNK> humble aspiration <UNK> author believe <UNK> may fashion <UNK> <UNK> engage notion <UNK> citizenship <UNK> perhaps turn back <UNK> rise tide <UNK> barbarism
     Value: Total: 1 occurrences in 1 volum

### `db_peek_prefix`: Records matching a prefix

In [8]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_mag_filtered'

db_peek(db_path, start_key="[1980] <UNK> <UNK> <UNK> animal", n=5)

5 key-value pairs starting from 000007bc3c554e4b3e203c554e4b3e203c554e4b3e20616e696d616c:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1980] <UNK> <UNK> <UNK> animal <UNK> <UNK> element take <UNK> <UNK> minute <UNK> leave
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 2] Key:   [1980] <UNK> <UNK> <UNK> animal <UNK> <UNK> must <UNK> query <UNK> <UNK> <UNK> <UNK> fathom <UNK> division <UNK> life <UNK> <UNK>
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 3] Key:   [1980] <UNK> <UNK> <UNK> animal <UNK> end <UNK> <UNK> <UNK> shelter <UNK> bring <UNK> <UNK> <UNK> five animal control officer <UNK> travel <UNK> city <UNK> dodge <UNK> van carefully fill <UNK> form <UNK> <UNK> <UNK> <UNK> <UNK> particular <UNK> <UNK> stray unwanted pet <UNK> wild animal <UNK> pick <UNK>
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 4] Key:   [1980] <UNK> <UNK> <UNK> animal <UNK>