# **Davies Corpus: Acquisition Workflow**
This workflow ingests a locally downloaded Davies corpus (e.g., COHA, COCA) into a RocksDB database. Unlike Google Books ngrams which are downloaded on-the-fly, Davies corpora must be obtained separately and stored locally before running this notebook.

## **Setup**
### Imports

In [1]:
%load_ext autoreload
%autoreload 2

from stop_words import get_stop_words
from ngramkit.ngram_filter.lemmatizer import CachedSpacyLemmatizer
from davieskit.davies_acquire import ingest_davies_corpus
from davieskit.davies_filter import filter_davies_corpus, write_whitelist
from ngramkit.utilities.peek import db_head, db_peek

### Configure
Here we set basic parameters: the corpus name, local path to the downloaded corpus files, and database storage paths.

In [2]:
corpus_name = 'COHA'
db_path_stub = f'/scratch/edk202/NLP_corpora/{corpus_name}/'

## **Ingest Corpus into Database**

In [6]:
ingest_davies_corpus(
    db_path_stub = db_path_stub,
    workers=24,
    write_batch_size=500_000,
    track_genre=False,
    compact_after=True
)

COHA CORPUS ACQUISITION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-12-14 19:12:59

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Corpus path:          /scratch/edk202/NLP_corpora/COHA
Text directory:       /scratch/edk202/NLP_corpora/COHA/text
DB path:              /scratch/edk202/NLP_corpora/COHA/COHA
Text files found:     20
Workers:              24
Batch size:           500,000
Genre tracking:       Disabled

Processing Files
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|█████████████████████████████████████████████████████████| 20/20 [01:57<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         1.76 GB
Compaction completed in 0:00:20
Size before:             1.76 GB
Size after:              2.44 GB
Space saved:             -691.15 MB (-38.3%)

Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Files processed:          20/20
Failed files:             0
Total sentences written:  23,128,970
Database path:            /scratch/edk202/NLP_corpora/COHA/COHA

End Time: 2025-12-14 19:15:19
Total Runtime: 0:02:19.421905



## **Filter Database**

In [5]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': CachedSpacyLemmatizer()
}

filter_davies_corpus(
    src_db_path=f'{db_path_stub}/{corpus_name}',
    dst_db_path=f'{db_path_stub}/{corpus_name}_filtered',
    workers=100,
    batch_size=250_000,
    track_genre=False,
    create_whitelist=True,
    whitelist_path=f'{db_path_stub}/{corpus_name}_whitelist.txt',
    apply_whitelist=True,
    whitelist_size=30_000,
    whitelist_spell_check=True,
    whitelist_year_range=(1900, 2020),
    compact_after=True,
    **filter_options
)

COHA CORPUS FILTERING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-12-15 22:12:02

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/COHA/COHA
Destination DB:       /scratch/edk202/NLP_corpora/COHA/COHA_filtered
Lowercase:            True
Alpha only:           True
Filter short:         True (min_len=3)
Filter stops:         True
Apply lemmas:         True
Workers:              100
Batch size:           250,000

Processing Sentences
════════════════════════════════════════════════════════════════════════════════════════════════════


Batches Processed: 100%|███████████████████████████████████████████████████████| 89/89 [00:57<00:00]



Building Whitelist
════════════════════════════════════════════════════════════════════════════════════════════════════
Whitelist path:          /scratch/edk202/NLP_corpora/COHA/COHA_whitelist.txt
Top N tokens:            30,000
Year range:              1900-2020
Spell check:             True

Scanning database...
Found 20,645,839 sentences
Detecting years present in corpus...


Building token frequencies: 100%|██████████████████████████████████| 20645839/20645839 [10:21<00:00]



Years present in corpus within range: 11 years
  Range: 1900 to 2000
  Years: [1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000]

Filtering tokens by year coverage (must appear in all 11 years)...
Tokens before year filter: 64,598
Tokens after year filter:  23,805
Tokens removed:            40,793

Ranking 23,805 unique tokens...
Selected top 30,000 tokens
Writing whitelist to /scratch/edk202/NLP_corpora/COHA/COHA_whitelist.txt...
Whitelist written successfully: /scratch/edk202/NLP_corpora/COHA/COHA_whitelist.txt

Applying Whitelist
════════════════════════════════════════════════════════════════════════════════════════════════════
Loading whitelist into memory...
Loaded 23,805 tokens from whitelist

Replacing non-whitelist tokens with <UNK>...



Batches Processed: 100%|███████████████████████████████████████████████████████| 83/83 [01:03<00:00]



Whitelist application complete!
Sentences processed:      20,645,839
Sentences modified:       733,338

Post-Filter Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         2.39 GB
Compaction completed in 0:00:18
Size before:             2.39 GB
Size after:              2.39 GB
Space saved:             253.48 KB (0.0%)


Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Sentences read:           22,172,225
Writes accumulated:       20,802,843
Sentences rejected:       1,369,382
Retention rate:           93.8%
Destination DB:           /scratch/edk202/NLP_corpora/COHA/COHA_filtered

End Time: 2025-12-15 22:26:41
Total Runtime: 0:14:38.559571



## **Optional: Inspect Database**

### `db_head`: Show first N records

In [6]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_filtered'

db_head(str(db_path), n=15)

First 15 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1810] 1 <UNK> 2 attendant
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 2] Key:   [1810] 1 <UNK> 2 fisherman fish <UNK> <UNK> rock
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 3] Key:   [1810] 1 <UNK> <UNK> <UNK> <UNK> clerk <UNK> norwich <UNK> <UNK> <UNK> partner <UNK> <UNK> general store <UNK> <UNK> con necticut town alway <UNK> everywhere successful
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 4] Key:   [1810] 1 <UNK> <UNK> <UNK> <UNK> educate secondly board thirdly lodge fourthly clothe <UNK> <UNK> <UNK> expence <UNK> sir spendall flinty
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 5] Key:   [1810] 1 <UNK> <UNK> road <UNK> crime 1 <UNK> must <UNK> <UNK> road <UNK> expiation
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 6] Key:  

### `db_peek`: Show records starting from a key

In [7]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_filtered'

db_peek(db_path, start_key="[2000] <UNK> horror", n=5)


5 key-value pairs starting from 000007d03c554e4b3e20686f72726f72:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2000] <UNK> horror <UNK> <UNK> <UNK> <UNK> <UNK> somehow make <UNK> wise wise even <UNK> <UNK> witch <UNK> <UNK> speak <UNK> along <UNK> <UNK> empty easy conversation <UNK> <UNK> <UNK> <UNK> <UNK> old friend <UNK> friend now seem like mirror show <UNK> whatever <UNK> want <UNK> see
     Value: Total: 2 occurrences in 2 volumes (2000-2000, 1 bins)

[ 2] Key:   [2000] <UNK> horror <UNK> <UNK> <UNK> beard <UNK> good <UNK> <UNK> heart begin <UNK> beat <UNK> fast <UNK> become almost <UNK>
     Value: Total: 1 occurrences in 1 volumes (2000-2000, 1 bins)

[ 3] Key:   [2000] <UNK> horror <UNK> <UNK> <UNK> happen fully hit <UNK>
     Value: Total: 2 occurrences in 2 volumes (2000-2000, 1 bins)

[ 4] Key:   [2000] <UNK> horror <UNK> <UNK> choice <UNK> <UNK> <UNK> still
     Value: Total: 2 occurrences in 2 volumes (20

### `db_peek_prefix`: Records matching a prefix

In [12]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_filtered'

db_peek(db_path, start_key="[1980] time", n=15)

15 key-value pairs starting from 000007bc74696d65:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1980] time <UNK> <UNK> 1 run <UNK> <UNK>
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 2] Key:   [1980] time <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> behavior <UNK> <UNK> quick word <UNK> <UNK> hard look <UNK> disapproval <UNK>
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 3] Key:   [1980] time <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> john <UNK> <UNK> page <UNK>
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 4] Key:   [1980] time <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> life <UNK> death <UNK> comprehend <UNK> language <UNK> dialect <UNK> other
     Value: Total: 1 occurrences in 1 volumes (1980-1980, 1 bins)

[ 5] Key:   [1980] time <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> start <UNK> new time
     Value: Total: 1 occurrences in 1 volumes (1980-19