# **Davies Corpus: Acquisition Workflow**
This workflow ingests a locally downloaded Davies corpus (e.g., COHA, COCA) into a RocksDB database. Unlike Google Books ngrams which are downloaded on-the-fly, Davies corpora must be obtained separately and stored locally before running this notebook.

## **Setup**
### Imports

In [44]:
%load_ext autoreload
%autoreload 2

from stop_words import get_stop_words
from ngramkit.ngram_filter.lemmatizer import CachedSpacyLemmatizer
from davieskit.davies_acquire import ingest_davies_corpus
from davieskit.davies_filter import filter_davies_corpus, FilterConfig
from ngramkit.utilities.peek import db_head, db_peek

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Configure
Here we set basic parameters: the corpus name, local path to the downloaded corpus files, and database storage paths.

In [40]:
corpus_name = 'COHA'
db_path_stub = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}'

## **Ingest Corpus into Database**

In [41]:
ingest_davies_corpus(
    db_path_stub = db_path_stub,
    workers=24,
    write_batch_size=500_000,
    compact_after_ingest=True
)

COHA CORPUS ACQUISITION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-12-09 15:41:16

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Corpus path:          /scratch/edk202/NLP_corpora/COHA
Text directory:       /scratch/edk202/NLP_corpora/COHA/text
DB path:              /scratch/edk202/NLP_corpora/COHA/db
Text files found:     20
Workers:              24
Batch size:           500,000

Processing Files
════════════════════════════════════════════════════════════════════════════════════════════════════
  Processed 5/20 files (2,330,050 sentences)
  Processed 10/20 files (7,204,389 sentences)
  Processed 15/20 files (14,626,031 sentences)
  Processed 20/20 files (23,168,949 sentences)

Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         1.74 GB
Compactio

## **Filter Database**

In [None]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': CachedSpacyLemmatizer()
}

filter_davies_corpus(
    src_db_path=f'{db_path_stub}/db',
    dst_db_path=f'{db_path_stub}/db/{corpus_name}_filtered',
    **filter_options
)

DAVIES CORPUS FILTERING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2025-12-09 15:57:24

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/COHA/db
Destination DB:       /scratch/edk202/NLP_corpora/COHA/db/COHA_filtered
Lowercase:            True
Alpha only:           True
Filter short:         True (min_len=3)
Filter stops:         True
Apply lemmas:         True
Batch size:           100,000

Processing Sentences
════════════════════════════════════════════════════════════════════════════════════════════════════
  Processed 500,000 sentences (469,044 unique seen, 29,178 rejected)
  Processed 1,000,000 sentences (942,508 unique seen, 53,932 rejected)
  Processed 1,500,000 sentences (1,421,916 unique seen, 72,716 rejected)
  Processed 2,000,000 sentences (1,892,790 unique seen, 99,805 rejected)
  Processed

## **Optional: Inspect Database**

### `db_head`: Show first N records

In [42]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/db'

db_head(str(db_path), n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1810] &; Leon
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 2] Key:   [1810] &; Mad
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 3] Key:   [1810] &; Moth
     Value: Total: 2 occurrences in 1 volumes (1810-1810, 1 bins)

[ 4] Key:   [1810] &; Stur
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)

[ 5] Key:   [1810] 'The standard of good behavior for the continuance in office of the-judieial magistracy is certainly one of the most valuable of the modern improvements in the practice of government
     Value: Total: 1 occurrences in 1 volumes (1810-1810, 1 bins)



### `db_peek`: Show records starting from a key

In [43]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/db'

db_peek(db_path, start_key="[1990] The horror", n=5)


5 key-value pairs starting from 000007c654686520686f72726f72:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1990] The horror
     Value: Total: 4 occurrences in 2 volumes (1990-1990, 1 bins)

[ 2] Key:   [1990] The horror and the shame were like two vicious heavy blows
     Value: Total: 1 occurrences in 1 volumes (1990-1990, 1 bins)

[ 3] Key:   [1990] The horror meant he still was n't sure
     Value: Total: 1 occurrences in 1 volumes (1990-1990, 1 bins)

[ 4] Key:   [1990] The horror novel I read on the Bullet Train now rides low just inside my jacket pocket and Lew Spencer the agent who receives me at the door sees it and smiles
     Value: Total: 1 occurrences in 1 volumes (1990-1990, 1 bins)

[ 5] Key:   [1990] The horror of being institutionalized has kept my daughter faithful to her medications
     Value: Total: 1 occurrences in 1 volumes (1990-1990, 1 bins)



### `db_peek_prefix`: Records matching a prefix

In [38]:
db_path = f'/scratch/edk202/NLP_corpora/{corpus_name}/db'

db_peek(db_path, start_key="[1980] nuclear weapons", n=5)

5 key-value pairs starting from 000007bc6e75636c65617220776561706f6e73:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1980] nuclear weapons and the payment of taxes
     Value: 1 occurrences in 1 documents

[ 2] Key:   [1980] nuclear weapons have no military purpose other than to prevent a nuclear attack on the homeland of their possessors
     Value: 1 occurrences in 1 documents

[ 3] Key:   [1980] nuclear weapons in Spain
     Value: 1 occurrences in 1 documents

[ 4] Key:   [1980] nuclear weapons policy
     Value: 1 occurrences in 1 documents

[ 5] Key:   [1980] nuclear weapons stored in that country and in Turkey
     Value: 1 occurrences in 1 documents

