# **Davies Corpus: Acquisition Workflow**
This workflow ingests a locally downloaded Davies corpus (e.g., COHA, COCA) into a RocksDB database. Unlike Google Books ngrams which are downloaded on-the-fly, Davies corpora must be obtained separately and stored locally before running this notebook.

## **Setup**
### Imports

In [None]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from davieskit.davies_acquire import ingest_davies_corpus
from ngramkit.utilities.peek import db_head, db_peek

### Configure
Here we set basic parameters: the corpus name, local path to the downloaded corpus files, and database storage paths.

In [None]:
corpus_name = 'COHA'  # COHA, COCA, etc.
corpus_path = '/scratch/edk202/NLP_corpora/Davies/COHA_raw'  # Path to downloaded corpus
db_path = '/scratch/edk202/NLP_corpora/Davies/COHA/corpus.db'

## **Ingest Corpus into Database**

In [None]:
ingest_davies_corpus(
    corpus_name=corpus_name,
    corpus_path=corpus_path,
    db_path=db_path,
    overwrite_db=True,
    workers=24,
    write_batch_size=100_000
)

## **Optional: Inspect Database**

### `db_head`: Show first N records

In [None]:
db_head(str(db_path), n=5)

### `db_peek`: Show records starting from a key

In [None]:
# For Davies, keys are (year, sentence) tuples
db_peek(str(db_path), start_key=None, n=5)