# **Davies Corpus: Acquisition Workflow**
This workflow ingests a locally downloaded Davies corpus (e.g., COHA, COCA) into a RocksDB database. Unlike Google Books ngrams which are downloaded on-the-fly, Davies corpora must be obtained separately and stored locally before running this notebook.

## **Setup**
### Imports

In [26]:
%load_ext autoreload
%autoreload 2

from stop_words import get_stop_words
from ngramprep.ngram_filter.lemmatizer import CachedSpacyLemmatizer
from daviesprep.davies_acquire import ingest_davies_corpus
from daviesprep.davies_filter import filter_davies_corpus, write_whitelist
from ngramprep.utilities.peek import db_head, db_peek, db_peek_prefix
from ngramprep.utilities.count_items import count_db_items

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Configure
Here we set basic parameters: the corpus name, local path to the downloaded corpus files, and database storage paths.

In [27]:
corpus_name = 'COCA'
genre_focus = None
bin_size = 1
db_path_stub = f'/scratch/edk202/NLP_corpora/{corpus_name}/'

## **Ingest Corpus into Database**

In [28]:
ingest_davies_corpus(
    db_path_stub = db_path_stub,
    genre_focus=genre_focus,
    bin_size=bin_size,
    workers=24,
    write_batch_size=500_000,
    compact_after=True,
    combined_bigrams=None
)

COCA CORPUS ACQUISITION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-16 18:05:11

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Corpus path:          /scratch/edk202/NLP_corpora/COCA
Text directory:       /scratch/edk202/NLP_corpora/COCA/text
DB path:              /scratch/edk202/NLP_corpora/COCA/COCA
Text files found:     8
Genre focus:          All genres
Key format:           Genre-prefixed (archival)
Year bin size:        1
Workers:              24
Batch size:           500,000

Processing Files
════════════════════════════════════════════════════════════════════════════════════════════════════


Files Processed: 100%|███████████████████████████████████████████████████████████| 8/8 [06:07<00:00]



Post-Ingestion Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         5.27 GB
Compaction completed in 0:00:32
Size before:             5.27 GB
Size after:              6.17 GB
Space saved:             -915.57 MB (-17.0%)

Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Files processed:          8/8
Failed files:             0
Total sentences written:  68,746,583
Database path:            /scratch/edk202/NLP_corpora/COCA/COCA

Genre breakdown:
  acad       6,013,628 sentences
  blog       6,615,009 sentences
  fic        9,243,359 sentences
  mag        6,773,413 sentences
  mov        17,154,021 sentences
  news       7,112,008 sentences
  spok       9,147,176 sentences
  web        6,687,969 sentences

End Time: 2026-01-16 18:11:53
Total Runtime: 0:06:42.000620



## **Filter Database**

In [29]:
filter_options = {
    'stop_set': set(get_stop_words("english")),
    'lemma_gen': CachedSpacyLemmatizer()
}

always_include_tokens = {
    'he', 'she',
    'him', 'her',
    'his', 'hers',
    'himself', 'herself',
    'man', 'woman',
    'men', 'women',
    'boy', 'girl',
    'boys', 'girls',
    'father', 'mother',
    'fathers', 'mothers',
    'son', 'daughter',
    'sons', 'daughters',
    'brother', 'sister',
    'brothers', 'sisters'
}

filter_davies_corpus(
    db_path_stub=db_path_stub,
    genre_focus=genre_focus,
    workers=128,
    batch_size=250_000,
    create_whitelist=True,
    apply_whitelist=True,
    whitelist_size=30_000,
    whitelist_spell_check=True,
    whitelist_year_range=(1950, 2019),
    compact_after=True,
    always_include=always_include_tokens,
    **filter_options
)

COCA CORPUS FILTERING
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start Time: 2026-01-16 18:11:59

Configuration
════════════════════════════════════════════════════════════════════════════════════════════════════
Source DB:            /scratch/edk202/NLP_corpora/COCA/COCA
Destination DB:       /scratch/edk202/NLP_corpora/COCA/COCA_filtered
Lowercase:            True
Alpha only:           True
ASCII alpha only:     True
Filter short:         True (min_len=3)
Filter stops:         True
Apply lemmas:         True
Workers:              128
Batch size:           250,000

Processing Sentences
════════════════════════════════════════════════════════════════════════════════════════════════════


Batches Processed: 100%|█████████████████████████████████████████████████████| 251/251 [02:31<00:00]



Building Whitelist
════════════════════════════════════════════════════════════════════════════════════════════════════
Whitelist path:          /scratch/edk202/NLP_corpora/COCA/COCA_whitelist.txt
Top N tokens:            30,000
Year range:              1950-2019
Spell check:             True

Scanning database...
Found 56,384,237 sentences in 1128 batches
Detecting years present in corpus...


Building token frequencies: 100%|██████████████████████████████████████████| 1128/1128 [13:08<00:00]



Years present in corpus within range: 30 years
  Range: 1990 to 2019
  Sample: [1990, 1991, 1992, 1993, 1994] ... [2015, 2016, 2017, 2018, 2019]

Filtering tokens by year coverage (must appear in all 30 years)...
Tokens before year filter: 46,135
Tokens after year filter:  23,776
Tokens removed:            22,359

Ranking 23,776 unique tokens...
Selected top 30,000 tokens (+ 0 always_include)
Writing whitelist to /scratch/edk202/NLP_corpora/COCA/COCA_whitelist.txt...
Whitelist written successfully: /scratch/edk202/NLP_corpora/COCA/COCA_whitelist.txt

Applying Whitelist
════════════════════════════════════════════════════════════════════════════════════════════════════
Loading whitelist into memory...
Loaded 23,776 tokens from whitelist

Replacing non-whitelist tokens with <UNK>...



Batches Processed: 100%|█████████████████████████████████████████████████████| 226/226 [02:34<00:00]



Whitelist application complete!
Sentences processed:      56,384,237
Sentences modified:       429,548

Post-Filter Compaction
════════════════════════════════════════════════════════════════════════════════════════════════════
Initial DB size:         5.83 GB
Compaction completed in 0:00:19
Size before:             5.83 GB
Size after:              5.82 GB
Space saved:             3.91 MB (0.1%)


Processing complete!

Final Summary
════════════════════════════════════════════════════════════════════════════════════════════════════
Sentences read:           62,668,682
Writes accumulated:       57,383,940
Sentences rejected:       5,284,742
Retention rate:           91.6%
Destination DB:           /scratch/edk202/NLP_corpora/COCA/COCA_filtered

End Time: 2026-01-16 18:35:08
Total Runtime: 0:23:08.671679



## **Optional: Inspect Database Files**

### `db_head`: Show first N records

In [30]:
db = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}'

db_head(db, n=5)

First 5 key-value pairs:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [1990] # # Billy get your guns # @@5211283
     Value: Total: 1 occurrences in 1 volumes (1990-1990, 1 bins)

[ 2] Key:   [1990] # # California dreamin # # On such a winter 's day ## Look what I have in my hands
     Value: Total: 1 occurrences in 1 volumes (1990-1990, 1 bins)

[ 3] Key:   [1990] # # EDWARD # I 'll guard them and you with my life
     Value: Total: 1 occurrences in 1 volumes (1990-1990, 1 bins)

[ 4] Key:   [1990] # # GIDEON # Your mother asked me not to mention it but your mother 's birthday was last week
     Value: Total: 1 occurrences in 1 volumes (1990-1990, 1 bins)

[ 5] Key:   [1990] # # HANK # Well hoss hai n't got too much to say
     Value: Total: 1 occurrences in 1 volumes (1990-1990, 1 bins)



### `db_peek`: Show records starting from a key

In [31]:
db = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_filtered'

db_peek(db, start_key="[2001] art", n=5)

5 key-value pairs starting from 000007d1617274:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2001] art <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK>
     Value: Total: 1 occurrences in 1 volumes (2001-2001, 1 bins)

[ 2] Key:   [2001] art <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> movie
     Value: Total: 1 occurrences in 1 volumes (2001-2001, 1 bins)

[ 3] Key:   [2001] art <UNK> <UNK> <UNK> <UNK> seem <UNK> transcend <UNK> limitation
     Value: Total: 1 occurrences in 1 volumes (2001-2001, 1 bins)

[ 4] Key:   [2001] art <UNK> <UNK> <UNK> art <UNK> <UNK> <UNK> build <UNK> <UNK> <UNK>
     Value: Total: 1 occurrences in 1 volumes (2001-2001, 1 bins)

[ 5] Key:   [2001] art <UNK> <UNK> <UNK> goal <UNK> <UNK> occasion <UNK> method <UNK> locate <UNK> rhythm <UNK> bury possibility <UNK> <UNK> time
     Value: Total: 1 occurrences in 1 volumes (2001-2001, 1 bins)



### `db_peek_prefix`: Records matching a prefix

In [36]:
db = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_filtered'

db_peek_prefix(db, prefix="[2018] frankly", n=5)

5 key-value pairs with prefix 000007e26672616e6b6c79:
────────────────────────────────────────────────────────────────────────────────────────────────────
[ 1] Key:   [2018] frankly <UNK> <UNK> <UNK> <UNK> <UNK>
     Value: Total: 1 occurrences in 1 volumes (2018-2018, 1 bins)

[ 2] Key:   [2018] frankly <UNK> <UNK> <UNK> <UNK> clearly <UNK> case <UNK> <UNK> <UNK> <UNK> weight <UNK> vote
     Value: Total: 1 occurrences in 1 volumes (2018-2018, 1 bins)

[ 3] Key:   [2018] frankly <UNK> <UNK> <UNK> <UNK> help
     Value: Total: 1 occurrences in 1 volumes (2018-2018, 1 bins)

[ 4] Key:   [2018] frankly <UNK> <UNK> <UNK> <UNK> transparency <UNK> <UNK> see <UNK> <UNK> twitter statement <UNK> <UNK> long time even <UNK> <UNK> take <UNK> <UNK> half <UNK> day <UNK> get <UNK> <UNK> <UNK>
     Value: Total: 1 occurrences in 1 volumes (2018-2018, 1 bins)

[ 5] Key:   [2018] frankly <UNK> <UNK> <UNK> believe <UNK> <UNK> still live <UNK>
     Value: Total: 1 occurrences in 1 volumes (2018-2018, 1 b

## **Optional: Count Database Items**

In [33]:
raw_db = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}'

raw_count = count_db_items(raw_db)

DATABASE ITEM COUNTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
/scratch/edk202/NLP_corpora/COCA/COCA
Progress interval: every 10,000,000 items

COUNTING
────────────────────────────────────────────────────────────────────────────────────────────────────
[     10,000,000] | elapsed      6.7s | rate 1,486,125 items/sec
[     20,000,000] | elapsed     13.8s | rate 1,453,764 items/sec
[     30,000,000] | elapsed     22.9s | rate 1,312,747 items/sec
[     40,000,000] | elapsed     34.5s | rate 1,158,450 items/sec
[     50,000,000] | elapsed     41.7s | rate 1,200,301 items/sec
[     60,000,000] | elapsed     48.7s | rate 1,232,552 items/sec

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ COUNT COMPLETE                                                                                   │
├────────────────────────────────────────────────────────────────────────────────────────────

In [34]:
filtered_db = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_filtered'

filtered_count = count_db_items(filtered_db)

DATABASE ITEM COUNTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
/scratch/edk202/NLP_corpora/COCA/COCA_filtered
Progress interval: every 10,000,000 items

COUNTING
────────────────────────────────────────────────────────────────────────────────────────────────────


[     10,000,000] | elapsed      6.5s | rate 1,538,929 items/sec
[     20,000,000] | elapsed     15.9s | rate 1,255,239 items/sec
[     30,000,000] | elapsed     22.9s | rate 1,307,460 items/sec
[     40,000,000] | elapsed     30.5s | rate 1,310,869 items/sec
[     50,000,000] | elapsed     37.5s | rate 1,333,208 items/sec

┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
│ COUNT COMPLETE                                                                                   │
├──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Items: 54,704,136                                                                                │
│ Elapsed: 42.06s                                                                                  │
│ Avg rate: 1,300,507 items/sec                                                                    │
│ Database: /scratch/edk202/NLP_corpora/COCA/COCA_filtered          

In [35]:
filtered_db = f'/scratch/edk202/NLP_corpora/{corpus_name}/{corpus_name}_filtered'

count_per_bin = count_db_items(filtered_db, progress_interval=1_000_000, grouping='year_bin')

DATABASE ITEM COUNTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
/scratch/edk202/NLP_corpora/COCA/COCA_filtered
Progress interval: every 1,000,000 items
Grouping by: year_bin

COUNTING
────────────────────────────────────────────────────────────────────────────────────────────────────
[      1,000,000] | elapsed      0.9s | rate 1,056,135 items/sec
[      2,000,000] | elapsed      1.9s | rate 1,070,470 items/sec
[      3,000,000] | elapsed      2.8s | rate 1,070,523 items/sec
[      4,000,000] | elapsed      3.7s | rate 1,071,370 items/sec
[      5,000,000] | elapsed      4.7s | rate 1,068,197 items/sec
[      6,000,000] | elapsed      5.6s | rate 1,070,215 items/sec
[      7,000,000] | elapsed      6.5s | rate 1,070,800 items/sec
[      8,000,000] | elapsed      7.5s | rate 1,071,166 items/sec
[      9,000,000] | elapsed      8.4s | rate 1,071,020 items/sec
[     10,000,000] | elapsed      9.4s | rate 1,061,643 items/sec
[     