In [371]:
# For development, use local paths.

import sys
sys.path.append("..")

In [372]:
# Load local
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [373]:
import nonconsumptive as nc

# Feature counts from text files.

This notebook creating a set of document-level feature count files akin to those distributed by the Hathi Trust, but from Project Gutenberg text files in the folder `{nonconsumptive_root}/sample_inputs/gutenberg/texts`.
Metadata is read from a file called "metadata.json" and bound to files based on their filenames.

Files are stored as parquet, which allows for fast processing.

# Create a corpus

First, create a corpus. Every corpus has to be build from a strategy for retrieving texts, and a strategy for retrieving metadata.

Ideally these will be disentangled. Some strategies might include:

* metadata from { csv, yaml header block, TEI header blocks }
* text from { set of files }
* ids from { filename, filename plus directory, first column of mallet input,  etc. }

The ids allow looking up the texts in the metadata.



In [374]:
gutenberg = nc.corpus.FolderCorpus("../sample_inputs/gutenberg/texts/", metadata = "../sample_inputs/gutenberg/metadata.ndjson")

## Metadata

The metadata is stored internally as a pyarrow table with some wrappers to ensure type integrity.

Based on internal data and column types, this will leverage some Bookworm code to determine that "date"  or "year" are date type columns.
It should also be able to discriminate between "categorical" types (or in library parlance, "controlled vocabulary" fields and free entry ones, perhaps with additional help from configuration files.


In [375]:
gutenberg.metadata.tb.to_pandas().head(4)

Unnamed: 0,id,htid,pubdate,title,author
0,15,dul1.ark+=13960=t3kw6ns1s,1851,"Moby-Dick; or, The Whale","Melville, Herman"
1,27,coo.31924014152700,1894,Far from the Madding Crowd,"Hardy, Thomas"
2,60,nyp.33433075744890,1905,The Scarlet Pimpernel,"Orczy, Emmuska Orczy, Baroness"
3,62,hvd.32044004480208,1917,A Princess of Mars,"Burroughs, Edgar Rice"


Individual entries can be retrieved by their identifier. The identifier field should be called 'filename' or 'id,' or (ultimately) specified in the definition.

In [376]:
gutenberg.metadata.get("179")

{'id': '179',
 'htid': 'uc2.ark+=13960=t84j0c38h',
 'pubdate': 1879,
 'title': 'The Europeans',
 'author': 'James, Henry'}

# Documents

The documents part of the corpus is structured as an iterator, because it's generally foolhardy to read in all the documents at once.

Right now, the text of the document is read at iteration. Ultimately, the strategy would be to read only the parts of the document 
as needed from the corpus. (For example, if you request feature counts, it's fine if the raw document isn't there.)

In [377]:
gutenberg.documents # Is an iterable.

<bound method FolderCorpus.documents of <nonconsumptive.corpus.FolderCorpus object at 0x7fd730479b50>>

## An individual document

`first` is a convenience method to get a single document.

In [378]:
one_book = gutenberg.first()
one_book

<DOCUMENT> {"id": "15", "htid": "dul1.ark+=13960=t3kw6ns1s", "pubdate": 1851, "title": "Moby-Dick; or, The Whale", "author": "Melville, Herman"}

In [379]:
one_book.metadata

{'id': '15',
 'htid': 'dul1.ark+=13960=t3kw6ns1s',
 'pubdate': 1851,
 'title': 'Moby-Dick; or, The Whale',
 'author': 'Melville, Herman'}

In [385]:
print(one_book.full_text[17000:17250])

e king.” —_Blackstone_.

“Soon to the sport of death the crews repair:
Rodmond unerring o’er his head suspends
The barbed steel, and every turn attends.”
—_Falconer’s Shipwreck_.

“Bright shone the roofs, the domes, the spires,
    And rockets blew s


In [386]:
one_book.tokens[1000:1010]

['.—', 'The', 'Chase', '.', 'First', 'Day', 'CHAPTER', 'CXXXV', '.—', 'The']

## Wordcounts

Wordcounts are a basic element of nonconsumptive reading that can be used in analysis or stored. They are returned as a pyarrow RecordBatch.

In [387]:
one_book.wordcounts

pyarrow.RecordBatch
token: string
count: uint32

In [388]:
one_book.wordcounts.to_pandas().query("token.str.match('whal')").head(5)

Unnamed: 0,token,count
500,whale,914
598,whales,239
1413,whale_,3
1829,whalemen,70
2443,whaling,118


## Metadata on wordcounts

The schema includes metadata. Figuring out how to dress this up into full json-ld  is a major goal.

In [420]:
one_book.wordcounts.schema.metadata

{b'nc_metadata': b'{"id": "15", "htid": "dul1.ark+=13960=t3kw6ns1s", "pubdate": 1851, "title": "Moby-Dick; or, The Whale", "author": "Melville, Herman"}'}

In [390]:
gutenberg.write_feature_counts("../sample_inputs/gutenberg/feature_counts/")

In [403]:
!ls ../sample_inputs/gutenberg/feature_counts/

103.parquet 141.parquet 165.parquet 178.parquet 27.parquet  84.parquet
105.parquet 142.parquet 170.parquet 179.parquet 60.parquet  86.parquet
119.parquet 144.parquet 171.parquet 203.parquet 62.parquet  91.parquet
121.parquet 145.parquet 172.parquet 215.parquet 64.parquet  94.parquet
126.parquet 15.parquet  173.parquet 217.parquet 72.parquet  95.parquet
133.parquet 155.parquet 174.parquet 222.parquet 73.parquet
135.parquet 158.parquet 175.parquet 224.parquet 76.parquet
139.parquet 161.parquet 176.parquet 233.parquet 78.parquet
140.parquet 164.parquet 177.parquet 234.parquet 82.parquet


## Iterating over feature counts

We can now iterate over the token count files to get a list of--say--which books use words the most.

In [419]:
whales = []

for meta, counts in gutenberg.feature_counts("../sample_inputs/gutenberg/feature_counts/"):
    words = counts.to_pandas()['count'].sum()
    title = meta['title']
    whale_counts = counts.to_pandas().query("token=='whale'")['count'].sum()
    whales.append((whale_counts, title, words))

whales.sort(reverse = True)
whales[:10]

[(914, 'Moby-Dick; or, The Whale', 265775),
 (34, 'Twenty Thousand Leagues under the Sea', 127587),
 (3, 'Frankenstein; Or, The Modern Prometheus', 89394),
 (1, 'The Red Badge of Courage: An Episode of the American Civil War', 59152),
 (1, 'The Poison Belt', 38688),
 (1, 'The Lost World', 93809),
 (1, 'The Call of the Wild', 41323),
 (1, 'McTeague: A Story of San Francisco', 146942),
 (1, 'Les Misérables', 689604),
 (1, 'Adventures of Huckleberry Finn', 146747)]

In [83]:
t.pre_tokenize_str(doc.fulltext[:1000])

from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, decoders, trainers

tokenizer = Tokenizer(models.WordLevel())
tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([UnicodeScripts(), Whitespace()])
tokenizer.decoders = decoders.ByteLevel()

trainer = trainers.WordLevelTrainer(
    show_progress=True,
    vocab_size=1_000_000,
)

tokenizer.train([str(f) for f in files], trainer)

In [95]:
for file in gutenberg.files():
    print(file, end = "\r")
    tokenized = tokenizer.pre_tokenizer.pre_tokenize_str(file.open().read())

../sample_inputs/gutenberg/texts/119.txt

In [102]:
pa.array([t[0] for t in tokenized])

<pyarrow.lib.StringArray object at 0x7fd75224a760>
[
  "﻿",
  "The",
  "Project",
  "Gutenberg",
  "EBook",
  "of",
  "A",
  "Tramp",
  "Abroad",
  ",",
  ...
  "to",
  "our",
  "email",
  "newsletter",
  "to",
  "hear",
  "about",
  "new",
  "eBooks",
  "."
]

In [85]:
tokenizer.save("tmp")

In [91]:
len(tokenizer.encode(doc.fulltext[:1000]))

196

In [93]:
tokenizer.decode(tokenizer.encode(doc.fulltext[:1000]).ids)

'\ufeff The Project Gutenberg EBook of Confidence , by Henry James This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever . You may copy it , give it away or re - use it under the terms of the Project Gutenberg License included with this eBook or online at www . gutenberg . org Title : Confidence Author : Henry James Release Date : March 14 , 2006 [ EBook # 178 ] Last Updated : September 18 , 2016 Language : English Character set encoding : UTF - 8 *** START OF THIS PROJECT GUTENBERG EBOOK CONFIDENCE *** Produced by Judith Boss and David Widger CONFIDENCE by Henry James CHAPTER I It was in the early days of April ; Bernard Longueville had been spending the winter in Rome . He had travelled northward with the consciousness of several social duties that appealed to him from the further side of the Alps , but he was under the charm of the Italian spring , and he made a pretext for lingering . He had spent five days at Siena , where he had intend