# Corpora
This notebook demonstrates how the Hawthorne corpora were generated from the [Gale corpus of American Fiction](https://www.gale.com/c/american-fiction-1774-1920), and provides summary statistics about the corpora.

In [1]:
import pandas as pd

In [9]:
gale_path = '/Users/e/Documents/Corpora/Gale/txt'

In [14]:
meta_path = '/Users/e/Documents/Corpora/Gale/metadata.csv'

In [15]:
meta = pd.read_csv(meta_path, low_memory = False) # low_memory because mixed dtypes

In [30]:
meta[:2] # example of what metadata looks like

Unnamed: 0,filename,year,author,author_birth_date,author_death_date,full_title,publication_place_city,total_pages,ocr_accuracy,valid_words,decade
0,AMFCF0002-C00000-B0014500.txt,1849,"Curtis, Newton Mallory",,,The Patrol of the Mountain: A Tale of the Revo...,New York,112.0,0.904358,64450,1840
1,AMFCF0002-C00000-B0781700.txt,1886,"Dromgoole, William Allen, Miss.",,,The Sunny Side of the Cumberland: A Story of t...,Philadelphia,438.0,0.914261,119077,1880


Date range of texts in Gale:

In [31]:
meta['year'].describe()

count    18150.000000
mean      1887.492837
std         25.129007
min       1785.000000
25%       1872.000000
50%       1895.000000
75%       1906.000000
max       1920.000000
Name: year, dtype: float64

Total words contained in Gale calculated below. Note that `valid_words` is calculated by tokenizing the source text, and validating results against OED dictionary of about 3.5M words. As a result, `valid_words` excludes non-dictionary words and errors.

In [24]:
meta['valid_words'].sum()

1175772600

And the number of unique authors in Gale:

In [91]:
len(meta['author'].unique())

8580

# Subsetting Gale
Here, we subset Gale for Hawthorne's writing life (1828's *Fanshawe* to his death in 1864) to capture contemporaneous publication:

In [60]:
sub = meta[meta['year'].between(1828, 1864)]

In [61]:
len(sub) # number of works in subset

3349

In [62]:
sub['valid_words'].sum() # number of words

215186977

Results are slightly different if we count from *Twice Told Tales* (1837):

In [63]:
sub2 = meta[meta['year'].between(1837, 1864)]

In [64]:
len(sub2) # number of works in subset

2980

In [65]:
sub2['valid_words'].sum() # number of words

196460034

# Finding Hawthorne
`nh` collects all of the works by Hawthorne in Gale:

In [100]:
nh = sub[sub['author'].str.contains('Hawthorne, Nathaniel', na = False)]

In [101]:
nh['valid_words'].sum() # Hawthorne words in corpus

1120788

Hawthorne's words as a percentage of the corpus:

In [102]:
pct = nh['valid_words'].sum() / sub['valid_words'].sum()

In [109]:
print('{}%'.format(round(pct * 100, 3)))

0.521%


Comparing the words in Gale against the [Library of America Hawthorne](https://loa.org/writers/257-nathaniel-hawthorne) (we do not have digital access to the Ohio editions):

In [49]:
loa = '/Users/e/code/hawthorne/local/loa_nh/loa_hawthorne_all.txt'

In [51]:
with open(loa) as f:
    text = f.read()
    words = [x for x in text.split(' ') if x] # split words by spaces; don't count blanks
len(words)

984400

As we can see from the differences between these numbers, there exists in Gale a small amount  of internal duplication of Hawthorne's work as compared to the LoA (plus or minus any editorial inclusions or exclusions from the LoA not present in Gale's editions of Hawthorne's work).

# Creating corpora
First, the 1828-1864 corpus, inclusive:

- Revise into function

In [55]:
import os
from shutil import copyfile

In [71]:
to_path = 'local/corpus'

In [73]:
if not os.path.exists(to_path):
    os.mkdir(to_path)

In [74]:
texts = sub['filename']

In [75]:
texts[0]

'AMFCF0002-C00000-B0014500.txt'

In [76]:
for text in texts:
    f_from = os.path.join(gale_path, text)
    f_to = os.path.join(to_path, text)
    copyfile(f_from, f_to)

And, second, 1828-1864 without Hawthorne:

In [84]:
to_path = 'local/corpus_no_nh'

In [85]:
nonh = sub[~sub['author'].str.contains('Hawthorne, Nathaniel', na = False)]

In [86]:
if not os.path.exists(to_path):
    os.mkdir(to_path)

In [87]:
texts = nonh['filename']

In [None]:
for text in texts:
    f_from = os.path.join(gale_path, text)
    f_to = os.path.join(to_path, text)
    copyfile(f_from, f_to)

Finally, Hawthorne alone:

In [110]:
to_path = 'local/gale_nh/'

In [111]:
if not os.path.exists(to_path):
    os.mkdir(to_path)

In [118]:
nh = meta[meta['author'].str.contains('Hawthorne, Nathaniel', na = False)]

In [120]:
texts = nh['filename']

In [122]:
for text in texts:
    f_from = os.path.join(gale_path, text)
    f_to = os.path.join(to_path, text)
    copyfile(f_from, f_to)

Now that we have our corpora, we can begin the analyses.