Devin Short
30 June 2023
shortda@uw.edu

This is a tutorial demonstrating how to explore data parsed out of the catalogues. The idea is to understand what kind of content we've generated, get a feel for how well our current tools are working, and try improving them. This notebook requires functions from `reporting.py`, currently living in the scripts directory of the github repository.

Let's import some packages and set parameters we'll use below.

## <u>Testing new methods</u>

If you can come up with a way to identify those author-to-author lines and separate them from the entries, you'll improve our data set. That's going to be pretty difficult though, so let's start with something easier.

Have a look at a sample of the longest entries:

The first two cases are relatively easy to handle. Let's start by finding the regular expressions our code uses to split entries apart and build on that to improve the current process. That code is documented in `Parsing_ECB_1912_wi23.ipynb`, currently living in the `scripts` directory of the github repository. In the section "Splitting the Text into Entries" in that notebook, I find the following process:

In [18]:
# ALL CODE IN THIS CELL IS ADAPTED FROM Parsing_ECB_1912_wi23.ipynb
import re

# read the Hathi Trust OCR into memory
with open(f'ecb_1919.txt', encoding='utf8') as f:
    ocr_full_text = f.read()

# Separate main text from the front matter of the catalogue (note the
# string called ocr_main_text still has the back matter attached until
# a later step)
# Note python allows multiple assignment: the split operation below
# creates a two-element list whose elements are then assigned to two
# different variables
ocr_front_matter, ocr_main_text = re.split(r'centimetres', ocr_full_text)

appendix_pattern = (
    r"LEARNED SOCIETIES, PRINTING CLUBS, &c., WITH LISTS OF THEIR\nPUBLICATIONS, 1918\."
)

# Separate main text from the back matter of the catalogue
ocr_main_text, ocr_back_matter = re.split(appendix_pattern, ocr_main_text)

# Make a regular expression to capture headers at the top of each
# catalogue page
header_capital_letters = r"^(?:[A-Z\-\'\sÈ]+)"
header_pattern = r"^#(?s:.*?){}(?s:.*?){}(?s:.*?){}$".format(
    header_capital_letters, header_capital_letters, header_capital_letters
)

# Split the main text into pages and strip the headers. According to
# Parsing_ECB_1912_wi23.ipynb this method fails to remove six headers.
pages = [
    re.sub(header_pattern, '', page, flags=re.M)
    for page in ocr_main_text.split('\f')
]

# Now find all lines ending in 12 and insert a token to split on
entries_by_page = [
    re.sub(r'(\W19\.?$)', '\\1<ENTRY_CUT>', page, flags=re.M)
    for page in pages
]

# Split on the token to create a list of lists where each element is
# a list of entries on an individual page
entries_by_page = [
    re.split(r'<ENTRY_CUT>', page, flags=re.M)
    for page in entries_by_page
]

print(f'total entries: {sum([len(p) for p in entries_by_page])}')

total entries: 13918


Now let's test a new regular expression for splitting entries

In [30]:
# This cell assumes we're looking at the 1912 catalogue; try changing
# this code (and the year parameter in the first code cell of the
# notebook) to catch OCR mistakes for a different year.

# create a list of possible values the OCR software might have created
# when it ran into an 11 or 12 on the page
ocr_interpretations_of_18 = [
    '18'
]
ocr_interpretations_of_19 = [
    '19',
    'lg'
]
terminators = ocr_interpretations_of_18 + ocr_interpretations_of_19

# modify the existing regex with our new options
entry_terminator_regex = r'(\W({})\.?$)'.format('|'.join(terminators))

# get a new set of entries the same way as above, but using the new
# regular expression
new_entries_by_page = [
    re.sub(entry_terminator_regex, '\\1<ENTRY_CUT>', page, flags=re.M)
    for page in pages
]
new_entries_by_page = [
    re.split(r'<ENTRY_CUT>', page, flags=re.M)
    for page in new_entries_by_page
]

print(f'total entries: {sum([len(p) for p in new_entries_by_page])}')

#long_entries = clean_entries.loc[(lengths > 250)]
#print(long_entries.sample(20))



print('New entries on one page:')
for item in new_entries_by_page[215]:
  print(item)


total entries: 14646
New entries on one page:



2.
a
45. net
Skinner (Walter R.)-The Oil and petroleum
manual, 1919. 8vo. 81 x 54, pp. 276, 6s. net
SKINNER, June '19

Skipton (H. P. K.)-George Alfred Lefroy, Bishop
of Calcutta. Cr. 8vo. 2d. net
(19, Arthur Rd., Brixton)
MISSIONARY LIT. SUPPLY, Oct.'19

Skvortzov (J. V.)-Russian history. Vols. I and
18mo. Ea. swd. is, net
(78, Charing Cross Rd.) R. JASCHKE, June 19

Sky pilot, Connor (R.) 6s. net
. Apr.'19

Sky pilot, Connor (R.) is. net July '19

Sky pilot of No Man's Land, Connor (R.) 6s. net
May '19

Slack (Charles, Capt.) - Tourist's and student's
manual of languages. 7th ed. 18mo. 61 x 4,
pp. 128, 4s. 6d. net .. SIMPKIN, Aug. '19

Slack (J. E.) and Dorey (A.)--"At a glance."
Ryl. 8vo. 91 x6, pp. 76, 78. 6d. net
THE TECHNICAL PUBLG. CO., Mar. '19

Sladen (Douglas)—Paul's wife ; or, The Ostriches :
a romance of the awakening of Britain. Cr. 8vot
74 X 5, pp. 256, 6s. 9d. net HUTCHINSON, Feb.'19

Slater (E. I.)-Practical German gramma

OK, we might have found 157 entries that weren't caught with the previous splitting expression, meaning we might have gained 314 entries that were either not present or invalid in the previous data set. Let's see if that's really what happened.

In [None]:
# It can be much more efficient to do things like check if elements in
# one sequence exist in another using pandas objects rather than the
# standard python library, so I flatten the lists of lists of entries
# into pandas series containing all the entries.
old_entries = pd.Series(
    [e for page in entries_by_page for e in page],
    dtype=pd.StringDtype()
)
new_entries = pd.Series(
    [e for page in new_entries_by_page for e in page],
    dtype=pd.StringDtype()
)

# strip nonprinting characters and replace all newlines in each entry
# with spaces
old_entries = old_entries.str.strip()
new_entries = new_entries.str.strip()
old_entries = old_entries.str.replace('\n', ' ')
new_entries = new_entries.str.replace('\n', ' ')

# Get new entries that don't exist in the old set
new_entries = new_entries.loc[~new_entries.isin(old_entries)]

print(f'number of new entries: {len(new_entries)}')

# plot a histogram of lengths of the new entries
fig2 = new_entries.map(len).hist(bins=range(0, 400, 5))

The histogram we started with covers nearly 20,000 entries. Here we have only 300, but the plot generated in the previous cell already looks extremely similar to the one at the top of the document. This means we can probably assume the entries we've produced here are just as good as the rest of the entries we're already capturing. Probably! I haven't actually looked at the new entries in detail, but it's 0200 and I need to finish this up.

## <u>Wrapping up</u>

The goal here was to demonstrate how to play around with the data and the code for our summer project. The general method I followed was something like the following:
1. Get some kind of overview of the output from our current parsing operations. I went with the size of the strings but there are all kinds of ways to do this. You could think about features of strings that have words in all caps and compare them to strings that don't have capitalized words (publishers in the author-first entries are almost always capitalized). You could start by exploring the dataframes, where the strings have been parsed out into specific fields, rather than thinking about whole entries. Explore the results we have and find something you think is interesting. If you can't think of a way to use that feature to explore the data, run it by one of your colleagues and see what they think. Your team is always your best asset!
2. Use that view of the data to characterize what's working and what isn't. In this case it was clear that unreasonably short or long strings were going to have some errors, so I was able to look at those entries to see where there might be problems in the existing process.
3. Find out which part of the existing code is relevant for the feature you're interested in. This part is probably going to be hard if you haven't had a much experience with programming. If it isn't obvious how to proceed after you've had a look at some of the scripts, ask me or Anna how to do what you want to do.
4. See if you can modify the code to get different results.
5. Use the same overview process you started with to compare your new result with the old one.
