<a href="https://colab.research.google.com/github/jtcarlyle/parsing-ecb-1912/blob/main/Parsing_ECB_1912.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook, we're trying to parse information from the English
Catalogue of Books (1912) based on OCR'd text from digital scans. The
scans of the book can be found on HathiTrust here: https://babel.hathitrust.org/cgi/pt?id=nyp.33433087536938&view=1up&seq=



Import statements and other relevant setup go here. First, we import
our main packages.

In [None]:
from string import punctuation
import re
import pandas as pd

pd.set_option("display.max_colwidth", 800)
pd.set_option("display.max_rows", 800)
import gdown

Then we pull down the plain text file we are working with. 

(This is only necessary on Colab. Locally, I have the file save in
my project directory.)


In [None]:
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
from google.colab import drive

drive.mount("/content/gdrive/", force_remount=True)

Mounted at /content/gdrive/


Download file from Google Drive.

In [None]:
google_drive_file = "https://drive.google.com/uc?id=1mU_bG5JfRhei_VWvOnNzrCb9u6x4XXzl"
output_filename = "ecb_1912_trial.txt"
gdown.download(google_drive_file, output_filename, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1mU_bG5JfRhei_VWvOnNzrCb9u6x4XXzl
To: /content/ecb_1912_trial.txt
100%|██████████| 2.19M/2.19M [00:00<00:00, 194MB/s]


'ecb_1912_trial.txt'

In [None]:
path = "/content/ecb_1912_trial.txt"

infile = open(path, "r", encoding="utf-8", errors="ignore")
contents = infile.read()
# we need to close this once we are done with it
infile.close()

In [None]:
print(contents[:1000000])

    <!-- #region id="SDurbz79kAtz" -->
# Splitting the Text into Entries
<!-- #endregion -->

Splitting off the front and back matter from the entries

In [None]:
patternFront = r"A\nACADEMY"
text_raw = re.split(patternFront, contents)

front_matter = text_raw[0]
ecb_content = text_raw[1]

appendix_pattern = r"APPENDIX\nLEARNED SOCIETIES, PRINTING CLUBS, &c., WITH LISTS OF THEIR\nPUBLICATIONS, 1912"
appendix_list = re.split(appendix_pattern, ecb_content)

ecb_content = appendix_list[0]
back_matter = appendix_list[1]

# print(ecb_content[:5000])

To get a list of pages, we split on the formfeed page breaks using
`'\f'`. 
<!-- #endregion -->

In [None]:
ecb_pages = ecb_content.split("\f")
print(ecb_pages[7])

There is a rather serious issue with the OCR on page 8. I checked with
the PDF, and after the headers ALMANACK and AMES the ORC has skipped
lines all the way down to a line that should occur after the line that
begins in "Alphabet, formation of the...." It is not clear how many
times did this happen in the Hathitrust OCR. John considered redoing
the whole OCR using the Adobe proprietary OCR, but it ended up doing
a slightly worse job with the page layout recognition and a much worse
job at character recognition. We decided to abandon the attempt. Besides
the Hathitrust ORC (probably made using Google's ORC) being more accurate,
it eliminates us having to worry about the implications of us taking
responsibility for the OCR.

One other thing that was nice about the Adobe OCR was that it was able
to detect font style attributes like bold text and italics. This is
important information since the head word of the main entries in the
catalogue are always given in bold. However, other structural information
in the entries allowed us to accurately infer which entries where main
entries. The other problems the Adobe OCR introduced were not worth
recovering the additional font information.



Now that we have split by pages, we can split by lines that end in
a date followed by 12 to get entries. Before that, we need to strip
out pages headers.

`pat3` below performed the best. It look for the first three strings
of all capital letters including hyphens and aspostrophes. the (?:s)
inline tag allows `.` to match line breaks. The function calls using
the pattern used the mulitline regex tag. There were only two headers
that the pattern could not match.


In [None]:
# wasn't getting all the headers
pat1 = r"^#(?s:.*?)^[A-Z]+(?s:.*?)^[A-Z]+(?s:.*?)^[A-Z]+$"
pat2 = r"^##(?s:.*?)^THE ENGLISH CATALOGUE(?s:.*?)^[A-Z]{3,}(?s:.*?)^[A-Z]{3,}$"

# I was missing hyphens, apostrophes, spaces and accented E in the headwords
caps_header = r"^(?:[A-Z\-\'\sÈ]+)"
pat3 = r"^#(?s:.*?){}(?s:.*?){}(?s:.*?){}$".format(
    caps_header, caps_header, caps_header
)
# the only remaining headers I can't grab are:
# ## p. 188 (#198) ############################################  188 [1912 THE ENGLISH CATALOGUE 58. net MacHARDY M'Hardy (George)-The Higher powers of the soul 12mo. 7 X 41, pp. 134, 25. net T. & T. CLARK, Dec. 12
# ## p. 330 (#340) ############################################  330 (1912 THE ENGLISH CATALOGUE
# this is because of the comma and lowercase letters in the former, and the lack of headwords in
# the latter

# here are some tests to get them all
len(ecb_pages)  # 330
whole_ecb = "\f".join(ecb_pages)

matches = re.findall(pat1, whole_ecb, flags=re.M)
len(matches)  # 313

matches = re.findall(pat2, whole_ecb, flags=re.M)
len(matches)  # 296

matches = re.findall(pat3, whole_ecb, flags=re.M)
len(matches)  # 328

# This deletes the all but 6 headers.
ecb_pe = [
    re.sub(pat3, "", page, flags=re.M)  # this flag is for multiline regex
    for page in ecb_pages
]

Now we replace 12 with a preceeding non-word character and an optional
final period at the end of lines with the original capture group and
`<ENTRY_CUT>`, then split the entries on `<ENTRY_CUT>`. This allows
us to split entries without losing any of the original text.

In [None]:
# This inserts a token we can split on while preserving the month + 12
ecb_pe = [re.sub(r"(\W12\.?$)", "\\1<ENTRY_CUT>", page, flags=re.M) for page in ecb_pe]

ecb_pe = [re.split(r"<ENTRY_CUT>", page, flags=re.M) for page in ecb_pe]

for entry in ecb_pe[100]:
    print("--------------------\n", entry)

--------------------
 


Everett-Green (Iivelyn)--Two enthusiasts. Cheaper
reissue. Cr. 8vo. 7 * x5, pp. 312, 1s. 60.
R.T.S., Oct. 12
--------------------
 
Everett-Green (Evelyn)--The Wife of Arthur
Lorraine. 8vo., swd., 60. F. V. WHITE, Mar. 12
--------------------
 
Everett-Green (Evelyn) -- The Yellow pup: a
story for boys. Cr. 8vo. 7} x 41, pp. 170,
IS. 6d.
PARTRIDGE, Sep. 12
--------------------
 
Evers (B. S.) and Davies (C. E. Hughes)—The
Complete association footballer. Illus. 8vo.
9x5), pp. 242, 55. net.... METHUEN, Nov. 12
--------------------
 
Eversley (Lord)—Gladstone and Ireland : the
Irish policy of Parliament from 1850-1894.
8vo. 9 X54, pp. 400, ios. 6d. net
METHUEN, Mar. 12
--------------------
 
Every (Edward)—Songs and stories of a Saviour's
love. 16mo. 6} x 44, pp. 120, Is. 60. net
SIMPKIN, Nov. 12
--------------------
 
Every man's own lawyer 1912. Cr. 8vo.; 8 X5,
6s, 8d, net
.C. LOCKWOOD, Feb. 12
--------------------
 
Everybody's boy, Bashford (L.). 65. ....Feb.

In [None]:
type (ecb_pe)

list

In [None]:
f = open("entries_clean.txt", "a")
for item in ecb_pe:
  for item1 in item:
    f.write(item1)
f.close()

Now we take entries flattened from the nested list `ecb_pe` we made
above. `pe` was named as an abbrevation of "page-entries" since the
original list was a nested list of pages and entries.

In [None]:
entries = [
    re.sub(r"\n", " ", entry.strip()) for entries in ecb_pe for entry in entries
]
print(len(entries))

import random

for entry in random.sample(entries, 100):
    print("---\n", entry)

19920
---
 Martin (Stuart)-Inheritance. Cr. 8vo. 73 x 44, DP 304, 6s. .... .J. OUSELEY, July-12
---
 Law relating to the relief of the poor (The), by the Editor of the Poor Law Officers' Journal. 8vo. 81 X5), pp. 272, 28. POOR LAW PUB. Co., Apr. 12
---
 Pemberton (Max)— The Gold wolf. Pop. edit. Cr. 8vo., pp. 382, is. net...... WARD, L., Mar.12
---
 In the ship of the church, Gresley (R. St. J.) is. net Mar. 12
---
 Meikle (Louis S.)-Confederation of the British West Indies versus annexation to the United States of America. 8vo. 82X51, pp. 292, 55. net Low, Feb. 12
---
 Buchanan (E. S.) ed.—The Epistles and apoca- lypse : from the Codex Harleianus. 8vo., swd., 2IS. net ...FROWDE, Nov. 12
---
 Poets, Days with the lyric: Keats, Longfellow, Burns. 35. 6d. net.. .Sep. 12
---
 Ibsen (Henrik)-Love's comedy. Trans. by C. H. Herford. Cr. 8vo. 7 X5, pp. 172, 2s, net DUCKWORTH, Nov. 12
---
 Law, International, Oppenheim (L.) 21S. net ...Aug. 12
---
 Marriage of Esther, Boothby (G.) 7d. ..Dec. 1

We are still getting a few suspicious entries that run seem together. For example:

> STEEDMAN Statues, Buddhist, Vorobjev (N. J.) 28. 9d. Dec. II Statuettes, Italian bronze, of Renaissance Bedo, (w.) Vol. 3, £6 ros... ..... Sep.
>
> . Jan. 12 Evans (I.. Worthington)—The National Insurance Eugenics, Problem of practical, Pearson (K.) Act, 1911 Summary, with explanatory chapters Is, net... ...d pr. 12 and full index. 8vo. swd, id. ; superior edit., Eugenics, Problems in, Ios. 6d. net .. July 12 60. net Eugénie, Empress, and her circle, Barthez (E.) NATIONAL CONSERVATIVE UNION, Mar. 12

To find these, using a regular expression to find line-medial "DATEABBR. 12" was useful.

Others might be truncated at the front:

> lebration (The) th, 1912. Edit . by Prof. Knight MITH, E., May
>
> nests 1904, nes.) uly 12
>
> 155. net
>
>

For the ones missing the front part, checking if the line does not
start in capital letters could be a good test.

## Line-medial Dates

There are not many of these when searching only for well-formed dates
abbreviations. I counted 82 at first. Many of these patterns involved
a spurious `.` at the end of the 12 in dates. I went back and fixed
the original split pattern to account for these. That reduced the total
to just 62, 0.03% of all the entries we extracted. There are surely
more this pattern did not account for, but I doubt they constitute
a signifcant part of the data. Of the 62 entries found here, it is
not clear if all of them are indeed merged entries. I have decided
not to split all of them automatically for the nonce.


In [None]:
# NOTE use double braces in r"" strings that are formatted
#      when we need specific times of reptition, e.g.
# >>> pat = r".*l{{2,}}{}".format("bunk")
# >>> print(pat)
# .*l{2,}bunk

# there will be many instances this pattern cannot catch
# we can always add more to this list as we find them
month_abbrvs = [
    "Jan",
    "Feb",
    "Mar",
    "Apr",
    "May",
    "June",
    "July",
    "Aug",
    "Sept",
    "Oct",
    "Nov",
    "Dec",
]

linemid_re = re.compile(r".*({})\.?\W12\.?[^\.]+".format("|".join(month_abbrvs)))
linemid_entries = [entry for entry in entries if linemid_re.search(entry)]

print(len(linemid_entries))
print(len(linemid_entries) / len(entries))
for entry in linemid_entries:
    print(entry)

62
0.003112449799196787
net ..Mar. 12 Evans (Christopher J.)—Breconshire. Cr. 8vo., Eucken (Rudolf)-Main currents of modern 71 X4), pp. 184, Is. 6d. (Cambridge county thought : a study of the spiritual and intellec- geographies) .... CAMB. UNIV. PRESS, Mar. 12
Nobel lecture delivered at Stockholm Evans (Edwin)--Historical, descriptive and ana- March 27th, 1909. 8vo., pp. 44, swd. is. net lytical account of the entire works of Johannes HEFFER, Mar. 12 Brahms. Vol. I. The Vocal works. 8vo., lucken : a philosophy of life, Jones (A. J.) 6d. net 9 X54, pp. 620, 1os. ...W. REEVES, Mar. 12
Eugenics, Church and, Gerrard (T. J.) 60. Aug. 12 Evans (George)- The Child of his adoption. Eugenics, Darwinism, medical progress and, Cr. 8vo. 71 X5, pp. 444, 6s. Pearson (K.) is. net ..Sep. 12
Eugenics, Hercdity in relation to, Davenport Evans (Herbert A.)--Castles of Englaud and (C. B.) 8s. 64. net. . May 12 Wales. Illus. 8vo. 9 X 51, pp. 386, 125. 6d. net Eugenics, Intro. to, Whetham (W. C. D. and C. D

## Lines that Don't Start in Capital Letters

There are 637 lines that do not start capital letters or quotes, or
about 3% of all the entries. If we assume that the other halves of
these entries are in our data and there are more truncated entries
near the headers that we didn't catch, I think it is reasonable to
assume about 6~10% of our entries have truncation problems. This isn't
so bad, but there might be an easy way to improve it.


In [None]:
fronttrunc_re = re.compile(r"^[^A-ZÆ\"“]")
fronttrunc_entries = [entry for entry in entries if fronttrunc_re.search(entry)]

print(len(fronttrunc_entries))
print(len(fronttrunc_entries) / len(entries))

for entry in random.sample(fronttrunc_entries, 100):
    print(entry)

for entry in fronttrunc_entries:
    if re.match(r"^#", entry):
        print(entry)

637
0.031977911646586345
net, 35. net 38. 6d. Swinburne (James K.)—Beneath the cloak of England's respectability. Cr. 8vo. 71 X41, Pp. 190, 2s, net ..SKEFFINGTON, Jan. 12
39. 60. . * 3s. net gs. net 2 Lailey (B.)- The Law of extraordinary traffic on highways. In 8 parts. 8vo., 75. 6d. net SWEET & M., Oct. 12
+ Missionary methods, Allen (R.) 55. net.. Mar. 12
withernsea, Hist, of, Miles (G. T. J.) and Richard. son (W.) 6s. net.. .Dec. 11 Within : thoughts during convalescence, Youngo husband (Sir F.) 35. 6d. net ...Oct, 12
> net Fickle fortune, Garvice (C.) 6d........ . Jan. 12
а
portraits, 1600-1700. ..Jan. 12
on : ....Oct. 12
net .. July 12
1 By 1 Q:-See Quiller-Couch (Sir A. T.). Quain's Elements of anatomy. Vol. 2, part 1. Text-book of microscopical anatomy. Edward Albert Schäfer. Ryl, 8vo. 10 X6, pp. 754, 255. net. ..LONGMANS, Mar. 12
: . . . . .
155. net IS. 55. net • • • • • Camp-fire tales : a book of stirring episodes collected from the works of mighty hunters. Illus, by Edwin 

## Getting Clean Entries

We decided to take out the line medial dates and entries that do not
start in capital letters to sanitize the data.


In [None]:
clean_entries = [
    entry
    for entry in entries
    if not (linemid_re.search(entry) or fronttrunc_re.search(entry))
]

len(clean_entries)  # 19241
for entry in random.sample(clean_entries, 100):
    print(entry)

Elizabeth of Roumania (“Carmen Sylva”)- Sparks from the anvil; or, Thoughts of a Queen. 12mo. 6} x 41, pp. 108, 3s. net JARROLD, Oct. 12
Haggard (Sir H. Rider)-The Ghost Kings. 8vo., swd, 6d. ...CASSELL, Jan, 12
Heaton's Annual: the commercial handbook of Canada, 1912. Cr. 8vo., 5s. net SIMPKIN, Feb. 12
Children's Friend (The) for 1913. 4to., 25 ; gilt, 28. 6d. ; bds. is. 6d....... PARTRIDGE, Oct. 12
Kaye (Michael W.)-A Robin Hood of France. Cr. 8vo. 7* X5, pp. 334, 6s. S. PAUL, June 12
Monvel (Roger Boutet de)-Eminent English men and women in Paris, (Crowned by the French Academy in 1912.) Illus. 8vo. 9 X5, pp. 530, 12s. 6d. net ..NUTT, Dec, 12
Millar (Martha)-Useful hints household management. 12mo., pp. 128, is, net BLACKIE, Sep. 12
Schoolmasters' year-book and directory (The) 1912. Cr. 8vo., 12s. 6d. net YEAR BOOK PRESS, Feb. 12
Reichardt (F. Noel)— The Significance of ancient religions in relation to human evolution and brain development. 8vo. 8} x5), pp. 470, 12s. 6d. ..G. ALLEN,

There are occasional misses where the OCR failed to read the final
12, but they are few and far between.


# Making a Dataframe

## Getting the Main (Heavy Type) Entries

As I was perusing the catalogue, I noticed something interesting. Virtually
all of the main entries with author's surname or first keyword in heavy
type include the publisher in all capital letters immediately before
the date while the "index entries" seem to never include the publisher
in all caps in this location. We are able to tell the main entries
from index entries without accessing font style in this way.

There is an appendix in the back with all the publishers. I am not
sure if it is worth digitizing all the headers for these entries (manually?).

In [None]:
pat1 = r"[A-Z]+\.?,?\W\w+\.?\W12\.?$"
pat2 = r"[A-Z]+\.?,\W.+\W12\.?$"  # 9246
pat3 = r"\W(?!\.+)([A-Z\.\s&,]+),\W\w+\.?\W12\.?$"  # 8526
pat4 = r"[A-Z]+\.?,\W\w+\.?\W12\.?$"  # 8564

# the same pattern used below to extract publishers and dates
pat5 = r"[A-ZÀ-ž][A-ZÀ-ž\.\s&,'\-]+,\W\w[^A-ZÀ-ž]+(?:\.|,)?\W12\.?$"  # 8834

main_entries = [entry for entry in clean_entries if re.search(pat5, entry)]

len(main_entries)  # now 8834, was 9246
for entry in random.sample(main_entries, 100):
    print(entry)

Bouverie-Pusey (S. E. B.)—The Past history of Ireland: a brief sketch. Ryl. 16mo., 64 x44, pp. 174, IS. 6d. net UNWIN, Feb. 12
Perry (Ralph Barton)-Present philosophical tendencies : a critical survey of naturalism, idealism, pragmatism and realism, &c. 8vo., IOS. 6d. net.. .LONGMANS, Mar. 12
Wilcox (Ella Wheeler)-Selected Poems. Printed on Japanese wood with a border in six colours and gold, corded, 8 X 6. Set of 12 poems, 3s. net SIEGLE, H., June 12
Cripps (Arthur Shcarly)--Pilgrimage of grace : verses on a mission. 12mo. 7 X1), pp. 120, 25. 6d. net ......B. H. BLACKWELL, Sep. 12
Ward (Mrs. Humphry)--Canadian born. 8vo. swd. 6d. ..NEWNES, Sep. 12
Taylor (William F.)—The Charterhouse of London ; monastery, palace, and Thomas Sutton's Foundation. Illus. 8vo. 81 X54, pp. 298, 78. 6d. net..... ..DENT, May 12
Daudet (Alphonse)—Sidonie's revenge. Cr. 8vo., 7 X 41, pp. 286, Is. 6d. net, Ithr. 25. net. (Lotus library) ..GREENING, Feb. 12
Lee (Vernon)-Vital lies : studies of some varieties of

I am estimating roughly 30 main entries per page for 330 pages giving
approximately 9900 total main entries. 8834 is a very reasonable number
of entries to start with, in my opinion.

## Splitting the Entries into Pandas Series

First, we make a Pandas series out of the list of main entries.

In [None]:
entries = pd.Series(main_entries)

Before anything else, we clean up the issues with 1 and I in
the entry strings.

In [None]:
# replace I with 1 when in close juncture with a number
entries = entries.str.replace(r"I(\d)", "1\\1", regex=True)
entries = entries.str.replace(r"(\d)I", "\\1\1", regex=True)

# replace I with 1 before publishing formats
entries = entries.str.replace(r"I([tmv]o)", "1\\1", regex=True)

# replace word-separated cases of IS with 1s
entries = entries.str.replace(r"(\W)IS(\W)", "\\1\1s\\2", regex=True)

# replace word-separated cases of I/TIS with 11s
entries = entries.str.replace(r"(\W)[TI]IS(\W)", "\\1\1\1s\\2", regex=True)

# replace I with 1 before shillings and pence
entries = entries.str.replace(r"I(d|s)", "1\\1", regex=True)

# make sure the shilling "s" is lowercase
entries = entries.str.replace(r"(\d)S", "\\1s", regex=True)

# make floating I 1 before the above cases
entries = entries.str.replace(r"I\s+(\d+(?:d|s|[tmv]o))", "1\\1", regex=True)

# replace digits followed by 5. as digits followed by s.
entries = entries.str.replace(r"(\d+)5\.", "\\1s.", regex=True)

### String-Final Information
Now we use the Pandas string extract function to put the information
we want into a dataframe. There is some complicated regex involved,
so I have documented it very carefully.

To start, we work our way up from the end of string by splitting off
the publisher and date. The first capture group grabs everything up
until the first instance of the publisher pattern. I acccomplished
this with a "lazy" /?/ operator after /.*/

Originally, I used a more complicated negative lookbehind to be sure
the last characters of front were not capital letters, period, white
space, apersand, comma, or apostrophe. Anything following this would
be one or more charaters that were capital letters. I think these conditions
are too complicated and open the door for problems. Grabbing the first
part lazily makes more sense.

In [None]:
# back_pat = r"(?P<front>.*)(?<![A-Z\.\s&,'À-ž]{2})[^A-ZÀ-ž]+"
back_pat = r"(?P<front>.*?)"

Next is the publisher capture group. We look for an instance of a capital
letter followed by one or more continued instances of ethier other capital letters,
period, white space, ampersand, comma, or apostrophe. This is matched
greedily up until a comma, a non-word character and the date pattern
which is anchored to the end of the string.

In [None]:
# back_pat += r"(?P<publisher>[A-ZÀ-ž][A-Z\.\s&,'À-ž]+),\W"
back_pat += r"(?P<publisher>[A-ZÀ-ž][A-ZÀ-ž\.\s&,'\-]+),\W" # add hyphen

Finally, we finish with the date capture group. This looks for 12 and
the end of a string followed by an optional period. Before this, there
is one word charater followed by one or more non-capital letter characters,
and optional period or comma and one non-word character.

Extract fields for all the named captured groups in the full concatenated
pattern.

In [None]:
back_pat += r"(?P<date>\w[^A-ZÀ-ž]+(?:\.|,)?\W12)\.?$"
entry_backs = entries.str.extract(back_pat)

### String-Inital Information
Now we can turn our attention to the front part of the entries. Here,
we are looking to extract a string of all of the authors and editors
("creators") and any cross-reference notes pertaining to them. We can
split and clean up this string later. The "middle" part of the entry
is everything else. This, in theory, should contain the title, publishing
format, and price information. We will sift out that information in
another stage.

First, the creators capture group is recursive since there could by
many names separated by "and" outside of parentheses. I have wrapped
the inside of this capture group in a non-capture group followed by
/+/ to reflect this.

In [None]:
# front_pat = r"^(?P<creators>(?:(?:(?!and|see)[^\(\)—\s]+\s){1,3}"
front_pat = r"^(?P<creators>(?:"

I started a new line so this is easier to read. Now we have one to
three strings of one of more characters that are not parentheses, em
dash, or whitespace followed by whitespace. This is roughly equivalent
to three words outside or parentheses that don't involve em dash.

In [None]:
front_pat += r"(?:[^()—\s]+\s){1,3}"

Next, we grab whatever is in parentheses that follow so long it is
not "the, " or "post free."

In [None]:
front_pat += r"\((?![Tt]he|post free)[^\)]+\)"

After that, we see if we have an optional space delimited "and" or
a "(,) see [other main entry head]. " expression.

In [None]:
front_pat += r"(?:\sand\s|,?\ssee.*?\.(?![^\(]*\))\s*)?"

Close the wrapper non-capture group that can occur one or more times
and the creators capture group. I have made the creators capture group
optional since we will not be able to find it for every entry.

In [None]:
front_pat

'^(?P<creators>(?:(?:[^()—\\s]+\\s){1,3}\\((?![Tt]he|post free)[^\\)]+\\)(?:\\sand\\s|,?\\ssee.*?\\.(?![^\\(]*\\))\\s*)?'

In [None]:
front_pat += r")+)?"

Next, we see if the creator(s) were designated as editor(s). This is
also an optional capture group.

In [None]:
front_pat += r"\.?\s*(?P<is_editor>eds?\.,?)?"

Grab everything else that is not a string of intervening spaces or
dashes as "middle."

In [None]:
front_pat += r"[\-—\s]*(?![\-—\s]+)(?P<middle>.*)"

Extract fields from our capture groups and put dataframe together.

In [None]:
# what to do about "see ...," entries for related authors?
entry_fronts = entry_backs["front"].str.extract(
    front_pat
    # previously used creator patterns:
    # r"^(?P<authors>(?:[\w\-']+\s\((?![Tt]he|[Aa]nd?)[^\)]+\)(?:\sand\s)?)+)"
    # r"^(?P<authors>[\w\-']+\s\((?![Tt]he|[Aa]nd?)[^\)]+\)(?:\sand\s)?)(?P<mid>.*)"
)

df = pd.DataFrame()

df["entry"] = entries

df["front"] = entry_backs["front"]
df["publisher"] = entry_backs["publisher"]
df["date"] = entry_backs["date"]

df["creators"] = entry_fronts["creators"]
df["is_editor"] = entry_fronts["is_editor"]
df["middle"] = entry_fronts["middle"]
df = df[["entry", "front", "creators", "is_editor", "middle", "publisher", "date"]]

df[["creators", "middle", "publisher"]].head(100)

Unnamed: 0,creators,middle,publisher
0,Abercromby (Hon. John),"A Study of the bronze age pottery of Great Britain and Ireland and its associated grave-goods. Illus. 2 vols. 4to. 63s, net (Clarendon Press)",FROWDE
1,Abernathy (M.),"The Ride of the Abernathy Boys. Cr. 8vo., 3s. 6d.",HODDER & S.
2,Abhedananda (Swami),"Vedanta Philosophy : Great Saviours of the world. Vol. 1. (Krishna, Zoroaster, Lâo-Tze, and their teachings, with portraits.) Cr. 8vo., pp. 176, 4s. 6d. net",LUZAC
3,Abhedananda (Swami),"Vedanta philosophy : Human affection, and divine love. 12mo., pp. 46, 1s. 6d. net ..",LUZAC
4,Abraham (Ashley P.),"Beautiful Lakeland. Illus. 4to. 111 X 81, pp. 52, bds. 3s. 6d.",G. P. ABRAHAM
5,Abraham (George D.),"British mountain climbs. Cheaper edit. 12mo., 7 X4, pp. 464, 59. net",MILLS & B.
6,Abraham (George D.),"Swiss mountain climbs. Cheaper edit. 12mo. 7 X 4), pp. 448, 5s. net",MILLS & B.
7,Abram (A.),"English life and manners in the later middle ages. Illus. Cr. 8vo. 74 X5, pp. 368, 6s. ...",ROUTLEDGE
8,,"Academy architecture and architectural review, 1911.-Vol. 40, Founded by Alex. Koch. 4to., 4s. rod. net, swd. 4s. net....",SIMPKIN
9,,"Academy Architecture and architectural review. Vol. 41, 1912, part 1. 4to. 98 x74, pp. 168, 4s. rod. net; swd, 4s, net",SIMPKIN


Now we can clean up the creators field and split it into a list.

In [None]:
# substitute "Surname (Name1) and Surname (Name 2)" for
# "Surname (Name1 and 2)"
df["creators"] = df["creators"].str.replace(
    r"([^()]+)\(([^)]+) and ([^)]+)\)", "\\1(\\2) and \\1(\\3)", regex=True
)

# remove all cross-reference "see [other header]." expressions
# takes everything from "see" to the first period not in parens
df["creators"] = df["creators"].str.replace(
    r"see.*\.(?![^(]*\))\s*", " and ", regex=True
)

# get rid of any trailing ands
df["creators"] = df["creators"].str.replace(r"\s+and\s+$", "", regex=True)

# split each entry into a list of authors by " and " not in parens
df["creators"] = df["creators"].str.split(r"\s+(?:and)(?![^(]*\))\s+")

df["creators"].head(100)


for creator_list in df["creators"].dropna().tolist():
    if len(creator_list) > 1:
        print(creator_list)

['Adam (J. A. Stanley)', 'White (Bernard C.)']
['Adami (J. G.)', 'McCrae (J.)']
['Adams (Frank)', 'Adams (George Burton)']
['Agriculture', 'Fisheries (Board of)']
['Agriculture', 'Fisheries (Board of)']
['Agriculture', 'Fisheries (Board of)']
['Alder (J.)', 'Hancock (A.)']
['Alexander (T.)', 'Thomson (A. W.)']
['Allen (I. C.)', 'Jacobs (W. A.)']
['Allen-Brown (A.)', 'Allen-Brown (D.)']
['Allhusen (Beatrice)', 'Fox-Reeve (Iris)']
['Amelung (W.)', 'Holtzinger (H.)']
['Andersen (Knud)', 'Anderson (A. J.)']
['Andom (R.)', 'Hodder (Reginald)']
['Annett (E. A,)', 'Annett (E. M.)']
['Archbutt (L.)', 'Deeley (R. M.)']
['Armstrong (H. G.)', 'Brickdale (J. M. P.)']
['Arner (G. I..)', 'Arnim (Baroness von)']
['Amold (E. V.)', 'Pearce (J. W. E.)']
['Ashford (F.)', 'Ashley (C. G.)', 'Hayward (C. B.)']
['Askew (Alice)', 'Askew (Claude)']
['Askew (Alice)', 'Askew (Claude)']
['Askew (Alice)', 'Askew (Claude)']
['Askew (Alice)', 'Askew (Claude)']
['Askew (Alice)', 'Askew (Claude)']
['Askew (Alice)', 'A

Now, take the of each creator list to use for our "last_name" and "first_name"
columns.

In [None]:
head_names = df["creators"].apply(lambda x: x[0] if isinstance(x, list) else x)
head_names = head_names.str.extract(r"^(?P<last_name>[^()]+)\s\((?P<first_name>[^)]+)\)$")
df["last_name"] = head_names["last_name"]
df["first_name"] = head_names["first_name"]

### String-Medial Information

From the first capital letter or digit of the middle string to the
first period that is not part of an obvious abbrevation or the beginnings
of expressions like Cr., Vo., No., english publishing formats, price
in shilling and/or pence, Illus., or Ryl.

Also do not match titles that begin in Vo., No., english publishing
formats, price in shilling and/or pence, Illus., or Ryl. 

In [None]:
df["title"] = df["middle"].str.extract(
    r"(?!^(?:No\.|Cr\.|Vo\.|fo\.|\d+\s?\}?\w|Illus\.|Ryl\.).*)"
    + r"^[^\dA-ZÀ-ž]*([\dA-ZÀ-ž].+?)"
    + r"(?:(?<!\W[A-ZÀ-ž]|No|id|pp)\.|"
    + r"[,.]?\W(?=No\.|Cr\.|Vo\.|fo\.|\d+\s?\}?\w|Illus\.|Ryl\.))"
)


# (?![^(]*\)), got rid of in paren check because of unbalanced parens

Extract English publishing formats.

In [None]:
df["format"] = df["middle"].str.extract(
    r"\W(fo\.|\d+[tvm]o[,.]?)\W"
)

Extract price strings, clean, and convert.

Look for combinations of number characters followed by d., s., d,,
s,, etc. Then look for "net" after this. Make sure this is the last
case of the price capture group to get the final price in cases where
there are multiple prices. Check that the string ends is parenthetical
notes, space, or trailing periods that were leaders before the publisher
and date in the original text.

In [None]:
price_df = df["middle"].str.extract(
    r"(?P<price>\d+s\.?,?\s*\d+d\.?,?|\d+s\.?,?|\d+d\.?,?)"
    + r"\s*(?P<is_net>net)?"
    + r"(?!.*\1)(?=(?:\s*\([^\)]+\))*[\s.]*$)"
)

df["price_dirty"] = price_df["price"]
df["is_net"] = price_df["is_net"]

df["price"] = df["price_dirty"].str.replace(r"([ds]),", "\\1.", regex=True)
df["price"] = df["price"].str.replace(r"s\.?\s+", "s. ", regex=True)
df["price"] = df["price"].str.strip(",\s")
df["shillings"] = df["price"].str.extract(r"(\d+)s").fillna(0).astype(int)
df["pence"] = df["price"].str.extract(r"(\d+)d").fillna(0).astype(int)
df["price_in_pounds"] = df["pence"] / 240 + df["shillings"] / 20

## Writing to CSV

Limit to the columns we want.

In [None]:
df["original_entry"] = pd.Series(main_entries)
df["author_name"] = df["first_name"].str.cat(df["last_name"], sep=" ")
df = df[
    [
        "entry",
        "last_name",
        "first_name",
        "title",
        "publisher",
        "price",
        "price_in_pounds",
        "format",
        "original_entry",
        "author_name",
        # new fields
        "creators",
        "is_editor",
        "date",
        "is_net"
    ]
]

Export for review

In [None]:
from datetime import datetime

today = datetime.today().strftime("%Y%m%d")
out = "ecb_1912_{}.csv".format(today)

df.to_csv("ecb_1912.csv")
#df.to_csv(out)

In [None]:
df.sample(100)

Unnamed: 0,entry,last_name,first_name,title,publisher,price,price_in_pounds,format,original_entry,author_name,creators,is_editor,date,is_net
4769,"Le Queux (William)- The Mystery of nine. Cr. 8vo., 7* *4*, pp. 320, 6s... ...NASH, Jan. 12",Le Queux,William,The Mystery of nine,NASH,6s.,0.3,8vo.,"Le Queux (William)- The Mystery of nine. Cr. 8vo., 7* *4*, pp. 320, 6s... ...NASH, Jan. 12",William Le Queux,[Le Queux (William)],,Jan. 12,
735,"Bennett (Arnold)--Leonora : a novel. New impres. Cr. 8vo. 7X 4), pp. 368, 2s, net CHATTO, Mar. 12",Bennett,Arnold,Leonora : a novel,CHATTO,2s.,0.1,8vo.,"Bennett (Arnold)--Leonora : a novel. New impres. Cr. 8vo. 7X 4), pp. 368, 2s, net CHATTO, Mar. 12",Arnold Bennett,[Bennett (Arnold)],,Mar. 12,net
761,"Benson (E. F.)—The Luck of the Vails. Cheaper re-issue. Cr. 8vo. 71 X4), pp. 332, 2s. net HEINEMANN, Aug. 12",Benson,E. F.,The Luck of the Vails,HEINEMANN,2s.,0.1,8vo.,"Benson (E. F.)—The Luck of the Vails. Cheaper re-issue. Cr. 8vo. 71 X4), pp. 332, 25. net HEINEMANN, Aug. 12",E. F. Benson,[Benson (E. F.)],,Aug. 12,net
7694,"HODGES, FIGGIS, Jan. 12",,,,"HODGES, FIGGIS",,0.0,,"HODGES, FIGGIS, Jan. 12",,,,Jan. 12,
2314,"Dickie (Hugh W.)--Short methods and by-ways in arithmetic. Cr. 8vo., pp. 152, is. CHAMBERS, June 12",Dickie,Hugh W.,Short methods and by-ways in arithmetic,CHAMBERS,,0.0,8vo.,"Dickie (Hugh W.)--Short methods and by-ways in arithmetic. Cr. 8vo., pp. 152, is. CHAMBERS, June 12",Hugh W. Dickie,[Dickie (Hugh W.)],,June 12,
8196,"Walker (T.) and Shuker (J. W.) eds.—The Gospel according to S. Mark. Cr. 8vo., pp. 118, is. 60. CLIVE, May 12",Walker,T.,The Gospel according to S. Mark,CLIVE,,0.0,8vo.,"Walker (T.) and Shuker (J. W.) eds.—The Gospel according to S. Mark. Cr. 8vo., pp. 118, is. 60. CLIVE, May 12",T. Walker,"[Walker (T.), Shuker (J. W.)]",eds.,May 12,
5637,"Molière (J. B. P.)--Les Femmes savants. Trans. by C. H. Page. Cr. 8vo., 3s. 6d. net PUTNAM, May 12",Molière,J. B. P.,Les Femmes savants,PUTNAM,3s. 6d.,0.175,8vo.,"Molière (J. B. P.)--Les Femmes savants. Trans. by C. H. Page. Cr. 8vo., 35. 6d. net PUTNAM, May 12",J. B. P. Molière,[Molière (J. B. P.)],,May 12,net
3675,"Haverfield (E. L.)-The Ogilvies' adventures Cr. 8vo. 71 x 5, pp. 320, 39. 60. FROWDE, Oct. 12",Haverfield,E. L.,The Ogilvies' adventures,FROWDE,,0.0,8vo.,"Haverfield (E. L.)-The Ogilvies' adventures Cr. 8vo. 71 x 5, pp. 320, 39. 60. FROWDE, Oct. 12",E. L. Haverfield,[Haverfield (E. L.)],,Oct. 12,
1023,"Boreham (F. W.)-The Luggage of life; or, a Fireside philosophy. Cr. Evo., 3s. 6d. net C. H. KELLY, Sep. 12",Boreham,F. W.,"The Luggage of life; or, a Fireside philosophy",C. H. KELLY,3s. 6d.,0.175,,"Boreham (F. W.)-The Luggage of life; or, a Fireside philosophy. Cr. Evo., 35. 6d. net C. H. KELLY, Sep. 12",F. W. Boreham,[Boreham (F. W.)],,Sep. 12,net
6381,"Philpotts (Eden)-The Thief of virtue. 12mo. 7d. net .HUTCHINSON, Jan, 12",Philpotts,Eden,The Thief of virtue,HUTCHINSON,7d.,0.029167,12mo.,"Philpotts (Eden)-The Thief of virtue. 12mo. 7d. net .HUTCHINSON, Jan, 12",Eden Philpotts,[Philpotts (Eden)],,"Jan, 12",net


# Data Mining



## Diagnosing Problems

How many entries have information for last name?

Previously, we distiguished between entries where we uncertain if there
were author and editor names versus entries that we were certain contained
no names. Now, we just assume that all entries where no name was captured
in the `creators` column of the dataframe have no names.

In [None]:
df['last_name'].value_counts()[:10]

Shakespeare    41
Smith          38
Dickens        33
Brown          31
Wilcox         30
Hall           28
Taylor         27
Watson         27
Le Queux       27
Wilson         26
Name: last_name, dtype: int64

There are 1597 entries where there is no author/editor name captured. Looking at the dataframe of unknown name entries, it seems most of these are entries where there was simply no name given. A very small minority (2 or 3 entries?) seem to be cases where the names were not captured properly. I still need to dig into data to figure out how I might refine the pattern. I suspect the problem is more serious in the other direction. The current pattern seems to come up with more false positives and improperly capture strings that are not names than the other way around.

In [None]:
df["last_name"].size - df["last_name"].dropna().size

1597

In [None]:
# df[df["last_name"] == "Unknown"].sample(100)
df[df["last_name"].isna()].sample(100)

Unnamed: 0,entry,last_name,first_name,title,publisher,price,price_in_pounds,format,original_entry,author_name,creators,is_editor,date,is_net
1741,"Christian World pulpit (The). Vol. 80, July-Dec., 1911. 4to., 4s. od.........J. CLARKE, Jan. 12",,,Christian World pulpit (The),J. CLARKE,,0.0,4to.,"Christian World pulpit (The). Vol. 80, July-Dec., 1911. 4to., 45. od.........J. CLARKE, Jan. 12",,,,Jan. 12,
3118,"Gerard Dorothea) -A Glorious lic. Cr. 8vo. 74 X5, pp. 320, 6s.... ......LONG, Jan. 12",,,Gerard Dorothea) -A Glorious lic,LONG,6s.,0.3,8vo.,"Gerard Dorothea) -A Glorious lic. Cr. 8vo. 74 X5, pp. 320, 65.... ......LONG, Jan. 12",,,,Jan. 12,
6323,"Peronne Marie : a spiritual daughter of Saint Francis de Sales, 1586-1637. By a Religious of the Visitation. Cr. 8vo., 3s. 6d. net BURNS & O., Jan. 12",,,Peronne Marie : a spiritual daughter of Saint Francis de Sales,BURNS & O.,3s. 6d.,0.175,8vo.,"Peronne Marie : a spiritual daughter of Saint Francis de Sales, 1586-1637. By a Religious of the Visitation. Cr. 8vo., 35. 6d. net BURNS & O., Jan. 12",,,,Jan. 12,net
2802,""" Financial Times” Oil handbook (The)-Narrow Cr. 8vo., 28. net..... OFFICE, Apr. 12",,,Financial Times” Oil handbook (The)-Narrow,OFFICE,,0.0,8vo.,""" Financial Times” Oil handbook (The)-Narrow Cr. 8vo., 28. net..... OFFICE, Apr. 12",,,,Apr. 12,
7344,"Snarer (The). By Brown Linnet. Cr. 8vo. 71 x 41, pp. 256, 3s. 6d. net ........MURRAY, Oct. 12",,,Snarer (The),MURRAY,3s. 6d.,0.175,8vo.,"Snarer (The). By Brown Linnet. Cr. 8vo. 71 x 41, pp. 256, 35. 6d. net ........MURRAY, Oct. 12",,,,Oct. 12,net
8010,"Truth about man, (The). By a Spinster who knows him. Cr. 8vo., 71 X 48, pp. 256, swd. is. net .HUTCHINSON, Aug. 12",,,"Truth about man, (The)",HUTCHINSON,,0.0,8vo.,"Truth about man, (The). By a Spinster who knows him. Cr. 8vo., 71 X 48, pp. 256, swd. is. net .HUTCHINSON, Aug. 12",,,,Aug. 12,
887,"Births, marriages and deaths—Report of Registrar General for Ireland, 1911, 3s. (post free) WYMAN, Aug. 12",,,"Births, marriages and deaths—Report of Registrar General for Ireland",WYMAN,3s.,0.15,,"Births, marriages and deaths—Report of Registrar General for Ireland, 1911, 35. (post free) WYMAN, Aug. 12",,,,Aug. 12,
35,"Acts—Egremont Urban District Water Act, 1912, 2s. id. ; Clyde Lighthouses Order Act, is. 4d. ; Dunstable Gas Act, 1912, 2s. id. (post free). WYMAN, Sep. 12",,,Acts—Egremont Urban District Water Act,WYMAN,,0.0,,"Acts—Egremont Urban District Water Act, 1912, 25. id. ; Clyde Lighthouses Order Act, is. 4d. ; Dunstable Gas Act, 1912, 25. id. (post free). WYMAN, Sep. 12",,,,Sep. 12,
6490,"Post Office London directory, 1913. 4to., 32s. KELLY, Dec. 12",,,Post Office London directory,KELLY,32s.,1.6,4to.,"Post Office London directory, 1913. 4to., 325. KELLY, Dec. 12",,,,Dec. 12,
1885,"Colonial Office list (The) 1912. 8vo., 15s. net WATERLOW, Apr. 12",,,Colonial Office list (The),WATERLOW,15s.,0.75,8vo.,"Colonial Office list (The) 1912. 8vo., 155. net WATERLOW, Apr. 12",,,,Apr. 12,net


In [None]:
# df_unknown = df[df["last_name"] == "Unknown"]
df_unknown = df[df["last_name"].isna()]

# Not capturing entries where no author is listed
# or entries where last names are not followed by first names in parens (as in Catullus)
# or authors listed with initials
# a good number of these entries look like fragments split off from other entries

df_unknown.sample(200)

Unnamed: 0,entry,last_name,first_name,title,publisher,price,price_in_pounds,format,original_entry,author_name,creators,is_editor,date,is_net
8140,"Virgil-Georgics. In English verse by Arthur S. Way. Imp. 16mo. 7x51, pp. 122, 28. 6d. net MACMILLAN, Dec. 12",,,Virgil-Georgics,MACMILLAN,6d.,0.025,16mo.,"Virgil-Georgics. In English verse by Arthur S. Way. Imp. 16mo. 7x51, pp. 122, 28. 6d. net MACMILLAN, Dec. 12",,,,Dec. 12,net
6507,"Prisons—Report for year ending March, 1912. Part 1 (post free), 10 d..... WYMAN, Sep. 12",,,Prisons—Report for year ending March,WYMAN,,0.0,,"Prisons—Report for year ending March, 1912. Part 1 (post free), 10 d..... WYMAN, Sep. 12",,,,Sep. 12,
7225,"Siege of Delhi (The): a record by a survivor who filled an important post at the glorious siege. Ryl. 8vo., swd., s... SIMPKIN, Sep: 12",,,Siege of Delhi (The): a record by a survivor who filled an important post at the glorious siege,SIMPKIN,,0.0,8vo.,"Siege of Delhi (The): a record by a survivor who filled an important post at the glorious siege. Ryl. 8vo., swd., IS... SIMPKIN, Sep: 12",,,,Sep: 12,
8535,"Wild-Fowl. By various authors.. Cheaper re- issue. Cr. 8vo. 7* x5, pp. 288, 2s. 6d. net (Fur, feather and fin ser.) LONGMANS, Jan 12.",,,Wild-Fowl,LONGMANS,2s. 6d.,0.125,8vo.,"Wild-Fowl. By various authors.. Cheaper re- issue. Cr. 8vo. 7* x5, pp. 288, 25. 6d. net (Fur, feather and fin ser.) LONGMANS, Jan 12.",,,,Jan 12,net
2565,"Education by life: a discussion of the problem of the school education of younger children. By various writers. Cr. 8vo. 8 X 5, pp. 220, 3s. 6d. net .G. PHILIP, Apr. 12",,,Education by life: a discussion of the problem of the school education of younger children,G. PHILIP,3s. 6d.,0.175,8vo.,"Education by life: a discussion of the problem of the school education of younger children. By various writers. Cr. 8vo. 8 X 5, pp. 220, 3s. 6d. net .G. PHILIP, Apr. 12",,,,Apr. 12,net
5754,"Motor manual (The). 15th edit., rev. Cr. 8vo. 1s. 6d. net.. . TEMPLE PRESS, Sep. 12",,,Motor manual (The),TEMPLE PRESS,1s. 6d.,0.075,8vo.,"Motor manual (The). 15th edit., rev. Cr. 8vo. Is. 6d. net.. . TEMPLE PRESS, Sep. 12",,,,Sep. 12,net
1466,"Cambridge University-Higher local examination class list and supplementary tables, December 1911. Demy 8vo., swd, 6d. CAMB. UNIV. PRESS, Jan. 12",,,"Cambridge University-Higher local examination class list and supplementary tables, December",CAMB. UNIV. PRESS,6d.,0.025,8vo.,"Cambridge University-Higher local examination class list and supplementary tables, December 1911. Demy 8vo., swd, 6d. CAMB. UNIV. PRESS, Jan. 12",,,,Jan. 12,
5847,"Navigation and Shipping-Annual Statement for 1911, 3s. 5d. (post free). .. WYMAN, Sep. 12",,,Navigation and Shipping-Annual Statement for,WYMAN,3s. 5d.,0.170833,,"Navigation and Shipping-Annual Statement for 1911, 35. 5d. (post free). .. WYMAN, Sep. 12",,,,Sep. 12,
81,"Admiralty—Distance tables. North and West coasts of Europe from White Sea to the Strait of Gibraltar, &c., is, 6d....... POTTER, J un. 12",,,Admiralty—Distance tables,POTTER,6d.,0.025,,"Admiralty—Distance tables. North and West coasts of Europe from White Sea to the Strait of Gibraltar, &c., is, 6d....... POTTER, J un. 12",,,,J un. 12,
6409,"Pitman's Exercises in business shorthand. 12mo. swd. is, net .. PITMAN, Nov. 12",,,Pitman's Exercises in business shorthand,PITMAN,,0.0,12mo.,"Pitman's Exercises in business shorthand. 12mo. swd. is, net .. PITMAN, Nov. 12",,,,Nov. 12,


In [None]:
df_unknown.to_csv("ecb_1912_unknowns.csv")
# entries where no name was found *and* there was no publisher
# any entries without a publisher also did not split off names
df_unknown["publisher"].dropna().size # 1595
df_unknown["publisher"].dropna().size / df["publisher"].dropna().size # ~18%

0.18059329710144928

As noted above, most of these are not problematic. About 18% of entries that were listed with no author or editor seems right to me.

In [None]:
df["publisher"].value_counts()[:50]

MACMILLAN            383
FROWDE               330
WYMAN                302
HODDER & S.          294
LONGMANS             255
CAMB. UNIV. PRESS    182
METHUEN              176
CONSTABLE            170
SIMPKIN              168
NELSON               147
DENT                 144
WARD, L.             124
CASSELL              122
PITMAN               121
UNWIN                114
CHAPMAN & H.         109
HEINEMANN            106
S. PAUL              105
MURRAY               101
WESLEY                97
HUTCHINSON            94
BLACKIE               93
MILLS & B.            92
JACK                  83
E. ARNOLD             83
BLACK                 79
LONG                  78
DUCKWORTH             73
PUTNAM                70
SMITH, E.             67
LANE                  67
EVERETT               66
ROUTLEDGE             65
NEWNES                64
NASH                  63
BELL                  62
SIEGLE, H.            60
CLIVE                 59
S.P.C.K.              56
K. PAUL               55


## Simple Statistics
Who are the top authors?

In [None]:
df['last_name'].value_counts()[:40]

Shakespeare    41
Smith          38
Dickens        33
Brown          31
Wilcox         30
Hall           28
Taylor         27
Watson         27
Le Queux       27
Wilson         26
Hugo           25
Scott          25
Benson         25
Oppenheim      24
Hardy          23
Strang         22
Moore          22
Bennett        22
Russell        20
Green          19
Garvice        19
Graham         18
Marshall       18
Wood           18
Byron          18
Williams       18
Johnson        18
Wright         17
Fraser         17
Thomas         17
Williamson     17
Ward           17
Jackson        16
Askew          16
Miller         16
Mason          15
Jones          15
Robertson      15
Doyle          14
Young          14
Name: last_name, dtype: int64

In [None]:
df['author_name'].value_counts()[:40]

William Shakespeare           39
Charles Dickens               31
Ella Wheeler Wilcox           28
William Le Queux              25
Victor Hugo                   25
E. Phillips Oppenheim         20
Thomas Hardy                  18
Charles Garvice               17
Arnold Bennett                17
May Byron                     17
Alice Askew                   15
Herbert Strang                15
Sir Arthur Conan Doyle        14
Fergus Hume                   13
Charles Kingsley              12
E. F. Benson                  12
L. T. Meade                   11
Alexandre Dumas               11
Max Pemberton                 11
Eden Phillpotts               11
Mrs. Moles worth              10
W. H. G. Kingston             10
Maurice Hewlett               10
Annie S. Swan                 10
E. Temple Thurston            10
Florence Warden                9
Hilaire Belloc                 9
Count L. N. Tolstoy            9
Guy Boothby                    9
John Oxenham                   8
Endowed Ch

In [None]:
df[df['last_name'] == "Austen"]

Unnamed: 0,entry,last_name,first_name,title,publisher,price,price_in_pounds,format,original_entry,author_name,creators,is_editor,date,is_net
414,"Austen (Jane)—Mansfield Park. 12mo., pp. 478, is. net, Ith., 2s. net (Illus pocket classics) W. COLLINS, Mar. 12",Austen,Jane,Mansfield Park,W. COLLINS,2s.,0.1,12mo.,"Austen (Jane)—Mansfield Park. 12mo., pp. 478, is. net, Ith., 25. net (Illus pocket classics) W. COLLINS, Mar. 12",Jane Austen,[Austen (Jane)],,Mar. 12,net
415,"Austen (Jane)—Pride and prejudice. Edit., with intro. and notes, by K. M. Metcalfe. Cr. 8vo. Pp. 436, 2s. 6d. (Clarendon Press) FROWDF, May 12",Austen,Jane,Pride and prejudice,FROWDF,2s. 6d.,0.125,8vo.,"Austen (Jane)—Pride and prejudice. Edit., with intro. and notes, by K. M. Metcalfe. Cr. 8vo. Pp. 436, 25. 6d. (Clarendon Press) FROWDF, May 12",Jane Austen,[Austen (Jane)],,May 12,


In [None]:
df[df['last_name'] == "Thackeray"]

Unnamed: 0,entry,last_name,first_name,title,publisher,price,price_in_pounds,format,original_entry,author_name,creators,is_editor,date,is_net
7781,"Thackeray (William Makepeace)—The English humorists; and, The Four Georges. 12mo. pp. 444, 1s. net, Ithr. 2s. net. (Everyman's library) DENT, Sep. 12",Thackeray,William Makepeace,"The English humorists; and, The Four Georges",DENT,,0.0,12mo.,"Thackeray (William Makepeace)—The English humorists; and, The Four Georges. I2mo. pp. 444, Is. net, Ithr. 25. net. (Everyman's library) DENT, Sep. 12",William Makepeace Thackeray,[Thackeray (William Makepeace)],,Sep. 12,
7783,"Thackeray (William Makepeace)—Irish sketch book. 12mo, pp. 424, 1s. net, Ithr. 28. net (Illus. pocket classics) ...W. COLLINS, Sep. 12",Thackeray,William Makepeace,Irish sketch book,W. COLLINS,,0.0,"12mo,","Thackeray (William Makepeace)—Irish sketch book. 12mo, pp. 424, Is. net, Ithr. 28. net (Illus. pocket classics) ...W. COLLINS, Sep. 12",William Makepeace Thackeray,[Thackeray (William Makepeace)],,Sep. 12,
7784,"Thackeray (William Makepeace)—The Paris sketch-book. 12mo., pp. 380, is, net, ithr, 28. net (Illus. pocket classics) ..W. COLLINS, May 12",Thackeray,William Makepeace,The Paris sketch-book,W. COLLINS,,0.0,12mo.,"Thackeray (William Makepeace)—The Paris sketch-book. 12mo., pp. 380, is, net, ithr, 28. net (Illus. pocket classics) ..W. COLLINS, May 12",William Makepeace Thackeray,[Thackeray (William Makepeace)],,May 12,
7785,"Thackeray (William Makepeace)-Works. Oxford edit. Arranged and edit. by George Saintsbury. 20 vols. 12mo. 6X4}, ea. is. 6d. net, Ithr. 29. 6d. net FROWDE, Sep. 12",Thackeray,William Makepeace,Works,FROWDE,6d.,0.025,12mo.,"Thackeray (William Makepeace)-Works. Oxford edit. Arranged and edit. by George Saintsbury. 20 vols. 12mo. 6X4}, ea. is. 6d. net, Ithr. 29. 6d. net FROWDE, Sep. 12",William Makepeace Thackeray,[Thackeray (William Makepeace)],,Sep. 12,net


In [None]:
df[df['last_name'] == "Wells"]

Unnamed: 0,entry,last_name,first_name,title,publisher,price,price_in_pounds,format,original_entry,author_name,creators,is_editor,date,is_net
8377,"Wells (D. D.)—Her Ladyship's elephant. 12mo, 7d. net .HEINEMANN, July 12",Wells,D. D.,Her Ladyship's elephant,HEINEMANN,7d.,0.029167,"12mo,","Wells (D. D.)—Her Ladyship's elephant. 12mo, 7d. net .HEINEMANN, July 12",D. D. Wells,[Wells (D. D.)],,July 12,net
8378,"Wells (H. G.)— The History of Mr. Polly. 12mo., 7d, net .NELSON, June 12",Wells,H. G.,The History of Mr,NELSON,7d.,0.029167,12mo.,"Wells (H. G.)— The History of Mr. Polly. 12mo., 7d, net .NELSON, June 12",H. G. Wells,[Wells (H. G.)],,June 12,net
8379,"Wells (H. G.)-- In the days of the comet. Re-issue, Cr. 8vo. 73 X5, pp. 314, 3s. 60. MACMILLAN, Mar. 12",Wells,H. G.,In the days of the comet,MACMILLAN,,0.0,8vo.,"Wells (H. G.)-- In the days of the comet. Re-issue, Cr. 8vo. 73 X5, pp. 314, 35. 60. MACMILLAN, Mar. 12",H. G. Wells,[Wells (H. G.)],,Mar. 12,
8380,"Wells (H. G.)—Marriage. Cr. 8vo. 74 x5, pp. 560, 6s. .MACMILLAN, Sep. 12",Wells,H. G.,Marriage,MACMILLAN,6s.,0.3,8vo.,"Wells (H. G.)—Marriage. Cr. 8vo. 74 x5, pp. 560, 6s. .MACMILLAN, Sep. 12",H. G. Wells,[Wells (H. G.)],,Sep. 12,
8381,"Wells (H. G.)—The Stolen bacillus. 12mo., 7d. net MACMILLAN, Apr. 12",Wells,H. G.,The Stolen bacillus,MACMILLAN,7d.,0.029167,12mo.,"Wells (H. G.)—The Stolen bacillus. 12mo., 7d. net MACMILLAN, Apr. 12",H. G. Wells,[Wells (H. G.)],,Apr. 12,net
8382,"Wells (H. G.)-Tono-Bungay. Re-issue. Cr. 8vo. 7 * x5, pp. 500, 3s. 6d. MACMILLAN, Mar. 12",Wells,H. G.,Tono-Bungay,MACMILLAN,3s. 6d.,0.175,8vo.,"Wells (H. G.)-Tono-Bungay. Re-issue. Cr. 8vo. 7 * x5, pp. 500, 35. 6d. MACMILLAN, Mar. 12",H. G. Wells,[Wells (H. G.)],,Mar. 12,
8383,"Wells (H. G.)—The War of the worlds. 12mo. 7d. net.... .HEINEMANN, June 12",Wells,H. G.,The War of the worlds,HEINEMANN,7d.,0.029167,12mo.,"Wells (H. G.)—The War of the worlds. 12mo. 7d. net.... .HEINEMANN, June 12",H. G. Wells,[Wells (H. G.)],,June 12,net
8384,"Wells (H. Wharton)-A Handbook of music and musicians. 12mo., pp. 302, 1s. net (Encyclo- pædic library) . NELSON, Aug. 12",Wells,H. Wharton,A Handbook of music and musicians,NELSON,1s.,0.05,12mo.,"Wells (H. Wharton)-A Handbook of music and musicians. 12mo., pp. 302, Is. net (Encyclo- pædic library) . NELSON, Aug. 12",H. Wharton Wells,[Wells (H. Wharton)],,Aug. 12,net
8385,"Wells (J.) see How (W. W.) and Wells. Wells (W. Henry)—The A B C of book-keeping. 16mo., pp. 88, s..... ...DRANE, May 12",Wells,J.,The A B C of book-keeping,DRANE,,0.0,16mo.,"Wells (J.) see How (W. W.) and Wells. Wells (W. Henry)—The A B C of book-keeping. 16mo., pp. 88, IS..... ...DRANE, May 12",J. Wells,"[Wells (J.), Wells (W. Henry)]",,May 12,


We seem to be missing Spenser entirely now. Is this a mistake?

In [None]:
df[df['last_name'] == "Spenser"]

Unnamed: 0,entry,last_name,first_name,title,publisher,price,price_in_pounds,format,original_entry,author_name,creators,is_editor,date,is_net


What are the top titles? This doesn't really make sense as a question
but it's surfacing some other parsing issues that we need to work out.

---

I am a little concerned that there are still a few cases of "8vo." remaining, but the rest of these seem to be either proper titles or parts of titles.

In [None]:
df['title'].value_counts()[:40]

Poems                                           13
St                                              12
The                                             11
Poetical works                                  11
Consular reports                                10
Admiralty-Hydrographic                          10
Works                                            8
Mr                                               7
Rita                                             7
Mrs                                              6
Roland Yorke                                     6
Vol                                              5
Fairy tales                                      5
Letters                                          5
Nelson's Encyclopædia                            5
Dr                                               4
Æsop's Fables                                    4
Verses                                           4
8vo                                              4
IS                             

Most expensive books?

--- 

The regex catches prices fairly strictly now. There are still a
few triple digit shilling entries, but they all seem to be legitimate.
It is more likely we are missing prices at this point.


In [None]:
df.sort_values(by='price_in_pounds', ascending=False)[:100]

Unnamed: 0,entry,last_name,first_name,title,publisher,price,price_in_pounds,format,original_entry,author_name,creators,is_editor,date,is_net
4315,"Jones (E. A.)- A Catalogue of the objects in gold and silver and the Limoges enamels in the Jones (William Hughes) --At the foot of Enyri : collection of the Baroness James de Rothschild. a book about poetry in Wales. 12mo., pp. 204, Folio, 147s. net.......... . CONSTABLE, Nov. 12",Jones,E. A.,A Catalogue of the objects in gold and silver and the Limoges enamels in the Jones (William Hughes) --At the foot of Enyri : collection of the Baroness James de Rothschild,CONSTABLE,147s.,7.35,12mo.,"Jones (E. A.)- A Catalogue of the objects in gold and silver and the Limoges enamels in the Jones (William Hughes) --At the foot of Enyri : collection of the Baroness James de Rothschild. a book about poetry in Wales. 12mo., pp. 204, Folio, 1475. net.......... . CONSTABLE, Nov. 12",E. A. Jones,[Jones (E. A.)],,Nov. 12,net
464,"Baker (C. II. C.)-Lely and the Stuart portrait painters, 2 vols. Illus. 4to., 126s. net P. LEE WARNER, Oct. 12",Baker,C. II. C.,Lely and the Stuart portrait painters,P. LEE WARNER,126s.,6.3,4to.,"Baker (C. II. C.)-Lely and the Stuart portrait painters, 2 vols. Illus. 4to., 126s. net P. LEE WARNER, Oct. 12",C. II. C. Baker,[Baker (C. II. C.)],,Oct. 12,net
572,"Barratt (T. J.)-The Annals of Hampstead. 3 vols. Illus. 4to., 105s. net... ...BLACK, Oct. 12",Barratt,T. J.,The Annals of Hampstead,BLACK,105s.,5.25,4to.,"Barratt (T. J.)-The Annals of Hampstead. 3 vols. Illus. 4to., 105s. net... ...BLACK, Oct. 12",T. J. Barratt,[Barratt (T. J.)],,Oct. 12,net
4638,"Latham (A.) and English (T. C.) eds. -A System of treatment. By many writers. 4 vols. 8vo., 84s. net CHURCHILL, May 12",Latham,A.,A System of treatment,CHURCHILL,84s.,4.2,8vo.,"Latham (A.) and English (T. C.) eds. -A System of treatment. By many writers. 4 vols. 8vo., 845. net CHURCHILL, May 12",A. Latham,"[Latham (A.), English (T. C.)]",eds.,May 12,net
4198,"Jackson (R. T.)-Phylogeny of the Echini, with a revision of paleozoic species. Illus., ryl. 4to. 121 x 10}, pp. 492, 70s. net (Boston Soc. Nat. Hist.) . WESLEY, Dec. 12",Jackson,R. T.,"Phylogeny of the Echini, with a revision of paleozoic species",WESLEY,70s.,3.5,4to.,"Jackson (R. T.)-Phylogeny of the Echini, with a revision of paleozoic species. Illus., ryl. 4to. 121 x 10}, pp. 492, 70s. net (Boston Soc. Nat. Hist.) . WESLEY, Dec. 12",R. T. Jackson,[Jackson (R. T.)],,Dec. 12,net
0,"Abercromby (Hon. John)-A Study of the bronze age pottery of Great Britain and Ireland and its associated grave-goods. Illus. 2 vols. 4to. 63s, net (Clarendon Press) FROWDE, July 12",Abercromby,Hon. John,A Study of the bronze age pottery of Great Britain and Ireland and its associated grave-goods,FROWDE,63s.,3.15,4to.,"Abercromby (Hon. John)-A Study of the bronze age pottery of Great Britain and Ireland and its associated grave-goods. Illus. 2 vols. 4to. 63s, net (Clarendon Press) FROWDE, July 12",Hon. John Abercromby,[Abercromby (Hon. John)],,July 12,net
1167,"British Museum. Squire (W. B.)-Catalogue of printed music published between 1487 and 1800. 2 vols. 8vo., 63s. .FROWDE, Aug. 12",British Museum. Squire,W. B.,Catalogue of printed music published between,FROWDE,63s.,3.15,8vo.,"British Museum. Squire (W. B.)-Catalogue of printed music published between 1487 and 1800. 2 vols. 8vo., 63s. .FROWDE, Aug. 12",W. B. British Museum. Squire,[British Museum. Squire (W. B.)],,Aug. 12,
6504,"Prinz (H.)-Dental formulary: a practical guide for the preparation of chemical and technical compounds. 2nd edit. Cr. 8vo., Ios, 6d. net KEENER, Dec. 11 Prior (Edward, S.) and Gardner (Arthur)--An account of mediæval figure-sculpture in Eng- land. With 855 photographs. 4to. 11 X 81, pp. 748, 63s. net CAMB. UNIV. PRESS, Oct. 12",Prinz,H.,Dental formulary: a practical guide for the preparation of chemical and technical compounds,CAMB. UNIV. PRESS,63s.,3.15,8vo.,"Prinz (H.)-Dental formulary: a practical guide for the preparation of chemical and technical compounds. 2nd edit. Cr. 8vo., Ios, 6d. net KEENER, Dec. 11 Prior (Edward, S.) and Gardner (Arthur)--An account of mediæval figure-sculpture in Eng- land. With 855 photographs. 4to. 11 X 81, pp. 748, 63s. net CAMB. UNIV. PRESS, Oct. 12",H. Prinz,[Prinz (H.)],,Oct. 12,net
2798,"Filipp! (Filippo de)--Karakoram and Western Himalaya, 1909 : an account of the expedition of H.R.H. Prince Lingi Amedeo of Savoy, Duke of the Abruzzi. 2 vols. Illus. 4to. 103 X87, pp. 488 and maps and charts, 63s. net CONSTABLE, Nov. 12",Filipp!,Filippo de,Karakoram and Western Himalaya,CONSTABLE,63s.,3.15,4to.,"Filipp! (Filippo de)--Karakoram and Western Himalaya, 1909 : an account of the expedition of H.R.H. Prince Lingi Amedeo of Savoy, Duke of the Abruzzi. 2 vols. Illus. 4to. 103 X87, pp. 488 and maps and charts, 63s. net CONSTABLE, Nov. 12",Filippo de Filipp!,[Filipp! (Filippo de)],,Nov. 12,net
2254,"Deffand (Marquise)Lettres à Horace Walpole. Edit. by Mrs. P. Tonybee. 3 vols. 8vo., 63s. net METHUEN, Oct. 12",Deffand,Marquise,Lettres à Horace Walpole,METHUEN,63s.,3.15,8vo.,"Deffand (Marquise)Lettres à Horace Walpole. Edit. by Mrs. P. Tonybee. 3 vols. 8vo., 635. net METHUEN, Oct. 12",Marquise Deffand,[Deffand (Marquise)],,Oct. 12,net


In [None]:
df.to_csv("English-Catalogue-of-Books-1912.csv", index=False, encoding="utf-8")

In [None]:
from google.colab import drive
drive.mount('drive/')

Mounted at drive/


In [None]:
!cp "English-Catalogue-of-Books-1912.csv" "drive/My Drive/"

https://drive.google.com/file/d/1--4pC7ZJE_mJPGNndsULeEk5jVHeriNw/view?usp=sharing 