## Text data on the web

## Intro: The basic idea

In [181]:
# This happends to be Frankenstein, the most downloaded of all Gutenberg books on the day this
# NB was created.
idx= 84
import urllib
url = f"https://gutenberg.org/cache/epub/{idx}/pg{idx}.txt"
with urllib.request.urlopen(url) as stream:
    byte_str = stream.read()

In [182]:
type(byte_str)

bytes

It's not a string because it hasn't been decoded from "UTF-8" into a string.

In [183]:
text=byte_str.decode("UTF8")

In [184]:
type(text)

str

Make a function implementing the idea (retrieval of books from Gutenberg.org by book index).

In [34]:
def get_book (ind):
    url = f"https://gutenberg.org/cache/epub/{ind}/pg{ind}.txt"
    with urllib.request.urlopen(url) as stream:
        byte_str = stream.read()
        return byte_str.decode("UTF8")

In [36]:
# Get book 84 (= Frankenstein) from Gutenberg
text2= get_book(84)

In [37]:
text == text2

True

##  Get data pointers from a trusted source

The [Gutenberg.org page](https://gutenberg.org/browse/scores/top) contains a list of the 100
most downloaded books, including *Frankenstein*, which we just downloaded in our introductory section,
and *Pride and Prejudice*, which has shown up in a lot of our examples.

Let's get the list and compuet some statistics from that data.

In [26]:
from bs4 import BeautifulSoup

book_list = "https://gutenberg.org/browse/scores/top"
with urllib.request.urlopen(book_list) as stream:
    html_doc = stream.read()
soup = BeautifulSoup(html_doc, 'html.parser')

In [138]:
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Top 100 | Project Gutenberg</title>
<link href="/gutenberg/style.css?v=1.1" rel="stylesheet"/>
<link href="/gutenberg/collapsible.css?1.1" rel="stylesheet"/>
<link href="/gutenberg/new_nav.css?v=1.321231" rel="stylesheet"/>
<link href="/gutenberg/pg-desktop-one.css" rel="stylesheet"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="books, ebooks, free, kindle, android, iphone, ipad" name="keywords">
<meta content="wucOEvSnj5kP3Ts_36OfP64laakK-1mVTg-ptrGC9io" name="google-site-verification"/>
<meta content="4WNaCljsE-A82vP_ih2H_UqXZvM" name="alexaVerifyID"/>
<link href="https://www.gnu.org/copyleft/fdl.html" rel="copyright">
<link href="/gutenberg/favicon.ico?v=1.1" rel="shortcut icon">
<meta content="Project Gutenberg" property="og:title"/>
<meta content="website" property="og:type"/>
<meta content="https://www.gutenberg.org/" property="og:url"/>
<m

A link looks like this

```
<a href="/ebooks/4300">Ulysses by James Joyce (434)</a>
````

Find all of the links.  Extract the book idx (4300, in this case) and the title from the link.

Use the structure in the parsed html (the `soup` instance).,

In [143]:
import re

#  All links are inside the tag <a ... >
links = soup.find_all('a')

# We're restricting our harvest to links whose hrefs start with this string (links to book pages)
path_re = "/ebooks/(\d+)"
reg_exp = re.compile(path_re)

## Containers for collected data
idxs = []
# Let's grab the book titles too (cause we're humans. and like names instead of numbers)
titles = dict()

#  Do the collecting
for link in links:
    # Get the ref string from inside the link instance
    ref = link.get("href")
    match = reg_exp.findall(ref)
    if match:
        # findall returns a list.  If there's a match, there will be only one idx
        idx = match[0]
        idxs.append(idx)
        titles[idx]  = link.get_text()

# There is more than one top 100 list on the page. They have duplicates.  Remove them
idxs = list(set(idxs))
print(f"{len(idxs)} book indices found.")
idxs[:10]

115 book indices found.


['521',
 '2160',
 '2600',
 '4300',
 '20228',
 '16119',
 '24869',
 '76',
 '10007',
 '8800']

In [61]:
titles

{'84': 'Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley (104809)',
 '1342': 'Pride and Prejudice by Jane Austen (78103)',
 '2701': 'Moby Dick; Or, The Whale by Herman Melville (68617)',
 '1513': 'Romeo and Juliet by William Shakespeare (65500)',
 '145': 'Middlemarch by George Eliot (50027)',
 '100': 'The Complete Works of William Shakespeare by William Shakespeare (47432)',
 '2641': 'A Room with a View by E. M.  Forster (47687)',
 '37106': 'Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott (46229)',
 '64317': 'The Great Gatsby by F. Scott  Fitzgerald (35947)',
 '67979': 'The Blue Castle: a novel by L. M.  Montgomery (42148)',
 '16389': 'The Enchanted April by Elizabeth Von Arnim (42084)',
 '6761': 'The Adventures of Ferdinand Count Fathom — Complete by T.  Smollett (39108)',
 '394': 'Cranford by Elizabeth Cleghorn Gaskell (38786)',
 '2160': 'The Expedition of Humphry Clinker by T.  Smollett (38022)',
 '11': "Alice's Adventures in Wonderland by Lewis C

For you possibly puzzled Tolstoy fans, {\em graf} is Russian for Count.  Although the link
takes you to the Constance Garnett translation, teh metadata lists the author as "graf Leo Tolstoy".
I don't know why.

##  Get the data

We have the **indexes** for teh books we want.  Now download the data using `get_book` (defined
in teh first section).

In [167]:
import time

books = []
errs = []

# Implement delay between downloads to be NICE to host website
delay = 5
num_samples = 20

print(f"Getting {num_samples} books")
for (i,idx) in enumerate(idxs[:num_samples]):
    try:  #  There do seem to be missing books
       books.append(get_book(idx))
       print(f"{i} read!")
       time.sleep(delay)
    except urllib.request.HTTPError:
        print(f"Err {i}")
        errs.append(idx)

Getting 20 books
0 read!
1 read!
2 read!
3 read!
4 read!
5 read!
6 read!
7 read!
8 read!
9 read!
10 read!
11 read!
12 read!
13 read!
14 read!
15 read!
16 read!
17 read!
18 read!
19 read!


The books that have been moved or removed (when downloading all 115 books):

In [62]:
for idx in errs:
    print(titles[idx])

Calculus Made Easy by Silvanus P.  Thompson (10359)
Moby Word Lists by Grady Ward (355)
Tractatus Logico-Philosophicus by Ludwig Wittgenstein (12068)


## Cleanup

Remove Gutenberg.org identifying front matter

In [168]:
tag_str = "*** START OF THE PROJECT GUTENBERG EBOOK"
def get_tag_str_line_no (line_list, strict=False):
    for (i,l) in enumerate(line_list):
        if l.startswith(tag_str):
            return i
    if strict:
        raise Exception("No luck!")
    else:
        return -1
        
def clean_book (book_str,strict=False):
    lines = book_str.splitlines()
    return '\n'.join(lines[get_tag_str_line_no(lines,strict=strict)+1:])

In [169]:
cleaned_books = []

#  Return with Exception if cleaning fails
strict = True
print(f"Cleaning {len(books)} books")
for (i,book_str) in enumerate(books):
    try:
       cleaned_books.append(clean_book (book_str,strict=strict))
    except Exception:
        print(f"Err {i}")
        continue

print(len(cleaned_books))

Cleaning 20 books
20


Save space.

In [170]:
books = cleaned_books

## English letter frequencies

We illustrate some simple statistics tracking with text.

In [171]:
from collections import Counter

ltr_ctr = Counter()

for book in books:
    ltr_ctr.update(book)

In [172]:
ltr_ctr.most_common(10)

[(' ', 3438463),
 ('e', 1606643),
 ('a', 1218392),
 ('t', 1163350),
 ('n', 1018244),
 ('o', 1007464),
 ('i', 895359),
 ('h', 861382),
 ('s', 852186),
 ('r', 767826)]

Removing white space:

In [91]:
ltr_ctr2 = Counter()

for book in books:
    ltr_ctr2.update(''.join(book.split()))

In [93]:
ltr_ctr2.most_common(10)

[('e', 1606643),
 ('a', 1218392),
 ('t', 1163350),
 ('n', 1018244),
 ('o', 1007464),
 ('i', 895359),
 ('h', 861382),
 ('s', 852186),
 ('r', 767826),
 ('d', 590909)]

## English letter digraph frequencies

In [118]:
from nltk import bigrams

ltr_digraph_ctr = Counter()

for book in books:
    ltr_digraph_ctr.update(bigrams(''.join(book.split())))

In [119]:
ltr_digraph_ctr.most_common(10)

[(('t', 'h'), 396986),
 (('h', 'e'), 371264),
 (('a', 'n'), 268701),
 (('i', 'n'), 250742),
 (('e', 'r'), 219100),
 (('n', 'd'), 183155),
 (('r', 'e'), 174384),
 (('n', 'g'), 159345),
 (('a', 't'), 148314),
 (('e', 'd'), 148178)]

Criticism:  This technique creates spurious letter bigrams:

```
of the
```

becomes

```
ofthe
```

creating the unlikely digraph "ft".  And sure enough:

In [108]:
ltr_digraph_ctr["f","t"]

42605

To avoid this update digraph counts word by word (which is slower):

In [126]:
from nltk.tokenize import word_tokenize

ltr_digraph_ctr2 = Counter()

for book in books:
    for word in word_tokenize(book):
        ltr_digraph_ctr2.update(bigrams(word))

In [127]:
ltr_digraph_ctr2.most_common(10)

[(('t', 'h'), 376156),
 (('h', 'e'), 369879),
 (('a', 'n'), 263625),
 (('i', 'n'), 249591),
 (('e', 'r'), 204937),
 (('n', 'd'), 179190),
 (('r', 'e'), 170888),
 (('n', 'g'), 157199),
 (('a', 't'), 143285),
 (('h', 'a'), 135524)]

In [128]:
ltr_digraph_ctr.most_common(10)[9]

(('e', 'd'), 148178)

In [129]:
ltr_digraph_ctr["f","t"],ltr_digraph_ctr2["f","t"]

(42605, 11156)

In [130]:
ltr_digraph_ctr["e","d"],ltr_digraph_ctr2["e","d"]

(148178, 130603)

In [131]:
ltr_digraph_ctr["h","a"],ltr_digraph_ctr2["h","a"]

(146977, 135524)

##  English word frequencies

In [102]:

from nltk.tokenize import word_tokenize

wd_ctr =  Counter()

for (i,book)  in enumerate(books):
    print(f"{i}", end=" ")
    wd_ctr.update(word_tokenize(book))
    print(wd_ctr.most_common(10))
    print("="*20)
          

0 [(',', 11481), ('the', 5955), ('I', 5108), ('and', 4772), ('to', 4322), ('of', 3614), (';', 2332), ('a', 2312), ('.', 2206), ('my', 2087)]
1 [(',', 26360), ('the', 14123), ('and', 9810), ('of', 9270), ('to', 8633), ('I', 7888), ('a', 5945), ('.', 5015), ('in', 4985), (';', 4713)]
2 [(',', 66235), ('the', 45919), ('and', 30862), ('.', 28223), ('to', 25100), ('of', 24190), ('a', 15995), ('in', 13371), ('I', 12367), ('that', 11425)]
3 [(',', 82735), ('the', 59580), ('.', 48241), ('and', 37545), ('of', 32352), ('to', 29970), ('a', 21871), ('in', 18083), ('I', 15151), ('that', 13901)]
4 [(',', 93350), ('the', 59744), ('.', 53735), ('and', 37610), ('of', 32459), ('to', 30046), ('a', 21933), ('in', 18158), ('I', 15152), ('that', 13914)]
5 [(',', 95827), ('the', 61451), ('.', 55724), ('and', 38307), ('of', 33471), ('to', 30574), ('a', 22310), ('in', 18899), ('I', 15364), ('that', 14217)]
6 [(',', 137741), ('the', 81483), ('.', 75123), ('and', 49130), ('of', 43027), ('to', 37522), ('a', 25850

The sudden upsurge in the count of the word "|" is not typical of English data.  It happens
in the book with index 17.  So we confirm the anomaly and check the source:

In [106]:
books[17].count("|")

45626

In [173]:
titles[idxs[17]]

'Names and places in the Old and New Testament and Apocrypha, with their modern identifications (452)'

In [None]:
text17 = word_tokenize(books[17])

wd_idx = text17.index("|")

It looks as if the "|"s are column separators that derive from ASCII "markup" representations of tables.  Aargh.

In [180]:
' '.join(text17[wd_idx-60:wd_idx+100])

'number where the name is found ; ‘ B . R. , ’ _Robinson ’ s Biblical Researches_ . -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- + -- -- -- -- -- -+ -- -- -- -- -- + -- -- -- -- -- -- -- -- -- -- Bible and | | Modern | No . of |Remarks , References , Apocrypha | References . | Identifi- \\Sheet on/ and Name . | | cation . \\ ⅜-in./ No . of Sheet on | | |Map./ Large Map . -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- + -- -- -- -- -- -- -- + -- -+ -- -- -- -- -- -- -- -- -- -- -- -- =ABANA= , River |2 Kings v. 12 |_Nahr | 3 |One of the rivers of | | Abanias_ , a |and| Damascus'

## Letter counts revisited

Finally, letter counts can be defined different ways.  

Here's a common strategy:

In [132]:
ltr_ctr3 = Counter()

for word in wd_ctr:
    ltr_ctr3.update(word)
    
ltr_ctr3.most_common(10)

[('a', 105521),
 ('e', 84872),
 ('n', 75547),
 ('i', 72772),
 ('s', 59968),
 ('t', 55348),
 ('r', 54434),
 ('o', 50009),
 ('l', 48250),
 ('g', 37939)]

Compare the original figures.  Explain the differences.

In [135]:
ltr_ctr2.most_common(10)

[('e', 1606643),
 ('a', 1218392),
 ('t', 1163350),
 ('n', 1018244),
 ('o', 1007464),
 ('i', 895359),
 ('h', 861382),
 ('s', 852186),
 ('r', 767826),
 ('d', 590909)]