## Text data on the web

## Intro: The basic idea

In [24]:
import urllib

# This happens to be Frankenstein, the most downloaded of all Gutenberg books on the day this
# NB was created.
idx= 84
url = f"https://gutenberg.org/cache/epub/{idx}/pg{idx}.txt"

with urllib.request.urlopen(url) as stream:
    byte_str = stream.read()

In [25]:
type(byte_str)

bytes

It's not a string because it hasn't been decoded from "UTF-8" into a string.

In [26]:
text=byte_str.decode("UTF8")

In [27]:
type(text)

str

Make a function implementing the idea (retrieval of books from Gutenberg.org by book index).

In [13]:
import urllib

def get_book (ind):
    """
    We replace "\r\n" (Mac rep for newlines) with Windows rep ("\n")
    to facilate regexp matching across newline barriers, but this
    is only a pathc on a bigger problem.
    """
    url = f"https://gutenberg.org/cache/epub/{ind}/pg{ind}.txt"
    with urllib.request.urlopen(url) as stream:
        byte_str = stream.read()
        return byte_str.decode("UTF8").replace("\r\n","\n")

In [14]:
# Get book 84 (= Frankenstein) from Gutenberg
text = get_book(84)

In [15]:
text

'\ufeffThe Project Gutenberg eBook of Frankenstein; Or, The Modern Prometheus\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: Frankenstein; Or, The Modern Prometheus\n\nAuthor: Mary Wollstonecraft Shelley\n\nRelease date: October 1, 1993 [eBook #84]\n                Most recently updated: December 2, 2022\n\nLanguage: English\n\nCredits: Judith Boss, Christy Phillips, Lynn Hanninen and David Meltzer. HTML version by Al Haines.\n        Further corrections by Menno de Leeuw.\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN; OR, THE MODERN PROMETHEUS ***\n\n\n\n\nFran

In [37]:
text == text2

True

##  Get data pointers from a trusted source

The [Gutenberg.org page](https://gutenberg.org/browse/scores/top) contains a list of the 100
most downloaded books, including *Frankenstein*, which we just downloaded in our introductory section,
and *Pride and Prejudice*, which has shown up in a lot of our examples.

Let's get the list and compute some statistics from that data.

In [30]:
from bs4 import BeautifulSoup

book_list = "https://gutenberg.org/browse/scores/top"
with urllib.request.urlopen(book_list) as stream:
    html_doc = stream.read()
soup = BeautifulSoup(html_doc, 'html.parser')

In [31]:
# soup is a document tree (sort of)
L = soup.findChildren()[0].findChildren()
for child in L[:15]:
    if child.name == "head":
        continue
    print(child)
    print("****")

<meta charset="utf-8"/>
****
<title>Top 100 | Project Gutenberg</title>
****
<link href="/gutenberg/style.css?v=1.1" rel="stylesheet"/>
****
<link href="/gutenberg/collapsible.css?1.1" rel="stylesheet"/>
****
<link href="/gutenberg/new_nav.css?v=1.321231" rel="stylesheet"/>
****
<link href="/gutenberg/pg-desktop-one.css" rel="stylesheet"/>
****
<meta content="width=device-width, initial-scale=1" name="viewport"/>
****
<meta content="books, ebooks, free, kindle, android, iphone, ipad" name="keywords">
<meta content="wucOEvSnj5kP3Ts_36OfP64laakK-1mVTg-ptrGC9io" name="google-site-verification"/>
<meta content="4WNaCljsE-A82vP_ih2H_UqXZvM" name="alexaVerifyID"/>
<link href="https://www.gnu.org/copyleft/fdl.html" rel="copyright">
<link href="/gutenberg/favicon.ico" rel="icon" sizes="16x16" type="image/png">
<meta content="Project Gutenberg" property="og:title"/>
<meta content="website" property="og:type"/>
<meta content="https://www.gutenberg.org/" property="og:url"/>
<meta content="Project

A properly closed link looks like this:

```
<a href="/ebooks/4300">Ulysses by James Joyce (434)</a>
````

Find all of the links.  Extract the book idx (4300, in this case) and the title from the link.

Use the structure in the parsed html (the `soup` instance).,

In [32]:
import re

#  All links are inside the tag <a ... >
links = soup.find_all('a')

# We're restricting our harvest to links whose hrefs start with this string (links to book pages)
path_re = "/ebooks/(\d+)"
reg_exp = re.compile(path_re)

## Containers for collected data
idxs = []
# Let's grab the book titles too (cause we're humans. and like names instead of numbers)
titles = dict()

#  Do the collecting
for link in links:
    # Get the ref string from inside the link instance
    ref = link.get("href")
    match = reg_exp.findall(ref)
    if match:
        # findall returns a list.  If there's a match, there will be only one idx
        idx = match[0]
        idxs.append(idx)
        titles[idx]  = link.get_text()

# There is more than one top 100 list on the page. They have duplicates.  Remove them
idxs = list(set(idxs))
print(f"{len(idxs)} book indices found.")
idxs[:10]

138 book indices found.


['2680',
 '64317',
 '10940',
 '1727',
 '33283',
 '1661',
 '16',
 '41580',
 '6761',
 '1232']

In [33]:
titles

{'84': 'Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley (85269)',
 '1342': 'Pride and Prejudice by Jane Austen (68228)',
 '2701': 'Moby Dick; Or, The Whale by Herman Melville (63500)',
 '12233': 'Stonewall Jackson and the American Civil War by G. F. R.  Henderson (2177)',
 '1513': 'Romeo and Juliet by William Shakespeare (58361)',
 '145': 'Middlemarch by George Eliot (46099)',
 '37106': 'Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott (42596)',
 '100': 'The Complete Works of William Shakespeare by William Shakespeare (43956)',
 '55231': 'A history of the Peninsular War, Vol. 3, Sep. 1809-Dec. 1810 : by Charles Oman (1470)',
 '2641': 'A Room with a View by E. M.  Forster (44091)',
 '16389': 'The Enchanted April by Elizabeth Von Arnim (39653)',
 '2542': "A Doll's House : a play by Henrik Ibsen (29019)",
 '67979': 'The Blue Castle: a novel by L. M.  Montgomery (39425)',
 '64317': 'The Great Gatsby by F. Scott  Fitzgerald (29613)',
 '844': 'The Importan

For you possibly puzzled Tolstoy fans, *graf* is Russian for Count.  Although the link
takes you to the Constance Garnett translation, the metadata lists the author as "graf Leo Tolstoy".
I don't know why.

##  Get the data

We have the **indexes** for the books we want.  Now download the data using `get_book` (defined
in the first section).  Reset `num_samples` to fit your time and space requirements.

In [34]:
import time

books = []
errs = []

# Implement delay between downloads to be NICE to host website
delay = 5
num_samples = 20

print(f"Getting {num_samples} books")
for (i,idx) in enumerate(idxs[:num_samples]):
    try:  #  There do seem to be missing books
       books.append(get_book(idx))
       print(f"{i} read!")
       time.sleep(delay)
    except urllib.request.HTTPError:
        print(f"Err {i}")
        errs.append(idx)

Getting 20 books
0 read!
1 read!
2 read!
3 read!
Err 4
5 read!
6 read!
7 read!
8 read!
9 read!
10 read!
11 read!
12 read!
13 read!
14 read!
15 read!
16 read!
17 read!
18 read!
19 read!


The books that have been moved or removed (when downloading all 115 books):

In [35]:
for idx in errs:
    print(titles[idx])

Calculus Made Easy by Silvanus P.  Thompson (8578)


In [62]:
for idx in errs:
    print(titles[idx])

Calculus Made Easy by Silvanus P.  Thompson (10359)
Moby Word Lists by Grady Ward (355)
Tractatus Logico-Philosophicus by Ludwig Wittgenstein (12068)


## Cleanup

Remove Gutenberg.org identifying front matter

In [77]:
tag_str = "*** START OF THE PROJECT GUTENBERG EBOOK"
def get_tag_str_line_no (line_list, strict=False):
    for (i,l) in enumerate(line_list):
        if l.startswith(tag_str):
            return i
    if strict:
        raise Exception("No luck!")
    else:
        return -1
        
def clean_book (book_str,strict=False):
    lines = book_str.splitlines()
    return '\n'.join(lines[get_tag_str_line_no(lines,strict=strict)+1:])

In [78]:
cleaned_books = []

#  Return with Exception if cleaning fails
strict = True
print(f"Cleaning {len(books)} books")
for (i,book_str) in enumerate(books):
    try:
       cleaned_books.append(clean_book (book_str,strict=strict))
    except Exception:
        print(f"Err {i}")
        continue

print(len(cleaned_books))

Cleaning 19 books
19


Save space.

In [79]:
books = cleaned_books

## English letter frequencies

We illustrate some simple statistics tracking with text.

In [36]:
from collections import Counter

ltr_ctr = Counter()

for book in books:
    ltr_ctr.update(book)

In [37]:
ltr_ctr.most_common(10)

[(' ', 2985656),
 ('e', 1701383),
 ('t', 1207977),
 ('o', 1065592),
 ('a', 1039458),
 ('n', 918774),
 ('i', 868089),
 ('h', 867218),
 ('s', 850086),
 ('r', 821677)]

Removing white space:

In [82]:
ltr_ctr2 = Counter()

for book in books:
    ltr_ctr2.update(''.join(book.split()))

In [83]:
ltr_ctr2.most_common(10)

[('e', 1602260),
 ('t', 1149828),
 ('a', 1007349),
 ('o', 961439),
 ('h', 914048),
 ('n', 875366),
 ('s', 791139),
 ('i', 785031),
 ('r', 726181),
 ('d', 578171)]

## English letter digraph frequencies

In [39]:
list(bigrams("abracadabra"))

[('a', 'b'),
 ('b', 'r'),
 ('r', 'a'),
 ('a', 'c'),
 ('c', 'a'),
 ('a', 'd'),
 ('d', 'a'),
 ('a', 'b'),
 ('b', 'r'),
 ('r', 'a')]

In [40]:
from nltk import bigrams

ltr_digraph_ctr = Counter()

for book in books:
    ltr_digraph_ctr.update(bigrams(''.join(book.split())))

In [41]:
ltr_digraph_ctr.most_common(10)

[(('t', 'h'), 422250),
 (('h', 'e'), 367676),
 (('e', 'r'), 238425),
 (('i', 'n'), 222905),
 (('a', 'n'), 207068),
 (('r', 'e'), 187583),
 (('n', 'd'), 166000),
 (('e', 's'), 155890),
 (('e', 'n'), 155167),
 (('h', 'a'), 153920)]

Criticism:  This technique creates spurious letter bigrams:

```
of the
```

becomes

```
ofthe
```

creating the unlikely digraph "ft".  And sure enough:

In [86]:
ltr_digraph_ctr["f","t"]

46843

To avoid this update digraph counts word by word (which is slower):

In [42]:
from nltk.tokenize import word_tokenize

ltr_digraph_ctr2 = Counter()

for book in books:
    for word in word_tokenize(book):
        ltr_digraph_ctr2.update(bigrams(word))

In [88]:
ltr_digraph_ctr2.most_common(10)

[(('t', 'h'), 434605),
 (('h', 'e'), 398254),
 (('a', 'n'), 220483),
 (('i', 'n'), 209097),
 (('e', 'r'), 194978),
 (('n', 'd'), 194009),
 (('r', 'e'), 169484),
 (('h', 'a'), 142927),
 (('o', 'u'), 133577),
 (('a', 't'), 126597)]

In [89]:
ltr_digraph_ctr.most_common(10)[9]

(('e', 'n'), 135528)

In [90]:
ltr_digraph_ctr["f","t"],ltr_digraph_ctr2["f","t"]

(46843, 10031)

In [91]:
ltr_digraph_ctr["e","d"],ltr_digraph_ctr2["e","d"]

(133737, 116265)

In [92]:
ltr_digraph_ctr["h","a"],ltr_digraph_ctr2["h","a"]

(153818, 142927)

In [8]:
books.keys()

NameError: name 'books' is not defined

##  English word frequencies

In [43]:
from nltk.tokenize import word_tokenize
from collections import Counter

wd_ctr =  Counter()

for (i,book)  in enumerate(books):
    print(f"{i}", end=" ")
    wd_ctr.update(word_tokenize(book))
    if i < 5:
        print(wd_ctr.most_common(10))
        print("="*20)
          

0 [(',', 6539), ('and', 3159), ('.', 2972), ('the', 2637), ('of', 2485), ('to', 2001), ('that', 1902), ('is', 1441), ('in', 1137), ('it', 1133)]
1 [(',', 9652), ('.', 5445), ('the', 5013), ('and', 4699), ('of', 3693), ('to', 3192), ('a', 2499), ('that', 2483), ('in', 1962), ('I', 1854)]
2 [(',', 24648), ('the', 18273), ('of', 13483), ('.', 11159), ('and', 10071), ('to', 7412), ('a', 5496), ('in', 5413), ('that', 3830), ('was', 3107)]
3 [(',', 34023), ('the', 24992), ('of', 17090), ('and', 15346), ('.', 14779), ('to', 10951), ('a', 7492), ('in', 7275), ('that', 5126), ('was', 4168)]
4 [(',', 41819), ('the', 30414), ('of', 19823), ('.', 19307), ('and', 18212), ('to', 13678), ('a', 10070), ('in', 9015), ('I', 6939), ('that', 6775)]
5 6 7 8 9 10 11 12 13 14 15 16 17 18 

In [44]:
print(wd_ctr.most_common(10))

[(',', 292750), ('.', 173957), ('the', 161817), ('of', 100781), ('and', 88099), ('to', 77120), ('a', 57897), ('in', 51476), ('I', 45022), ('’', 40088)]


In [23]:
text[:200]

'\ufeffThe Project Gutenberg eBook of Frankenstein; Or, The Modern Prometheus\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with alm'

In [18]:
from nltk.tokenize import word_tokenize
tkns = word_tokenize(text[:10_000])

In [20]:
len(tkns)

1963

In [22]:
tkns[500:600]

['wonders',
 'and',
 'in',
 'beauty',
 'every',
 'region',
 'hitherto',
 'discovered',
 'on',
 'the',
 'habitable',
 'globe',
 '.',
 'Its',
 'productions',
 'and',
 'features',
 'may',
 'be',
 'without',
 'example',
 ',',
 'as',
 'the',
 'phenomena',
 'of',
 'the',
 'heavenly',
 'bodies',
 'undoubtedly',
 'are',
 'in',
 'those',
 'undiscovered',
 'solitudes',
 '.',
 'What',
 'may',
 'not',
 'be',
 'expected',
 'in',
 'a',
 'country',
 'of',
 'eternal',
 'light',
 '?',
 'I',
 'may',
 'there',
 'discover',
 'the',
 'wondrous',
 'power',
 'which',
 'attracts',
 'the',
 'needle',
 'and',
 'may',
 'regulate',
 'a',
 'thousand',
 'celestial',
 'observations',
 'that',
 'require',
 'only',
 'this',
 'voyage',
 'to',
 'render',
 'their',
 'seeming',
 'eccentricities',
 'consistent',
 'for',
 'ever',
 '.',
 'I',
 'shall',
 'satiate',
 'my',
 'ardent',
 'curiosity',
 'with',
 'the',
 'sight',
 'of',
 'a',
 'part',
 'of',
 'the',
 'world',
 'never',
 'before',
 'visited',
 ',',
 'and']

Compare the original figures.  Explain the differences.

## Creating a corpus of data

If you're running in google colab do this first:

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# Create the following folder in the root directory
!mkdir -p "/content/drive/My Drive/nltk"

nltk_corpus_dir = "/content/drive/My Drive/nltk"

Otherwise pick a corpus directory on your own file system and create a subdirectory for your corpus.

In [97]:
import os, os.path
import nltk.data

nltk_corpus_dir = '~/nltk_data/corpora/gutenberg2'
# Note if you changed the path above you have to retype it here (because variables dont work right in !-commands)
!mkdir -p ~/nltk_data/corpora/gutenberg2

In [98]:
nltk_corpus_dir = os.path.expanduser(nltk_corpus_dir)

if nltk_corpus_dir not in nltk.data.path:
    nltk.data.path.append(nltk_corpus_dir)

Define the code to put your data in `nltk_corpus_dir`:

In [100]:
import re
# Many gutenberg title tags include a digit sequence in parens.  Not needed.
# Making the pattern as specific as possible so as to not affect titles with genuine parentheses
reg_exp23 = "(\(\d+\))"
reg_exp23_c = re.compile(reg_exp23)


def make_file_name (title):
    try:
        (start,end) = reg_exp23_c.search(title).span()
        ttn = title[:start] + title[end:]
    except:
        ttn = title
    return '_'.join(ttn.split()) + ".txt"

def make_corpus(corpus_dir,idxs,books,titles,verbose=False):
    for (i,book) in enumerate(books):
        idx = idxs[i]
        fn = make_file_name(titles[idx])
        with open(os.path.join(corpus_dir,fn),'w') as ofh:
            ofh.write(book)
        if verbose:
            print(f"{fn} written!")

Put your data in `nltk_corpus_dir`:

In [101]:
make_corpus(nltk_corpus_dir,idxs,books,titles,verbose=True)

Moby_Word_Lists_by_Grady_Ward.txt written!
The_Confessions_of_St._Augustine_by_Bishop_of_Hippo_Saint_Augustine.txt written!
Treasure_Island_by_Robert_Louis_Stevenson.txt written!
The_King_James_Version_of_the_Bible.txt written!
The_Yellow_Wallpaper_by_Charlotte_Perkins_Gilman.txt written!
The_giant_horse_of_Oz_by_Ruth_Plumly_Thompson.txt written!
The_Importance_of_Being_Earnest:_A_Trivial_Comedy_for_Serious_People_by_Oscar_Wilde.txt written!
Jane_Eyre:_An_Autobiography_by_Charlotte_Brontë.txt written!
Lady_Chatterley's_lover_by_D._H._Lawrence.txt written!
Middlemarch_by_George_Eliot.txt written!
The_divine_comedy_by_Dante_Alighieri.txt written!
The_War_of_the_Worlds_by_H._G._Wells.txt written!
The_Rámáyan_of_Válmíki,_translated_into_English_verse_by_Valmiki.txt written!
Thus_Spake_Zarathustra:_A_Book_for_All_and_None_by_Friedrich_Wilhelm_Nietzsche.txt written!
Peter_Pan_by_J._M._Barrie.txt written!
Pogo_Planet_by_Donald_A._Wollheim.txt written!
Walden,_and_On_The_Duty_Of_Civil_Disobedi

Import your spanking new corpus

In [102]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

newcorpus = PlaintextCorpusReader(nltk_corpus_dir, '.*')

Sanity check.  There may be more books in the corpus if you do this on multiple days.

In [103]:
newcorpus.fileids()

['A_Modest_Proposal_by_Jonathan_Swift.txt',
 'A_Room_with_a_View_by_E._M._Forster.txt',
 'A_Study_in_Scarlet_by_Arthur_Conan_Doyle.txt',
 'Ang_"Filibusterismo"_(Karugtóng_ng_Noli_Me_Tangere)_by_José_Rizal.txt',
 'Carmilla_by_Joseph_Sheridan_Le_Fanu.txt',
 'Don_Quixote_by_Miguel_de_Cervantes_Saavedra.txt',
 'Dubliners_by_James_Joyce.txt',
 "Gulliver's_Travels_into_Several_Remote_Nations_of_the_World_by_Jonathan_Swift.txt",
 'Jane_Eyre:_An_Autobiography_by_Charlotte_Brontë.txt',
 "Lady_Chatterley's_lover_by_D._H._Lawrence.txt",
 'Middlemarch_by_George_Eliot.txt',
 'Moby_Word_Lists_by_Grady_Ward.txt',
 'Peter_Pan_by_J._M._Barrie.txt',
 'Pogo_Planet_by_Donald_A._Wollheim.txt',
 'Pygmalion_by_Bernard_Shaw.txt',
 'Second_Treatise_of_Government_by_John_Locke.txt',
 'Sense_and_Sensibility_by_Jane_Austen.txt',
 'The_Adventures_of_Tom_Sawyer,_Complete_by_Mark_Twain.txt',
 'The_Confessions_of_St._Augustine_by_Bishop_of_Hippo_Saint_Augustine.txt',
 'The_Count_of_Monte_Cristo_by_Alexandre_Dumas_and

Get the raw string for the third book:

In [104]:
first_book = newcorpus.fileids()[2]
first_book_str = newcorpus.raw(first_book)
first_book_str[:200]

'\n\n\n\nA STUDY IN SCARLET\n\nBy A. Conan Doyle\n\n\n\n\nCONTENTS\n\n A STUDY IN SCARLET.\n\n PART I.\n CHAPTER I. MR. SHERLOCK HOLMES.\n CHAPTER II. THE SCIENCE OF DEDUCTION.\n CHAPTER III. THE LAURISTON GARDENS MYSTE'

Sentence tokenize it.

In [105]:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(first_book_str)

In [106]:
print(sentences[0])





A STUDY IN SCARLET

By A. Conan Doyle




CONTENTS

 A STUDY IN SCARLET.


In [107]:
print(*sentences[35:40],sep="\n\n******************\n\n")

The campaign brought honours and promotion to many, but for me it had
nothing but misfortune and disaster.

******************

I was removed from my brigade and
attached to the Berkshires, with whom I served at the fatal battle of
Maiwand.

******************

There I was struck on the shoulder by a Jezail bullet, which
shattered the bone and grazed the subclavian artery.

******************

I should have
fallen into the hands of the murderous Ghazis had it not been for the
devotion and courage shown by Murray, my orderly, who threw me across a
pack-horse, and succeeded in bringing me safely to the British lines.

******************

Worn with pain, and weak from the prolonged hardships which I had
undergone, I was removed, with a great train of wounded sufferers, to
the base hospital at Peshawar.


### Using regular expressions to search for patterns in your corpus.

For help using regular expressions for pattern searches, see [Andrew Kuchling's Regular Expressions Tutorial.](https://docs.python.org/3/howto/regex.html)

For a complete description of the Python regular expression language,
see [the very accessible Python regular expression documentation.](https://docs.python.org/3/library/re.html)

If you prefer using [a web interface](https://regex101.com/r/eY4wC6/2) to testing the regular expressions
directly in Python as in the next cell, try the link.

Also see the regular expressions notebook for this course.

Let's use regular expressions to find all sentences beginning with "Let's"

Here's a quick look at the search pattern and some test  search results:

In [108]:
import re
# Let's at Start of string (assuming sentence tokenized input)
# Nonalphanumeric characters follow
reg_exp29 = r"^Let's\b"
# Ignore case, match acrss line boundaries, let . include line boundaries
reg_exp29_c = re.compile(reg_exp29,re.I|re.M|re.S)

#########  Examples   ######################################
print(1, reg_exp29_c.search("Let's go fly a kite."))
# OK not to capitalize
print(2, reg_exp29_c.search("let's go fly a kite."))
# Ok  Let's can be followed by any non alphumeric
print(5, reg_exp29_c.search("Let's."))
# Ok even if Let's at the end of string
print(5, reg_exp29_c.search("Let's"))

# Negative result.  Let's does not start sentence.
print(3, reg_exp29_c.search("You said Let's go fly a kite."))
# Negative result.  Let's word internal: not followed by non alphanumeric
print(4, reg_exp29_c.search("Let'sgo fly a kite."))


1 <re.Match object; span=(0, 5), match="Let's">
2 <re.Match object; span=(0, 5), match="let's">
5 <re.Match object; span=(0, 5), match="Let's">
5 <re.Match object; span=(0, 5), match="Let's">
3 None
4 None


In [109]:
import re
# Search for "Let's" sentence initially
# With Single quote
reg_exp29a = r"^Let's\b"
# With Aprostrophe ( a distinct unicode character)
reg_exp29b = r"^Let’s\b"
reg_exp29_c = re.compile(reg_exp29b,re.I|re.M|re.S)
found = []

for sent in sent_tokenize(first_book_str):
    res = reg_exp29_c.search(sent)
    if res is not None:
        found.append(sent)

In [110]:
found

[]

Negative result

Let's try a different pattern.

Search for "not a" 

In [111]:
import re

reg_exp31 = r"not\s+a\b"
reg_exp31_c = re.compile(reg_exp31,re.I|re.M|re.S)
found = []

for sent in sent_tokenize(first_book_str):
    res = reg_exp31_c.search(sent)
    if res is not None:
        found.append(sent)

In [112]:
len(found)

12

Search for  forms of *be* followed by "therefore":

In [113]:
import re
#reg_exp31 = r"\bfall(en)?\s+into\b"
reg_exp37 = r"\b((was)|(were)|(are)|(been)|(be))\s+therefore\b"
reg_exp37_c = re.compile(reg_exp37,re.I|re.M|re.S)
found = []

for (i,sent) in enumerate(sent_tokenize(first_book_str)):
    res = reg_exp37_c.search(sent)
    if res is not None:
        found.append(sent)

In [114]:
len(found)

1

### Search through the whole saved NLTK corpus:

Using the code below assumes you have created an NLTK corpus and saved it on disk.
If you have a sequence of strings in memory (for example `books` as defined earlier in this NB),
consult the next section.

Define search function:

In [122]:
import re
import nltk

use_punkt=True

if use_punkt:
    sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle').tokenize
else:
    from nltk.tokenize import sent_tokenize

# flags is a customizable parameter, though this is a useful default.
# You may however not want to ignore case, and call this with flags=re.M|re.S
def search_nltk_corpus (corpus, pattern,flags=re.I|re.M|re.S,fileids=None):
    pattern_c = re.compile(pattern,flags)
    found = []
    if fileids is None:
        fileids = newcorpus.fileids()
    for fileid in fileids:
        book_str = newcorpus.raw(fileid)
        for (sent_idx,sent) in enumerate(sent_tokenize(book_str)):
            res = pattern_c.search(sent)
            if res is not None:
                found.append((fileid, sent_idx, sent))
    return found

Load the created corpus:

In [116]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

newcorpus = PlaintextCorpusReader(nltk_corpus_dir, '.*')

In [123]:
newcorpus.fileids()

['A_Modest_Proposal_by_Jonathan_Swift.txt',
 'A_Room_with_a_View_by_E._M._Forster.txt',
 'A_Study_in_Scarlet_by_Arthur_Conan_Doyle.txt',
 'Ang_"Filibusterismo"_(Karugtóng_ng_Noli_Me_Tangere)_by_José_Rizal.txt',
 'Carmilla_by_Joseph_Sheridan_Le_Fanu.txt',
 'Don_Quixote_by_Miguel_de_Cervantes_Saavedra.txt',
 'Dubliners_by_James_Joyce.txt',
 "Gulliver's_Travels_into_Several_Remote_Nations_of_the_World_by_Jonathan_Swift.txt",
 'Jane_Eyre:_An_Autobiography_by_Charlotte_Brontë.txt',
 "Lady_Chatterley's_lover_by_D._H._Lawrence.txt",
 'Middlemarch_by_George_Eliot.txt',
 'Moby_Word_Lists_by_Grady_Ward.txt',
 'Peter_Pan_by_J._M._Barrie.txt',
 'Pogo_Planet_by_Donald_A._Wollheim.txt',
 'Pygmalion_by_Bernard_Shaw.txt',
 'Second_Treatise_of_Government_by_John_Locke.txt',
 'Sense_and_Sensibility_by_Jane_Austen.txt',
 'The_Adventures_of_Tom_Sawyer,_Complete_by_Mark_Twain.txt',
 'The_Confessions_of_St._Augustine_by_Bishop_of_Hippo_Saint_Augustine.txt',
 'The_Count_of_Monte_Cristo_by_Alexandre_Dumas_and

Search for forms of be followed by "therefore":

In [117]:
reg_exp37 = r"\b((was)|(were)|(are)|(been)|(be))\s+therefore\b"
found_37 = search_nltk_corpus (newcorpus, reg_exp37)

The search yielded many examples.

In [118]:
len(found_37)

49

#### The case of apostrophe (allo-characters)

This search failed before, but we tried only one book.  Now let's try the whole corpus.

In [133]:
# Keyboard apostrophe (=single quote)
reg_exp29a = r"^Let's\b"
# Unicode apostrophe
reg_exp29b = r"^Let’s\b"

# Unicode first
found_29b = search_nltk_corpus (newcorpus, reg_exp29b)

In [136]:
len(found_29b)

15

In [134]:
found_29b

[('A_Room_with_a_View_by_E._M._Forster.txt', 3879, 'Let’s tell her.'),
 ('A_Room_with_a_View_by_E._M._Forster.txt',
  3937,
  'Let’s turn\nin here.”\n\n“Here” was the British Museum.'),
 ('A_Room_with_a_View_by_E._M._Forster.txt', 3942, 'Let’s go to Mudie’s.'),
 ('A_Room_with_a_View_by_E._M._Forster.txt', 4015, 'Let’s\nall go.'),
 ('Pygmalion_by_Bernard_Shaw.txt',
  481,
  'Let’s see how fast you can make her hop it.'),
 ('Pygmalion_by_Bernard_Shaw.txt', 1345, 'Let’s give him ten.'),
 ('The_Adventures_of_Tom_Sawyer,_Complete_by_Mark_Twain.txt',
  1726,
  'Let’s us go, too, Tom.”\n\n“I won’t!'),
 ('The_Adventures_of_Tom_Sawyer,_Complete_by_Mark_Twain.txt',
  2508,
  'Let’s hide the tools in the bushes.”\n\nThe boys were there that night, about the appointed time.'),
 ('The_Adventures_of_Tom_Sawyer,_Complete_by_Mark_Twain.txt',
  2593,
  'Let’s run!”\n\n“Keep still!'),
 ('The_Adventures_of_Tom_Sawyer,_Complete_by_Mark_Twain.txt',
  3167,
  'Let’s try some other way, so as not to go\nthro

Then apostrophe.

In [141]:
# Apostrophe
reg_exp29a = r"^Let's\b"
found_29a = search_nltk_corpus (newcorpus, reg_exp29a)

In fact the two searches yield disjoint results

In [142]:
len(found_29a)

10

In [143]:
set(found_29b) & set(found_29a)

set()

So two different unicode characters, ASCII apostrophe and unicode apostrophe, can express the English apostrophe in our data. Therefore, to do the search, we need to look for ASCII apostrophe OR unicode apostrophe.

In [144]:
reg_exp29a_or_b = r"^Let('|’)s\b"
found_29a_or_b = search_nltk_corpus (newcorpus, reg_exp29a_or_b)
len(found_29a_or_b)

25

History: The search that got me some examples of unicode apostrophe printing out.

In [262]:
reg_exp29 = r"^Let\b"
found_29 = search_nltk_corpus (newcorpus, reg_exp29)

In [263]:
len(found_29)

328

#### A more complicated regular expression

Parsing the next reg exp (search for "fall into", "fell into", or "falling into" or "fallen into"):

$$
\begin{array}[t]{cccccccc}
\text{\\b}& \text{f} & \text{(a | e)} & \text{ll} & \text{((en)|(ing))?} & \text{\\s+} & \text{into}& \text{\\b}\\
(0) & (1) & (2) & (3) & (4) & (5) & (6) & (7) \\
\end{array}
$$

0.  No characters that can appear inside a word can precede the character "f".
1.  Character "f" here.
2.  Either the character "a" or the character "e" here.
3.  Characters "ll" (as in "llama") here.
4.  Optionally: Characters "en" or characters "ing" here.
5.  Arbitrary number of white space characters here, but at least one. White space characters include line breaks
6.  Characters "into" here
7.  No characters that can appear inside a word can follow "into" so the word "into" can appear here and be followed by a space or a comma,  but not the characters "xication", as in "intoxication".

As result, the regular expression matches the bracketed part of all of the following:

```
They may [fall into] trouble.
She is [falling into] bad habits.
The boy [fell into] a deep hole.
You have [fallen into] a trap
through [fall into] winter
```
as well as

```
I afraid I will [fall
into] bad habits.
```
 where "fall" and "into" appear on separate lines.  It also matches
 
 ```
 [fellen into]
 [felling into]
 ```
 
 It matches neither of the following (failed match in parens)
 
 ```
 she felt none of that summer sadness or (fall into)xication
 This sudden turn of events threw one of those men few tragedies be(fall into) a deep depression.
 ```
 
 

In [146]:
# Allow fall into, fell into, fallen into, falling into
reg_exp31 = r"\bf(a|e)ll((en)|(ing))?\s+into\b"
found_31 = search_nltk_corpus (newcorpus, reg_exp31)
len(found_31)

231

In [147]:
found_31[:10]

[('A_Room_with_a_View_by_E._M._Forster.txt',
  2657,
  'The bank broke away, and he fell into the pool before he had weighed\nthe question properly.'),
 ('A_Room_with_a_View_by_E._M._Forster.txt',
  2996,
  'I fell into\nall those violets, and he was silly and surprised.'),
 ('A_Room_with_a_View_by_E._M._Forster.txt',
  3256,
  '“Also\nthat men fall into two classes—those who forget views and those who\nremember them, even in small rooms.”\n\n“Mr.'),
 ('A_Study_in_Scarlet_by_Arthur_Conan_Doyle.txt',
  38,
  'I should have\nfallen into the hands of the murderous Ghazis had it not been for the\ndevotion and courage shown by Murray, my orderly, who threw me across a\npack-horse, and succeeded in bringing me safely to the British lines.'),
 ('Carmilla_by_Joseph_Sheridan_Le_Fanu.txt',
  989,
  'Very late, she said, she had got to the\nhousekeeper’s bedroom in despair of finding us, and had then fallen\ninto a deep sleep which, long as it was, had hardly sufficed to recruit\nher strength aft

### Search through a list of doc strings:

This code will get you through the next two sections.

In [263]:
from nltk.tokenize import sent_tokenize
import re
import urllib

def get_book (ind):
    url = f"https://gutenberg.org/cache/epub/{ind}/pg{ind}.txt"
    with urllib.request.urlopen(url) as stream:
        byte_str = stream.read()
        return byte_str.decode("UTF8").replace("\r\n","\n")

def search_doc(book_str, pattern_c,found=None,book_id='',count_hits=False,flags=re.I|re.M|re.S):
    global findall_sents,matches,match_obj,sent0,sentx
    if found is None:
        found = []
    count,findall_sents,matches = 0,[],[]
    for (sent_idx,sent) in enumerate(sent_tokenize(book_str)):
        sentx=sent
        try:
            res = pattern_c.search(sent)
        except AttributeError:
            pattern_c = re.compile(pattern_c,flags)
            res = pattern_c.search(sent)
        if res is not None and count_hits:
            findalls=pattern_c.findall(sent)
            this_count = len(findalls)
            count += this_count
            findall_sents.extend(findalls)
            sent0 = sent
            for i in range(this_count):
                match_obj = pattern_c.search(sent0)
                if match_obj is None:
                    break
                z = match_obj.start() - 5
                start = z if z > 0 else 0
                extracted = sent0[start:match_obj.end()+5]
                if extracted:
                    matches.append(extracted)
                else:
                    pass ## Debugging statementts now outdated
                    #print(extracted, sent0)
                    #print(sent,end="\n*****\n")
                    #sentx=sent
                    #raise Exception
                sent0 = sent0[match_obj.end()+1:]
            found.append((book_id, sent_idx, sent))
        elif res is not None:
            found.append((book_id, sent_idx, sent))
    if count_hits:
        return (count,found)
    else:
        return found
        
    
def search_doc_strings (book_strings, pattern):
    pattern_c = re.compile(pattern, re.I|re.M|re.S)
    found = []
    for (book_id,book_str) in enumerate(book_strings):
        search_doc(book_str, pattern_c,found=found,book_id=book_id)
    return found

To search through a list of strings, do the following.

If you have been following along, `books` should be a list of doc strings defined earlier.

In [149]:
reg_exp37 = r"\b((was)|(were)|(are)|(been)|(be))\s+therefore\b"
found_37 = search_doc_strings(books, reg_exp37)

In [150]:
len(found_37)

23

###  Searching an arbitrary book on Gutenberg.org

When looking up the index of a book, make sure it is not the index of the audio book.  Many books appear on
Gutenberg in both print and audio form, and they have different indexes.

In [152]:
#frankenstein_idx = 84
reg_exp37 = r"\b((was)|(were)|(are)|(been)|(be))\s+therefore\b"
text2= get_book(84)
found = search_doc(text2, reg_exp37, book_id='Frankenstein')

In [153]:
found

[('Frankenstein',
  202,
  'You have been tutored and\nrefined by books and retirement from the world, and you are therefore\nsomewhat fastidious; but this only renders you the more fit to\nappreciate the extraordinary merits of this wonderful man.'),
 ('Frankenstein',
  384,
  'My\ndeparture was therefore fixed at an early date, but before the day\nresolved upon could arrive, the first misfortune of my life\noccurred—an omen, as it were, of my future misery.')]

How often when Sherlock Holmes is referred to, is his first name used?

In [154]:
#Search A study in Scarlet; idx is 244
reg_exp43 = r"\bSherlock\b"
text3 = get_book(244)
found_43 = search_doc(text3, reg_exp43, book_id='A Study in Scarlet')

In [155]:
len(found_43)

52

How often is his last name used?

In [156]:
#Search A study in Scarlet; idx is 244
reg_exp47 = r"\bHolmes\b"
# Uncomment if needed
#text3 = get_book(244)
found_47 = search_doc(text3, reg_exp47, book_id='A Study in Scarlet')

In [157]:
len(found_47)

97

In [158]:
found_47[:10]

[('A Study in Scarlet', 5, 'CHAPTER I. MR. SHERLOCK HOLMES.'),
 ('A Study in Scarlet',
  32,
  '(_Being a reprint from the Reminiscences of_ JOHN H. WATSON, M.D.,\n_Late of the Army Medical Department._)\n\n\n\n\nCHAPTER I.\nMR. SHERLOCK HOLMES.'),
 ('A Study in Scarlet',
  63,
  '“You\ndon’t know Sherlock Holmes yet,” he said; “perhaps you would not care\nfor him as a constant companion.”\n\n“Why, what is there against him?”\n\n“Oh, I didn’t say there was anything against him.'),
 ('A Study in Scarlet',
  82,
  '“Holmes is a little too scientific for my tastes—it approaches\nto cold-bloodedness.'),
 ('A Study in Scarlet',
  100,
  'Watson, Mr. Sherlock Holmes,” said Stamford, introducing us.'),
 ('A Study in Scarlet',
  130,
  'Now we have the\nSherlock Holmes’ test, and there will no longer be any difficulty.”\n\nHis eyes fairly glittered as he spoke, and he put his hand over his\nheart and bowed as if to some applauding crowd conjured up by his\nimagination.'),
 ('A Study in Scarlet

Note the previous two counts include overlap, because there are cases
in which "Sherlock Holmes" is used to refer to Sherlock Holmes. A more difficult regular expression is 
to look for cases in which uses of "Sherlock" are not followed by "Holmes".
This involves **negative lookahead**.

In [159]:
# Sherlock not followed by " Holmes"
reg_exp53 = r"\bSherlock(?!\s+Holmes\b)"
found_53 = search_doc(text3, reg_exp53, book_id='A Study in Scarlet')

In [160]:
found_53

[]

Confirming what every Sherlock Holmes fan knows:  Watson never calls Sherlock Holmes Sherlock.
It's always "Holmes, how the devil do you know that?" or "It can't be him, Holmes!"  In fact,
Watson rarely does even that.  If we search for occurrences of "Holmes" inside quotation
marks, a left quotation mark followed by any number of non-right quotation mark characters
followed by "Holmes" when it is not preceded by "Sherlock" (involving a negative lookbehind),
we get only one genuine hit:

In [161]:
reg_exp57 = r"“[^”]+(?<!Sherlock)\s+Holmes"
search_doc(text3, reg_exp57, book_id='A Study in Scarlet')

[('A Study in Scarlet',
  564,
  '“There is nothing like first hand evidence,” he remarked; “as a matter\nof fact, my mind is entirely made up upon the case, but still we may as\nwell learn all that is to be learned.”\n\n“You amaze me, Holmes,” said I.')]

###  Looking at the occurrences of a taboo word in Huck Finn

In [264]:
idx=76
# Optional "s"  Match the plural as well
reg_exp_43 = r"niggers?"

#The has a few false negatives because of some instances of "_<PAT>_"  disallowed by `\b`
#because `_` can occur in a word.
#reg_exp_43a = r"\bniggers?\b"
#count_43a,found_43a = search_doc(text_43, reg_exp_43a, book_id='Huckleberry Finn',count_hits=True)

#text_43 = get_book(idx)
count_43,found_43 = search_doc(text_43, reg_exp_43, book_id='Huckleberry Finn',count_hits=True)

In [265]:
count_43

214

Our `count_43` just about agrees with the N-word count of 219, cited in [the NYT book review of Percival Everett's new book *James*, ]( https://www.nytimes.com/2024/03/11/books/review/percival-everett-james.html)
which is *Huckleberry Finn* told from Jim's point of view.  We can't explain the discrepancy.

Sanity check:  the set of words matched with the reg exp:

In [266]:
set(findall_sents)

{'Nigger', 'Niggers', 'nigger', 'niggers'}

So the n-word occurs quite often.  Nearly twice as many times as the word *raft*, which is  very important
in this story about a trip down the Mississippi.

In [267]:
reg_exp_59 = r"\braft\b"
count_59,found_59 = search_doc(text_43, reg_exp_59, book_id='Huckleberry Finn',count_hits=True)

In [268]:
count_59

118

#### Collecting the result of multiple searches

You can append the results of a search to previous results, if you want.  Just be sure to give
book_id a useful value, so you know what book yielded what results.

In [305]:
#Search E.M. Forster: A Room with a View = 2641
reg_exp47 = r"\bfashionable\s+world\b"
text4 = get_book(2641)
print(len(found))
search_doc(text4, reg_exp47, book_id='A Room with a View', found=found)
print(len(found))

52
54


### Twitter texts

In [1]:
with open ('chat_corp/chat_corpus-master/twitter_en.txt') as fh:
    text = fh.read()

Finding emoji

In [6]:
from re import findall
for code in range(0x1f600,+0x1f600+80):
    #print(f"{code:08x} {chr(code)} {unicodedata.name(chr(code))}")
    if chr(code) in text:
        found = findall(chr(code),text)
        print(chr(code), len(found))

😀 804
😁 1136
😂 24225
😃 430
😄 620
😅 993
😆 539
😇 413
😈 687
😉 1938
😊 2872
😋 433
😌 375
😍 3783
😎 1314
😏 903
😐 469
😑 537
😒 974
😓 177
😔 620
😕 556
😖 206
😗 20
😘 2535
😙 98
😚 173
😛 605
😜 1045
😝 404
😞 501
😟 115
😠 205
😡 739
😢 814
😣 182
😤 444
😥 224
😦 48
😧 79
😨 161
😩 2551
😪 306
😫 573
😬 705
😭 7657
😮 120
😯 67
😰 182
😱 615
😲 84
😳 1165
😴 441
😵 119
😶 122
😷 288
😸 88
😹 319
😺 30
😻 367
😼 32
😽 51
😾 27
😿 76
🙀 260
🙁 164
🙂 550
🙃 1574
🙄 2875
🙅 134
🙆 88
🙇 72
🙈 527
🙉 51
🙊 195
🙋 212
🙌 1574
🙍 18
🙎 6
🙏 1750


In [2]:
type(text)

str