## Extracting Names:  Try #1

Let's try to extract all proper names from a large body of text.  As a first approximation, use the string method "is_title" which tells us, by returning `True` or `False`, whether a string is capitalized or not:

In [None]:
print('Fred'.istitle())
print('tablespoon'.istitle())

True
False


We need to open a file, loop through every line, loop through every word in the line, and collect the ones that are capitalized into some sort of container.  Let's assume we don't want to know how many times a name occurs, just that it occurs.  So we want a container that can't contain duplicates.  To that end we want a `Set`.  As our example text, let's use Jane Austen's *Pride and Prejudice*, downloaded for free from Project Gutenberg.

Get Jane Austen's novel *Pride and Prejudice* from Project Gutenberg:

In [15]:
import urllib

url = "https://www.gutenberg.org/files/1342/1342-0.txt"
with urllib.request.urlopen(url) as req:
    # Web text gets downloaded as raw bytes.  Need an encoding to make it a string
    text = req.read().decode("utf-8")

cap_words = set()
## Loop through each line and in each line loop
## through each word
for line in text.splitlines():
     # Now split into words
     line_list = line.split()
     for word in line_list:
         if word.istitle():
            cap_words.add(word)
print(cap_words)


{'“Well,”', 'Mary?', 'Carter,', 'Kitty,--', 'J.', 'Kitty;', 'F.;', 'Lydia,”', 'Pray,', 'Spectator,”', '“As', 'Hill,', 'Lucases', '“Certainly,', '“_My_', 'Happiness', '“Lizzy', '_Monday,', '1.E.1.', 'Michael', 'Lakes.”', '_Chap', 'Words', '“Mrs.', 'Collins.', 'In', '“Hearing', 'Fitzwilliam,', 'There', '“Read', '“Pray', 'Does', 'Convinced', 'Sally', '“_I_', 'Jane.', 'Updated', 'William,', 'Darcy!”', 'Vain', 'Argemone', 'Fixed', 'Sir,', '1.F.5.', 'Jane,', 'God', 'Attention,', 'Other', 'Wednesday.', 'Burney,', 'Charlotte.', '“Ay,', 'Why,', 'Longbourn.”', 'Ring', 'Feb', 'Gardiner;', 'Must', 'There,', 'Younger', 'Adieu', 'Catherine,)', 'Two', 'Bennet,', '“Upon', 'However', 'Scott', 'Take', 'Teasing,', 'Saturday:', 'Wickham!', '“Implacable', '“No--I', '“Their', 'Fielding,', 'This', 'For,', 'Bingley:', 'Hursts', 'Release', 'Lodge.', 'Some', '“Why', 'Younge', 'Five', 'Breakfast', 'No', 'Bingley,”', '“Was', 'Sections', 'But', 'Fairfax', 'Whitman', 'Lakes.', '“What', 'Great', 'Monday,', 'Party', 

Now let's for the moment ignore the flaws in what we've done (Many, if not most, of the things collected are not proper names).  Suppose we want to do the same thing to another Jane Austen novel, *Emma*. Here's code for that:

In [20]:
url2 = "https://www.gutenberg.org/cache/epub/158/pg158.txt"
with urllib.request.urlopen(url2) as req:
    # Web text gets downloaded as raw bytes.  Need an encoding to make it a string
    text2 = req.read().decode("utf-8")

cap_words2 = set()
## Loop through each line and in each line loop
## through each word
for line in text2.splitlines():
    line_list = line.split()
    for word in line_list:
        if word.istitle():
            cap_words2.add(word)
print(cap_words2)

{'“Well,”', '“Ill,', 'Wakefield.', '“Me!”', 'Richardson,', '“Nonsense!', 'Highbury.', 'Walk', 'Kings', '“Emma,”', 'Philip', 'Happily,', '‘No’', 'Pray,', 'Randalls.—But', '“As', 'Gratifying,', 'Richard?—I', 'Hill,', '“Certainly,', '“_My_', 'Square,', '1.E.1.', 'Welch', '‘But,’', 'Michael', 'Nature', '“Mrs.', 'In', 'There', '“Read', 'Serle', '“Pray', 'Perry?—Has', 'John,', 'Does', 'Men', '“Emma!”', 'Donwell?—He', '“_I_', 'Jane.', 'Taylor,', 'Updated', 'William,', 'Picture,', 'Sir,', '1.F.5.', 'Jane,', 'God', 'Other', 'Small', '‘Not', '“—Mrs.', '“Very.”', 'Dixon.', 'Must', 'There,', 'Weston!—Astonished', '“Low,', 'Two', '“‘Smallridge!’—What', '“Upon', 'Maple', 'Take', 'Bateses,', 'Goddard.', 'Barnes,', 'Hall;', 'Perrys,', 'Windsor?”', 'This', 'Perry,”', 'Release', 'Some', 'Sucklings,', 'Knightley!—I', '“Why', 'Five', 'Braithwaites,', 'Frank.', '“Till', 'Eltons?—Here', 'No', '“Better', '“Was', 'Bella,', 'Sections', 'But', '“No—Mrs.', 'Fairfax', 'August,', '“What', 'Bragge,', 'Since', 'Adel

So now we have two code cells above which are practically identical.  But what I want to focus on here is not the redundancy.  After all, the way I created the second cell was by cutting, pasting, and doing a small edit.  Not by laboriously retyping all the same things.  What I want to focus on is what it's going to be like to read this notebook three months from now.  We're going to see two cells that look nearly identical, and it's going to take some work to see where they differ.  That's unnecessary.  There are  really three complaints about the copy and paste method of reusing code (which you will do a lot):

  1.  There's no explicit, easy-to-read statement, of what's different in the two uses.
  2.  Closely related, but not entirely identical, There's no explicit statement of what the code *depends on*, 
      that is, on what names have to be defined in order for the code to work.  This makes it hard to fit the 
      code into a large data pipeline (which is another thing you'll be doing).
  3.  There isn't any explcit easy-to-read statement of what the code **does**.  In a very practical sense
      it inputs a filename and outputs a set of capitalized words. Now those facts happen to be related to the           first and last lines of the code cell, so they're not hard to figure out, but in many cases the code flow
      will be more complicated and it won't be so easy to read the purpose off the order in the cell.
      
I made the second issue more difficult because I renamed more than one variable when I copied
and pasted the code.

```python
url
text
cap_words
```

So there were three differences to find.  I could certainly have gotten the same
basic functionality by just changing `url`, but what if I wanted to keep both sets
`cap_words` around for comparison?  What if I had other things I wanted to do 
to `text` and `text2`?


But suppose we revised the code sp that the code only needed the name `url` to be defined to work properly.  Easy enough to see since I put it right at beginning of the cell, but  it could just as easily have been defined several cells earlier, if I'd needed that URL for other purposes.  Then it would take some careful reading to see that the cell needs the name `url` to be defined.  This is what I meant by problem 2 above. The dependency on that name is inexplicit, and needs to be dug out by doing work.  So both complaints fall under the broad heading of *code readability*.  But we can go a bit further.  Both complaints 1 and 2 go to code **reusability**.  In order to cut and paste the code to use it on a third file, I have to have these kinds of issues settled.  Moreover, the reusability issues get worse as the context dependency of the code gets worse.  Suppose the above code depended both on a file name and a website URL, and that I generally (maybe not always) changed both at the same time.  That's not explicit either, and that kind of inexplicitness is responsible for many, many bugs when code is reused.

## Functions are the solution

The solution is to encapsulate what you've done as a function definition.   This solves all three of our problems:

In [None]:
import sys
import re



def find_cap_words(url):
   
     cap_words = set()
    with open(filename, 'r') as filehandle:
        ## Loop through each line and in each line loop
        ## through each word
        for line in filehandle:
            line_list = line.split()
            for word in line_list:
                if word.istitle():
                    cap_words.add(word)
    return cap_words

`url` is our input, `cap_words` is our output (what is **returned**).  What's different from use to use is the exact value of `url`.  What the code depends on is also the exact value of `url`.  Moreover suppose we want there to be another optional dependency (a website).  Now the code would look like this:

In [21]:
import os.path
def find_cap_words(filename, site = None):
    if site is not None:
        filename = os.path.join(site, filename)
    with urllib.request.urlopen(filename) as req:
        # Web text gets downloaded as raw bytes.  Need an encoding to make it a string
        text = req.read().decode("utf-8")
    cap_words = set()
    for line in text.splitlines():
        line_list = line.split()
        for word in line_list:
            if word.istitle():
                cap_words.add(word)
    return cap_words

The names in parentheses after the function name are called the **parameters** or **arguments** of the function.  Notice that in this case only one of the parameters is obligatory (No `= Value` after `filename`, so no default value supplied).  In the case of the `site` parameter in `find_cap_words`, there was a fairly natural answer as to what to do when `site` wasn't supplied.  Just  assume `filename` includes the site URL.  But if  `filename` isn't supplied, where are we to look? There's no natural default, so we make that parameter obligatory.  The decision as to whether a parameter is optional or obligatory is an important thing to think about when defining a function.

Demonstrating `find_cap_words`:

In [22]:
url = "https://www.gutenberg.org/files/1342/1342-0.txt"
cap_words = find_cap_words(url)
print(len(cap_words), list(cap_words)[:100])

1382 ['“Well,”', 'Mary?', 'Carter,', 'Kitty,--', 'J.', 'Kitty;', 'F.;', 'Lydia,”', 'Pray,', 'Spectator,”', '“As', 'Hill,', 'Lucases', '“Certainly,', '“_My_', 'Happiness', '“Lizzy', '_Monday,', '1.E.1.', 'Michael', 'Lakes.”', '_Chap', 'Words', '“Mrs.', 'Collins.', 'In', '“Hearing', 'Fitzwilliam,', 'There', '“Read', '“Pray', 'Does', 'Convinced', 'Sally', '“_I_', 'Jane.', 'Updated', 'William,', 'Darcy!”', 'Vain', 'Argemone', 'Fixed', 'Sir,', '1.F.5.', 'Jane,', 'God', 'Attention,', 'Other', 'Wednesday.', 'Burney,', 'Charlotte.', '“Ay,', 'Why,', 'Longbourn.”', 'Ring', 'Feb', 'Gardiner;', 'Must', 'There,', 'Younger', 'Adieu', 'Catherine,)', 'Two', 'Bennet,', '“Upon', 'However', 'Scott', 'Take', 'Teasing,', 'Saturday:', 'Wickham!', '“Implacable', '“No--I', '“Their', 'Fielding,', 'This', 'For,', 'Bingley:', 'Hursts', 'Release', 'Lodge.', 'Some', '“Why', 'Younge', 'Five', 'Breakfast', 'No', 'Bingley,”', '“Was', 'Sections', 'But', 'Fairfax', 'Whitman', 'Lakes.', '“What', 'Great', 'Monday,', 'Par

And here we are using it on another file.  See?  Quite easy to see the differences.

In [23]:
url2 = "https://www.gutenberg.org/cache/epub/158/pg158.txt"
cap_words = find_cap_words(url2)
print(len(cap_words), list(cap_words)[:100])

1592 ['“Well,”', '“Ill,', 'Wakefield.', '“Me!”', 'Richardson,', '“Nonsense!', 'Highbury.', 'Walk', 'Kings', '“Emma,”', 'Philip', 'Happily,', '‘No’', 'Pray,', 'Randalls.—But', '“As', 'Gratifying,', 'Richard?—I', 'Hill,', '“Certainly,', '“_My_', 'Square,', '1.E.1.', 'Welch', '‘But,’', 'Michael', 'Nature', '“Mrs.', 'In', 'There', '“Read', 'Serle', '“Pray', 'Perry?—Has', 'John,', 'Does', 'Men', '“Emma!”', 'Donwell?—He', '“_I_', 'Jane.', 'Taylor,', 'Updated', 'William,', 'Picture,', 'Sir,', '1.F.5.', 'Jane,', 'God', 'Other', 'Small', '‘Not', '“—Mrs.', '“Very.”', 'Dixon.', 'Must', 'There,', 'Weston!—Astonished', '“Low,', 'Two', '“‘Smallridge!’—What', '“Upon', 'Maple', 'Take', 'Bateses,', 'Goddard.', 'Barnes,', 'Hall;', 'Perrys,', 'Windsor?”', 'This', 'Perry,”', 'Release', 'Some', 'Sucklings,', 'Knightley!—I', '“Why', 'Five', 'Braithwaites,', 'Frank.', '“Till', 'Eltons?—Here', 'No', '“Better', '“Was', 'Bella,', 'Sections', 'But', '“No—Mrs.', 'Fairfax', 'August,', '“What', 'Bragge,', 'Since', 

## Takeaways

So here's the new plan, now that we know about functions.  Write the code once encapsulated as a function.
Reuse it as many times as you like varying the **arguments** of the function  (the part that changes from use to use, the information the code explicitly depends on): in this case what changes is the `filename` value.  In one case it's `"pride_and_prejudice.txt"`.  In another it's `"emma.txt"`.

## Parameters and Return values (input/output)

We tackle the problem of breaking `find_cap_words` into two more natural pieces, each
a function, increasing the re-usability and (hopefully) the readability.

Eyeballing `find_cap_words` as currently written, it really does two things
in sequence, easily visible in the two biggest blocks of code:

In [None]:
def find_cap_words(filename, site = None):
    if site is not None:
        filename = os.path.join(site, filename)
    with urllib.request.urlopen(filename) as req:
        # Web text gets downloaded as raw bytes.  Need an encoding to make it a string
        text = req.read().decode("utf-8")
    cap_words = set()
    for line in text.splitlines():
        line_list = line.split()
        for word in line_list:
            if word.istitle():
                cap_words.add(word)
    return cap_words

It downloads a file from the web, decoding it, then extracts caps.  Let's write two functions,
`download_text_file` and `extract_caps`.

Here are the first drafts of those functions.  The parameters of the functions have been left out.
It's your job to:

1.  Add the parameters (arguments) to be functions, deciding which should be optional
2.  Determine what each function should return.
3.  Write some code demonstrating how the functions should be called
    to download *Pride and Prejudice* and *Emma*.
    
Here's something to bear in mind as you edit the functions:  You want them to be usable beyond just
these two cases, and not every text file on the web will be encoded in UTF8 (although
most will).

Some blank code cells have been provided for your answers.  Reasonable answers have been supplied
a few more cells down.  Be aware that this problem has more than one answer.

In [None]:
def down_text_file():
    if site is not None:
        filename = os.path.join(site, filename)
    with urllib.request.urlopen(filename) as req:
        # Web text gets downloaded as raw bytes.  Need an encoding to make it a string
        text = req.read().decode("utf-8")

def extract_caps ():
    cap_words = set()
    for line in text.splitlines():
        line_list = line.split()
        for word in line_list:
            if word.istitle():
                cap_words.add(word)

In [39]:
#### A reasonable answer to the question
import os.path

def download_text_file(filename, site = None, encoding="utf8"):
    if site is not None:
        filename = os.path.join(site, filename)
    with urllib.request.urlopen(filename) as req:
        # Web text gets downloaded as raw bytes.  Need an encoding to make it a string
        return req.read().decode(encoding)

def extract_caps (text):
    cap_words = set()
    for line in text.splitlines():
        line_list = line.split()
        for word in line_list:
            if word.istitle():
                cap_words.add(word)
    return cap_words


In [40]:
site = "https://www.gutenberg.org"

#  Run pipeline once
filename1 = "files/1342/1342-0.txt"
text1 = download_text_file(filename1, site = site)
caps1 = extract_caps (text1)

#  Run pipeline a second time
filename2 = "cache/epub/158/pg158.txt"
text2 = download_text_file(filename2, site = site)
caps2 = extract_caps (text2)

So now we have the above natural sequence of two commands whenever we get a new URL to extract information from.

1. download_text_file
2. extract_caps


This, like many of the pieces of code we will write in this course, is a **pipeline**.   We take in one piece of data, a filename, use it to extract some information, a text string, and pass those on for further processing,  A simple pipeline has the following structure:

```
output_1 = Function_1(input)
output_2 = Function_2(output_1, ... [other parameters])
output_3 = Function_3(output_2, ... [other parameters])
...
Function_n(output_{n-1}, ... [other paramters])
```

We are assuming here that Funtion_n is a function that saves the data to a disk, so there is not output_n in the program.  In a simple variant, there is an output_n, because the pipeline is part of some larger program, which will do further processing on output_n.

Our simple pipeline has only two functions.  

## Appendix:  Regular expression-based approach

Another good reason for encapsulating code in a function, is that it's easy to change all uses of the function in a large body of code, and experiment with a new method, as we will frequently do in this course.  For example here's an entirely new approach to the extract names problem:

In [61]:
import re

cap_word_re = r'\b([A-Z][a-z]+)\b'
x = set(re.findall(cap_word_re, text1))

Here we use a pattern-matching Python module called `re` (for **regular expression**)  to try to extract all capitalized words that are at least two characters long.  The pattern is the string that's given the name `cap_word_re`, which is then used as the first argument of `re.findall`.  The part in the first pair of square breackets of the pattern says the first character must be a capitalized letter (`A-Z`) and then there must be one or more following characters of the sort that can occur in words (`\w+`).  All of this must begin and end with a word boundary (`\b`). 

Applying this pattern to a simple string we get:

In [62]:
re.findall(cap_word_re,'F3 I 22 love Beethoven')

['Beethoven']

Notice that this excludes `I`and `F3` , which our previous attempt didn't.  There are more powerful things one can say with regular expressions, and when we learn more about regular expressions, we'll return to this problem and come up with a still more satisfactory solution.

Here we use a pattern-matching Python module called `re` (for **regular expression**)  to try to extract all capitalized words that are at least two characters long.  The pattern is the string that's the first argument of `re.findall`.  The part in the first pair of square breackets of the pattern says the first character must be a capitalized letter (`A-Z`) and then there must be one or more following characters of the sort that can occur in words (`\w+`).  All of this must begin and end with a word boundary (`\b`).  Notice that this excludes `I`, which our previous attempt didn't.  There are more powerful things one can say with regular expressions, and when we learn more about regular expressions, we'll return to this problem and come up with a still more satisfactory solution.

For now, your task is to redefine `extract_caps` to use a regular expression approach.  Theer is a reasonable
answer a few cells below.  Be sure to try out your new function on one of the examples
done above, to see if you can reproduce that functionality, or do better.

In [64]:
def extract_caps2 (text):
    cap_word_re = r'\b([A-Z][a-z]+)\b'
    return set(re.findall(cap_word_re, text))

To try this out we'd just re-execute any of the cells above that call `cownload_text_file`
to provide us with some text.  And plug in our new function.  For example.

In [66]:
#  Run pipeline once
filename1 = "files/1342/1342-0.txt"
text1 = download_text_file(filename1, site = site)
caps3 = extract_caps2 (text1)

In [71]:
print(len(caps3), list(caps3)[:100])
'Darcy' in caps3

691 ['Goulding', 'Charing', 'Nay', 'Compared', 'Robinson', 'Much', 'From', 'Email', 'Interested', 'How', 'Promise', 'As', 'Caroline', 'Despite', 'Easter', 'Kent', 'Replacement', 'Commerce', 'Thomson', 'Frank', 'Unless', 'Persuaded', 'Ah', 'Spanish', 'Lucases', 'Road', 'Birmingham', 'Into', 'Happiness', 'Society', 'Christmas', 'Delighted', 'Carr', 'Consider', 'Were', 'Michael', 'Yet', 'Poor', 'Words', 'Regulars', 'Epsom', 'In', 'There', 'West', 'Does', 'Tell', 'International', 'Convinced', 'Can', 'Sally', 'November', 'Grosvenor', 'Covering', 'Also', 'License', 'Never', 'Cambridge', 'Nicholls', 'Smollett', 'Updated', 'Vain', 'Argemone', 'Certain', 'Fixed', 'Or', 'Implacable', 'Write', 'Vingt', 'God', 'De', 'By', 'Other', 'Stone', 'Lake', 'Protested', 'House', 'Affectation', 'Just', 'Ring', 'Feb', 'Saturday', 'Tease', 'Vernon', 'Piling', 'What', 'Philistinism', 'Swiftian', 'Maria', 'Must', 'Was', 'Younger', 'Insolent', 'Young', 'To', 'Adieu', 'Revenue', 'Excuse', 'Niece', 'Chawton', 'Two'

True

Let's see how different the two results from *Pride and Prejudice* are:

In [68]:
len(caps1), len(caps3)

(1382, 691)

We've actually improved performance considerably  Here's why. 

A number of "names" containing punctuation as part of the word have been eliminated:

But all is not sunshine and roses.  Our solution has bugs.  

Here are names that have **not** been discovered:

In [79]:
ds = set(caps3).difference(caps1)
ds

{'Abbey',
 'Accompanied',
 'Accordingly',
 'Ah',
 'Allen',
 'Already',
 'Astonishment',
 'Attention',
 'Author',
 'Ay',
 'Bakewell',
 'Bates',
 'Bath',
 'Because',
 'Believe',
 'Bell',
 'Besides',
 'Beyond',
 'Birmingham',
 'Blame',
 'Blenheim',
 'Books',
 'Bountiful',
 'Bromley',
 'Brother',
 'Burney',
 'Cambridge',
 'Care',
 'Certainly',
 'Chatsworth',
 'Cheapside',
 'City',
 'Clapham',
 'Clarke',
 'Clement',
 'Commerce',
 'Compared',
 'Complied',
 'Conjecturing',
 'Courier',
 'Covering',
 'Dashwood',
 'Date',
 'Defects',
 'Delighted',
 'Depend',
 'Design',
 'Dining',
 'Eastbourne',
 'Eltons',
 'Esmond',
 'Esq',
 'Exceed',
 'Exceedingly',
 'Fordyce',
 'Frenchman',
 'Friday',
 'Generous',
 'Girls',
 'Go',
 'Grant',
 'Grantley',
 'Green',
 'Hate',
 'Hatfield',
 'Haye',
 'Hearing',
 'Illustration',
 'Implacable',
 'Impossible',
 'Insolent',
 'James',
 'Janites',
 'July',
 'Just',
 'Keep',
 'Kenilworth',
 'Kympton',
 'La',
 'Lane',
 'Language',
 'Lately',
 'Laugh',
 'Lavington',
 'Liverp

We can easily see, it's not really a matter of the shape of these names themselves:

In [75]:
d_str = ' '.join(ds)
extract_caps2 (d_str)

{'Abbey',
 'Accompanied',
 'Accordingly',
 'Ah',
 'Allen',
 'Already',
 'Astonishment',
 'Attention',
 'Author',
 'Ay',
 'Bakewell',
 'Bates',
 'Bath',
 'Because',
 'Believe',
 'Bell',
 'Besides',
 'Beyond',
 'Birmingham',
 'Blame',
 'Blenheim',
 'Books',
 'Bountiful',
 'Bromley',
 'Brother',
 'Burney',
 'Cambridge',
 'Care',
 'Certainly',
 'Chatsworth',
 'Cheapside',
 'City',
 'Clapham',
 'Clarke',
 'Clement',
 'Commerce',
 'Compared',
 'Complied',
 'Conjecturing',
 'Courier',
 'Covering',
 'Dashwood',
 'Date',
 'Defects',
 'Delighted',
 'Depend',
 'Design',
 'Dining',
 'Eastbourne',
 'Eltons',
 'Esmond',
 'Esq',
 'Exceed',
 'Exceedingly',
 'Fordyce',
 'Frenchman',
 'Friday',
 'Generous',
 'Girls',
 'Go',
 'Grant',
 'Grantley',
 'Green',
 'Hate',
 'Hatfield',
 'Haye',
 'Hearing',
 'Illustration',
 'Implacable',
 'Impossible',
 'Insolent',
 'James',
 'Janites',
 'July',
 'Just',
 'Keep',
 'Kenilworth',
 'Kympton',
 'La',
 'Lane',
 'Language',
 'Lately',
 'Laugh',
 'Lavington',
 'Liverp

In fact, all the names on their own match our capitalized name reg exp. So it's something about the context in
which they occur in the text of *Pride and Prejudice*.

Here's your clue, using one of the words from the above list and a slightly looser regexp which allows
any single character before and after "Merely":

In [78]:
re.findall(".Merely.",text1)

['“Merely ']

This can be fixed.  You are welcome to try to fix `extract_caps2`.  Answer supplied on request.