# Function Creation

### Motivating example

The first reason to write a function is to avoid repeating code.

The next two code cells contain code snippets that do the same thing to slightly different data.  First, see if you can explain what that thing is.

The third code cell abstracts the shared code into a function.  See if you can
guess what the function will look like before you get to the third cell.


### snippet #1

In [33]:
from urllib.request import urlopen

# This is the Iliad
with urlopen('https://www.gutenberg.org/cache/epub/6130/pg6130.txt') as iliad_page:
    iliad_bytes = iliad_page.read()

iliad = iliad_bytes.decode('utf-8')
iliad_words = iliad.split()
iliad_vocab = set(iliad_words)

### snippet #2

In [9]:
from urllib.request import urlopen

# This is the odyssey
with urlopen('https://www.gutenberg.org/cache/epub/1727/pg1727.txt') as odyssey_page:
    odyssey_bytes = odyssey_page.read()
    
odyssey = odyssey_bytes.decode('utf-8')
odyssey_words = odyssey.split()
odyssey_vocab = set(odyssey_words)

Next try to articulate to the task that both these code snippers perform.  The answer is in a markup (text) cell a few cells down.

Description A: **The code snippets extract the vocabulary from a url. If we wanted to write a function that performs that task, the function would take a url string as its input and it would return a set of words.**

Note that the page at that url address needs to contain text encoded in UTF-8 in order for
the code to give the expected result.  This is why pages on `gutenberg.org` were
chosen.

There are other possible answers you could have come up with.  You might have said.

Description B:  **The code snippets extract the vocabulary from a gutenberg.org book. If we wanted to write a function that performed that task, it could take a gutenberg
document number as its input and return a set of words.**


Next try to write a function that does what the code snippets do, consistent
with the description in Description A.

You can start by copying and pasting one of the snippets and indenting it
under:

```
def <function_name> (<parameters>):
```

but you will have to chose a legal function name, supply the parameters for the function,
and change the code snippet.

There are two things to think about:

1.  What the parameters (input) of the function will be.
2.  What the function returns

The answer is a few cells down.

In [183]:
from urllib.request import urlopen

def extract_vocab_from_text_url(url):
    with urlopen(url) as page:
        page_bytes = page.read()
    page_text = page_bytes.decode('utf-8')
    words = page_text.split()
    return set(words)

We decided the function would input a url string and output the vocabulary set found
at that url. The snippet the above function definition evolved from is:

```
with urlopen('https://www.gutenberg.org/cache/epub/1727/pg1727.txt') as odyssey_page:
    odyssey_bytes = odyssey_page.read()
odyssey = odyssey_bytes.decode('utf-8')
odyssey_words = odyssey.split()
odyssey_vocab = set(odyssey_words)
```

Note all the name changes in the code and make sure you understand them. Make sure
you understand the need to use `return` in the last line. Notice
the necessary import statement was included in the cell defining
the function.  It is good practice when defining functions to declare
all its dependencies explicitly and place them nearby the function definition.

Let's apply this function to our previous examples to show the benefits of our work.

In [144]:
odyssey_vocab = \
   extract_vocab_from_text_url('https://www.gutenberg.org/cache/epub/1727/pg1727.txt')

In [145]:
iliad_vocab = \
   extract_vocab_from_text_url('https://www.gutenberg.org/cache/epub/6130/pg6130.txt')

We demonstrate the function gets different results with different arguments (this is a good way of checking you made all the necessary name changes when you defined the function).

In [146]:
len(odyssey_vocab),len(iliad_vocab)

(14214, 27070)

Note: This is kind of a noteworthy result.  *The Iliad* is about 16K lines,
and and *The Odyssey* about 12K lines,  so it's interesting that the vocabulary size
of *The Iliad* is so much greater.  If you know anything about the stories,
you might be able think of some story-specific reasons for this.

# Exercise 1

Here are two code snippets. 

1.  Write a description of what they do.
2.  Write a function that does what the code snippets do, consistent with your description of what they do.
3.  Test it by calling it on the appropriate arguments to reproduce what snippet #1
and snippet #2 do.  Check to see the results are different.

In [3]:
emails.split("\n")

['',
 'johnson,fred      fjohnson@gmail.com',
 'clark,kenneth     kclark223@gmail.com',
 'rotsler,elaine    er337@apple.com',
 'smith,howard      hgs@yahoo.com',
 '']

In [7]:
# Snippet 1

emails = \
"""
johnson,fred      fjohnson@gmail.com
clark,kenneth     kclark223@gmail.com
rotsler,elaine    er337@apple.com
smith,howard      hgs@yahoo.com
"""

email_dict = dict()
for line in emails.split("\n"):
    if line.strip():
        name,email = line.strip().split()
        email_dict[name] = email

In [8]:
email_dict

{'johnson,fred': 'fjohnson@gmail.com',
 'clark,kenneth': 'kclark223@gmail.com',
 'rotsler,elaine': 'er337@apple.com',
 'smith,howard': 'hgs@yahoo.com'}

In [11]:
# Snippet 2

tel_nums = \
"""
johnson,fred      619-564-7789
clark,kenneth     858-915-5460
rotsler,elaine    858-276-1990
smith,howard      619-352-9911
"""

dd = dict()
for pair in tel_nums.split("\n"):
    if pair.strip():
        name,num = pair.strip().split()
        dd[name] = num

Put your description of the function in the cell below.  Be sure to say what the input of the function is and what it returns.  And be sure to write a function consistent with your
description.  Remember: You have to double click on a text (markup) cell to edit it.

#### Description goes here

The snippets take lines of data with whitespace-separated items and create a dictionary.  The
string before the white space is the key and the string after the white-space is the value.

In [13]:
#Put your definition of the function here

def string2dict (str):
    dd = dict()
    for pair in str.split("\n"):
        if pair.strip().split():
            name,num = pair.split()
            dd[name] = num
    return dd

In [17]:
#Test your function on the email data and the telephone number data.
string2dict (tel_nums)

{'johnson,fred': '619-564-7789',
 'clark,kenneth': '858-915-5460',
 'rotsler,elaine': '858-276-1990',
 'smith,howard': '619-352-9911'}

In [16]:
string2dict (emails)

{'johnson,fred': 'fjohnson@gmail.com',
 'clark,kenneth': 'kclark223@gmail.com',
 'rotsler,elaine': 'er337@apple.com',
 'smith,howard': 'hgs@yahoo.com'}

# Exercise 2

Here are two more code snippets. 

1.  Write a description of what they do.
2.  Write a function that does what the code snippets do, consistent with your description of what they do.
3.  Test it by calling it on the appropriate arguments to reproduce what snippet #1
and snippet #2 do.  Check to see the results are different.

In [22]:
# Snippet #1

L1 = "demon navel hoard ram bib apple".split()
L2 = "yolk lapel dint mongoose slake".split()

sorted(L1 + L2)

['apple',
 'bib',
 'demon',
 'dint',
 'hoard',
 'lapel',
 'mongoose',
 'navel',
 'ram',
 'slake',
 'yolk']

'demon navel hoard ram bib apple'

In [18]:
# Snippet # 2

fruits = ['cherry', 'banana', 'apple', 'orange','mango']
months = ['January', 'February', 'March', 'April','May']

sorted(fruits + months)

['April',
 'February',
 'January',
 'March',
 'May',
 'apple',
 'banana',
 'cherry',
 'mango',
 'orange']

Put your description of the function in the cell below.  Be sure to say what the input of the function is and what it returns.  

#### Description goes here

The function joins two sequences of words into a single sequence of words, returning a list.  If the inputs
are strings they are converted into lists with `.split()`.  Otherwise they are assumed to already be lists.

In [25]:
# Function definition goes after this line.
def join_word_sequences (seq1,seq2):
    if isinstance(seq1,str):
        seq1 = seq1.split()
    if isinstance(seq2,str):
        seq1 = seq2.split()
    return seq1 + seq2

In [27]:
join_word_sequences (L1,L2)

['demon',
 'navel',
 'hoard',
 'ram',
 'bib',
 'apple',
 'yolk',
 'lapel',
 'dint',
 'mongoose',
 'slake']

In [28]:
#Test your function on the fruits data and the months data.
join_word_sequences (fruits,months)

['cherry',
 'banana',
 'apple',
 'orange',
 'mango',
 'January',
 'February',
 'March',
 'April',
 'May']

# Exercise 3

Go back to the motivating example.  Now write and test a function consistent
with description B of what the code snippets do. The new function should take a Gutenberg document number as its input and return a set of words. Call it `extract_vocab_from_gutenberg_doc`.

Hint:  Use an f-string.

Put your description of the function in the cell below.  Be sure to say what the input of the function is and what it returns.  And be sure to write a function consistent with your
description.

#### Function description

The function `extract_vocab_from_gutenberg_doc` takes a Gutenberg document number as its input and returns the vocabulary of the document as a set of strings.

In [31]:
# Function defintion goes after this line
def extract_vocab_from_gutenberg_doc(doc_number):
    url = f'https://www.gutenberg.org/cache/epub/{doc_number}/pg{doc_number}.txt'
    with urlopen(url) as page:
        page_bytes = page.read()
    page_text = page_bytes.decode('utf-8')
    words = page_text.split()
    return set(words)

In [34]:
# Test your function on the appropriate arguments using 
doc_number = 6130
iliad_vocab2 = extract_vocab_from_gutenberg_doc(doc_number)
# iliad vocab was computed in the Motivating Example section of this NB
iliad_vocab2 == iliad_vocab

True

In [164]:
# Test your function on the appropriate arguments using 
doc_number2 = 1727
odyssey_vocab2 = extract_vocab_from_gutenberg_doc(doc_number2)

# Exercise 4

Write function that takes as its arguments two url strings and returns the vocabulary set common to both urls.

As an example, suppose we have two URLs containing the following very short texts:

```
url1: cats dogs urchin
url2: dogs my trail
```

Then this is the behavior we want:

```
>>> extract_shared_vocab_from_text_urls(url1,url2)
{'dogs'}
```

The point of this exercise is not to start from scratch.  Your function
should make use of the function `extract_vocab_from_text_url` already defined
in the motivating example.  If you do that, this should be very easy.

In [36]:
def extract_shared_vocab_from_text_urls(url1,url2):
    return extract_vocab_from_text_url(url1).intersection(extract_vocab_from_text_url(url2))
    

In [149]:
url1 = 'https://www.gutenberg.org/cache/epub/6130/pg6130.txt'
url2 = 'https://www.gutenberg.org/cache/epub/1727/pg1727.txt'
common_vocab1 = extract_shared_vocab_from_text_urls(url1,url2)
len(common_vocab1)

6597

# Exercise 4a

**Optional advanced problem**:  Define a new version of `extract_vocab_from_text_url`
called `extract_shared_vocab_from_text_urls`
that takes *any number of url arguments* and returns the vocabulary common
to all the URLs.

As an example, suppose we have four URLs containing the following very short texts:

```
url1: cats dogs urchin
url2: dogs my trail
url3: my dogs hurt
url4: raining cats and dogs
```

Then this is the behavior we want:

```
>>> extract_shared_vocab_from_text_urls(url1,url2,url3,url4)
{'dogs'}
```

To help streamline this task, we present some facts about 
the asterix prefix (`*`) in Python.  This prefix can be used
to define functions that take any arbitrary number of arguments
(which is what `extract_shared_vocab_from_text_urls` should be),
and it can also be used to pass the contents of a container
as arguments to a function without having to "take it apart".

First we define a tuple and apply `print` to it.

In [167]:
set_tuple = (set('abc'), set('bcd'), set('cde'), set('cgh'))
print(set_tuple)

({'c', 'b', 'a'}, {'c', 'b', 'd'}, {'e', 'c', 'd'}, {'h', 'c', 'g'})


Note that a tuple was printed out, complete with surrounding parentheses.

Suppose we want to print the individual elements of the tuple
instead of the tuple as a unit.  Then we could do: 

In [168]:
print(set_tuple[0],set_tuple[1],set_tuple[2],set_tuple[3])

{'c', 'b', 'a'} {'c', 'b', 'd'} {'e', 'c', 'd'} {'h', 'c', 'g'}


Note:  No surrounding parentheses around the tuple elements:
But we can get exactly the same result using the asterisk prefix:

In [169]:
print(*set_tuple)

{'c', 'b', 'a'} {'c', 'b', 'd'} {'e', 'c', 'd'} {'h', 'c', 'g'}


In the first case

```
print(set_tuple)
```

we print a tuple; in the second, 

```
print(*set_tuple)
```

the individual elements of the tuple are passed as arguments to the print command, so it prints each of them.

Similarly, since the intersection method on sets allows any number of container arguments,
the two python expressions executed in the `print` commands in the next cell are equivalent:

In [72]:
set_tuple = (set('abc'), set('bcd'), set('cde'), set('cgh'))
print(set('cat').intersection(set_tuple[0],set_tuple[1],set_tuple[2],set_tuple[3]))
print(set('cat').intersection(*set_tuple))

{'c'}
{'c'}


We can also use the asterisk prefix in function signatures.

The following two functions have exactly the same definition
but different argument **signatures** (different ways of defining the function parameters), so they are called in slightly different ways:

In [54]:
def a_single_container_arg (args):
    for x in args:
        print(x)
def any_number_of_args (*args):
    for x in args:
        print(x)


a_single_container_arg ((1,2,3,4))
print('='*20)
any_number_of_args(1,2,3,4)

1
2
3
4
1
2
3
4


Using asterisks and `extract_vocab_from_text_url`, it should be easy to define
`extract_shared_vocab_from_text_urls`:

Here are some docs to test on:

Add your definition to the next cell then execute it to run the test.

Hint:  It may help to know that set method `.intersection()`, like `print`, takes any number of arguments, and the method can be called using the type name.

In [181]:
set.intersection(set("abc"),set("bcd"),set("cde"))

{'c'}

Here are three distinct ways to do it:

In [None]:
from datetime import datetime
from urllib.request import urlopen

# Here is a repeat definition of the function defined above

def extract_vocab_from_text_url(url):
    with urlopen(url) as page:
        page_bytes = page.read()
    page_text = page_bytes.decode('utf-8')
    words = page_text.split()
    return set(words)

#Here are three equivalent answers using that definition

def extract_shared_vocab_from_text_urls0(*urls):
    result = extract_vocab_from_text_url(urls[0])
    for url in urls[1:]:
        result.intersection_update(extract_vocab_from_text_url(url))
    return result

def extract_shared_vocab_from_text_urls1(*urls):
    result = extract_vocab_from_text_url(urls[0])
    for url in urls[1:]:
        result = result.intersection(extract_vocab_from_text_url(url))
    return result

def extract_shared_vocab_from_text_urls2(*urls):
    return set.intersection(*[extract_vocab_from_text_url(url) for url in urls])

In [200]:
common_vocab2 = extract_shared_vocab_from_text_urls2 (url0,url1,url2,url3)
len(common_vocab2)

3201

In [199]:

common_vocab1 = extract_shared_vocab_from_text_urls1 (url0,url1,url2,url3)
len(common_vocab1)

3201

In [202]:
common_vocab0 = extract_shared_vocab_from_text_urls0 (url0,url1,url2,url3)
len(common_vocab0)

3201

# Exercise 5

Copy and paste the definition of the function `extract_vocab_from_text_url` from above into a code cell and modify it so that it can accept an encoding
as an additional argument. This additional argument should be optional
and it should have `utf8` as its default value.  Call the modified version `new_extract_vocab_from_text_url`. 

The function `new_extract_vocab_from_text_url`
should work exactly like the
original version on, say, the Odyssey page,
`'https://www.gutenberg.org/cache/epub/1727/pg1727.txt'`,
but it should also work on text pages encoded in, say, UTF-16.

You can test `new_extract_vocab_from_text_url` two ways.

#### test # 1

After defining `new_extract_vocab_from_text_url` execute the cell below to make sure it works the same way `extract_vocab_from_text_url` does on the motivating examples. That is, using `new_extract_vocab_from_text_url` on a gutenberg.org URL should still work without supplying an encoding argument and it should return the same value as 
`extract_vocab_from_text_url` does.

In [None]:
def new_extract_vocab_from_text_url(url,encoding="utf-8"):
    with urlopen(url) as page:
        page_bytes = page.read()
    page_text = page_bytes.decode(encoding)
    words = page_text.split()
    return set(words)

In [28]:
# Test #1
new_iliad_vocab = \
   new_extract_vocab_from_text_url('https://www.gutenberg.org/cache/epub/6130/pg6130.txt')
new_iliad_vocab == iliad_vocab

True

#### test #2

It should also work on a utf16 version of *the Iliad*, which has been conveniently provided for you in the next cell; `new_extract_vocab_from_text_url` should return exactly the same vocabulary it did for the utf8 version on gutenberg.org.

In [29]:
# Test #2
iliad16_url = 'https://gawron.sdsu.edu/python_for_ss/course_core/data/iliad_utf16.txt'
iliad_vocab16 = new_extract_vocab_from_text_url(iliad16_url, encoding='utf16')
iliad_vocab16 == iliad_vocab

True

Comment: if test #2 raises a `UnicodeDecodeError` error like the one in the code cell below, that's probably because you are still using the default encoding to try to read the UTF16 version of the file.  Check to make sure you aren't always using UTF-8 in 
the line calling `page_bytes.decode`.

We can say something stronger: Using the default encoding to try to read a UTF16 file *should* cause an error (We will return to this fact when we discuss unicode and text encodings later on in the course). So for example, if you've
correctly defined `new_extract_vocab_from_text_url`, the following code cell raises an error: 


In [30]:
# Correctly an error
iliad_vocabx = new_extract_vocab_from_text_url(iliad16_url)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

# Exercise 6

Here are two more complicated snippets that do the same thing to two different datasets.


1.  Write a description of what they do.
2.  Write a function that does what the code snippets do, consistent with your description of what they do.  Choose a good name consistent with your description.
3.  Test it by calling it on the appropriate arguments to reproduce what snippet #1
and snippet #2 do.  Check to see the results are different.


A common sense constraint:  Your function should return
the right data structure to represent the work that
was done in the code snippet.

In [150]:
# Snippet # 1

from urllib.request import urlopen

with urlopen(urls[0])  as zero_page:
    zero_bytes = zero_page.read()

zero_text = zero_bytes.decode('utf-8')
zero_words = zero_text.lower().split()

zero_dd = dict()

for word in zero_words:
    if word in zero_dd:
        zero_dd[word]  += 1
    else:
        zero_dd[word]  = 1      


In [151]:
# Snippet # 2

from urllib.request import urlopen

with urlopen(urls[1])  as one_page:
    one_bytes = one_page.read()

one_text = one_bytes.decode('utf-8')
one_words = one_text.lower().split()

one_dd = dict()

for word in one_words:
    if word in one_dd:
        one_dd[word]  += 1
    else:
        one_dd[word]  = 1      


#### Description

These code snippers get the frequency counts for each vocabulary item of the text at the url.

In [153]:
#Put your function definition in the cell

def get_word_counts(url,encoding="utf-8"):
    from urllib.request import urlopen

    with urlopen(url)  as one_page:
        one_bytes = one_page.read()

    one_words = one_bytes.decode(encoding).lower().split()

    one_dd = dict()

    for word in one_words:
        if word in one_dd:
            one_dd[word]  += 1
        else:
            one_dd[word]  = 1      

    return one_dd

In [154]:
# Put your test code in this cell.
url0 = 'https://www.gutenberg.org/cache/epub/6130/pg6130.txt'
wc0 = get_word_counts(urls[0],encoding="utf-8")
wc1 = get_word_counts(urls[1],encoding="utf-8")
print(wc0 == zero_dd)
print(wc1 == one_dd)
                           

True
True


# Exercise  7

Write a function that does what the following snippet does.  The
function you define can (and should) make use of a function previously
defined in this notebook.

After you've written your function, use it to answer the following questions.

1.  What are the 20 most common content words in *The Odyssey*? The best definition of a content word is negative:  A content word is not a **function** word, a word like *the*, *and*, or *in*, a word that is extremely frequent amd serves a grammatical function (More examples below). 
2.  What are the 20 most common content words in *The Iliad*? 

For distinguishing content words from function words, here's some help.  Function words are sometimes
called stop words in natural language processing.  Use the list of `cutom_stops` defined below
and find the 20 most frequent words from each text that are **not** stop words.

In [115]:
from urllib.request import urlopen

with urlopen(urls[1
                 ])  as odyssey_page:
    odyssey_bytes = odyssey_page.read()

odyssey_text = odyssey_bytes.decode('utf-8')
odyssey_words = odyssey_text.lower().split()

odyssey_dd = dict()

for word in odyssey_words:
    if word in odyssey_dd:
        odyssey_dd[word]  += 1
    else:
        odyssey_dd[word]  = 1      

LL = sorted(odyssey_dd.items(),key=lambda x:x[1],reverse=True)


In order to help you understand this code, you should look at the first 25 elements of `LL`.

In [97]:
LL[:25]

[('the', 7016),
 ('and', 5320),
 ('of', 3626),
 ('to', 3557),
 ('a', 2030),
 ('in', 1897),
 ('i', 1860),
 ('he', 1811),
 ('you', 1675),
 ('for', 1368),
 ('his', 1341),
 ('as', 1291),
 ('that', 1275),
 ('with', 1236),
 ('was', 1033),
 ('it', 1005),
 ('they', 952),
 ('is', 934),
 ('on', 882),
 ('had', 880),
 ('have', 858),
 ('but', 849),
 ('all', 828),
 ('my', 816),
 ('not', 772)]

It's a list of the pairs extracted from the dictionary `odyssey_dd`.  They have been sorted by the value of the second element,
from the largest value for that second element to the smallest value for that second element.  You're looking at the top 25 pairs after the sort.

All 25 of the top 25 words in `LL` are function words according to the list below:

In [86]:
from nltk.corpus import stopwords,brown
# This 
#from nltk.corpus import stopwords
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
#function_words 

In [138]:
custom_stops = {wd.lower() for (wd,tag) in brown_news_tagged if (tag in {"DET",".","CONJ","PRT","ADP","PRON","NUM"})}
custom_stops.update({"thy","thou","thee","thine"})
custom_stops.update(  {"have","has", "had","having","has","\'ve"} )
custom_stops.update( {"do","does","did","done","doing"}  )
custom_stops.update(   {"go","goes","went","gone","going"} )
custom_stops.update(  {"get","gets","got","gotten","getting"} )
custom_stops.update(  {"come","comes","came","coming"} )
custom_stops.update(   {"say","says","said","saying"} )
custom_stops.update(  {"tell","tells", "told","telling"} )
custom_stops.update(   {"take","takes,","took","taken","telling"} )
custom_stops.update(   {"let","lets","letting"})
custom_stops.update(   {"make","makes","made","making"})

custom_stops.update({"be","is","was","were","am", "are","been","being",}   )
custom_stops.update( {"will","would","may","might","should",
                                                       "shall","must","can","could"}  )
custom_stops.update( {"not","\'nt"}  )
custom_stops.update( {"when","where","how", "then","now","here","there","back","other","very",
                                                      "more","less","much","also","thus","o\'er","o’er","first",
                                                      "second","third","fourth","fifth","sixth","seventh",
                                                      "eighth","ninth","tenth"}  )

In [131]:
[(wd,ct) for(wd,ct) in LL[:25] if wd not in custom_stops]

[]

#### Solution

In [203]:
def sort_words_by_frequency(url):
    with urlopen(url)  as page:
        p_bytes = page.read()

    words = p_bytes.decode('utf-8').lower().split()

    dd = dict()

    for word in words:
        if word in dd:
            dd[word]  += 1
        else:
            dd[word]  = 1      

    return sorted(dd.items(),key=lambda x:x[1],reverse=True)

In [204]:
freq_list_odyssey = sort_words_by_frequency(urls[1])

We show the 20 most common content words, as well as their count and rank.  Notably
none of the top 46 most frequent words count as content words.  Many
of the qualifying words are due to errornof **tokenization** ("him,","and,", and "them,"
should not be words).  These are due to using `.split()` to tokenize.  If instead we
use a regular expression based tokenizer, teh results improve.

In [205]:
[(i,(wd, freq)) for (i,(wd,freq)) in enumerate(freq_list_odyssey) if wd not in custom_stops][:20]

[(47, ('ulysses', 434)),
 (66, ('own', 264)),
 (71, ('see', 249)),
 (72, ('man', 248)),
 (74, ('men', 246)),
 (78, ('house', 228)),
 (81, ('son', 218)),
 (89, ('good', 199)),
 (97, ('[', 186)),
 (99, ('great', 184)),
 (104, ('him,', 173)),
 (105, ('said,', 173)),
 (107, ('set', 166)),
 (108, ('telemachus', 165)),
 (112, ('till', 158)),
 (113, ('ship', 158)),
 (119, ('suitors', 149)),
 (125, ('home', 143)),
 (126, ('them,', 142)),
 (127, ('even', 142))]

In [206]:
freq_list_iliad = sort_words_by_frequency('https://www.gutenberg.org/cache/epub/6130/pg6130.txt')
[(i,(wd, freq)) for (i,(wd,freq)) in enumerate(freq_list_iliad) if wd not in custom_stops][:20]

[(39, ('great', 445)),
 (59, ('hector', 282)),
 (63, ('arms', 257)),
 (70, ('achilles', 232)),
 (76, ('trojan', 215)),
 (77, ('jove', 206)),
 (79, ('gods', 195)),
 (81, ('chief', 187)),
 (82, ('high', 187)),
 (86, ('god', 180)),
 (90, ('son', 165)),
 (91, ('grecian', 161)),
 (92, ('greece', 161)),
 (95, ('and,', 153)),
 (96, ('troy', 148)),
 (97, ('greeks', 148)),
 (98, ('still', 148)),
 (99, ('fierce', 146)),
 (102, ('till', 143)),
 (103, ('hand', 143))]