In [1]:
# Important: From this chapter onwards, our program samples will assume you begin your interactive session or your program with the 
# following import statements:

from __future__ import division  # Python 2 users only
import nltk, re, pprint
from nltk import word_tokenize
from __future__ import print_function

In [2]:
# 4   Writing Structured Programs

# By now you will have a sense of the capabilities of the Python programming language for processing natural language. However, if 
# you're new to Python or to programming, you may still be wrestling with Python and not feel like you are in full control yet. 
# In this chapter we'll address the following questions:

#    How can you write well-structured, readable programs that you and others will be able to re-use easily?
#    How do the fundamental building blocks work, such as loops, functions and assignment?
#    What are some of the pitfalls with Python programming and how can you avoid them?

# Along the way, you will consolidate your knowledge of fundamental programming constructs, learn more about using features of the 
# Python language in a natural and concise way, and learn some useful techniques in visualizing natural language data. As before, 
# this chapter contains many examples and exercises (and as before, some exercises introduce new material). Readers new to 
# programming should work through them carefully and consult other introductions to programming if necessary; experienced programmers
# can quickly skim this chapter.

# In the other chapters of this book, we have organized the programming concepts as dictated by the needs of NLP. Here we revert 
# to a more conventional approach where the material is more closely tied to the structure of the programming language. There's not 
# room for a complete presentation of the language, so we'll just focus on the language constructs and idioms that are most 
# important for NLP.

# 4.1   Back to the Basics

In [3]:
# Assignment

# Assignment would seem to be the most elementary programming concept, not deserving a separate discussion. However, there are some 
# surprising subtleties here. Consider the following code fragment:

foo = 'Monty'  # foo points to some memory space with 'Monty'
bar = foo      # bar points to the same memory space 'Monty' 
foo = 'Python' # foo points to another memory space with 'Python'
bar            # bar STILL points to memory space with 'Monty' 

'Monty'

In [4]:
# This behaves exactly as expected. When we write bar = foo in the above code [1], the value of foo (the string 'Monty') is assigned 
# to bar. That is, bar is a copy of foo, so when we overwrite foo with a new string 'Python' on line [2], the value of bar is not 
# affected.

# However, assignment statements do not always involve making copies in this way. Assignment always copies the value of an 
# expression, but a value is not always what you might expect it to be. In particular, the "value" of a structured object such as 
# a list is actually just a reference to the object. In the following example, [1] assigns the reference of foo to the new variable 
# bar. Now when we modify something inside foo on line [2], we can see that the contents of bar have also been changed.

foo = ['Monty', 'Python'] # foo points to a list ['Monty', 'Python']
bar = foo # bar points to this same list
foo[1] = 'Bodkin' # Second element of the list is changed to 'Bodkin'
bar # Since bar points to this same list, the change is reflected

['Monty', 'Bodkin']

In [5]:
# The line bar = foo [1] does not copy the contents of the variable, only its "object reference". To understand what is going on 
# here, we need to know how lists are stored in the computer's memory. In 4.1, we see that a list foo is a reference to an object 
# stored at location 3133 (which is itself a series of pointers to other locations holding strings). When we assign bar = foo, it 
# is just the object reference 3133 that gets copied. This behavior extends to other aspects of the language, such as parameter 
# passing (4.4).

# Let's experiment some more, by creating a variable empty holding the empty list, then using it three times on the next line.

In [6]:
empty = []
nested = [empty, empty, empty]
nested

[[], [], []]

In [8]:
nested[1].append('Python')
nested

# I don't presently understand this example! 8/29/2015 @ 8:17pm
# I still don't understand this example 29 April 2017 @ 8:05pm

[['Python', 'Python'], ['Python', 'Python'], ['Python', 'Python']]

In [9]:
# Observe that changing one of the items inside our nested list of lists changed them all. This is because each of the three 
# elements is actually just a reference to one and the same list in memory.

In [10]:
# Now, notice that when we assign a new value to one of the elements of the list, it does not propagate to the others:

nested = [[]] * 3
nested[1].append('Python')
nested[1] = ['Monty']
nested

[['Python'], ['Monty'], ['Python']]

In [11]:
# We began with a list containing three references to a single empty list object. Then we modified that object by appending 
# 'Python' to it, resulting in a list containing three references to a single list object ['Python']. Next, we overwrote one of 
# those references with a reference to a new object ['Monty']. This last step modified one of the three object references inside 
# the nested list. However, the ['Python'] object wasn't changed, and is still referenced from two places in our nested list of lists. 
# It is crucial to appreciate this difference between modifying an object via an object reference, and overwriting an object 
# reference.

# Note

# Important: To copy the items from a list foo to a new list bar, you can write bar = foo[:]. This copies the object references 
# inside the list. To copy a structure without copying any object references, use copy.deepcopy().

In [12]:
# Equality

# Python provides two ways to check that a pair of items are the same. The is operator tests for object identity. We can use it to 
# verify our earlier observations about objects. First we create a list containing several copies of the same object, and demonstrate 
# that they are not only identical according to ==, but also that they are one and the same object:

size = 5
python = ['Python']
snake_nest = [python] * size
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

True

In [13]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

True

In [14]:
# Now let's put a new python in this nest. We can easily show that the objects are not all identical:

import random
position = random.choice(range(size))
print (position)
snake_nest[position] = ['Python']
snake_nest

4


[['Python'], ['Python'], ['Python'], ['Python'], ['Python']]

In [15]:
snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]

True

In [16]:
snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]

False

In [18]:
# You can do several pairwise tests to discover which position contains the interloper, but the id() function makes detection easier:

[id(snake) for snake in snake_nest]

# This reveals that the second item of the list has a distinct identifier. If you try running this code snippet yourself, expect 
# to see different numbers in the resulting list, and also the interloper may be in a different position.

# Having two kinds of equality might seem strange. However, it's really just the type-token distinction, familiar from natural 
# language, here showing up in a programming language.

# I don't get the point of this.

[154562440L, 154562440L, 154562440L, 154562440L, 154584840L]

In [20]:
# Conditionals

# In the condition part of an if statement, a nonempty string or list is evaluated as true, while an empty string or list evaluates 
# as false.

mixed = ['cat', '', ['dog'], []]
for element in mixed:
    if element:
        print(element)

cat
['dog']


In [21]:
# That is, we don't need to say if len(element) > 0: in the condition.

# What's the difference between using if...elif as opposed to using a couple of if statements in a row? Well, consider the following 
# situation:

animals = ['cat', 'dog']
if 'cat' in animals:
    print(1)
elif 'dog' in animals:
    print(2)

1


In [22]:
# Since the if clause of the statement is satisfied, Python never tries to evaluate the elif clause, so we never get to print out 
# 2. By contrast, if we replaced the elif by an if, then we would print out both 1 and 2. So an elif clause potentially gives us more 
# information than a bare if clause; when it evaluates to true, it tells us not only that the condition is satisfied, but also that 
# the condition of the main if clause was not satisfied.

In [23]:
# The functions all() and any() can be applied to a list (or other sequence) to check whether all or any items meet some condition:

sent = ['No', 'good', 'fish', 'goes', 'anywhere', 'without', 'a', 'porpoise', '.']
all(len(w) > 4 for w in sent)

False

In [24]:
any(len(w) > 4 for w in sent)

True

# 4.2   Sequences

In [25]:
# So far, we have seen two kinds of sequence object: strings and lists. Another kind of sequence is called a tuple. Tuples are 
# formed with the comma operator [1], and typically enclosed using parentheses. We've actually seen them in the previous chapters, 
# and sometimes referred to them as "pairs", since there were always two members. However, tuples can have any number of members. 
# Like lists and strings, tuples can be indexed [2] and sliced [3], and have a length [4].

# Let's define t as a 3-tuple
t = 'walk', 'fem', 3
t

('walk', 'fem', 3)

In [26]:
# What is the first element of the tuple t?
t[0]

'walk'

In [27]:
# Print out the 2nd through the end of the tuple's elements
t[1:]

('fem', 3)

In [28]:
# What is the length of the tuple t?
len(t) 

3

In [29]:
# Caution!

# Tuples are constructed using the comma operator. Parentheses are a more general feature of Python syntax, designed for grouping. 
# A tuple containing the single element 'snark' is defined by adding a trailing comma, like this: "'snark',". The empty tuple is a 
# special case, and is defined using empty parentheses ().

# Let's compare strings, lists and tuples directly, and do the indexing, slice, and length operation on each type:

raw = 'I turned off the spectroroute' # string
text = ['I', 'turned', 'off', 'the', 'spectroroute'] # list
pair = (6, 'turned') # 2-tuple
raw[2], text[3], pair[1] # second character of raw, 3rd element of list, and 1st element of tuple
# note that the actual output is itself a tuple

('t', 'the', 'turned')

In [30]:
raw[-3:], text[-3:], pair[-3:]
# from 3rd to the last to the end of string
# from 3rd to the last to the end of list
# from 3rd to last to the end of tuple... Since we have 2-tuple, it only returns the 2 elements
# note that the actual output is itself a tuple

('ute', ['off', 'the', 'spectroroute'], (6, 'turned'))

In [31]:
len(raw), len(text), len(pair)
# length of raw string, length of text list, and length of 2-tuple pair

(29, 5, 2)

In [32]:
# Notice in this code sample that we computed multiple values on a single line, separated by commas. These comma-separated 
# expressions are actually just tuples — Python allows us to omit the parentheses around tuples if there is no ambiguity. When we 
# print a tuple, the parentheses are always displayed. By using tuples in this way, we are implicitly aggregating items together.

In [33]:
# Let's try these sequences
s = [1,2,3,4,5,6,7,8,9,10]
for item in s:
    print (item)

1
2
3
4
5
6
7
8
9
10


In [34]:
s = [10,3,2,1,5,7,9,8,4,6]
for item in sorted(s):
    print (item)

1
2
3
4
5
6
7
8
9
10


In [35]:
s = [10,3,2,1,5,7,9,8,4,6,5,6,2,1,10,2,3]
for item in set(s):
    print (item)

1
2
3
4
5
6
7
8
9
10


In [36]:
for item in reversed(s):
    print (item)

3
2
10
1
2
6
5
6
4
8
9
7
5
1
2
3
10


In [37]:
t = [2,3,4,5]
for item in set(s).difference(t):
    print (item)

1
6
7
8
9
10


In [38]:
# Some other objects, such as a FreqDist, can be converted into a sequence (using list() or sorted()) and support iteration, e.g.

raw = 'Red lorry, yellow lorry, red lorry, yellow lorry.'
text = word_tokenize(raw)
print (text)
fdist = nltk.FreqDist(text)
sorted(fdist)

['Red', 'lorry', ',', 'yellow', 'lorry', ',', 'red', 'lorry', ',', 'yellow', 'lorry', '.']


[',', '.', 'Red', 'lorry', 'red', 'yellow']

In [39]:
for key in fdist:
    print(key + ':', fdist[key], end='; ')

,: 3; yellow: 2; .: 1; Red: 1; lorry: 4; red: 1; 

In [40]:
# In the next example, we use tuples to re-arrange the contents of our list. (We can omit the parentheses because the comma has 
# higher precedence than assignment.)

words = ['I', 'turned', 'off', 'the', 'spectroroute'] # define words as a list of words
words[2], words[3], words[4] = words[3], words[4], words[2] # should be I turned the spectroroute off
words

['I', 'turned', 'the', 'spectroroute', 'off']

In [41]:
# This is an idiomatic and readable way to move items inside a list. It is equivalent to the following traditional way of doing 
# such tasks that does not use tuples (notice that this method needs a temporary variable tmp).

tmp = words[2]
words[2] = words[3]
words[3] = words[4]
words[4] = tmp
words

['I', 'turned', 'spectroroute', 'off', 'the']

In [42]:
# As we have seen, Python has sequence functions such as sorted() and reversed() that rearrange the items of a sequence. There are 
# also functions that modify the structure of a sequence and which can be handy for language processing. Thus, zip() takes the items 
# of two or more sequences and "zips" them together into a single list of tuples. Given a sequence s, enumerate(s) returns pairs 
# consisting of an index and the item at that index.

words = ['I', 'turned', 'off', 'the', 'spectroroute']
tags = ['noun', 'verb', 'prep', 'det', 'noun']
zip(words, tags)

[('I', 'noun'),
 ('turned', 'verb'),
 ('off', 'prep'),
 ('the', 'det'),
 ('spectroroute', 'noun')]

In [43]:
list(zip(words, tags))

[('I', 'noun'),
 ('turned', 'verb'),
 ('off', 'prep'),
 ('the', 'det'),
 ('spectroroute', 'noun')]

In [44]:
list(enumerate(words))

[(0, 'I'), (1, 'turned'), (2, 'off'), (3, 'the'), (4, 'spectroroute')]

In [45]:
list(enumerate(tags))

[(0, 'noun'), (1, 'verb'), (2, 'prep'), (3, 'det'), (4, 'noun')]

In [46]:
# For some NLP tasks it is necessary to cut up a sequence into two or more parts. For instance, we might want to "train" a system 
# on 90% of the data and test it on the remaining 10%. To do this we decide the location where we want to cut the data [1], then 
# cut the sequence at that location [2].

text = nltk.corpus.nps_chat.words() # use chat words as text
cut = int(0.9 * len(text)) # calculate 90% of length and assign it to nearest integer value
training_data, test_data = text[:cut], text[cut:] # set training data to first 90% and test data to last 10%
text == training_data + test_data # test that text is concatenation of training data and test data

True

In [47]:
len(training_data) / len(test_data) # another test that length of training data is 90% of total data

9.0

In [48]:
# Combining Different Sequence Types

# Let's combine our knowledge of these three sequence types, together with list comprehensions, to perform the task of sorting the 
# words in a string by their length.

words = 'I turned off the spectroroute'.split() # Words is a list of words that are split by whitespace from this string

wordlens = [(len(word), word) for word in words] # Create a list of tuples (length of word, word)
# words lens should look like [(1, 'I'), (6, 'turned'), (3, 'off'), (3, 'the'), (12, 'spectroroute')]

wordlens.sort() # sorts by first element (number) and then name

' '.join(w for (_, w) in wordlens)

'I off the turned spectroroute'

In [49]:
# Each of the above lines of code contains a significant feature. A simple string is actually an object with methods defined on it 
# such as split() [1]. We use a list comprehension to build a list of tuples [2], where each tuple consists of a number (the word 
# length) and the word, e.g. (3, 'the'). We use the sort() method [3] to sort the list in-place. Finally, we discard the length 
# information and join the words back into a single string [4]. (The underscore [4] is just a regular Python variable, but we can 
# use underscore by convention to indicate that we will not use its value.)

In [50]:
# We began by talking about the commonalities in these sequence types, but the above code illustrates important differences in their 
# roles. First, strings appear at the beginning and the end: this is typical in the context where our program is reading in some 
# text and producing output for us to read. Lists and tuples are used in the middle, but for different purposes. A list is typically 
# a sequence of objects all having the same type, of arbitrary length. We often use lists to hold sequences of words. In contrast, 
# a tuple is typically a collection of objects of different types, of fixed length. We often use a tuple to hold a record, a 
# collection of different fields relating to some entity. This distinction between the use of lists and tuples takes some getting 
# used to, so here is another example:

lexicon = [
    ('the', 'det', ['Di:', 'D@']),
    ('off', 'prep', ['Qf', 'O:f'])
]

In [51]:
# Here, a lexicon is represented as a list because it is a collection of objects of a single type — lexical entries — of no 
# predetermined length. An individual entry is represented as a tuple because it is a collection of objects with different 
# interpretations, such as the orthographic form, the part of speech, and the pronunciations (represented in the SAMPA 
# computer-readable phonetic alphabet http://www.phon.ucl.ac.uk/home/sampa/). Note that these pronunciations are stored using a 
# list. (Why?)

In [52]:
# Note

# A good way to decide when to use tuples vs lists is to ask whether the interpretation of an item depends on its position. 
# For example, a tagged token combines two strings having different interpretation, and we choose to interpret the first item as 
# the token and the second item as the tag. Thus we use tuples like this: ('grail', 'noun'); a tuple of the form ('noun', 'grail') 
# would be nonsensical since it would be a word noun tagged grail. In contrast, the elements of a text are all tokens, and position 
# is not significant. Thus we use lists like this: ['venetian', 'blind']; a list of the form ['blind', 'venetian'] would be equally 
# valid. The linguistic meaning of the words might be different, but the interpretation of list items as tokens is unchanged.

In [53]:
# The distinction between lists and tuples has been described in terms of usage. However, there is a more fundamental difference: 
# in Python, lists are mutable, while tuples are immutable. In other words, lists can be modified, while tuples cannot. Here are 
# some of the operations on lists that do in-place modification of the list.

lexicon.sort()
lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd'])
del lexicon[0]
lexicon

[('turned', 'VBD', ['t3:nd', 't3`nd'])]

In [54]:
# Generator Expressions

# We've been making heavy use of list comprehensions, for compact and readable processing of texts. Here's an example where 
# we tokenize and normalize a text:

text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
"it means just what I choose it to mean - neither more nor less."'''
[w.lower() for w in word_tokenize(text)]

['``',
 'when',
 'i',
 'use',
 'a',
 'word',
 ',',
 "''",
 'humpty',
 'dumpty',
 'said',
 'in',
 'rather',
 'a',
 'scornful',
 'tone',
 ',',
 "''",
 'it',
 'means',
 'just',
 'what',
 'i',
 'choose',
 'it',
 'to',
 'mean',
 '-',
 'neither',
 'more',
 'nor',
 'less',
 '.',
 "''"]

In [57]:
# Suppose we now want to process these words further. We can do this by inserting the above expression inside a call to some 
# other function [1], but Python allows us to omit the brackets [2].

max([w.lower() for w in word_tokenize(text)]) # [1]
# I don't presently understand max? 29 April 2017 @ 8:57pm

'word'

In [58]:
max(w.lower() for w in word_tokenize(text)) # [2]

'word'

In [59]:
# The second line uses a generator expression. This is more than a notational convenience: in many language processing situations, 
# generator expressions will be more efficient. In [1], storage for the list object must be allocated before the value of max() 
# is computed. If the text is very large, this could be slow. In [2], the data is streamed to the calling function. Since the 
# calling function simply has to find the maximum value — the word which comes latest in lexicographic sort order — it can process 
# the stream of data without having to store anything more than the maximum value seen so far.

# 4.3   Questions of Style

In [61]:
# Programming is as much an art as a science. The undisputed "bible" of programming, a 2,500 page multi-volume work by Donald Knuth, 
# is called The Art of Computer Programming. Many books have been written on Literate Programming, recognizing that humans, not 
# just computers, must read and understand programs. Here we pick up on some issues of programming style that have important 
# ramifications for the readability of your code, including code layout, procedural vs declarative style, and the use of loop 
# variables.

# Python Coding Style

# When writing programs you make many subtle choices about names, spacing, comments, and so on. When you look at code written by 
# other people, needless differences in style make it harder to interpret the code. Therefore, the designers of the Python 
# language have published a style guide for Python code, available at http://www.python.org/dev/peps/pep-0008/. The underlying 
# value presented in the style guide is consistency, for the purpose of maximizing the readability of code. We briefly review 
# some of its key recommendations here, and refer readers to the full guide for detailed discussion with examples.

# Code layout should use four spaces per indentation level. You should make sure that when you write Python code in a file, you 
# avoid tabs for indentation, since these can be misinterpreted by different text editors and the indentation can be messed up. 
# Lines should be less than 80 characters long; if necessary you can break a line inside parentheses, brackets, or braces, 
# because Python is able to detect that the line continues over to the next line. If you need to break a line outside 
# parentheses, brackets, or braces, you can often add extra parentheses, and you can always add a backslash at the end of the 
# line that is broken:

syllables = 'my name is victory'

if (len(syllables) > 4 and len(syllables[2]) == 3 and
   syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]):
    process(syllables)
if len(syllables) > 4 and len(syllables[2]) == 3 and \
   syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]:
    process(syllables)
    
# Note

# Typing spaces instead of tabs soon becomes a chore. Many programming editors have built-in support for Python, and can 
# automatically indent code and highlight any syntax errors (including indentation errors). For a list of Python-aware editors, 
# please see http://wiki.python.org/moin/PythonEditors.

In [62]:
# Procedural vs Declarative Style

# We have just seen how the same task can be performed in different ways, with implications for efficiency. Another factor 
# influencing program development is programming style. Consider the following program to compute the average length of words 
# in the Brown Corpus:

tokens = nltk.corpus.brown.words(categories='news') #words from all news articles Brown corpus
count = 0 # set count to 0
total = 0 # set total to 0
for token in tokens:
    count += 1
    total += len(token)
total / count

# In this program we use the variable count to keep track of the number of tokens seen, and total to store the combined length of 
# all words. This is a low-level style, not far removed from machine code, the primitive operations performed by the computer's CPU. 
# The two variables are just like a CPU's registers, accumulating values at many intermediate stages, values that are meaningless 
# until the end. We say that this program is written in a procedural style, dictating the machine operations step by step. 

4.401545438271973

In [63]:
# Now consider the following program that computes the same thing:

total = sum(len(t) for t in tokens)
print(total / len(tokens))

4.40154543827


In [66]:
# The first line uses a generator expression to sum the token lengths, while the second line computes the average as before. 
# Each line of code performs a complete, meaningful task, which can be understood in terms of high-level properties like: "total 
# is the sum of the lengths of the tokens". Implementation details are left to the Python interpreter. The second program uses a 
# built-in function, and constitutes programming at a more abstract level; the resulting code is more declarative. 

# Let's look at an extreme example:

word_list = [] # word_list is nothing
i = 0 # counter is 0
while i < len(tokens):
    j = 0
    while j < len(word_list) and word_list[j] <= tokens[i]:
        j += 1
    if j == 0 or tokens[i] != word_list[j-1]:
        word_list.insert(j, tokens[i])
    i += 1

KeyboardInterrupt: 

In [67]:
# The equivalent declarative version uses familiar built-in functions, and its purpose is instantly recognizable:

word_list = sorted(set(tokens))
word_list

[u'!',
 u'$1',
 u'$1,000',
 u'$1,000,000,000',
 u'$1,500',
 u'$1,500,000',
 u'$1,600',
 u'$1,800',
 u'$1.1',
 u'$1.4',
 u'$1.5',
 u'$1.80',
 u'$10',
 u'$10,000',
 u'$10,000-per-year',
 u'$100',
 u'$100,000',
 u'$102,285,000',
 u'$109',
 u'$11.50',
 u'$115,000',
 u'$12',
 u'$12,192,865',
 u'$12,500',
 u'$12.50',
 u'$12.7',
 u'$120',
 u'$125',
 u'$135',
 u'$139.3',
 u'$14',
 u'$15',
 u'$15,000',
 u'$15,000,000',
 u'$150',
 u'$157,460',
 u'$16',
 u'$16,000',
 u'$17',
 u'$17,000',
 u'$17.8',
 u'$172,000',
 u'$172,400',
 u'$18',
 u'$18.2',
 u'$18.9',
 u'$2',
 u'$2,000',
 u'$2,170',
 u'$2,330,000',
 u'$2,700',
 u'$2.50',
 u'$2.80',
 u'$20',
 u'$20,000',
 u'$20,447,000',
 u'$200,000',
 u'$214',
 u'$22',
 u'$22.50',
 u'$2400',
 u'$25',
 u'$25,000',
 u'$25-a-plate',
 u'$250',
 u'$250,000',
 u'$251',
 u'$253,355,000',
 u'$26,000,000',
 u'$278,877,000',
 u'$28',
 u'$28,700,000',
 u'$29,000',
 u'$3',
 u'$3,500',
 u'$3,675',
 u'$3.5',
 u'$30',
 u'$300',
 u'$300,000,000',
 u'$3100',
 u'$32,000',
 u'

In [68]:
# Another case where a loop variable seems to be necessary is for printing a counter with each line of output. Instead, we can use 
# enumerate(), which processes a sequence s and produces a tuple of the form (i, s[i]) for each item in s, starting with (0, s[0]). 
# Here we enumerate the key-value pairs of the frequency distribution, resulting in nested tuples (rank, (word, count)). We print 
# rank+1 so that the counting appears to start from 1, as required when producing a list of ranked items.

fd = nltk.FreqDist(nltk.corpus.brown.words())
cumulative = 0.0
most_common_words = [word for (word, count) in fd.most_common()]
for rank, word in enumerate(most_common_words):
    cumulative += fd.freq(word)
    print("%3d %6.2f%% %s" % (rank + 1, cumulative * 100, word))
    if cumulative > 0.25:
        break

  1   5.40% the
  2  10.42% ,
  3  14.67% .
  4  17.78% of
  5  20.19% and
  6  22.40% to
  7  24.29% a
  8  25.97% in


In [69]:
# It's sometimes tempting to use loop variables to store a maximum or minimum value seen so far. Let's use this method to find the 
# longest word in a text.

text = nltk.corpus.gutenberg.words('milton-paradise.txt')
longest = ''
for word in text:
    if len(word) > len(longest):
        longest = word
longest

u'unextinguishable'

In [70]:
# However, a more transparent solution uses two list comprehensions, both having forms that should be familiar by now:

maxlen = max(len(word) for word in text)
[word for word in text if len(word) == maxlen]

# Note that our first solution found the first word having the longest length, while the second solution found all of the longest 
# words (which is usually what we would want). Although there's a theoretical efficiency difference between the two solutions, 
# the main overhead is reading the data into main memory; once it's there, a second pass through the data is effectively instantaneous. 
# We also need to balance our concerns about program efficiency with programmer efficiency. A fast but cryptic solution will be harder 
# to understand and maintain.

[u'unextinguishable',
 u'transubstantiate',
 u'inextinguishable',
 u'incomprehensible']

In [71]:
# Some Legitimate Uses for Counters

# There are cases where we still want to use loop variables in a list comprehension. For example, we need to use a loop variable 
# to extract successive overlapping n-grams from a list:

sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
n = 3
[sent[i:i+n] for i in range(len(sent)-n+1)]

# It is quite tricky to get the range of the loop variable right. Since this is a common operation in NLP, NLTK supports it with 
# functions bigrams(text) and trigrams(text), and a general purpose ngrams(text, n).

[['The', 'dog', 'gave'],
 ['dog', 'gave', 'John'],
 ['gave', 'John', 'the'],
 ['John', 'the', 'newspaper']]

In [72]:
# Here's an example of how we can use loop variables in building multidimensional structures. For example, to build an array with 
# m rows and n columns, where each cell is a set, we could use a nested list comprehension:
m, n = 3, 7
array = [[set() for i in range(n)] for j in range(m)]
array[2][5].add('Alice')
pprint.pprint(array)

[[set([]), set([]), set([]), set([]), set([]), set([]), set([])],
 [set([]), set([]), set([]), set([]), set([]), set([]), set([])],
 [set([]), set([]), set([]), set([]), set([]), set(['Alice']), set([])]]


In [74]:
# Observe that the loop variables i and j are not used anywhere in the resulting object, they are just needed for a syntactically 
# correct for statement. As another example of this usage, observe that the expression ['very' for i in range(3)] produces a 
# list containing three instances of 'very', with no integers in sight.

# Note that it would be incorrect to do this work using multiplication, for reasons concerning object copying that were discussed 
# earlier in this section.

array = [[set()] * n] * m
array[2][5].add(7)
pprint.pprint(array)

# Iteration is an important programming device. It is tempting to adopt idioms from other languages. However, Python offers some 
# elegant and highly readable alternatives, as we have seen.

[[set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],
 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],
 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])]]


# 4.4   Functions: The Foundation of Structured Programming

In [75]:
# Functions provide an effective way to package and re-use program code, as already explained in 3. For example, suppose we find 
# that we often want to read text from an HTML file. This involves several steps: opening the file, reading it in, normalizing 
# whitespace, and stripping HTML markup. We can collect these steps into a function, and give it a name such as get_text(), as 
# shown in 4.2.

In [76]:
import re
def get_text(file):
    """Read text from a file, normalizing whitespace and stripping HTML markup."""
    text = open(file).read()
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub('\s+', ' ', text)
    return text

# Now, any time we want to get cleaned-up text from an HTML file, we can just call get_text() with the name of the file as its 
# only argument. It will return a string, and we can assign this to a variable, e.g.: contents = get_text("test.html"). Each time 
# we want to use this series of steps we only have to call the function.

# Using functions has the benefit of saving space in our program. More importantly, our choice of name for the function helps 
# make the program readable. In the case of the above example, whenever our program needs to read cleaned-up text from a file 
# we don't have to clutter the program with four lines of code, we simply need to call get_text(). This naming helps to provide 
# some "semantic interpretation" — it helps a reader of our program to see what the program "means".

In [None]:
# Notice that the above function definition contains a string. The first string inside a function definition is called a docstring. 
# Not only does it document the purpose of the function to someone reading the code, it is accessible to a programmer who has loaded 
# the code from a file:

# |   >>> help(get_text)
# |   Help on function get_text in module __main__:
# |
# |   get(text)
# |       Read text from a file, normalizing whitespace and stripping HTML markup.

# We have seen that functions help to make our work reusable and readable. They also help make it reliable. When we re-use code 
# that has already been developed and tested, we can be more confident that it handles a variety of cases correctly. We also remove 
# the risk that we forget some important step, or introduce a bug. The program that calls our function also has increased 
# reliability. The author of that program is dealing with a shorter program, and its components behave transparently.

# To summarize, as its name suggests, a function captures functionality. It is a segment of code that can be given a meaningful 
# name and which performs a well-defined task. Functions allow us to abstract away from the details, to see a bigger picture, 
# and to program more effectively.

# The rest of this section takes a closer look at functions, exploring the mechanics and discussing ways to make your programs 
# easier to read.

In [77]:
# Function Inputs and Outputs

# We pass information to functions using a function's parameters, the parenthesized list of variables and constants following 
# the function's name in the function definition. Here's a complete example:

# create a function repeat that repeats a string msg num number of times. Separate each instance by a space.
def repeat(msg, num):
    return ' '.join([msg] * num)

# define monty string
monty = 'Monty Python'

# Repeat monty string 3 times
repeat(monty, 3)

'Monty Python Monty Python Monty Python'

In [78]:
# We first define the function to take two parameters, msg and num [1]. Then we call the function and pass it two arguments, 
# monty and 3 [2]; these arguments fill the "placeholders" provided by the parameters and provide values for the occurrences of 
# msg and num in the function body.

# It is not necessary to have any parameters, as we see in the following example:

def monty():
    return "Monty Python"
monty()

'Monty Python'

In [79]:
# A function usually communicates its results back to the calling program via the return statement, as we have just seen. To the 
# calling program, it looks as if the function call had been replaced with the function's result, e.g.:

repeat(monty(), 3)

'Monty Python Monty Python Monty Python'

In [80]:
repeat('Monty Python', 3)

'Monty Python Monty Python Monty Python'

In [81]:
# A Python function is not required to have a return statement. Some functions do their work as a side effect, printing a result, 
# modifying a file, or updating the contents of a parameter to the function (such functions are called "procedures" in some other 
# programming languages).

# Consider the following three sort functions. The third one is dangerous because a programmer could use it without realizing that 
# it had modified its input. In general, functions should modify the contents of a parameter (my_sort1()), or return a value 
# (my_sort2()), not both (my_sort3()).

def my_sort1(mylist):      # good: modifies its argument, no return value
    mylist.sort()

In [82]:
def my_sort2(mylist):      # good: doesn't touch its argument, returns value
    return sorted(mylist)

In [83]:
def my_sort3(mylist):      # bad: modifies its argument and also returns it
    mylist.sort()
    return mylist

In [84]:
# Parameter Passing

# Back in 4.1 you saw that assignment works on values, but that the value of a structured object is a reference to that object. 
# The same is true for functions. Python interprets function parameters as values (this is known as call-by-value). In the following 
# code, set_up() has two parameters, both of which are modified inside the function. We begin by assigning an empty string to w and 
# an empty list to p. After calling the function, w is unchanged, while p is changed:

def set_up(word, properties):
    word = 'lolcat'
    properties.append('noun')
    properties = 5

w = ''
p = []
set_up(w, p)
w

''

In [85]:
p

['noun']

In [86]:
# Notice that w was not changed by the function. When we called set_up(w, p), the value of w (an empty string) was assigned to a 
# new variable word. Inside the function, the value of word was modified. However, that change did not propagate to w. This parameter 
# passing is identical to the following sequence of assignments:
  
w = ''
word = w
word = 'lolcat'
w

''

In [87]:
# Let's look at what happened with the list p. When we called set_up(w, p), the value of p (a reference to an empty list) was 
# assigned to a new local variable properties, so both variables now reference the same memory location. The function modifies 
# properties, and this change is also reflected in the value of p as we saw. The function also assigned a new value to properties 
# (the number 5); this did not modify the contents at that memory location, but created a new local variable. This behavior is just 
# as if we had done the following sequence of assignments:

p = []
properties = p
properties.append('noun')
properties = 5
p

# Thus, to understand Python's call-by-value parameter passing, it is enough to understand how assignment works. Remember that you 
# can use the id() function and is operator to check your understanding of object identity after each statement.

['noun']

In [None]:
# Checking Parameter Types

# Python does not allow us to declare the type of a variable when we write a program, and this permits us to define functions 
# that are flexible about the type of their arguments. For example, a tagger might expect a sequence of words, but it wouldn't 
# care whether this sequence is expressed as a list or a tuple (or an iterator, another sequence type that is outside the scope 
# of the current discussion).

# However, often we want to write programs for later use by others, and want to program in a defensive style, providing useful 
# warnings when functions have not been invoked correctly. The author of the following tag() function assumed that its argument 
# would always be a string.

In [88]:
def tag(word):
    if word in ['a', 'the', 'all']:
        return 'det'
    else:
        return 'noun'

tag('the')

'det'

In [89]:
tag('knight')

'noun'

In [90]:
tag(["'Tis", 'but', 'a', 'scratch'])

'noun'

In [91]:
# Here's a better solution, using an assert statement together with Python's basestring type that generalizes over both unicode 
# and str.

def tag(word):
    assert isinstance(word, basestring), "argument to tag() must be a string"
    if word in ['a', 'the', 'all']:
        return 'det'
    else:
        return 'noun'

In [92]:
tag('the')

'det'

In [93]:
tag('knight')

'noun'

In [94]:
tag(["'Tis", 'but', 'a', 'scratch'])

AssertionError: argument to tag() must be a string

In [95]:
# If the assert statement fails, it will produce an error that cannot be ignored, since it halts program execution. Additionally, 
# the error message is easy to interpret. Adding assertions to a program helps you find logical errors, and is a kind of defensive 
# programming. A more fundamental approach is to document the parameters to each function using docstrings as described later in 
# this section.

# Functional Decomposition

# Well-structured programs usually make extensive use of functions. When a block of program code grows longer than 10-20 lines, 
# it is a great help to readability if the code is broken up into one or more functions, each one having a clear purpose. This is 
# analogous to the way a good essay is divided into paragraphs, each expressing one main idea.

# Functions provide an important kind of abstraction. They allow us to group multiple actions into a single, complex action, 
# and associate a name with it. (Compare this with the way we combine the actions of go and bring back into a single more 
# complex action fetch.) When we use functions, the main program can be written at a higher level of abstraction, making its 
# structure transparent, e.g.

# data = load_corpus()
# results = analyze(data)
# present(results)

# Appropriate use of functions makes programs more readable and maintainable. Additionally, it becomes possible to reimplement a 
# function — replacing the function's body with more efficient code — without having to be concerned with the rest of the program.

In [96]:
# Consider the freq_words function in 4.3. It updates the contents of a frequency distribution that is passed in as a parameter, 
# and it also prints a list of the n most frequent words.

from urllib import request # I cannot import request...
from bs4 import BeautifulSoup

def freq_words(url, freqdist, n):
    html = request.urlopen(url).read().decode('utf8')
    raw = BeautifulSoup(html).get_text()
    for word in word_tokenize(raw):
        freqdist[word.lower()] += 1
    result = []
    for word, count in freqdist.most_common(n):
        result = result + [word]
    print(result)
    
constitution = "http://www.archives.gov/exhibits/charters/constitution_transcript.html"
fd = nltk.FreqDist()
freq_words(constitution, fd, 30)

ImportError: cannot import name request

# 4.5   Doing More with Functions

In [97]:
# This section discusses more advanced features, which you may prefer to skip on the first time through this chapter.

# Functions as Arguments

# So far the arguments we have passed into functions have been simple objects like strings, or structured objects like lists. 
# Python also lets us pass a function as an argument to another function. Now we can abstract out the operation, and apply a 
# different operation on the same data. As the following examples show, we can pass the built-in function len() or a user-defined 
# function last_letter() as arguments to another function:

sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
        'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']

def extract_property(prop):
    return [prop(word) for word in sent]

extract_property(len)

# Wow... This is powerful. Instead of passing a simple data object, you can pass in a function...
# we pass the function len. Then we return a list of this function applied to elements of a list

[4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1]

In [99]:
def last_letter(word):
    return word[-1]
extract_property(last_letter)

['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']

In [100]:
# The objects len and last_letter can be passed around like lists and dictionaries. Notice that parentheses are only used after 
# a function name if we are invoking the function; when we are simply treating the function as an object these are omitted.

# Python provides us with one more way to define functions as arguments to other functions, so-called lambda expressions. 
# Supposing there was no need to use the above last_letter() function in multiple places, and thus no need to give it a name. 
# We can equivalently write the following:

extract_property(lambda w: w[-1])

['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']

In [101]:
# Our next example illustrates passing a function to the sorted() function. When we call the latter with a single argument 
# (the list to be sorted), it uses the built-in comparison function cmp(). However, we can supply our own sort function, e.g. 
# to sort by decreasing length.

sorted(sent)

[',',
 '.',
 'Take',
 'and',
 'care',
 'care',
 'of',
 'of',
 'sense',
 'sounds',
 'take',
 'the',
 'the',
 'themselves',
 'will']

In [102]:
sorted(sent, cmp)

[',',
 '.',
 'Take',
 'and',
 'care',
 'care',
 'of',
 'of',
 'sense',
 'sounds',
 'take',
 'the',
 'the',
 'themselves',
 'will']

In [103]:
sorted(sent, lambda x, y: cmp(len(y), len(x)))

['themselves',
 'sounds',
 'sense',
 'Take',
 'care',
 'will',
 'take',
 'care',
 'the',
 'and',
 'the',
 'of',
 'of',
 ',',
 '.']

In [None]:
# Accumulative Functions

# These functions start by initializing some storage, and iterate over input to build it up, before returning some final object 
# (a large structure or aggregated result). A standard way to do this is to initialize an empty list, accumulate the material, 
# then return the list, as shown in function search1() in 4.6.

# I skipped the rest of this section...

In [None]:
# Skipped Higher-Order Functions - discusses functional programming
# Skipped Named Arguments

# I skipped the rest of the chapter. There were interesting things but more general with respect to programming and Python rather
# than NLTK. Plus throughout this chapter, it mentioned that the material was largely optional and could be skipped on a first pass.