In [1]:
# 3   Processing Raw Text

# The most important source of texts is undoubtedly the Web. It's convenient to have existing text collections to explore, such as the 
# corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access 
# them.

# The goal of this chapter is to answer the following questions:

# How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language 
# material?
# How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did 
# with text corpora in earlier chapters?
# How can we write programs to produce formatted output and save it in a file?
# In order to address these questions, we will be covering key concepts in NLP, including tokenization and stemming. Along the way 
# you will consolidate your Python knowledge and learn about strings, files, and regular expressions. Since so much text on the web 
# is in HTML format, we will also see how to dispense with markup.

# Note

# Important: From this chapter onwards, our program samples will assume you begin your interactive session or your program with the 
# following import statements:

from __future__ import division  # Python 2 users only
import nltk, re, pprint
from nltk import word_tokenize

In [2]:
# 3.1   Accessing Text from the Web and from Disk

# Electronic Books

# A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing 
# other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, 
# and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 
# 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese and Spanish (with more than 100 
# texts each).

# Text number 2554 is an English translation of Crime and Punishment, and we can access it as follows.

from urllib2 import urlopen
# Import urlopen from urllib2 module

url = "http://www.gutenberg.org/files/2554/2554.txt"
# Specify url as this particular text string

response = urlopen(url)
# what does urlopen do?

raw = response.read().decode('utf8')
# we decode... I need to understand the unicode stuff.

type(raw)
# what is the type of the raw string data we read? Unicode

unicode

In [3]:
len(raw)
# number of characters (including spaces) from this text file from the web

1176896

In [4]:
raw[:75]
# what are the first 75 characters from this text file?

u'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

In [5]:
# Note: add this revised code to the GitHub issue tracker!

In [7]:
# The variable raw contains a string with 1,176,893 characters. (We can see that it is a string, using type(raw).) This is the raw 
# content of the book, including many details we are not interested in such as whitespace, line breaks and blank lines. Notice the \r 
# and \n in the opening line of the file, which is how Python displays the special carriage return and line feed characters (the file 
# must have been created on a Windows machine). For our language processing, we want to break up the string into words and 
# punctuation, as we saw in 1.. This step is called tokenization, and it produces our familiar structure, a list of words and 
# punctuation.

tokens = word_tokenize(raw)
# word_tokenize converts raw string data into word tokens

type(tokens)
# Shows that these tokens are placed in a list

list

In [8]:
len(tokens)
# Return the number of word tokens

254352

In [9]:
tokens[:10]
# let's return the first 10 words/tokens

[u'The',
 u'Project',
 u'Gutenberg',
 u'EBook',
 u'of',
 u'Crime',
 u'and',
 u'Punishment',
 u',',
 u'by']

In [10]:
# Notice that NLTK was needed for tokenization, but not for any of the earlier tasks of opening a URL and reading it into a string. 
# If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we 
# saw in 1., along with the regular list operations like slicing:

text = nltk.Text(tokens)
# Convert tokens list into a format NLTK can understand and process

type(text)
# Now we have an NLTK text

nltk.text.Text

In [11]:
text[1024:1062]
# return this subset of words/tokens

[u'CHAPTER',
 u'I',
 u'On',
 u'an',
 u'exceptionally',
 u'hot',
 u'evening',
 u'early',
 u'in',
 u'July',
 u'a',
 u'young',
 u'man',
 u'came',
 u'out',
 u'of',
 u'the',
 u'garret',
 u'in',
 u'which',
 u'he',
 u'lodged',
 u'in',
 u'S.',
 u'Place',
 u'and',
 u'walked',
 u'slowly',
 u',',
 u'as',
 u'though',
 u'in',
 u'hesitation',
 u',',
 u'towards',
 u'K.',
 u'bridge',
 u'.']

In [12]:
text.collocations()

# Remember, "Collocations are expressions of multiple words which commonly co-occur."

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;
Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market


In [13]:
# Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a 
# header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. 
# Sometimes this information appears in a footer at the end of the file. We cannot reliably detect where the content begins and ends, 
# and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before 
# trimming raw to be just the content and nothing else:

raw.find("PART I")

5338

In [14]:
raw.rfind("End of Project Gutenberg's Crime")

1157746

In [15]:
# Here we essentially subset raw to be the "raw" content, and no header/metadata.

raw = raw[5338:1157743]
raw.find("PART I")

# The find() and rfind() ("reverse find") methods help us get the right index values to use for slicing the string [1]. We overwrite 
# raw with this slice, so now it begins with "PART I" and goes up to (but not including) the phrase that marks the end of the content.

# This was our first brush with the reality of the web: texts found on the web may contain unwanted material, and there may not be an 
# automatic way to remove it. But with a small amount of extra work we can extract the material we need.

0

In [16]:
# Dealing with HTML

# Much of the text on the web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, 
# then access this as described in the section on files below. However, if you're going to do this often, it's easiest to get Python 
# to do the work directly. The first step is the same as before, using urlopen. For fun we'll pick a BBC News story called Blondes to 
# die out in 200 years, an urban legend passed along by the BBC as established scientific fact:

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

# Remember, we use urlopen instead of request.urlopen
html = urlopen(url).read().decode('utf8')
html[:60]

u'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [18]:
# You can type print(html) to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.

# To get text out of HTML we will use a Python library called BeautifulSoup, available from 
# http://www.crummy.com/software/BeautifulSoup/:

from bs4 import BeautifulSoup
raw = BeautifulSoup(html, "lxml").get_text() # Note: added ,"lxml"
tokens = word_tokenize(raw)
tokens

[u'BBC',
 u'NEWS',
 u'|',
 u'Health',
 u'|',
 u'Blondes',
 u"'to",
 u'die',
 u'out',
 u'in',
 u'200',
 u"years'",
 u'NEWS',
 u'SPORT',
 u'WEATHER',
 u'WORLD',
 u'SERVICE',
 u'A-Z',
 u'INDEX',
 u'SEARCH',
 u'You',
 u'are',
 u'in',
 u':',
 u'Health',
 u'News',
 u'Front',
 u'Page',
 u'Africa',
 u'Americas',
 u'Asia-Pacific',
 u'Europe',
 u'Middle',
 u'East',
 u'South',
 u'Asia',
 u'UK',
 u'Business',
 u'Entertainment',
 u'Science/Nature',
 u'Technology',
 u'Health',
 u'Medical',
 u'notes',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'-',
 u'Talking',
 u'Point',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'-',
 u'Country',
 u'Profiles',
 u'In',
 u'Depth',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'-',
 u'Programmes',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'-',
 u'SERVICES',
 u'Daily',
 u'E-mail',
 u'News',
 u'Ticker',
 u'Mobile/PDAs',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'--',
 u'-',
 u'Text',
 u'Only',
 u'Feedback',
 u'Help',
 u'EDITIONS',
 u'Change',
 

In [19]:
# This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the 
# start and end indexes of the content and select the tokens of interest, and initialize a text as before.

tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


In [20]:
# Processing Search Engine Results

# The web can be thought of as a huge corpus of unannotated text. Web search engines provide an efficient means of searching this large 
# quantity of text for relevant linguistic examples. The main advantage of search engines is size: since you are searching such a large 
# set of documents, you are more likely to find any linguistic pattern you are interested in. Furthermore, you can make use of very 
# specific patterns, which would only match one or two examples on a smaller example, but which might match tens of thousands of 
# examples when run on the web. A second advantage of web search engines is that they are very easy to use. Thus, they provide a very 
# convenient tool for quickly checking a theory, to see if it is reasonable.

# Unfortunately, search engines have some significant shortcomings. First, the allowable range of search patterns is severely 
# restricted. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only 
# allow you to search for individual words or strings of words, sometimes with wildcards. Second, search engines give inconsistent 
# results, and can give widely different figures when used at different times or in different geographical regions. When content has 
# been duplicated across multiple sites, search results may be boosted. Finally, the markup in the result returned by a search engine 
# may change unpredictably, breaking any pattern-based method of locating particular content (a problem which is ameliorated by the use 
# of search engine APIs).

In [21]:
# Processing RSS Feeds

# The blogosphere is an important source of text, in both formal and informal registers. With the help of a Python library called the 
# Universal Feed Parser, available from https://pypi.python.org/pypi/feedparser, we can access the content of a blog, as shown below:

import feedparser
# I typed the following at the Windows Command prompt: "pip install feedparser"

llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

ImportError: No module named feedparser

In [None]:
len(llog.entries)

In [None]:
post = llog.entries[2]
post.title

In [None]:
content = post.content[0].value
content[:70]

In [None]:
raw = BeautifulSoup(content).get_text()
word_tokenize(raw)

# With some further work, we can write programs to create a small corpus of blog posts, and use this as the basis for our NLP work.

In [22]:
# Reading Local Files

# In order to read a local file, we need to use Python's built-in open() function, followed by the read() method. Suppose you have a 
# file document.txt, you can load its contents like this:

# first create a document.txt in c:\cgraph

f = open('document.txt')
raw = f.read()

In [23]:
# To check that the file that you are trying to open is really in the right directory, use IDLE's Open command in the File menu; this
# will display a list of all the files in the directory where IDLE is running. An alternative is to examine the current directory 
# from within Python:

import os
os.listdir('.')

# Note, I can use this in the context of my Migraine Dropbox folders, etc. Very useful. Also reference this in Learning Python book. 

['.ipynb_checkpoints',
 'Causal Extractor Oct 2015.ipynb',
 'data.csv',
 'data_big.csv',
 'data_original.csv',
 'document.txt',
 'NLTK Book Notes.docx',
 'NLTKCh1.ipynb',
 'NLTKCh2.ipynb',
 'NLTKCh3.ipynb',
 'NLTKCh4.ipynb',
 "NLTKCh5 (Bob-HP's conflicted copy 2015-08-30).ipynb",
 'NLTKCh5.ipynb',
 'NLTKCh6.ipynb',
 "NLTKCh7 (Bob-HP's conflicted copy 2015-08-30).ipynb",
 'NLTKCh7.ipynb',
 'Simple_Causal_Extractor.ipynb',
 't2.pkl']

In [24]:
# Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for 
# different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: 
# open('document.txt', 'rU') — 'r' means to open the file for reading (the default), and 'U' stands for "Universal", which lets us 
# ignore the different conventions used for marking newlines.

# Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents
# of the entire file:

f.read()

# Recall that the '\n' characters are newlines; this is equivalent to pressing Enter on a keyboard and starting a new line.

''

In [25]:
# We can also read a file one line at a time using a for loop:

f = open('document.txt', 'rU')
for line in f:
    print(line.strip())
    
# Here we use the strip() method to remove the newline character at the end of the input line.

This is a sample document for the NLTK Book, Chapter 3.
I am so excited to work in NLTK and with NLP.
In no time, I will be a super star!


In [26]:
# NLTK's corpus files can also be accessed using these methods. We simply have to use nltk.data.find() to get the filename for any 
# corpus item. Then we can open and read it in the way we just demonstrated above:

path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'rU').read()

In [27]:
# Extracting Text from PDF, MSWord and other Binary Formats

# ASCII text and HTML text are human readable formats. Text often comes in binary formats — like PDF and MSWord — that can only be 
# opened using specialized software. Third-party libraries such as pypdf and pywin32 provide access to these formats. Extracting 
# text from multi-column documents is particularly challenging. For once-off conversion of a few documents, it is simpler to open the 
# document with a suitable application, then save it as text to your local drive, and access it as described below. If the document 
# is already on the web, you can enter its URL in Google's search box. The search result often includes a link to an HTML version of 
# the document, which you can save as text.

# Capturing User Input

# Sometimes we want to capture the text that a user inputs when she is interacting with our program. To prompt the user to type a 
# line of input, call the Python function input(). After saving the input to a variable, we can manipulate it just as we have done 
# for other strings.

s = raw_input("Enter some text: ")

# For some reason, the input statement does not work!

# Note: I need to use raw_input() instead of input() in Python. This was mentioned earlier in NLTK book as well.
# Note: input works for Python 3. For Python 2, I need to use raw_input...
# Update NLTK folks with this as well.

Enter some text: Bob


In [28]:
print("You typed", len(word_tokenize(s)), "words.")

('You typed', 1, 'words.')


In [29]:
# The NLP Pipeline
# 3.1 summarizes what we have covered in this section, including the process of building a vocabulary that we saw in 1.. 
# (One step, normalization, will be discussed in 3.6.)

# Figure 3.1: The Processing Pipeline: We open a URL and read its HTML content, remove the markup and select a slice of characters; 
# this is then tokenized and optionally converted into an nltk.Text object; we can also lowercase all the words and extract the 
# vocabulary.

# There's a lot going on in this pipeline. To understand it properly, it helps to be clear about the type of each variable that it 
# mentions. We find out the type of any Python object x using type(x), e.g. type(1) is <int> since 1 is an integer.

# When we load the contents of a URL or file, and when we strip out HTML markup, we are dealing with strings, Python's <str> data 
# type. (We will learn more about strings in 3.2):

# Remember I created document.txt
raw = open('document.txt').read()
type(raw)

str

In [30]:
# When we tokenize a string we produce a list (of words), and this is Python's <list> type. Normalizing and sorting lists produces 
# other lists:

# Note that word_tokenize is from the NLTK library (as imported above)
tokens = word_tokenize(raw)

print type(tokens)
print(tokens)

<type 'list'>
['This', 'is', 'a', 'sample', 'document', 'for', 'the', 'NLTK', 'Book', ',', 'Chapter', '3', '.', 'I', 'am', 'so', 'excited', 'to', 'work', 'in', 'NLTK', 'and', 'with', 'NLP', '.', 'In', 'no', 'time', ',', 'I', 'will', 'be', 'a', 'super', 'star', '!']


In [31]:
# Here we make all tokens lower case and turn into a new list words

words = [w.lower() for w in tokens]
print(type(words))
print (words)

<type 'list'>
['this', 'is', 'a', 'sample', 'document', 'for', 'the', 'nltk', 'book', ',', 'chapter', '3', '.', 'i', 'am', 'so', 'excited', 'to', 'work', 'in', 'nltk', 'and', 'with', 'nlp', '.', 'in', 'no', 'time', ',', 'i', 'will', 'be', 'a', 'super', 'star', '!']


In [32]:
# We take our lower case words, apply set (to get the "set" of words or vocabulary)
# Then we sort it and save to a new variable, vocab

vocab = sorted(set(words))
print(type(vocab))
print(vocab)

<type 'list'>
['!', ',', '.', '3', 'a', 'am', 'and', 'be', 'book', 'chapter', 'document', 'excited', 'for', 'i', 'in', 'is', 'nlp', 'nltk', 'no', 'sample', 'so', 'star', 'super', 'the', 'this', 'time', 'to', 'will', 'with', 'work']


In [33]:
# The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a 
# string:

vocab.append('blog')
print (vocab)

# note, every time I rerun this code, I add "blog" to the end of it...

['!', ',', '.', '3', 'a', 'am', 'and', 'be', 'book', 'chapter', 'document', 'excited', 'for', 'i', 'in', 'is', 'nlp', 'nltk', 'no', 'sample', 'so', 'star', 'super', 'the', 'this', 'time', 'to', 'will', 'with', 'work', 'blog']


In [34]:
raw.append('blog')

AttributeError: 'str' object has no attribute 'append'

In [35]:
# Above is an error... HOwever, I could do the following?

raw = raw + " blog"
print (raw)

# Yay! I can...

This is a sample document for the NLTK Book, Chapter 3.
I am so excited to work in NLTK and with NLP.
In no time, I will be a super star! blog


In [36]:
# Similarly, we can concatenate strings with strings, and lists with lists, but we cannot concatenate strings with lists:

# query is a string
query = 'Who knows?'

# beatles is a list

beatles = ['john', 'paul', 'george', 'ringo']

# but we cannot concatenate a string to a list...
query + beatles

TypeError: cannot concatenate 'str' and 'list' objects

In [37]:
# But, in my estimation, we could append the string to the list as follows:

beatles.append(query)
print (beatles)

# yes, this works.

['john', 'paul', 'george', 'ringo', 'Who knows?']


In [38]:
# 3.2   Strings: Text Processing at the Lowest Level

# It's time to examine a fundamental data type that we've been studiously avoiding so far. In earlier chapters we focused on a text as 
# a list of words. We didn't look too closely at words and how they are handled in the programming language. By using NLTK's corpus 
# interface we were able to ignore the files that these texts had come from. The contents of a word, and of a file, are represented 
# by programming languages as a fundamental data type known as a string. In this section we explore strings in detail, and show the 
# connection between strings, words, texts and files.

# Basic Operations with Strings

# Strings are specified using single quotes [1] or double quotes [2], as shown below. If a string contains a single quote, we must
# backslash-escape the quote [3] so Python knows a literal quote character is intended, or else put the string in double quotes [2]. 
# Otherwise, the quote inside the string [4] will be interpreted as a close quote, and the Python interpreter will report a syntax 
# error:

monty = 'Monty Python'
monty

'Monty Python'

In [39]:
circus = "Monty Python's Flying Circus"
circus

"Monty Python's Flying Circus"

In [40]:
circus = 'Monty Python\'s Flying Circus'
circus

"Monty Python's Flying Circus"

In [41]:
circus = 'Monty Python's Flying Circus'

# Note: this gives us an error because it interpreted the second single quote as the end of the string

SyntaxError: invalid syntax (<ipython-input-41-af18724384f6>, line 1)

In [42]:
# Sometimes strings go over several lines. Python provides us with various ways of entering them. In the next example, a sequence of 
# two strings is joined into a single string. We need to use backslash [1] or parentheses [2] so that the interpreter knows that the 
# statement is not complete after the first line.

couplet = "Shall I compare thee to a Summer's day?"\
           "Thou are more lovely and more temperate:"
print couplet

Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:


In [43]:
couplet = ("Rough winds do shake the darling buds of May,"
           "And Summer's lease hath all too short a date:")
print couplet

Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:


In [44]:
# Unfortunately the above methods do not give us a newline between the two lines of the sonnet. Instead, we can use a triple-quoted 
# string as follows:

couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:"""
print(couplet)

# Note here that the second line is determined literally by how it is types from the leftmost space...

Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:


In [45]:
couplet = '''Rough winds do shake the darling buds of May,
And Summer's lease hath all too short a date:'''
print(couplet)
# This does the same thing. It does not matter whether this is single or double quotes, just as long as there are 3!

Rough winds do shake the darling buds of May,
And Summer's lease hath all too short a date:


In [46]:
# Now that we can define strings, we can try some simple operations on them. First let's look at the + operation, known as 
# concatenation [1]. It produces a new string that is a copy of the two original strings pasted together end-to-end. Notice that 
# concatenation doesn't do anything clever like insert a space between the words. We can even multiply strings [2]:

'very' + 'very' + 'very'

'veryveryvery'

In [47]:
'very' * 3

'veryveryvery'

In [48]:
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = [' ' * 2 * (7 - i) + 'very' * i for i in a]
for line in b:
    print(line)

            very
          veryvery
        veryveryvery
      veryveryveryvery
    veryveryveryveryvery
  veryveryveryveryveryvery
veryveryveryveryveryveryvery
  veryveryveryveryveryvery
    veryveryveryveryvery
      veryveryveryvery
        veryveryvery
          veryvery
            very


In [49]:
# We've seen that the addition and multiplication operations apply to strings, not just numbers. However, note that we cannot use 
# subtraction or division with strings:

'very' - 'y'

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [50]:
'very' / 2

TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [51]:
# These error messages are another example of Python telling us that we have got our data types in a muddle. In the first case, we are 
# told that the operation of subtraction (i.e., -) cannot apply to objects of type str (strings), while in the second, we are told 
# that division cannot take str and int as its two operands.

# Printing Strings

# So far, when we have wanted to look at the contents of a variable or see the result of a calculation, we have just typed the 
# variable name into the interpreter. We can also see the contents of a variable using the print statement:

print(monty)

Monty Python


In [52]:
# Notice that there are no quotation marks this time. When we inspect a variable by typing its name in the interpreter, the 
# interpreter prints the Python representation of its value. Since it's a string, the result is quoted. However, when we tell the 
# interpreter to print the contents of the variable, we don't see quotation characters since there are none inside the string.

# The print statement allows us to display more than one item on a line in various ways, as shown below:

grail = 'Holy Grail'
print(monty + grail)

Monty PythonHoly Grail


In [53]:
print(monty, grail)
# Here, it looks like it creates a list...

('Monty Python', 'Holy Grail')


In [54]:
print(monty, "and the", grail)
# Again, we get another list.

('Monty Python', 'and the', 'Holy Grail')


In [55]:
# Accessing Individual Characters

# As we saw in 2 for lists, strings are indexed, starting from zero. When we index a string, we get one of its characters 
# (or letters). A single character is nothing special — it's just a string of length 1.

# first element
monty[0]

'M'

In [56]:
# fourth element
monty[3]

't'

In [57]:
# sixth element
monty[5]

' '

In [58]:
# As with lists, if we try to access an index that is outside of the string we get an error:

monty[20]

IndexError: string index out of range

In [59]:
# Again as with lists, we can use negative indexes for strings, where -1 is the index of the last character [1]. Positive and 
# negative indexes give us two ways to refer to any position in a string. In this case, when the string had a length of 12, indexes 
# 5 and -7 both refer to the same character (a space). (Notice that 5 = len(monty) - 7.)

# Access last character
monty[-1]

'n'

In [60]:
monty[5]

' '

In [61]:
monty[-7]

' '

In [62]:
# We can write for loops to iterate over the characters in strings. This print function includes the 
# optional end=' ' parameter, which is how we tell Python to print a space instead of a newline at the end.

sent = 'colorless green ideas sleep furiously'
for char in sent:
    print char,
# example from the book doesn't work here. It does print (char, end=' ')
# In Python 2.X, what I did above is fine. I got this from Learning Python (5th ed) page 201.

c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y


In [63]:
# We can count individual characters as well. We should ignore the case distinction by normalizing 
# everything to lowercase, and filter out non-alphabetic characters:

# import gutenberg from nltk.corpus - reference it as gutenberg
from nltk.corpus import gutenberg

# extract raw string data
raw = gutenberg.raw('melville-moby_dick.txt')

# generate frequency distribution after we Loop through every alphabet character, and turn it into 
# lowercase
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())

# Find most common characters represented in Moby Dick.
fdist.most_common(5)

# It happens to be 'e', 't', 'a', 'o', and 'n'

[(u'e', 117092), (u't', 87996), (u'a', 77916), (u'o', 69326), (u'n', 65617)]

In [64]:
[char for (char, count) in fdist.most_common()]

# This gives us the letters of the alphabet, with the most frequently occurring letters listed first 
# (this is quite complicated and we'll explain it more carefully below). You might like to visualize 
# the distribution using fdist.plot(). The relative character frequencies of a text can be used in 
# automatically identifying the language of the text.

[u'e',
 u't',
 u'a',
 u'o',
 u'n',
 u'i',
 u's',
 u'h',
 u'r',
 u'l',
 u'd',
 u'u',
 u'm',
 u'c',
 u'w',
 u'f',
 u'g',
 u'p',
 u'b',
 u'y',
 u'v',
 u'k',
 u'q',
 u'j',
 u'x',
 u'z']

In [65]:
# Accessing Substrings
# Figure 3.2: String Slicing: The string "Monty Python" is shown along with its positive and negative 
# indexes; two substrings are selected using "slice" notation. The slice [m,n] contains the characters 
# from position m through n-1.

# A substring is any continuous section of a string that we want to pull out for further processing. We 
# can easily access substrings using the same slice notation we used for lists (see 3.2). For example, 
# the following code accesses the substring starting at index 6, up to (but not including) index 10:

monty[6:10]

'Pyth'

In [66]:
# Here we see the characters are 'P', 'y', 't', and 'h' which correspond to monty[6] ... monty[9] but 
# not monty[10]. This is because a slice starts at the first index but finishes one before the end index.

# We can also slice with negative indexes — the same basic rule of starting from the start index and 
# stopping one before the end index applies; here we stop before the space character.

monty[-12:-7]

'Monty'

In [67]:
# As with list slices, if we omit the first value, the substring begins at the start of the string. If 
# we omit the second value, the substring continues to the end of the string:
monty[:5]

'Monty'

In [68]:
monty[6:]

'Python'

In [69]:
# We test if a string contains a particular substring using the in operator, as follows:

In [70]:
phrase = 'And now for something completely different'
if 'thing' in phrase:
    print('found "thing"')

found "thing"


In [71]:
# We can also find the position of a substring within a string, using find():
monty.find('Python')

6

In [72]:
# More operations on strings

# Python has comprehensive support for processing strings. A summary, including some operations we 
# haven't seen yet, is shown in 3.2. For more information on strings, type help(str) at the Python prompt.

# Table 3.2:
# Useful String Methods: operations on strings in addition to the string tests shown in 4.2; all methods 
# produce a new string or list

# The Difference between Lists and Strings

# Strings and lists are both kinds of sequence. We can pull them apart by indexing and slicing them, and 
# we can join them together by concatenating them. However, we cannot join strings and lists:

query = 'Who knows?'
beatles = ['John', 'Paul', 'George', 'Ringo']
query[2]

'o'

In [73]:
beatles[2]

'George'

In [74]:
query[:2]

'Wh'

In [75]:
beatles[:2]

['John', 'Paul']

In [76]:
query + " I don't"

"Who knows? I don't"

In [77]:
beatles + 'Brian'

TypeError: can only concatenate list (not "str") to list

In [78]:
beatles + ['Brian', 'Bob']

['John', 'Paul', 'George', 'Ringo', 'Brian', 'Bob']

In [79]:
beatles + ['Brian']

['John', 'Paul', 'George', 'Ringo', 'Brian']

In [80]:
# When we open a file for reading into a Python program, we get a string corresponding to the contents of
# the whole file. If we use a for loop to process the elements of this string, all we can pick out are the 
# individual characters — we don't get to choose the granularity. By contrast, the elements of a list can 
# be as big or small as we like: for example, they could be paragraphs, sentences, phrases, words, 
# characters. So lists have the advantage that we can be flexible about the elements they contain, and 
# correspondingly flexible about any downstream processing. Consequently, one of the first things we are 
# likely to do in a piece of NLP code is tokenize a string into a list of strings (3.7). Conversely, when 
# we want to write our results to a file, or to a terminal, we will usually format them as a string (3.9).

# Lists and strings do not have exactly the same functionality. Lists have the added power that you can 
# change their elements:

beatles[0] = "John Lennon"
del beatles[-1]
beatles

['John Lennon', 'Paul', 'George']

In [81]:
# On the other hand if we try to do that with a string — changing the 0th character in query to 'F' — 
# we get:

query[0] = 'F'

# This is because strings are immutable — you can't change a string once you have created it. However, 
# lists are mutable, and their contents can be modified at any time. As a result, lists support 
# operations that modify the original value rather than producing a new value.

TypeError: 'str' object does not support item assignment

In [82]:
# 3.3 Text Processing with Unicode

# Our programs will often need to deal with different languages, and different character sets. The concept
# of "plain text" is a fiction. If you live in the English-speaking world you probably use ASCII, possibly 
# without realizing it. If you live in Europe you might use one of the extended Latin character sets, 
# containing such characters as "ø" for Danish and Norwegian, "ő" for Hungarian, "ñ" for Spanish and 
# Breton, and "ň" for Czech and Slovak. In this section, we will give an overview of how to use Unicode 
# for processing texts that use non-ASCII character sets.

# What is Unicode?

# Unicode supports over a million characters. Each character is assigned a number, called a code point. 
# In Python, code points are written in the form \uXXXX, where XXXX is the number in 4-digit hexadecimal 
# form.

# Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode 
# characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. 
# Some encodings (such as ASCII and Latin-2) use a single byte per code point, so they can only support 
# a small subset of Unicode, enough for a single language. Other encodings (such as UTF-8) use multiple 
# bytes and can represent the full range of Unicode characters.

# Text in files will be in a particular encoding, so we need some mechanism for translating it into 
# Unicode — translation into Unicode is called decoding. Conversely, to write out Unicode to a file or a 
# terminal, we first need to translate it into a suitable encoding — this translation out of Unicode is 
# called encoding, and is illustrated in 3.3.

# From a Unicode perspective, characters are abstract entities which can be realized as one or more 
# glyphs. Only glyphs can appear on a screen or be printed on paper. A font is a mapping from characters 
# to glyphs.

# Extracting encoded text from files

# Let's assume that we have a small text file, and that we know how it is encoded. For example, 
# polish-lat2.txt, as the name suggests, is a snippet of Polish text (from the Polish Wikipedia; see 
# http://pl.wikipedia.org/wiki/Biblioteka_Pruska). This file is encoded as Latin-2, also known as 
# ISO-8859-2. The function nltk.data.find() locates the file for us.

path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [83]:
# The Python open() function can read encoded data into Unicode strings, and write out Unicode strings 
# in encoded form. It takes a parameter to specify the encoding of the file being read or written. So 
# let's open our Polish file with the encoding 'latin2' and inspect the contents of the file:

import codecs

f = codecs.open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)
    
# To open the file with latin-2 encoding in Python 2.X, we need to codecs.open instead of open.

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


In [84]:
# If this does not display correctly on your terminal, or if we want to see the underlying numerical 
# values (or "codepoints") of the characters, then we can convert all non-ASCII characters into their 
# two-digit \xXX and four-digit \uXXXX representations:

f = codecs.open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

Pruska Biblioteka Pa\u0144stwowa. Jej dawne zbiory znane pod nazw\u0105
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y
odnalezione po 1945 r. na terytorium Polski. Trafi\u0142y do Biblioteki
Jagiello\u0144skiej w Krakowie, obejmuj\u0105 ponad 500 tys. zabytkowych
archiwali\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


In [85]:
# Note - this discussion on Unicode characters can help with Google Calendar...

# The first line above illustrates a Unicode escape string preceded by the \u escape string, namely 
# \u0144 . The relevant Unicode character will be dislayed on the screen as the glyph ń. In the third 
# line of the preceding example, we see \xf3, which corresponds to the glyph ó, and is within the 
# 128-255 range.

# In Python 3, source code is encoded using UTF-8 by default, and you can include Unicode characters in 
# strings if you are using IDLE or another program editor that supports Unicode. Arbitrary Unicode 
# characters can be included using the \uXXXX escape sequence. We find the integer ordinal of a character 
# using ord(). For example:

#ord("ń")
[ord(x) for x in u'ń'] # This works for Python 2 - inform Bird et al

# It doesn't work here... Let's stop here for now (3/19/2015 @ 7:15p)

[324]

In [86]:
# The hexadecimal 4 digit notation for 324 is 0144 (type hex(324) to discover this), and we can define a string with the appropriate 
# escape sequence.

hex(324)

'0x144'

In [87]:
nacute = '\u0144'
nacute
# This doesn't work for me.

'\\u0144'

In [88]:
# Note

# There are many factors determining what glyphs are rendered on your screen. If you are sure that you have the correct encoding, but 
# your Python code is still failing to produce the glyphs you expected, you should also check that you have the necessary fonts 
# installed on your system. It may be necessary to configure your locale to render UTF-8 encoded characters, then use 
# print(nacute.encode('utf8')) in order to see the ń displayed in your terminal.

print(nacute.encode('utf8'))

# Still did not work...

\u0144


In [89]:
# We can also see how this character is represented as a sequence of bytes inside a text file:
nacute.encode('utf8')
# Still doesn't work - something wrong with Python here.

'\\u0144'

In [91]:
# The module unicodedata lets us inspect the properties of Unicode characters. In the following example, we select all characters in 
# the third line of our Polish text outside the ASCII range and print their UTF-8 byte sequence, followed by their code point integer 
# using the standard Unicode convention (i.e., prefixing the hex digits with U+), followed by their Unicode name.

import unicodedata
lines = codecs.open(path, encoding='latin2').readlines() # codecs.open for Python 2.X
line = lines[2]
print(line.encode('unicode_escape'))

Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n


In [92]:
for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c.encode('utf8'), ord(c), unicodedata.name(c)))

ó U+00f3 LATIN SMALL LETTER O WITH ACUTE
ś U+015b LATIN SMALL LETTER S WITH ACUTE
Ś U+015a LATIN CAPITAL LETTER S WITH ACUTE
ą U+0105 LATIN SMALL LETTER A WITH OGONEK
ł U+0142 LATIN SMALL LETTER L WITH STROKE


In [93]:
# The next examples illustrate how Python string methods and the re module can work with Unicode characters. (We will take a close 
# look at the re module in the following section. The \w matches a "word character", cf 3.4).

line.find('zosta\u0142y')
# I don't think it found anything here - probably something with UNicode characters in Python

-1

In [95]:
line = line.lower()
line

u'niemc\xf3w pod koniec ii wojny \u015bwiatowej na dolny \u015bl\u0105sk, zosta\u0142y\n'

In [96]:
line.encode('unicode_escape')

'niemc\\xf3w pod koniec ii wojny \\u015bwiatowej na dolny \\u015bl\\u0105sk, zosta\\u0142y\\n'

In [97]:
import re
m = re.search('\u015b\w*', line)
m.group()

# Generated an error. Don't worry about for now...

AttributeError: 'NoneType' object has no attribute 'group'

In [98]:
# NLTK tokenizers allow Unicode strings as input, and correspondingly yield Unicode strings as output.

word_tokenize(line)

[u'niemc\xf3w',
 u'pod',
 u'koniec',
 u'ii',
 u'wojny',
 u'\u015bwiatowej',
 u'na',
 u'dolny',
 u'\u015bl\u0105sk',
 u',',
 u'zosta\u0142y']

In [None]:
# Skipped small section "Using your local encoding in Python"

# 3.4 Regular Expressions for Detecting Word Patterns

In [99]:
# Many linguistic processing tasks involve pattern matching. For example, we can find words ending with ed using endswith('ed'). 
# We saw a variety of such "word tests" in 4.2 (in Chapter 1). Regular expressions give us a more powerful and flexible method for 
# describing the character patterns we are interested in.

# Note

# There are many other published introductions to regular expressions, organized around the syntax of regular expressions and 
# applied to searching text files. Instead of doing this again, we focus on the use of regular expressions at different stages of 
# linguistic processing. As usual, we'll adopt a problem-based approach and present new features only as they are needed to solve 
# practical problems. In our discussion we will mark regular expressions using chevrons like this: «patt».

In [100]:
# To use regular expressions in Python we need to import the re library using: import re. We also need a list of words to search; 
# we'll use the Words Corpus again (4). We will preprocess it to remove any proper names.

# Import Python's regular expression library
import re

# nltk.corpus.words.words('en') is a list of all English words in its corpus
# We find all lower case words (so that we don't repeat with capitalized words) and place them in wordlist
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()] 

In [101]:
# Using Basic Meta-Characters

# Let's find words ending with ed using the regular expression «ed$». We will use the re.search(p, s) function to check whether 
# the pattern p can be found somewhere inside the string s. We need to specify the characters of interest, and use the dollar sign 
# which has a special behavior in the context of regular expressions in that it matches the end of the word:

[w for w in wordlist if re.search('ed$', w)]

[u'abaissed',
 u'abandoned',
 u'abased',
 u'abashed',
 u'abatised',
 u'abed',
 u'aborted',
 u'abridged',
 u'abscessed',
 u'absconded',
 u'absorbed',
 u'abstracted',
 u'abstricted',
 u'accelerated',
 u'accepted',
 u'accidented',
 u'accoladed',
 u'accolated',
 u'accomplished',
 u'accosted',
 u'accredited',
 u'accursed',
 u'accused',
 u'accustomed',
 u'acetated',
 u'acheweed',
 u'aciculated',
 u'aciliated',
 u'acknowledged',
 u'acorned',
 u'acquainted',
 u'acquired',
 u'acquisited',
 u'acred',
 u'aculeated',
 u'addebted',
 u'added',
 u'addicted',
 u'addlebrained',
 u'addleheaded',
 u'addlepated',
 u'addorsed',
 u'adempted',
 u'adfected',
 u'adjoined',
 u'admired',
 u'admitted',
 u'adnexed',
 u'adopted',
 u'adossed',
 u'adreamed',
 u'adscripted',
 u'aduncated',
 u'advanced',
 u'advised',
 u'aeried',
 u'aethered',
 u'afeared',
 u'affected',
 u'affectioned',
 u'affined',
 u'afflicted',
 u'affricated',
 u'affrighted',
 u'affronted',
 u'aforenamed',
 u'afterfeed',
 u'aftershafted',
 u'aftertho

In [102]:
# The . wildcard symbol matches any single character. Suppose we have room in a crossword puzzle for an 8-letter word with j as 
# its third letter and t as its sixth letter. In place of each blank cell we use a period:

[w for w in wordlist if re.search('^..j..t..$', w)]

[u'abjectly',
 u'adjuster',
 u'dejected',
 u'dejectly',
 u'injector',
 u'majestic',
 u'objectee',
 u'objector',
 u'rejecter',
 u'rejector',
 u'unjilted',
 u'unjolted',
 u'unjustly']

In [103]:
# Note

# Your Turn: The caret symbol ^ matches the start of a string, just like the $ matches the end. What results do we get with the above 
# example if we leave out both of these, and search for «..j..t..»?
[w for w in wordlist if re.search('..j..t..', w)]

[u'abjectedness',
 u'abjection',
 u'abjective',
 u'abjectly',
 u'abjectness',
 u'adjection',
 u'adjectional',
 u'adjectival',
 u'adjectivally',
 u'adjective',
 u'adjectively',
 u'adjectivism',
 u'adjectivitis',
 u'adjustable',
 u'adjustably',
 u'adjustage',
 u'adjustation',
 u'adjuster',
 u'adjustive',
 u'adjustment',
 u'antejentacular',
 u'antiprojectivity',
 u'bijouterie',
 u'coadjustment',
 u'cojusticiar',
 u'conjective',
 u'conjecturable',
 u'conjecturably',
 u'conjectural',
 u'conjecturalist',
 u'conjecturality',
 u'conjecturally',
 u'conjecture',
 u'conjecturer',
 u'coprojector',
 u'counterobjection',
 u'dejected',
 u'dejectedly',
 u'dejectedness',
 u'dejectile',
 u'dejection',
 u'dejectly',
 u'dejectory',
 u'dejecture',
 u'disjection',
 u'guanajuatite',
 u'inadjustability',
 u'inadjustable',
 u'injectable',
 u'injection',
 u'injector',
 u'injustice',
 u'insubjection',
 u'interjection',
 u'interjectional',
 u'interjectionalize',
 u'interjectionally',
 u'interjectionary',
 u'inter

In [104]:
# Finally, the ? symbol specifies that the previous character is optional. Thus «^e-?mail$» will match both email and e-mail. We 
# could count the total number of occurrences of this word (in either spelling) in a text using 
# sum(1 for w in text if re.search('^e-?mail$', w)).

In [105]:
# Ranges and Closures

# The T9 system is used for entering text on mobile phones (see 3.5). Two or more words that are entered with the same sequence 
# of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words 
# could be produced with the same sequence? Here we use the regular expression «^[ghi][mno][jlk][def]$»:

In [106]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]
# word can be the concatenation of [g|h|i] [m|n|o] [j|l|k] [d|e|f]

[u'gold', u'golf', u'hold', u'hole']

In [107]:
# The first part of the expression, «^[ghi]», matches the start of a word followed by g, h, or i. The next part of the expression, 
# «[mno]», constrains the second character to be m, n, or o. The third and fourth characters are also constrained. Only four words 
# satisfy all these constraints. Note that the order of characters inside the square brackets is not significant, so we could have 
# written «^[hig][nom][ljk][fed]$» and matched the same words.

In [108]:
# Note

# Your Turn: Look for some "finger-twisters", by searching for words that only use part of the number-pad. For example 
# «^[ghijklmno]+$», or more concisely, «^[g-o]+$», will match words that only use keys 4, 5, 6 in the center row, 
# and «^[a-fj-o]+$» will match words that use keys 2, 3, 5, 6 in the top-right corner. What do - and + mean?

[w for w in wordlist if re.search('^[ghijklmno]+$', w)]
# + means 1 or more instances of any of the characters in brackets

[u'g',
 u'ghoom',
 u'gig',
 u'giggling',
 u'gigolo',
 u'gilim',
 u'gill',
 u'gilling',
 u'gilo',
 u'gim',
 u'gin',
 u'ging',
 u'gingili',
 u'gink',
 u'ginkgo',
 u'ginning',
 u'gio',
 u'glink',
 u'glom',
 u'glonoin',
 u'gloom',
 u'glooming',
 u'gnomon',
 u'go',
 u'gog',
 u'gogo',
 u'goi',
 u'going',
 u'gol',
 u'goli',
 u'gon',
 u'gong',
 u'gonion',
 u'goo',
 u'googol',
 u'gook',
 u'gool',
 u'goon',
 u'h',
 u'hi',
 u'high',
 u'hill',
 u'him',
 u'hin',
 u'hing',
 u'hinoki',
 u'ho',
 u'hog',
 u'hoggin',
 u'hogling',
 u'hoi',
 u'hoin',
 u'holing',
 u'holl',
 u'hollin',
 u'hollo',
 u'hollong',
 u'holm',
 u'homo',
 u'homologon',
 u'hong',
 u'honk',
 u'hook',
 u'hoon',
 u'i',
 u'igloo',
 u'ihi',
 u'ilk',
 u'ill',
 u'imi',
 u'imino',
 u'immi',
 u'in',
 u'ing',
 u'ingoing',
 u'inion',
 u'ink',
 u'inkling',
 u'inlook',
 u'inn',
 u'inning',
 u'io',
 u'ion',
 u'j',
 u'jhool',
 u'jig',
 u'jing',
 u'jingling',
 u'jingo',
 u'jinjili',
 u'jink',
 u'jinn',
 u'jinni',
 u'jo',
 u'jog',
 u'johnin',
 u'join

In [109]:
[w for w in wordlist if re.search('^[a-fj-o]+$', w)]
# - means a range. So match any letters between a and f or j and o a minumum of 1 instance
# alternatively, any word that does not have the letters g, h, i, p, q, r, s, t, u, v, w, x, y, z

[u'a',
 u'aa',
 u'aal',
 u'aam',
 u'aba',
 u'abac',
 u'abaca',
 u'aback',
 u'abaff',
 u'abalone',
 u'abandon',
 u'abandonable',
 u'abandoned',
 u'abandonee',
 u'abb',
 u'abdal',
 u'abdomen',
 u'abeam',
 u'abed',
 u'abele',
 u'able',
 u'abloom',
 u'abode',
 u'abolla',
 u'aboma',
 u'aboon',
 u'academe',
 u'acana',
 u'acca',
 u'accede',
 u'accedence',
 u'accend',
 u'accolade',
 u'accoladed',
 u'accolle',
 u'accommodable',
 u'ace',
 u'ackman',
 u'acle',
 u'acme',
 u'acne',
 u'acnodal',
 u'acnode',
 u'acock',
 u'acold',
 u'acoma',
 u'acone',
 u'ad',
 u'adad',
 u'adance',
 u'add',
 u'adda',
 u'addable',
 u'added',
 u'addend',
 u'addenda',
 u'addle',
 u'ade',
 u'adead',
 u'adeem',
 u'adenocele',
 u'adenoma',
 u'adman',
 u'ado',
 u'adobe',
 u'ae',
 u'aefald',
 u'aenean',
 u'aeon',
 u'aface',
 u'affa',
 u'affable',
 u'aflame',
 u'afoam',
 u'ajaja',
 u'ak',
 u'aka',
 u'akala',
 u'ake',
 u'akeake',
 u'akee',
 u'aknee',
 u'ako',
 u'al',
 u'ala',
 u'alack',
 u'alada',
 u'alala',
 u'alameda',
 u'ala

In [110]:
# Let's explore the + symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters:

chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]
# I predict this means [one or more m, one or more i, one or more n, one or more e]
# this makes sense when you look at chat data

[u'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 u'miiiiiinnnnnnnnnneeeeeeee',
 u'mine',
 u'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [111]:
[w for w in chat_words if re.search('^[ha]+$', w)]
# I predict this is one or more of h or a. This could be words like haha, haaahaaa, etc.


[u'a',
 u'aaaaaaaaaaaaaaaaa',
 u'aaahhhh',
 u'ah',
 u'ahah',
 u'ahahah',
 u'ahh',
 u'ahhahahaha',
 u'ahhh',
 u'ahhhh',
 u'ahhhhhh',
 u'ahhhhhhhhhhhhhh',
 u'h',
 u'ha',
 u'haaa',
 u'hah',
 u'haha',
 u'hahaaa',
 u'hahah',
 u'hahaha',
 u'hahahaa',
 u'hahahah',
 u'hahahaha',
 u'hahahahaaa',
 u'hahahahahaha',
 u'hahahahahahaha',
 u'hahahahahahahahahahahahahahahaha',
 u'hahahhahah',
 u'hahhahahaha']

In [112]:
# It should be clear that + simply means "one or more instances of the preceding item", which could be an individual character 
# like m, a set like [fed] or a range like [d-f]. Now let's replace + with *, which means "zero or more instances of the preceding 
# item". The regular expression «^m*i*n*e*$» will match everything that we found using «^m+i+n+e+$», but also words where some of 
# the letters don't appear at all, e.g. me, min, and mmmmm. Note that the + and * symbols are sometimes referred to as Kleene 
# closures, or simply closures.

# The ^ operator has another function when it appears as the first character inside square brackets. For example «[^aeiouAEIOU]» 
# matches any character other than a vowel. We can search the NPS Chat Corpus for words that are made up entirely of non-vowel 
# characters using «^[^aeiouAEIOU]+$» to find items like these: :):):), grrr, cyb3r and zzzzzzzz. Notice this includes 
# non-alphabetic characters.

# Here are some more examples of regular expressions being used to find tokens that match a particular pattern, illustrating 
# the use of some new symbols: \, {}, (), and |:

wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]
# beginning ^ means start
# [one or more numbers] . [one or more numbers]
# $ means the end

[u'0.0085',
 u'0.05',
 u'0.1',
 u'0.16',
 u'0.2',
 u'0.25',
 u'0.28',
 u'0.3',
 u'0.4',
 u'0.5',
 u'0.50',
 u'0.54',
 u'0.56',
 u'0.60',
 u'0.7',
 u'0.82',
 u'0.84',
 u'0.9',
 u'0.95',
 u'0.99',
 u'1.01',
 u'1.1',
 u'1.125',
 u'1.14',
 u'1.1650',
 u'1.17',
 u'1.18',
 u'1.19',
 u'1.2',
 u'1.20',
 u'1.24',
 u'1.25',
 u'1.26',
 u'1.28',
 u'1.35',
 u'1.39',
 u'1.4',
 u'1.457',
 u'1.46',
 u'1.49',
 u'1.5',
 u'1.50',
 u'1.55',
 u'1.56',
 u'1.5755',
 u'1.5805',
 u'1.6',
 u'1.61',
 u'1.637',
 u'1.64',
 u'1.65',
 u'1.7',
 u'1.75',
 u'1.76',
 u'1.8',
 u'1.82',
 u'1.8415',
 u'1.85',
 u'1.8500',
 u'1.9',
 u'1.916',
 u'1.92',
 u'10.19',
 u'10.2',
 u'10.5',
 u'107.03',
 u'107.9',
 u'109.73',
 u'11.10',
 u'11.5',
 u'11.57',
 u'11.6',
 u'11.72',
 u'11.95',
 u'112.9',
 u'113.2',
 u'116.3',
 u'116.4',
 u'116.7',
 u'116.9',
 u'118.6',
 u'12.09',
 u'12.5',
 u'12.52',
 u'12.68',
 u'12.7',
 u'12.82',
 u'12.97',
 u'120.7',
 u'1206.26',
 u'121.6',
 u'126.1',
 u'126.15',
 u'127.03',
 u'129.91',
 u'13.1',
 u'13

In [113]:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]
# Prediction: one or more capital letters followed by $

[u'C$', u'US$']

In [114]:
[w for w in wsj if re.search('^[0-9]{4}$', w)]
# Prediction: [start] [four numbers] [end]

[u'1614',
 u'1637',
 u'1787',
 u'1901',
 u'1903',
 u'1917',
 u'1925',
 u'1929',
 u'1933',
 u'1934',
 u'1948',
 u'1953',
 u'1955',
 u'1956',
 u'1961',
 u'1965',
 u'1966',
 u'1967',
 u'1968',
 u'1969',
 u'1970',
 u'1971',
 u'1972',
 u'1973',
 u'1975',
 u'1976',
 u'1977',
 u'1979',
 u'1980',
 u'1981',
 u'1982',
 u'1983',
 u'1984',
 u'1985',
 u'1986',
 u'1987',
 u'1988',
 u'1989',
 u'1990',
 u'1991',
 u'1992',
 u'1993',
 u'1994',
 u'1995',
 u'1996',
 u'1997',
 u'1998',
 u'1999',
 u'2000',
 u'2005',
 u'2009',
 u'2017',
 u'2019',
 u'2029',
 u'3057',
 u'8300']

In [115]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]
# Prediction: [start] [1 or more numbers] - [3-5 letters] [end]

[u'10-day',
 u'10-lap',
 u'10-year',
 u'100-share',
 u'12-point',
 u'12-year',
 u'14-hour',
 u'15-day',
 u'150-point',
 u'190-point',
 u'20-point',
 u'20-stock',
 u'21-month',
 u'237-seat',
 u'240-page',
 u'27-year',
 u'30-day',
 u'30-point',
 u'30-share',
 u'30-year',
 u'300-day',
 u'36-day',
 u'36-store',
 u'42-year',
 u'50-state',
 u'500-stock',
 u'52-week',
 u'69-point',
 u'84-month',
 u'87-store',
 u'90-day']

In [116]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]
# Prediction: [start] [5 or more lowercase letters] - [2-3 lowercase letters] - [no more than 6 lowercase letters] [end]
# Yes! I got it right!

[u'black-and-white',
 u'bread-and-butter',
 u'father-in-law',
 u'machine-gun-toting',
 u'savings-and-loan']

In [117]:
[w for w in wsj if re.search('(ed|ing)$', w)]
# Prediction: any word that ends in ed or ing

[u'62%-owned',
 u'Absorbed',
 u'According',
 u'Adopting',
 u'Advanced',
 u'Advancing',
 u'Alfred',
 u'Allied',
 u'Annualized',
 u'Anything',
 u'Arbitrage-related',
 u'Arbitraging',
 u'Asked',
 u'Assuming',
 u'Atlanta-based',
 u'Baking',
 u'Banking',
 u'Beginning',
 u'Beijing',
 u'Being',
 u'Bermuda-based',
 u'Betting',
 u'Boeing',
 u'Broadcasting',
 u'Bucking',
 u'Buying',
 u'Calif.-based',
 u'Change-ringing',
 u'Citing',
 u'Concerned',
 u'Confronted',
 u'Conn.based',
 u'Consolidated',
 u'Continued',
 u'Continuing',
 u'Declining',
 u'Defending',
 u'Depending',
 u'Designated',
 u'Determining',
 u'Developed',
 u'Died',
 u'During',
 u'Encouraged',
 u'Encouraging',
 u'English-speaking',
 u'Estimated',
 u'Everything',
 u'Excluding',
 u'Exxon-owned',
 u'Faulding',
 u'Fed',
 u'Feeding',
 u'Filling',
 u'Filmed',
 u'Financing',
 u'Following',
 u'Founded',
 u'Fracturing',
 u'Francisco-based',
 u'Fred',
 u'Funded',
 u'Funding',
 u'Generalized',
 u'Germany-based',
 u'Getting',
 u'Guaranteed',
 u'H

In [None]:
# You probably worked out that a backslash means that the following character is deprived of its special powers and must literally 
# match a specific character in the word. Thus, while . is special, \. only matches a period. The braced expressions, like {3,5}, 
# specify the number of repeats of the previous item. The pipe character indicates a choice between the material on its left or its 
# right. Parentheses indicate the scope of an operator: they can be used together with the pipe (or disjunction) symbol like this: 
# «w(i|e|ai|oo)t», matching wit, wet, wait, and woot. It is instructive to see what happens when you omit the parentheses from the 
# last expression above, and search for «ed|ing$».

# The meta-characters we have seen are summarized in 3.3.

# Table 3.3:

# Basic Regular Expression Meta-Characters, Including Wildcards, Ranges and Closures

# Operator     Behavior
# .            Wildcard, matches any character
# ^abc         Matches some pattern abc at the start of a string
# abc$         Matches some pattern abc at the end of a string
# [abc]        Matches one of a set of characters
# [A-Z0-9]     Matches one of a range of characters
# ed|ing|s     Matches one of the specified strings (disjunction)
# *            Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
# +            One or more of previous item, e.g. a+, [a-z]+
# ?            Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
# {n}          Exactly n repeats where n is a non-negative integer
# {n,}         At least n repeats
# {,n}         No more than n repeats
# {m,n}        At least m and no more than n repeats
# a(b|c)+      Parentheses that indicate the scope of the operators

# To the Python interpreter, a regular expression is just like any other string. If the string contains a backslash followed by 
# particular characters, it will interpret these specially. For example \b would be interpreted as the backspace character. In 
# general, when using regular expressions containing backslash, we should instruct the interpreter not to look inside the string at 
# all, but simply to pass it directly to the re library for processing. We do this by prefixing the string with the letter r, to 
# indicate that it is a raw string. For example, the raw string r'\band\b' contains two \b symbols that are interpreted by the 
# re library as matching word boundaries instead of backspace characters. If you get into the habit of using r'...' for regular 
# expressions — as we will do from now on — you will avoid having to think about these complications.

# 3.5   Useful Applications of Regular Expressions

In [118]:
# The above examples all involved searching for words w that match some regular expression regexp using re.search(regexp, w). Apart 
# from checking if a regular expression matches a word, we can use regular expressions to extract material from words, or to modify 
# words in specific ways.

# Extracting Word Pieces

# The re.findall() ("find all") method finds all (non-overlapping) matches of the given regular expression. Let's find all the 
# vowels in a word, then count them:

word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word)

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

In [119]:
len(re.findall(r'[aeiou]', word))

16

In [None]:
# Let's look for all sequences of two or more vowels in some text, and determine their relative frequency:

In [120]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
                   for vs in re.findall(r'[aeiou]{2,}', word))

In [121]:
fd.most_common(12)

[(u'io', 549),
 (u'ea', 476),
 (u'ie', 331),
 (u'ou', 329),
 (u'ai', 261),
 (u'ia', 253),
 (u'ee', 217),
 (u'oo', 174),
 (u'ua', 109),
 (u'au', 106),
 (u'ue', 105),
 (u'ui', 95)]

In [122]:
# Doing More with Word Pieces

# Once we can use re.findall() to extract material from words, there's interesting things to do with the pieces, like glue them 
# back together or plot them.

# It is sometimes noted that English text is highly redundant, and it is still easy to read when word-internal vowels are left 
# out. For example, declaration becomes dclrtn, and inalienable becomes inlnble, retaining any initial or final vowel sequences. 
# The regular expression in our next example matches initial vowel sequences, final vowel sequences, and all consonants; everything 
# else is ignored. This three-way disjunction is processed left-to-right, if one of the three parts matches the word, any later
# parts of the regular expression are ignored. We use re.findall() to extract all the matching pieces, and ''.join() to join them 
# together (see 3.9 for more about the join operation).

regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'

def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


In [None]:
# I skipped the rest of this small section

In [123]:
# Finding Word Stems

# When we use a web search engine, we usually don't mind (or even notice) if the words in the document differ from our search terms 
# in having different endings. A query for laptops finds documents containing laptop and vice versa. Indeed, laptop and laptops are 
# just two forms of the same dictionary word (or lemma). For some language processing tasks we want to ignore word endings, and just 
# deal with word stems.

# There are various ways we can pull out the stem of a word. Here's a simple-minded approach which just strips off anything that 
# looks like a suffix:

def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
        return word

# Take in a word
# define a list suffix with 9 common suffixes
# Setup a loop to go through each suffix
# if the word ends with the suffix, return the word with that suffix removed

In [124]:
# Although we will ultimately use NLTK's built-in stemmers, it's interesting to see how we can use regular expressions for this 
# task. Our first step is to build up a disjunction of all the suffixes. We need to enclose it in parentheses in order to limit 
# the scope of the disjunction.

re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

# Here, we say that we should match the string in 'processing' where we have 0 or more characters
# followed by one of 9 possible suffixes
# Here, we see that it is 'ing'

['ing']

In [125]:
# Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is because the 
# parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to specify the 
# scope of the disjunction, but not to select the material to be output, we have to add ?:, which is just one of many arcane 
# subtleties of regular expressions. Here's the revised version.

re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

In [126]:
# However, we'd actually like to split the word into stem and suffix. So we should just parenthesize both parts of the regular 
# expression:

re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

[('process', 'ing')]

In [127]:
# This looks promising, but still has a problem. Let's look at a different word, processes:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('processe', 's')]

In [128]:
# The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates another subtlety: the star 
# operator is "greedy" and the .* part of the expression tries to consume as much of the input as possible. If we use the 
# "non-greedy" version of the star operator, written *?, we get what we want:

re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('process', 'es')]

In [129]:
# This works even when we allow an empty suffix, by making the content of the second parentheses optional:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

[('language', '')]

In [130]:
# This approach still has many problems (can you spot them?) but we will move on to define a function to perform stemming, and 
# apply it to a whole text:

def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

[stem(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'sword',
 'i',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

In [131]:
# Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words like distribut 
# and deriv, but these are acceptable stems in some applications.

In [134]:
# Searching Tokenized Text

# You can use a special kind of regular expression for searching across multiple words in a text (where a text is a list of tokens). 
# For example, "<a> <man>" finds all instances of a man in the text. The angle brackets are used to mark token boundaries, and any 
# whitespace between the angle brackets is ignored (behaviors that are unique to NLTK's findall() method for texts). In the following 
# example, we include <.*> [1] which will match any single token, and enclose it in parentheses so only the matched word 
# (e.g. monied) and not the matched phrase (e.g. a monied man) is produced. The second example finds three-word phrases ending with 
# the word bro [2]. The last example finds sequences of three or more words starting with the letter l [3].

from nltk.corpus import gutenberg, nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")
moby.findall(r"(<.*>) <man>")

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
artificial; any; that; a; monied; nervous; a; old; decent; This; a;
this; No; the; dangerous; a; white; a; white; this; the; a; a; any;
one; the; that; That; every; a; a; old; worsted; the; faithful;
Miserable; the; the; the; honest; the; is; a; a; a; fellow; fellow;
fellow; no; white; that; first; a; the; the; elderly; !; young; young;
Young; young; the; a; that; a; a; pious; our; young; young; young;
impenitent; ,; young; young; queer; like; good; good; good; young;
old; young; young; a; young; young; Young; Young; the; a; crazy; the;
a; good; a; the; of; a; that; a; mature; that; earnest; the;
steadfast; no; fearless; a; a; a; mighty; but; ruined; -; unfearing;
white; -; Cape; a; sepulchral; other; sleeping; less; old; old; old;
old; old; old; old; little; lit

In [135]:
chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*> <.*> <bro>")

you rule bro; telling you bro; u twizted bro


In [136]:
chat.findall(r"<l.*>{3,}")

lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la


In [138]:
# It is easy to build search patterns when the linguistic phenomenon we're studying is tied to particular words. In some cases, 
# a little creativity will go a long way. For instance, searching a large text corpus for expressions of the form x and other ys 
# allows us to discover hypernyms (cf 5):

from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


In [140]:
# With enough text, this approach would give us a useful store of information about the taxonomy of objects, without the need for 
# any manual labor. However, our search results will usually contain false positives, i.e. cases that we would want to exclude. 
# For example, the result: demands and other factors suggests that demand is an instance of the type factor, but this sentence is 
# actually about wage demands. Nevertheless, we could construct our own ontology of English concepts by manually correcting the 
# output of such searches.

# 3.6   Normalizing Text

In [141]:
# In earlier program examples we have often converted text to lowercase before doing anything with its words, 
# e.g. set(w.lower() for w in text). By using lower(), we have normalized the text to lowercase so that the distinction between 
# The and the is ignored. Often we want to go further than this, and strip off any affixes, a task known as stemming. A further step 
# is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization. We discuss each of these in 
# turn. First, we need to define the data we will use in this section:

raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)
tokens

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'lying',
 'in',
 'ponds',
 'distributing',
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

In [142]:
# Stemmers

# NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in preference to 
# crafting your own using regular expressions, since these handle a wide range of irregular cases. The Porter and Lancaster 
# stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying 
# (mapping it to lie), while the Lancaster stemmer does not.

porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]

[u'DENNI',
 u':',
 u'Listen',
 u',',
 u'strang',
 u'women',
 u'lie',
 u'in',
 u'pond',
 u'distribut',
 u'sword',
 u'is',
 u'no',
 u'basi',
 u'for',
 u'a',
 u'system',
 u'of',
 u'govern',
 u'.',
 u'Suprem',
 u'execut',
 u'power',
 u'deriv',
 u'from',
 u'a',
 u'mandat',
 u'from',
 u'the',
 u'mass',
 u',',
 u'not',
 u'from',
 u'some',
 u'farcic',
 u'aquat',
 u'ceremoni',
 u'.']

In [143]:
[lancaster.stem(t) for t in tokens]

['den',
 ':',
 'list',
 ',',
 'strange',
 'wom',
 'lying',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'bas',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'pow',
 'der',
 'from',
 'a',
 'mand',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'som',
 'farc',
 'aqu',
 'ceremony',
 '.']

In [144]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [145]:
# Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind. The 
# Porter Stemmer is a good choice if you are indexing some texts and want to support search using alternative forms of words 
# (illustrated in 3.6, which uses object oriented programming techniques that are outside the scope of this book, string formatting 
# techniques to be covered in 3.9, and the enumerate() function to be explained in 4.2).

porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

('r king ! DENNIS : Listen , strange women', 'lying in ponds distributing swords is no')
(' beat a very brave retreat . ROBIN : All', 'lies ! MINSTREL : [ singing ] Bravest of')
('       Nay . Nay . Come . Come . You may', 'lie here . Oh , but you are wounded !   ')
('doctors immediately ! No , no , please !', 'Lie down . [ clap clap ] PIGLET : Well  ')
('ere is much danger , for beyond the cave', 'lies the Gorge of Eternal Peril , which ')
('   you . Oh ... TIM : To the north there', 'lies a cave -- the cave of Caerbannog --')
('h it and lived ! Bones of full fifty men', 'lie strewn about its lair . So , brave k')
("not stop our fight ' til each one of you", 'lies dead , and the Holy Grail returns t')


In [146]:
# Lemmatization

# The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. This additional checking process makes 
# the lemmatizer slower than the above stemmers. Notice that it doesn't handle lying, but it converts women to woman.

wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 u'woman',
 'lying',
 'in',
 u'pond',
 'distributing',
 u'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 u'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

In [147]:
# The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas 
# (or lexicon headwords).

# Note

# Another normalization task involves identifying non-standard words including numbers, abbreviations, and dates, and mapping any 
# such tokens to a special vocabulary. For example, every decimal number could be mapped to a single token 0.0, and every acronym 
# could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks.

# 3.7   Regular Expressions for Tokenizing Text

In [148]:
# Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although 
# it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK 
# includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, 
# and to have much more control over the process.

# Simple Approaches to Tokenization

# The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice's Adventures in 
# Wonderland:

raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

In [149]:
# We could split this raw text on whitespace using raw.split(). To do the same using a regular expression, it is not enough to match 
# any space characters in the string [1] since this results in tokens that contain a \n newline character; instead we need to match 
# any number of spaces, tabs, or newlines [2]:

re.split(r' ', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone\nthough),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very\nwell',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

In [150]:
re.split(r'[ \t\n]+', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very',
 'well',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

In [151]:
# The regular expression «[ \t\n]+» matches one or more space, tab (\t) or newline (\n). Other whitespace characters, such as 
# carriage-return and form-feed should really be included too. Instead, we will use a built-in re abbreviation, \s, which means any 
# whitespace character. The above statement can be rewritten as re.split(r'\s+', raw).

# Note

# Important: Remember to prefix regular expressions with the letter r (meaning "raw"), which instructs the Python interpreter to 
# treat the string literally, rather than processing any backslashed characters it contains.

# Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with 
# a character class \w for word characters, equivalent to [a-zA-Z0-9_]. It also defines the complement of this class \W, i.e. all 
# characters other than letters, digits or underscore. We can use \W in a simple regular expression to split the input on anything 
# other than a word character:

re.split(r'\W+', raw)
# Split raw text on anything other than word characters

['',
 'When',
 'I',
 'M',
 'a',
 'Duchess',
 'she',
 'said',
 'to',
 'herself',
 'not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 'I',
 'won',
 't',
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 'Maybe',
 'it',
 's',
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 'tempered',
 '']

In [152]:
# Observe that this gives us empty strings at the start and the end (to understand why, try doing 'xx'.split('x')). We get the same 
# tokens, but without the empty strings, with re.findall(r'\w+', raw), using a pattern that matches the words instead of the spaces. 
# Now that we're matching the words, we're in a position to extend the regular expression to cover a wider range of cases. The regular 
# expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any 
# non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped 
# with any following letters (e.g. 's) but that sequences of two or more punctuation characters are separated.

re.findall(r'\w+|\S\w*', raw)

["'When",
 'I',
 "'M",
 'a',
 'Duchess',
 ',',
 "'",
 'she',
 'said',
 'to',
 'herself',
 ',',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 ')',
 ',',
 "'I",
 'won',
 "'t",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 '.',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 '-',
 '-Maybe',
 'it',
 "'s",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 '-tempered',
 ',',
 "'",
 '.',
 '.',
 '.']

In [153]:
# Let's generalize the \w+ in the above expression to permit word-internal hyphens and apostrophes: «\w+([-']\w+)*». This expression 
# means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it's. (We need to include ?: in this 
# expression for reasons discussed earlier.) We'll also add a pattern to match quote characters so these are kept separate from the 
# text they enclose.

print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))

["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


In [None]:
# The above expression also included «[-.(]+» which causes the double hyphen, ellipsis, and open parenthesis to be tokenized 
# separately.

# 3.4 lists the regular expression character class symbols we have seen in this section, in addition to some other useful symbols.

# Table 3.4:

# Regular Expression Symbols

# Symbol Function
# \b     Word boundary (zero width)
# \d     Any decimal digit (equivalent to [0-9])
# \D     Any non-digit character (equivalent to [^0-9])
# \s     Any whitespace character (equivalent to [ \t\n\r\f\v])
# \S     Any non-whitespace character (equivalent to [^ \t\n\r\f\v])
# \w     Any alphanumeric character (equivalent to [a-zA-Z0-9_])
# \W     Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])
# \t     The tab character
# \n     The newline character

In [154]:
#NLTK's Regular Expression Tokenizer

# The function nltk.regexp_tokenize() is similar to re.findall() (as we've been using it for tokenization). However, 
# nltk.regexp_tokenize() is more efficient for this task, and avoids the need for special treatment of parentheses. For readability 
# we break up the regular expression over several lines and add a comment about each line. The special (?x) "verbose flag" tells 
# Python to strip out the embedded whitespace and comments.

text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x)     # set flag to allow verbose regexps
     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
   | \w+(-\w+)*        # words with optional internal hyphens
   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
'''
nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

In [155]:
# When using the verbose flag, you can no longer use ' ' to match a space character; use \s instead. The regexp_tokenize() function 
# has an optional gaps parameter. When set to True, the regular expression specifies the gaps between tokens, as with re.split().

# Note

# We can evaluate a tokenizer by comparing the resulting tokens with a wordlist, and reporting any tokens that don't appear in the 
# wordlist, using set(tokens).difference(wordlist). You'll probably want to lowercase all the tokens first.

# Further Issues with Tokenization

# Tokenization turns out to be a far more difficult task than you might have expected. No single solution works well across-the-board, 
# and we must decide what counts as a token depending on the application domain.

# When developing a tokenizer it helps to have access to raw text which has been manually tokenized, in order to compare the output 
# of your tokenizer with high-quality (or "gold-standard") tokens. The NLTK corpus collection includes a sample of Penn Treebank 
# data, including the raw Wall Street Journal text (nltk.corpus.treebank_raw.raw()) and the tokenized version 
# (nltk.corpus.treebank.words()).

# A final issue for tokenization is the presence of contractions, such as didn't. If we are analyzing the meaning of a sentence, 
# it would probably be more useful to normalize this form to two separate forms: did and n't (or not). We can do this work with the 
# help of a lookup table.

# 3.8   Segmentation

In [157]:
# This section discusses more advanced concepts, which you may prefer to skip on the first time through this chapter.

# Tokenization is an instance of a more general problem of segmentation. In this section we will look at two other instances of this 
# problem, which use radically different techniques to the ones we have seen so far in this chapter.

# Sentence Segmentation

# Manipulating texts at the level of individual words often presupposes the ability to divide a text into individual sentences. As 
# we have seen, some corpora already provide access at the sentence level. In the following example, we compute the average number 
# of words per sentence in the Brown Corpus:

len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())

20.250994070456922

In [158]:
# In other cases, the text is only available as a stream of characters. Before tokenizing the text into words, we need to segment 
# it into sentences. NLTK facilitates this by including the Punkt sentence segmenter (Kiss & Strunk, 2006). Here is an example 
# of its use in segmenting the text of a novel. (Note that if the segmenter's internal data has been updated by the time you read 
# this, you will see different output):

text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
pprint.pprint(sents[79:89])

[u'"Nonsense!"',
 u'said Gregory, who was very rational when anyone else\nattempted paradox.',
 u'"Why do all the clerks and navvies in the\nrailway trains look so sad and tired, so very sad and tired?',
 u'I will\ntell you.',
 u'It is because they know that the train is going right.',
 u'It\nis because they know that whatever place they have taken a ticket\nfor that place they will reach.',
 u'It is because after they have\npassed Sloane Square they know that the next station must be\nVictoria, and nothing but Victoria.',
 u'Oh, their wild rapture!',
 u'oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation were unaccountably Baker Street!"',
 u'"It is you who are unpoetical," replied the poet Syme.']


In [None]:
# Notice that this example is really a single sentence, reporting the speech of Mr Lucian Gregory. However, the quoted speech 
# contains several sentences, and these have been split into individual strings. This is reasonable behavior for most applications.

# Sentence segmentation is difficult because period is used to mark abbreviations, and some periods simultaneously mark an 
# abbreviation and terminate a sentence, as often happens with acronyms like U.S.A.

# For another approach to sentence segmentation, see 2.

In [None]:
# Skipped the small section on Word Segmentation

In [None]:
# Skipped 3.9   Formatting: From Lists to Strings
# Instead of formatting the strings for display to some screen, I will store the results in a graph database

In [None]:
# There are many online resources for Unicode. Useful discussions of Python's facilities for handling Unicode are:

#    Ned Batchelder, Pragmatic Unicode, http://nedbatchelder.com/text/unipain.html
#    Unicode HOWTO, Python Documentation, http://docs.python.org/3/howto/unicode.html
#    David Beazley, Mastering Python 3 I/O, http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o
#    Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character 
#    Sets (No Excuses!), http://www.joelonsoftware.com/articles/Unicode.html