This bit at the beginning just had to do with discussion of character encoding, the 7-bit ASCII table, 8-bit bytes, decimal (base 10) and hexadecimal (base 16) notation.  I was showing that 3F (0011 1111) in hexadecimal is 63 decimal, and showing how to get back to hexadecimal from decimal using `hex()`.  Also discussed in here, Unicode, UTF-8, UTF-16.

In [1]:
48+15

63

In [2]:
hex(63)

'0x3f'

This task is to get Crime and Punishment from the Gutenberg Project.  This tracks pretty closely with chapter 3. Note that the URL has changed from what is in the book. And who knows if it will change again, but this was the link for the UTF-8 text version as of when the class happened.

In [3]:
import nltk

In [4]:
from urllib import request

In [5]:
url = "https://www.gutenberg.org/files/2554/2554-0.txt"

In [6]:
response = request.urlopen(url)

In [7]:
raw = response.read().decode('utf8')

In [8]:
len(raw)

1176967

In [9]:
raw[:100]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the '

`raw` now holds the full text.  I spent a bit of time talking about `\r` (carriage return) and `\n` (line feed), and pointed out that if you want to have a string with a carriage return in it, you need one or both of these (here `\n` will suffice to both return to the left and go down a line), because you can't just type a string with a return in it. I demonstrated this below, attempting to print a string split over two lines. This results in an error, but if I leave it in then the notebook stops executing at that point.  So, I have commented it out with `#` characters at the beginning, but if you want to recreate the experience from class, you can remove the comments and observe how...
```python
print("Hello.
There.")
```
...yields an error like...
```python
File "<ipython-input-12-822970d7f983>", line 2
    print("Hello.
                 ^
SyntaxError: EOL while scanning string literal
```

In [10]:
#print("Hello.
#There.")

The way to do what was intended above is to use `\n` (for "newline").

In [11]:
print("Hello.\nThere.")

Hello.
There.


It is useful to use `type()` to determine the type of something (string, list, dictionary, etc.).  To wit:

In [12]:
type(raw)

str

In [13]:
x = [1, 2, 3]

In [14]:
type(x)

list

In [15]:
y = {'a':1, 'b':2}

In [16]:
type(y)

dict

Now, here's the first 100 characters.  We'll deal with the `\ufeff` at the beginning there in a bit.

In [17]:
raw[:100]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the '

We will "tokenize" this text using `nltk.word_tokenize()`, which will split everything up into words (on spaces, but also with punctuation, etc.)

In [18]:
tokens = nltk.word_tokenize(raw)

In [19]:
type(tokens)

list

In [20]:
tokens[:5]

['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of']

If we need to do a lot of tokenization, we can save ourselves some typing by doing this, which will then allow us to call upon this function as `word_tokenize()` rather than as `nltk.word_tokenize()`.  Not a big deal, but this latter thing is what you see being done in the chapter.  And that's why they're doing it.  They're typing 30 characters ("from nltk import word_tokenize") to save five characters of typing ("nltk.").  Whether that was a good idea can be left for you to decide.

In [21]:
from nltk import word_tokenize

In [22]:
tokens2 = word_tokenize(raw)

Either way (`tokens2` via `word_tokenize()` or `tokens` via `nltk.word_tokenize()`), the result is the same.

In [23]:
tokens == tokens2

True

Now, what is the first token?

In [24]:
tokens2[0]

'\ufeffThe'

What is that first character of the first token? It turns out it is a single character, the Unicode characeter U+FEFF. You can look this up on Wikipedia, it is the Byte-Order Marker.  And it doesn't really belong in a UTF-8 file, it is appropriate for a UTF-16 file.  So, we're going to want to get rid of that.

In [25]:
tokens2[0][0]

'\ufeff'

In [26]:
raw[0]

'\ufeff'

We can get rid of it by slicing it off as follows. The first letter will then be `T` rather than `\uFEFF`.

In [27]:
raw = raw[1:]

In [28]:
raw[0]

'T'

In [29]:
tokens = word_tokenize(raw)

In [30]:
tokens[0]

'The'

Now, we can turn it into a `Text` which provides access to various analysis functions, like `collocations()`.

In [31]:
text = nltk.Text(tokens)

In [32]:
type(text)

nltk.text.Text

In [33]:
text.collocations()

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Ilya Petrovitch; Project
Gutenberg; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens


The fact that we have `Project Gutenberg-tm` in there reminds us that we actually have a bunch of junk in the file at the beginning and end, besides just the novel text.  So we'll locate `PART I` (the beginning of the novel) and `End of Project` (the beginning of the endmatter that Project Gutenberg added), and we'll slice out just the part between those.  Then we can re-tokenize it, so that we have just tokens from the novel.

In [34]:
raw.find("PART I")

5335

In [35]:
raw[5335:5345]

'PART I\r\n\r\n'

In [36]:
raw.rfind("End of Project")

1157811

In [37]:
realraw = raw[5335:1157811]

In [38]:
realraw[:10]

'PART I\r\n\r\n'

In [39]:
tokens = word_tokenize(realraw)

In [40]:
tokens[:10]

['PART',
 'I',
 'CHAPTER',
 'I',
 'On',
 'an',
 'exceptionally',
 'hot',
 'evening',
 'early']

In [41]:
text = nltk.Text(tokens)

In [42]:
text.collocations()

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Porfiry Petrovitch; Amalia Ivanovna; great deal; young man;
Nikodim Fomitch; Ilya Petrovitch; Andrey Semyonovitch; Hay Market;
Dmitri Prokofitch; Good heavens; police station; head clerk


Ok, new topic, reading things from the web.  Let's grab this ancient story from an old BBC news article.  See the textbook for commentary on it and why it was chosen. I just used the URL from the book, rather than trying to find a new news article. But note that this works better with old news articles than new pages from the BBC site, because everything is all very interactive now and a "web page" usually consists of a kind of a skeleton with little pockets within it that are dynamically reloaded all the time. So, it's much harder now to "grab the text of a web page" because it is no longer a single document, but a whole array of different documents. If you need to do more complex things, there are tools to do them, but we'll start with this simple example anyway.

In [43]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

In [44]:
html = request.urlopen(url).read().decode('utf8')

In [45]:
html[:600]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\r\n<html>\r\n<head>\r\n<title>BBC NEWS | Health | Blondes \'to die out in 200 years\'</title>\r\n<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">\r\n<meta name="OriginalPublicationDate" content="2002/09/27 11:51:55">\r\n<meta name="UKFS_URL" content="/1/hi/health/2284783.stm">\r\n<meta name="IFS_URL" content="/2/hi/health/2284783.stm">\r\n<meta name="HTTP-EQUIV" content="text/html;charset=iso-8859-1">\r\n<meta name="Headline" conte'

In [46]:
print(html)

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>BBC NEWS | Health | Blondes 'to die out in 200 years'</title>
<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">
<meta name="OriginalPublicationDate" content="2002/09/27 11:51:55">
<meta name="UKFS_URL" content="/1/hi/health/2284783.stm">
<meta name="IFS_URL" content="/2/hi/health/2284783.stm">
<meta name="HTTP-EQUIV" content="text/html;charset=iso-8859-1">
<meta name="Headline" content="Blondes 'to die out in 200 years'">
<meta name="Section" content="Health">
<meta name="Description" content="Natural blondes are an endangered species and will die out by 2202, a study suggests.">
<!-- GENMaps-->
<map name="banner">
<area alt="BBC NEWS" coords="7,9,167,32" href="http://news.bbc.co.uk/1/hi.html" shape="RECT">
</map>

<script src="/nol/shared/js/livestats_v1_1.js" langua

To analyze the contents of an HTML (Hypertext Markup Language) document, we will use `BeautifulSoup`.  Named that way because I don't know why.  If you feed it HTML, it knows how to `get_text()` which will remove a lot of the junk and return just the text.

In [47]:
from bs4 import BeautifulSoup

In [48]:
bbcraw = BeautifulSoup(html).get_text()



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [49]:
bbcraw

'\n\nBBC NEWS | Health | Blondes \'to die out in 200 years\'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNEWS\n\xa0\xa0SPORT\n\xa0\xa0WEATHER\n\xa0\xa0WORLD SERVICE\n\n\xa0\xa0A-Z INDEX\xa0\n\n\xa0\xa0SEARCH\xa0\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n    \xa0You are in:\xa0Health \xa0\r\n    \r\n    \r\n\n\n\n\n\n\n\n\n\n\n\nNews Front Page\n\n\n\n\n\nAfrica\n\n\nAmericas\n\n\nAsia-Pacific\n\n\nEurope\n\n\nMiddle East\n\n\nSouth Asia\n\n\nUK\n\n\nBusiness\n\n\nEntertainment\n\n\nScience/Nature\n\n\nTechnology\n\n\nHealth\n\n\nMedical notes\n\n\n-------------\n\n\nTalking Point\n\n\n-------------\n\n\nCountry Profiles\n\n\nIn Depth\n\n\n-------------\n\n\nProgrammes\n\n\n-------------\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSERVICES\r\n\n\n\n\n\n\n\nDaily E-mail\r\n\n\n\n\n\n\n\nNews Ticker\r\n\n\n\n\n\n\n\nMobile/PDAs\r\n\n\n\n\n\n\n-------------\r\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nText Only\n\n\

And that's as far as we took this, but at this point you have something that you can break into words using a tokenizer and analyze if you wish.

New topic: Unicode for other languages.  This is basically just following what is in the book, so you can look there for more description.

In [50]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [51]:
path

FileSystemPathPointer('/Users/hagstrom/nltk_data/corpora/unicode_samples/polish-lat2.txt')

In [52]:
f = open(path, encoding='latin2')

In [53]:
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


Since we opened the file, we should close it again. It's just polite.

In [54]:
f.close()

A quick explanation of what `strip()` does: it removes whitespace from the beginning and end of a string.

In [55]:
"    hello ".strip()

'hello'

This is what it would have looked like without `strip()`. The line breaks after each line would still be there, so it would look double-spaced.

In [56]:
f = open(path, encoding='latin2')

In [57]:
for line in f:
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą

"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez

Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały

odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki

Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych

archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.



In [58]:
f.close()

In [59]:
ord('ń')

324

In [60]:
hex(324)

'0x144'

In [61]:
print("\u0144")

ń


In [62]:
n_acute = "\u0144"

In [63]:
type(n_acute)

str

In [64]:
n_acute.encode('utf8')

b'\xc5\x84'

New topic. Regular expressions.  Doing `import re` familiarizes Python with some things regular expressions can do.

In [65]:
import re

Gather together a corpus of English words.

In [66]:
engword = nltk.corpus.words.words('en')

In [67]:
engword[:10]

['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron']

Let's get rid of the proper names by keeping only the words that do not start with a capital letter.

In [68]:
wordlist = [w for w in engword if w.islower()]

In [69]:
wordlist[:10]

['a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aardvark',
 'aardwolf',
 'aba',
 'abac',
 'abaca']

In [70]:
len(wordlist)

210687

Find words that end in `ed`

In [71]:
[w for w in wordlist if re.search('ed$', w)]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 'accelerated',
 'accepted',
 'accidented',
 'accoladed',
 'accolated',
 'accomplished',
 'accosted',
 'accredited',
 'accursed',
 'accused',
 'accustomed',
 'acetated',
 'acheweed',
 'aciculated',
 'aciliated',
 'acknowledged',
 'acorned',
 'acquainted',
 'acquired',
 'acquisited',
 'acred',
 'aculeated',
 'addebted',
 'added',
 'addicted',
 'addlebrained',
 'addleheaded',
 'addlepated',
 'addorsed',
 'adempted',
 'adfected',
 'adjoined',
 'admired',
 'admitted',
 'adnexed',
 'adopted',
 'adossed',
 'adreamed',
 'adscripted',
 'aduncated',
 'advanced',
 'advised',
 'aeried',
 'aethered',
 'afeared',
 'affected',
 'affectioned',
 'affined',
 'afflicted',
 'affricated',
 'affrighted',
 'affronted',
 'aforenamed',
 'afterfeed',
 'aftershafted',
 'afterthoughted',
 'afterwitted',
 'agazed',
 'aged',
 'agglomerated',
 'aggri

Find words that start with `ed`

In [72]:
[w for w in wordlist if re.search('^ed', w)]

['edacious',
 'edaciously',
 'edaciousness',
 'edacity',
 'edaphic',
 'edaphology',
 'edaphon',
 'edder',
 'eddish',
 'eddo',
 'eddy',
 'eddyroot',
 'edea',
 'edeagra',
 'edeitis',
 'edelweiss',
 'edema',
 'edematous',
 'edemic',
 'edenite',
 'edental',
 'edentalous',
 'edentate',
 'edentulate',
 'edentulous',
 'edeodynia',
 'edeology',
 'edeomania',
 'edeoscopy',
 'edeotomy',
 'edestan',
 'edestin',
 'edge',
 'edgebone',
 'edged',
 'edgeless',
 'edgemaker',
 'edgemaking',
 'edgeman',
 'edger',
 'edgerman',
 'edgeshot',
 'edgestone',
 'edgeways',
 'edgeweed',
 'edgewise',
 'edginess',
 'edging',
 'edgingly',
 'edgrew',
 'edgy',
 'edh',
 'edibility',
 'edible',
 'edibleness',
 'edict',
 'edictal',
 'edictally',
 'edicule',
 'edificable',
 'edification',
 'edificator',
 'edificatory',
 'edifice',
 'edificial',
 'edifier',
 'edify',
 'edifying',
 'edifyingly',
 'edifyingness',
 'edingtonite',
 'edit',
 'edital',
 'edition',
 'editor',
 'editorial',
 'editorialize',
 'editorially',
 'edito

Find words that are 8 letters long, have a `j` as the third letter and a `t` as the sixth. For those occasions when you want to cheat at crossword puzzles.

In [73]:
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

Find words that start with either `ae` or `be`.

In [74]:
[w for w in wordlist if re.search("^[ab]e", w)]

['ae',
 'aecial',
 'aecidial',
 'aecidioform',
 'aecidiospore',
 'aecidiostage',
 'aecidium',
 'aeciospore',
 'aeciostage',
 'aecioteliospore',
 'aeciotelium',
 'aecium',
 'aedeagus',
 'aedicula',
 'aedile',
 'aedileship',
 'aedilian',
 'aedilic',
 'aedilitian',
 'aedility',
 'aedoeagus',
 'aefald',
 'aefaldness',
 'aefaldy',
 'aefauld',
 'aegagropila',
 'aegagropile',
 'aegagrus',
 'aegerian',
 'aegeriid',
 'aegicrania',
 'aegirine',
 'aegirinolite',
 'aegirite',
 'aegis',
 'aegithognathism',
 'aegithognathous',
 'aegrotant',
 'aegyptilla',
 'aegyrite',
 'aeluroid',
 'aelurophobe',
 'aelurophobia',
 'aeluropodous',
 'aenach',
 'aenean',
 'aeneolithic',
 'aeneous',
 'aenigmatite',
 'aeolharmonica',
 'aeolid',
 'aeolina',
 'aeoline',
 'aeolipile',
 'aeolistic',
 'aeolodicon',
 'aeolodion',
 'aeolomelodicon',
 'aeolopantalon',
 'aeolotropic',
 'aeolotropism',
 'aeolotropy',
 'aeolsklavier',
 'aeon',
 'aeonial',
 'aeonian',
 'aeonist',
 'aequoreal',
 'aer',
 'aerage',
 'aerarian',
 'aerar

Figure out what you might have meant by typing 4652 on your RAZR.

In [75]:
[w for w in wordlist if re.search("^[ghi][mno][jkl][def]$", w)]

['gold', 'golf', 'hold', 'hole']

Gather words from the chat corpus.

In [76]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))

Look for words that have one or more `m`s followed by one or more `i`s followed by one or more `n`s followed by one or more `e`s.

In [77]:
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

Look for words that have zero or more `m`s followed by zero or more `i`s followed by zero or more `n`s followed by zero or more `e`s.

In [78]:
[w for w in chat_words if re.search('^m*i*n*e*$', w)]

['',
 'e',
 'i',
 'in',
 'm',
 'me',
 'meeeeeeeeeeeee',
 'mi',
 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'min',
 'mine',
 'mm',
 'mmm',
 'mmmm',
 'mmmmm',
 'mmmmmm',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee',
 'mmmmmmmmmm',
 'mmmmmmmmmmmmm',
 'mmmmmmmmmmmmmm',
 'n',
 'ne']

Find words you can make with one or more `a`s or `h`s.

In [79]:
[w for w in chat_words if re.search('^[ha]+$', w)]

['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

An example of `re.findall()` that finds the vowels in a word.

In [80]:
word = "supercalifragilisticexpialidocious"

In [81]:
re.findall('[aeiou]', word)

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

A brief sidetrack on why you might want to use a raw string as your regular expression pattern.  This is more clear when we note that something like `\w` stands for a word.  Because `\` is a special character, you'd need to "escape" it by putting a `\` before it.  Normally `\` itself means that the next thing is going to be interpreted in some special way.  That is (`\t` here is tab):

In [82]:
print("Hello\tWorld")

Hello	World


If we actually wanted the string "Hello\tWorld" we'd actually need to escape the `\`.

In [83]:
print("Hello\\tWorld")

Hello\tWorld


In [84]:
print(r"Hello\\tWorld")

Hello\\tWorld


In [85]:
re.findall(r'[aeiou]', word)

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

Next task, look at the Wall St Journal corpus, for words that have three vowels in a row.

In [86]:
wsj = sorted(set(nltk.corpus.treebank.words()))

In [87]:
len(wsj)

12408

In [88]:
fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{3,}', word))

In [89]:
fd.most_common(12)

[('iou', 27),
 ('eau', 10),
 ('oui', 6),
 ('eou', 5),
 ('uou', 5),
 ('uee', 4),
 ('ieu', 3),
 ('uie', 3),
 ('eei', 2),
 ('iai', 1),
 ('oei', 1),
 ('aii', 1)]

And that's essentially as far as we got.