### Processing raw text
1. How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?
2. How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
3. How can we write programs to produce formatted output and save it in a file?

-> tokenization, stemming, string methods, regular expressions and more

In [1]:
import nltk, re, pprint
from nltk import word_tokenize

### Books

In [4]:
from urllib import request
url = 'http://www.gutenberg.org/files/2554/2554-0.txt'
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

str

In [5]:
len(raw)

1176965

In [7]:
raw[:80]

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nTh'

For our language processing, we want to break up the string into words and punctuation. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.  

We also want to remove the whitespace, line breaks, blank lines, and carriage returns

In [8]:
tokens = word_tokenize(raw)
type(tokens)

list

In [9]:
len(tokens)

257726

In [14]:
tokens[:15]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by',
 'Fyodor',
 'Dostoevsky',
 'This',
 'eBook',
 'is']

By creating an NLTK text from these tokens, it allows use of all the processing methods

In [15]:
text = nltk.Text(tokens)
type(text)

nltk.text.Text

In [18]:
text[1024:1028]

['an', 'exceptionally', 'hot', 'evening']

In [19]:
text.collocations()

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Ilya Petrovitch; Project
Gutenberg; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens


In [20]:
raw.find('PART I')

5336

In [28]:
raw.rfind("End of Project Gutenberg’s Crime and Punishment")

1157810

In [29]:
raw = raw[5338:1157810]

### HTML

In [32]:
url = 'http://news.bbc.co.uk/2/hi/health/2284783.stm'
html = request.urlopen(url).read().decode('utf8')

In [33]:
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [34]:
from bs4 import BeautifulSoup as bs
raw = bs(html).get_text()



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [39]:
tokens = word_tokenize(raw)
tokens[5:12]

['Blondes', "'to", 'die', 'out', 'in', '200', "years'"]

### RSS Feeds

In [14]:
import feedparser
from bs4 import BeautifulSoup as bs

In [3]:
llog = feedparser.parse('http://languagelog.ldc.upenn.edu/nll/?feed=atom')

In [4]:
llog['feed']['title']

'Language Log'

In [5]:
len(llog.entries)

13

In [7]:
post=llog.entries[2]
post.title

'Language and politics in an Inner Mongolian post office'

In [11]:
content = post.content[0].value

In [13]:
content[:70]

'<p>[This is a guest post by Bathrobe.]</p>\n<p>Recently I travelled in '

In [16]:
raw = bs(content).get_text()
word_tokenize(raw)[:20]



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


['[',
 'This',
 'is',
 'a',
 'guest',
 'post',
 'by',
 'Bathrobe',
 '.',
 ']',
 'Recently',
 'I',
 'travelled',
 'in',
 'Inner',
 'Mongolia',
 '(',
 'China',
 ')',
 'where']

### String Processing

In [21]:
couplet = "Shall I compare thee to a Summer's day?"\
            " Thou are more lovely and more temperate"
couplet

"Shall I compare thee to a Summer's day? Thou are more lovely and more temperate"

In [22]:
for char in couplet:
    print(char, end=' ')

S h a l l   I   c o m p a r e   t h e e   t o   a   S u m m e r ' s   d a y ?   T h o u   a r e   m o r e   l o v e l y   a n d   m o r e   t e m p e r a t e 

In [23]:
couplet.find('lovely')

54

### Working with Unicode
Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode — translation into Unicode is called decoding. Conversely, to write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding — this translation out of Unicode is called encoding.
From a Unicode perspective, characters are abstract entities which can be realized as one or more glyphs. Only glyphs can appear on a screen or be printed on paper. A font is a mapping from characters to glyphs.

In [24]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [27]:
# The Python open() function can read encoded data into Unicode strings, and write out Unicode strings in encoded form. It takes a parameter to specify the encoding of the file being read or written.
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


In [28]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


In [29]:
# find integer oridnal of a character
ord('ń')

324

In [31]:
hex(324)

'0x144'

In [30]:
nacute = '\u0144'
nacute

'ń'

In [32]:
nacute.encode('utf8')

b'\xc5\x84'

In [33]:
import unicodedata
lines = open(path, encoding='latin2').readlines()
line = lines[2]

In [34]:
print(line.encode('unicode_escape'))

b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'


In [35]:
for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c.encode('utf8'), ord(c), unicodedata.name(c)))

b'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
b'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
b'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
b'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE


In [36]:
line.find('zosta\u0142y')

54

In [37]:
line = line.lower()
line

'niemców pod koniec ii wojny światowej na dolny śląsk, zostały\n'

In [38]:
line.encode('unicode_escape')

b'niemc\\xf3w pod koniec ii wojny \\u015bwiatowej na dolny \\u015bl\\u0105sk, zosta\\u0142y\\n'

In [40]:
import re
m = re.search('\u015b\w*', line)

In [41]:
m.group()

'światowej'

In [42]:
word_tokenize(line)

['niemców',
 'pod',
 'koniec',
 'ii',
 'wojny',
 'światowej',
 'na',
 'dolny',
 'śląsk',
 ',',
 'zostały']