In [2]:
from __future__ import division
import nltk, re, pprint
from bs4 import BeautifulSoup


### Accessing Text from the Web and from Disk

In [3]:
''' 
Electronic Book:

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection.
However, you may be interested in analyzing other texts from Project Gutenberg. You
can browse the catalog of 25,000 free online books at http://www.gutenberg.org/cata
log/, and obtain a URL to an ASCII text file. Although 90% of the texts in Project
Gutenberg are in English, it includes material in over 50 other languages, including
Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese, and Spanish
(with more than 100 texts each).
'''

from urllib.request import urlopen
url = "https://www.gutenberg.org/files/2554/2554-0.txt"

raw = urlopen(url).read().decode('utf-8')

In [4]:
print(len(raw))

raw[51:100]

1135214


'\n\n\nCRIME AND PUNISHMENT\n\nBy Fyodor Dostoevsky\n\n\n\n'

In [5]:
''' 
For our language processing, we want to break up the string into words and
punctuation. This step is called tokenization, and it produces
our familiar structure, a list of words and punctuation.
'''

tokens = nltk.word_tokenize(raw)
type(tokens)

list

In [6]:
len(tokens)

253688

In [7]:
tokens[:20]

['*',
 '*',
 '*',
 'START',
 'OF',
 'THE',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 '2554',
 '*',
 '*',
 '*',
 'CRIME',
 'AND',
 'PUNISHMENT',
 'By',
 'Fyodor',
 'Dostoevsky',
 'Translated']

In [8]:
raw.find("PART I")

4638

In [9]:
raw.find("End of Project")

-1

### Dealing with HTML

In [10]:
''' 
You can use a web browser to save a page as text to a local file, then access this as described in the later
section on files. However, if you’re going to do this often, it’s easiest to get Python to
do the work directly. The first step is the same as before, using urlopen.
'''

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read().decode('utf-8')


In [11]:
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [12]:
print(html)

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>BBC NEWS | Health | Blondes 'to die out in 200 years'</title>
<meta name="keywords" content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service">
<meta name="OriginalPublicationDate" content="2002/09/27 11:51:55">
<meta name="UKFS_URL" content="/1/hi/health/2284783.stm">
<meta name="IFS_URL" content="/2/hi/health/2284783.stm">
<meta name="HTTP-EQUIV" content="text/html;charset=iso-8859-1">
<meta name="Headline" content="Blondes 'to die out in 200 years'">
<meta name="Section" content="Health">
<meta name="Description" content="Natural blondes are an endangered species and will die out by 2202, a study suggests.">
<!-- GENMaps-->
<map name="banner">
<area alt="BBC NEWS" coords="7,9,167,32" href="http://news.bbc.co.uk/1/hi.html" shape="RECT">
</map>

<script src="/nol/shared/js/livestats_v1_1.js" language="JavaScript" t

In [13]:
soup = BeautifulSoup(html, "html.parser")
raw = soup.get_text()
tokens = nltk.word_tokenize(raw)

print(tokens[:50]) 

['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in', '200', 'years', "'", 'NEWS', 'SPORT', 'WEATHER', 'WORLD', 'SERVICE', 'A-Z', 'INDEX', 'SEARCH', 'You', 'are', 'in', ':', 'Health', 'News', 'Front', 'Page', 'Africa', 'Americas', 'Asia-Pacific', 'Europe', 'Middle', 'East', 'South', 'Asia', 'UK', 'Business', 'Entertainment', 'Science/Nature', 'Technology', 'Health', 'Medical', 'notes', '--', '--', '--', '--', '--']


## Processing RSS Feeds

In [14]:
''' 
The blogosphere is an important source of text, in both formal and informal registers.
With the help of a third-party Python library called the Universal Feed Parser, freely
downloadable from http://feedparser.org/, we can access the content of a blog
'''

import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

'Language Log'

In [15]:
post = llog.entries[2]
post.title

'Phonemic analysis of animal sounds as spelled in various popular languages'

In [16]:
content = post.content[0].value
content[:60]

"<p>This is something I've been waiting for for decades:</p>\n"

In [17]:
soup = BeautifulSoup(content, "html.parser")

raw = soup.get_text()
tokens = nltk.word_tokenize(raw)

print(tokens[:50]) 

['This', 'is', 'something', 'I', "'ve", 'been', 'waiting', 'for', 'for', 'decades', ':', "''", 'Onomatopoeia', 'Odyssey', ':', 'How', 'do', 'animals', 'sound', 'across', 'languages', '?', '``', ',', 'by', 'Vivian', 'Li', ',', 'The', 'Pudding', '(', 'March', ',', '2025', ')', 'For', 'many', ',', 'our', 'first', 'memories', 'of', 'learning', 'animal', 'sounds', 'include', 'the', 'song', '“', 'Old']


## Reading Local Files

In [18]:
''' 
In order to read a local file, we need to use Python’s built-in open() function, followed
by the read() method.
'''

file = open('document.txt')

for line in file:
    print(line.strip())

The quick brown fox jumps over the lazy dog.

Life is like riding a bicycle. To keep your balance, you must keep moving.

In the middle of difficulty lies opportunity.


## Capturing User Input

In [19]:
''' 
Sometimes we want to capture the text that a user inputs when she is interacting with
our program. To prompt the user to type a line of input, call the Python function
input(). After saving the input to a variable, we can manipulate it just as we have
done for other strings.
'''

s = input("Enter some text: ")

print("You entered", len(nltk.word_tokenize(s)), "words.")

You entered 3 words.


## The NLP Pipeline

![Screenshot](images/1.png)

## Exploring Unicode

Unicode supports over a million characters. Each character is assigned a number, called a **code point**. In Python, code points are written in the form `\uXXXX`, where `XXXX` is the number in four-digit hexadecimal form.

Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode characters are stored in files or displayed on a terminal, they must be **encoded** as a stream of bytes. 

### Encoding and Decoding
Some encodings (such as **ASCII** and **Latin-2**) use a single byte per code point, so they can support only a small subset of Unicode, enough for a single language. Other encodings (such as **UTF-8**) use multiple bytes and can represent the full range of Unicode characters.

Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode:
- **Decoding**: Translating from a file encoding into Unicode.
- **Encoding**: Translating Unicode into a suitable encoding to write to a file or display on a terminal.

### Characters vs. Glyphs
From a Unicode perspective, **characters** are abstract entities that can be realized as one or more **glyphs**. Only **glyphs** can appear on a screen or be printed on paper. A **font** is a mapping from characters to glyphs.


### Extracting Encoded Text from Files

In [20]:
'''
Let’s assume that we have a small text file, and that we know how it is encoded. For
example, polish-lat2.txt, as the name suggests, is a snippet of Polish text (from the Polish
Wikipedia; see http://pl.wikipedia.org/wiki/Biblioteka_Pruska). This file is encoded as
Latin-2, also known as ISO-8859-2. The function nltk.data.find() locates the file for
us.
''' 

path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

In [21]:
''' 
The Python codecs module provides functions to read encoded data into Unicode
strings, and to write out Unicode strings in encoded form. The codecs.open() function
takes an encoding parameter to specify the encoding of the file being read or written.
So let’s import the codecs module, and call it with the encoding 'latin2' to open our
Polish file as Unicode
'''

import codecs
f = codecs.open(path, encoding='latin2')

In [22]:
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


In [23]:
'''
The module unicodedata lets us inspect the properties of Unicode characters. In the
following example, we select all characters in the third line of our Polish text outside
the ASCII range and print their UTF-8 escaped value, followed by their code point
integer using the standard Unicode convention (i.e., prefixing the hex digits with U+),
followed by their Unicode name.
'''

import unicodedata
lines = codecs.open(path, encoding='latin2').readlines()

line = lines[2]

print(line.encode('unicode_escape'))
for c in line:
    if ord(c) > 127:
        print('%r U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c)))

b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'
b'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
b'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
b'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
b'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE


## Regular Expressions for Detecting Word Patterns

In [24]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

In [25]:
[w for w in wordlist if re.search('ed$', w)]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 'accelerated',
 'accepted',
 'accidented',
 'accoladed',
 'accolated',
 'accomplished',
 'accosted',
 'accredited',
 'accursed',
 'accused',
 'accustomed',
 'acetated',
 'acheweed',
 'aciculated',
 'aciliated',
 'acknowledged',
 'acorned',
 'acquainted',
 'acquired',
 'acquisited',
 'acred',
 'aculeated',
 'addebted',
 'added',
 'addicted',
 'addlebrained',
 'addleheaded',
 'addlepated',
 'addorsed',
 'adempted',
 'adfected',
 'adjoined',
 'admired',
 'admitted',
 'adnexed',
 'adopted',
 'adossed',
 'adreamed',
 'adscripted',
 'aduncated',
 'advanced',
 'advised',
 'aeried',
 'aethered',
 'afeared',
 'affected',
 'affectioned',
 'affined',
 'afflicted',
 'affricated',
 'affrighted',
 'affronted',
 'aforenamed',
 'afterfeed',
 'aftershafted',
 'afterthoughted',
 'afterwitted',
 'agazed',
 'aged',
 'agglomerated',
 'aggri

In [26]:
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

In [27]:
[w for w in wordlist if re.search('..j..t..', w)]


['abjectedness',
 'abjection',
 'abjective',
 'abjectly',
 'abjectness',
 'adjection',
 'adjectional',
 'adjectival',
 'adjectivally',
 'adjective',
 'adjectively',
 'adjectivism',
 'adjectivitis',
 'adjustable',
 'adjustably',
 'adjustage',
 'adjustation',
 'adjuster',
 'adjustive',
 'adjustment',
 'antejentacular',
 'antiprojectivity',
 'bijouterie',
 'coadjustment',
 'cojusticiar',
 'conjective',
 'conjecturable',
 'conjecturably',
 'conjectural',
 'conjecturalist',
 'conjecturality',
 'conjecturally',
 'conjecture',
 'conjecturer',
 'coprojector',
 'counterobjection',
 'dejected',
 'dejectedly',
 'dejectedness',
 'dejectile',
 'dejection',
 'dejectly',
 'dejectory',
 'dejecture',
 'disjection',
 'guanajuatite',
 'inadjustability',
 'inadjustable',
 'injectable',
 'injection',
 'injector',
 'injustice',
 'insubjection',
 'interjection',
 'interjectional',
 'interjectionalize',
 'interjectionally',
 'interjectionary',
 'interjectionize',
 'interjectiveness',
 'interjector',
 'interje

## Range and Closures

In [29]:
''' 
The T9 system is used for entering text on mobile phones (see Figure 3-5). Two or more
words that are entered with the same sequence of keystrokes are known as
textonyms. For example, both hole and golf are entered by pressing the sequence 4653.
'''

[w for w in wordlist if re.search('^[ghi][mno][jkl][def]$', w)]

['gold', 'golf', 'hold', 'hole']

In [30]:
[w for w in wordlist if re.search('^[ghijklmno]+$', w)]


['g',
 'ghoom',
 'gig',
 'giggling',
 'gigolo',
 'gilim',
 'gill',
 'gilling',
 'gilo',
 'gim',
 'gin',
 'ging',
 'gingili',
 'gink',
 'ginkgo',
 'ginning',
 'gio',
 'glink',
 'glom',
 'glonoin',
 'gloom',
 'glooming',
 'gnomon',
 'go',
 'gog',
 'gogo',
 'goi',
 'going',
 'gol',
 'goli',
 'gon',
 'gong',
 'gonion',
 'goo',
 'googol',
 'gook',
 'gool',
 'goon',
 'h',
 'hi',
 'high',
 'hill',
 'him',
 'hin',
 'hing',
 'hinoki',
 'ho',
 'hog',
 'hoggin',
 'hogling',
 'hoi',
 'hoin',
 'holing',
 'holl',
 'hollin',
 'hollo',
 'hollong',
 'holm',
 'homo',
 'homologon',
 'hong',
 'honk',
 'hook',
 'hoon',
 'i',
 'igloo',
 'ihi',
 'ilk',
 'ill',
 'imi',
 'imino',
 'immi',
 'in',
 'ing',
 'ingoing',
 'inion',
 'ink',
 'inkling',
 'inlook',
 'inn',
 'inning',
 'io',
 'ion',
 'j',
 'jhool',
 'jig',
 'jing',
 'jingling',
 'jingo',
 'jinjili',
 'jink',
 'jinn',
 'jinni',
 'jo',
 'jog',
 'johnin',
 'join',
 'joining',
 'joll',
 'joom',
 'k',
 'kiki',
 'kil',
 'kilhig',
 'kilim',
 'kill',
 'killing',

In [31]:
''' 
The ^ operator has another function when it appears as the first character inside square
brackets. For example, «[^aeiouAEIOU]» matches any character other than a vowel. We
can search the NPS Chat Corpus for words that are made up entirely of non-vowel
characters using «^[^aeiouAEIOU]+$» to find items like these: :):):), grrr, cyb3r, and
zzzzzzzz. Notice this includes non-alphabetic characters.
'''

wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

['0.0085',
 '0.05',
 '0.1',
 '0.16',
 '0.2',
 '0.25',
 '0.28',
 '0.3',
 '0.4',
 '0.5',
 '0.50',
 '0.54',
 '0.56',
 '0.60',
 '0.7',
 '0.82',
 '0.84',
 '0.9',
 '0.95',
 '0.99',
 '1.01',
 '1.1',
 '1.125',
 '1.14',
 '1.1650',
 '1.17',
 '1.18',
 '1.19',
 '1.2',
 '1.20',
 '1.24',
 '1.25',
 '1.26',
 '1.28',
 '1.35',
 '1.39',
 '1.4',
 '1.457',
 '1.46',
 '1.49',
 '1.5',
 '1.50',
 '1.55',
 '1.56',
 '1.5755',
 '1.5805',
 '1.6',
 '1.61',
 '1.637',
 '1.64',
 '1.65',
 '1.7',
 '1.75',
 '1.76',
 '1.8',
 '1.82',
 '1.8415',
 '1.85',
 '1.8500',
 '1.9',
 '1.916',
 '1.92',
 '10.19',
 '10.2',
 '10.5',
 '107.03',
 '107.9',
 '109.73',
 '11.10',
 '11.5',
 '11.57',
 '11.6',
 '11.72',
 '11.95',
 '112.9',
 '113.2',
 '116.3',
 '116.4',
 '116.7',
 '116.9',
 '118.6',
 '12.09',
 '12.5',
 '12.52',
 '12.68',
 '12.7',
 '12.82',
 '12.97',
 '120.7',
 '1206.26',
 '121.6',
 '126.1',
 '126.15',
 '127.03',
 '129.91',
 '13.1',
 '13.15',
 '13.5',
 '13.50',
 '13.625',
 '13.65',
 '13.73',
 '13.8',
 '13.90',
 '130.6',
 '130.7',
 '

In [32]:
[w for w in wsj if re.search('(ed|ing)$', w)]

['62%-owned',
 'Absorbed',
 'According',
 'Adopting',
 'Advanced',
 'Advancing',
 'Alfred',
 'Allied',
 'Annualized',
 'Anything',
 'Arbitrage-related',
 'Arbitraging',
 'Asked',
 'Assuming',
 'Atlanta-based',
 'Baking',
 'Banking',
 'Beginning',
 'Beijing',
 'Being',
 'Bermuda-based',
 'Betting',
 'Boeing',
 'Broadcasting',
 'Bucking',
 'Buying',
 'Calif.-based',
 'Change-ringing',
 'Citing',
 'Concerned',
 'Confronted',
 'Conn.based',
 'Consolidated',
 'Continued',
 'Continuing',
 'Declining',
 'Defending',
 'Depending',
 'Designated',
 'Determining',
 'Developed',
 'Died',
 'During',
 'Encouraged',
 'Encouraging',
 'English-speaking',
 'Estimated',
 'Everything',
 'Excluding',
 'Exxon-owned',
 'Faulding',
 'Fed',
 'Feeding',
 'Filling',
 'Filmed',
 'Financing',
 'Following',
 'Founded',
 'Fracturing',
 'Francisco-based',
 'Fred',
 'Funded',
 'Funding',
 'Generalized',
 'Germany-based',
 'Getting',
 'Guaranteed',
 'Having',
 'Heating',
 'Heightened',
 'Holding',
 'Housing',
 'Illumin

## Basic Regular Expressions metacharacters, including wildcards, ranges and closures

![Table](images/2.png)

In [33]:
[int(n) for n in re.findall(r'\d+', '2009-12-31')]

[2009, 12, 31]

## Word Segmentation

In [34]:
''' 
For some writing systems, tokenizing text is made more difficult by the fact that there
is no visual representation of word boundaries. For example, in Chinese, the threecharacter
string: 爱国人 (ai4 “love” [verb], guo3 “country”, ren2 “person”) could be
tokenized as 爱国 / 人, “country-loving person,” or as 爱 / 国人, “love country-person.”
'''

''' 
Our first challenge is simply to represent the problem: we need to find a way to separate
text content from the segmentation. We can do this by annotating each character with
a boolean value to indicate whether or not a word-break appears after the character (an
idea that will be used heavily for “chunking” in Chapter 7). Let’s assume that the learner
is given the utterance breaks, since these often correspond to extended pauses. Here is
a possible representation, including the initial and target segmentations:
'''

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"



In [36]:
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
            words.append(text[last:i+1])
            last = i+1
    words.append(text[last:])
    return words 

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

print(segment(text, seg1))
segment(text, seg2)


['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']


['do',
 'you',
 'see',
 'the',
 'kitty',
 'see',
 'the',
 'doggy',
 'do',
 'you',
 'like',
 'the',
 'kitty',
 'like',
 'the',
 'doggy']