# 3. Processing Raw Text

## 3.1   Accessing Text from the Web and from Disk

## Electronic Books

### nltk中的Gutenberg文集只是其中的小樣本，如果有興趣分析Gutenberg文集的其他文章，可以使用request() 在http://www.gutenberg.org/catalog/ 獲取其他文章。


In [3]:
from urllib import request
url = "http://www.gutenberg.org/files/12345/12345.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

str

In [9]:
len(raw)

281779

In [10]:
raw[:75]

'The Project Gutenberg EBook of Friday, the Thirteenth, by Thomas W. Lawson\r'

### raw()是代表文章的原始內容，包括空白字元

### word_tokenize() 將文章內容依照單字切割，包括標點符號

In [4]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(raw)
type(tokens)

list

In [5]:
len(tokens)

58388

In [16]:
tokens[:10]

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Friday',
 ',',
 'the',
 'Thirteenth',
 ',']

### 將tokens轉換成 nltk.text.Text 型態，在使用collocations()將前後兩兩單詞配對成有意義的詞

In [29]:
text = nltk.Text(tokens)
type(text)

nltk.text.Text

In [19]:
text[1024:1062]

['of',
 'which',
 'I',
 'of',
 'all',
 'men',
 'best',
 'knew',
 'the',
 'meaning',
 '.',
 'The',
 'big',
 'brown',
 'eyes',
 'were',
 'set',
 'on',
 'space',
 ';',
 'the',
 'outer',
 'corners',
 'of',
 'the',
 'handsome',
 'mouth',
 'were',
 'drawn',
 'hard',
 'and',
 'tense',
 'as',
 'though',
 'weighted',
 '.',
 'As',
 'I']

In [20]:
text.collocations()

Barry Conant; Project Gutenberg-tm; Beulah Sands; Wall Street; Stock
Exchange; Project Gutenberg; New York; Bob Brownley; Miss Sands;
Literary Archive; United States; Mr. Brownley; Gutenberg-tm
electronic; electronic works; Archive Foundation; Gutenberg Literary;
Dear Sir; 'the Street; Mr. Randolph; 'Standard Oil


### 使用find()查詢字串的位置與使用rfind()查詢字串開頭的前一個字元位置

In [47]:
raw.find("The big brown eyes")

4854

In [50]:
raw.rfind("though weighted")

4962

In [58]:
tmp = raw[4854:4962]
tmp

'The big brown eyes were set on space; the outer corners of the\r\nhandsome mouth were drawn hard and tense as '

## Dealing with HTML
### Web上的大部分文章都是HTML的形式，可以使用BeautifulSoup來進一步的解析

In [12]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [15]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html,"lxml").get_text()
tokens = word_tokenize(raw)
tokens[:60]

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-']

### 使用concordance()來搜尋字元

In [64]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


## Processing RSS Feeds
### feedparser 是用來處理RSS格式的文章，RSS是一種透過XML特性所制定的格式，讓網站的管理者可以把網頁內容傳給訂閱戶。

In [61]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

'Language Log'

### entries得到文件有幾個文章

In [68]:
len(llog.entries)

13

In [70]:
post = llog.entries[2]
post.title

'The vocabulary of sharp implements in Xinjiang'

In [77]:
content = post.content[0].value
content[:70]

'<p>Public notification posted in villages of <a href="https://en.wikip'

### 使用BeautifulSoup解析並列出原始內容

In [None]:
raw = BeautifulSoup(content, "lxml").get_text()
word_tokenize(raw)

['Public',
 'notification',
 'posted',
 'in',
 'villages',
 'of',
 'Makit',
 'County',
 '(',
 'Màigàití',
 'xiàn',
 '麦盖提县',
 ';',
 'Mәkit',
 'nah̡iyisi',
 '/',
 'Мәкит',
 'наһийиси',
 ...]

## Reading Local Files
### 使用open()跟read()讀取本地文件

In [5]:
f = open('document.txt')
raw = f.read()
raw

'In case the real world’s not scary enough, there are Halloween attractions out there designed to completely freak you out.\nOne called "This Is Real" will "literally kidnap you and stash you in a Brooklyn（New York）warehouse."\nHauntWorld.com’s scariest haunted houses includes Erebus in Pontiac, Michigan, where "things grab you, bite you, land on top of you, and then we will bury you alive."\nBut there are family friendly events like Mickey’s Not-So-Scary Halloween Party at Disney World and Dollywood’s Great Pumpkin Luminights in Tennessee.\nHalloween parades include New York City’s massive parade with gigantic puppets through Greenwich Village on Oct. 31, and New Orleans’ Krewe of Boo parade Oct. 21.\nKey West, Florida, says it’s going ahead with its annual Fantasy Fest despite the aftermath of Hurricane Irma. （AP）'

### 使用for迴圈將文章一行一行讀出
### strip()將換行符號去除

In [9]:
f = open('document.txt')
for line in f:
    print(line.strip())

In case the real world’s not scary enough, there are Halloween attractions out there designed to completely freak you out.
One called "This Is Real" will "literally kidnap you and stash you in a Brooklyn（New York）warehouse."
HauntWorld.com’s scariest haunted houses includes Erebus in Pontiac, Michigan, where "things grab you, bite you, land on top of you, and then we will bury you alive."
But there are family friendly events like Mickey’s Not-So-Scary Halloween Party at Disney World and Dollywood’s Great Pumpkin Luminights in Tennessee.
Halloween parades include New York City’s massive parade with gigantic puppets through Greenwich Village on Oct. 31, and New Orleans’ Krewe of Boo parade Oct. 21.
Key West, Florida, says it’s going ahead with its annual Fantasy Fest despite the aftermath of Hurricane Irma. （AP）


### 也可以使用nltk.data.find()先指到nltk文集的位置，再用read()讀出

In [16]:
import nltk
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'r').read()
raw[:75]

'[Moby Dick by Herman Melville 1851]\n\n\nETYMOLOGY.\n\n(Supplied by a Late Consu'

## Capturing User Input
### 有時我們需要用戶輸入一些文字時可以使用input()將字串儲存並操作它

In [17]:
s = input("Enter some text: ")

Enter some text: Hello World!


In [20]:
print("You typed", len(word_tokenize(s)), "words.")

You typed 3 words.


## The NLP Pipeline
![](https://i.imgur.com/oPrXN0v.jpg)
### 上圖是如何的到詞彙種數的方式，在處理文字時 raw() 是str，而 word_tokenize() 是一個list

In [21]:
raw = open('document.txt').read()
type(raw)

str

In [22]:
tokens = word_tokenize(raw)
type(tokens)

list

In [23]:
words = [w.lower() for w in tokens]
type(words)

list

In [24]:
vocab = sorted(set(words))
type(vocab)

list

### str可以使用 + 相加字串，list則要使用append

In [32]:
query = 'hello'
query += ' world'
query

'hello world'

In [34]:
test = ['john', 'paul', 'george', 'ringo']
test.append('blog')
test

['john', 'paul', 'george', 'ringo', 'blog']

## 3.2   Strings: Text Processing at the Lowest Level
### 在以前的章節中，我們將文本重點放在一個詞上。並沒有處理字元部分，以及如何處理編程語言。在本節中，我們詳細探索字元，並顯示字串，單詞，文件和文件之間的連接。
## Basic Operations with Strings

### 我們可以使用單引號與雙引號將字串框起來，當字串中需要顯示單引號時必須加  \  才能執行

In [35]:
monty = 'Monty Python'
monty

'Monty Python'

In [36]:
circus = "Monty Python's Flying Circus"
circus

"Monty Python's Flying Circus"

In [37]:
circus = 'Monty Python\'s Flying Circus'
circus

"Monty Python's Flying Circus"

### 當字串太長可以使用 \ 或 () 將其與下一行字串連接

In [39]:
couplet = "Shall I compare thee to a Summer's day?"\
    "Thou are more lovely and more temperate:"
couplet

"Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:"

In [40]:
couplet = ("Rough winds do shake the darling buds of May,"
           "And Summer's lease hath all too short a date:")
couplet

"Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:"

### 上述方式並沒有加上換行，使用三格雙引號或單引號就能實現換行

In [41]:
couplet = """Shall I compare thee to a Summer's day?
    Thou are more lovely and more temperate:"""
print(couplet)

Shall I compare thee to a Summer's day?
    Thou are more lovely and more temperate:


In [43]:
couplet = '''Rough winds do shake the darling buds of May,
    And Summer's lease hath all too short a date:'''
print(couplet)

Rough winds do shake the darling buds of May,
    And Summer's lease hath all too short a date:


### 字串也可以作加法與乘法，但不能執行減法與除法

In [44]:
'very' + 'very' + 'very'

'veryveryvery'

In [45]:
'very' * 3

'veryveryvery'

In [46]:
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = [' ' * 2 * (7 - i) + 'very' * i for i in a]
for line in b:
    print(line)

            very
          veryvery
        veryveryvery
      veryveryveryvery
    veryveryveryveryvery
  veryveryveryveryveryvery
veryveryveryveryveryveryvery
  veryveryveryveryveryvery
    veryveryveryveryvery
      veryveryveryvery
        veryveryvery
          veryvery
            very


## Printing Strings
### 在使用print()印出字串時可以使用 + 或 , 相加字串

In [47]:
print(monty)

Monty Python


In [48]:
grail = 'Holy Grail'
print(monty + grail)

Monty PythonHoly Grail


In [49]:
print(monty, grail)

Monty Python Holy Grail


In [50]:
print(monty, "and the", grail)

Monty Python and the Holy Grail


## Accessing Individual Characters
### 字串本身就是個列表，可以使用 [ ] 得到其中字元

In [58]:
monty = 'Monty Python'
monty [0]

'M'

In [52]:
monty [3]

't'

### [-1] 則是獲得字串倒數第一個字元

In [53]:
monty [-1]

'n'

In [54]:
sent = 'colorless green ideas sleep furiously'
for char in sent:
    print(char, end=' ')

c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y 

### 列出 melville-moby_dick.txt 中的字母種類
### isalpha() 判斷是否為字母

In [55]:
from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.most_common(5)

[('e', 117092), ('t', 87996), ('a', 77916), ('o', 69326), ('n', 65617)]

In [56]:
[char for (char, count) in fdist.most_common()]

['e',
 't',
 'a',
 'o',
 'n',
 'i',
 's',
 'h',
 'r',
 'l',
 'd',
 'u',
 'm',
 'c',
 'w',
 'f',
 'g',
 'p',
 'b',
 'y',
 'v',
 'k',
 'q',
 'j',
 'x',
 'z']

## Accessing Substrings
![](https://i.imgur.com/FvzJJ5P.jpg)
## 如上述說的字串本身就是個列表，可以使用列表的方式取的其中資料
### [6:10] 會從[6]位置開始算到[10]位置前，並不包含[10]位置本身

In [59]:
monty = 'Monty Python'
monty [6:10]

'Pyth'

In [61]:
monty[-12:-7]

'Monty'

### 省略前面的數值則會從頭開始算，反之則是算到尾部

In [62]:
monty[:5]

'Monty'

In [63]:
monty[6:]

'Python'

### 可以使用 if in 來搜尋字元有無在字串中出現

In [65]:
phrase = 'And now for something completely different'
if 'thing' in phrase:
    print('found "thing"')

found "thing"


### 也可以使用find()來搜尋，如果有出現會回傳第一次出現位置，反之則回傳 -1

In [69]:
monty.find('Python')

6

In [70]:
monty.find('132')

-1

## More operations on strings
![](https://i.imgur.com/IAiaPwo.jpg)

### rfind() 從尾部開始尋找

In [73]:
monty.find('n')

2

In [74]:
monty.rfind('n')

11

### index()與find() 相同，但找不到時會回傳 ValueError

In [77]:
monty.index('n')

2

In [84]:
monty.rindex('n')

11

In [79]:
monty.index('1223')

ValueError: substring not found

### join()可以將原先字串與()中字元相連起來

In [91]:
monty.join('1234')

'1Monty Python2Monty Python3Monty Python4'

### split() 則是依照給定的條件切割字串，splitlines() 依照\n \r 進行切割

In [93]:
monty.split(' ')

['Monty', 'Python']

### upper()將字串轉為大寫，lower()將字串轉為小寫

In [106]:
monty.upper()

'MONTY PYTHON'

### title()將字串單字開頭都改為大寫其餘為小寫

In [109]:
test = 'test hello world'
test.title()

'Test Hello World'

### strip()去除字串開頭與尾部的空白字元

In [110]:
test = '    123123   '
test.strip()

'123123'

### replace() 將字串依照條件取代其字元

In [111]:
test = '123123123'
test.replace('2','4')

'143143143'

## The Difference between Lists and Strings
### 字串和列表都是序列，我們可以透過索引使用它們，但無法講它們相加，此外列表可以隨意改變其值，但字串不能直接修改

In [112]:
query = 'Who knows?'
beatles = ['John', 'Paul', 'George', 'Ringo']
beatles[0] = "John Lennon"
beatles

['John Lennon', 'Paul', 'George', 'Ringo']

In [113]:
query[0] = 'F'

TypeError: 'str' object does not support item assignment

## 3.3   Text Processing with Unicode
## Extracting encoded text from files
### open()時在後面加上encoding可以指定編碼

In [136]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


### 在字串後面加上encode()也可以指定編碼

In [128]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


### ord()顯示字元的ASCII碼

In [124]:
ord('ń')

324

### 324的十六進制4位符號是0144，我們可以使用\u來反轉字元

In [121]:
nacute = '\u0144'
nacute

'ń'

### 將字元轉成utf8

In [129]:
nacute.encode('utf8')

b'\xc5\x84'

### unicodedata可以讓我們檢查Unicode的屬性
### unicodedata.name()可以顯示字元在unicode的名稱

In [155]:
import unicodedata
lines = open(path, encoding='latin2').readlines()
line = lines[2]
print(line.encode('unicode_escape'))

b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'


In [148]:
for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c, ord(c), unicodedata.name(c)))

ó U+00f3 LATIN SMALL LETTER O WITH ACUTE
ś U+015b LATIN SMALL LETTER S WITH ACUTE
ś U+015b LATIN SMALL LETTER S WITH ACUTE
ą U+0105 LATIN SMALL LETTER A WITH OGONEK
ł U+0142 LATIN SMALL LETTER L WITH STROKE


### 使用正規表使法匹配 \u015b 

In [153]:
print('\u015b')

ś


In [152]:
import re
m = re.search('\u015b\w*', line)
m.group()

'światowej'

### 也可以使用word_tokenize()將單字切割

In [157]:
word_tokenize(line)

['Niemców',
 'pod',
 'koniec',
 'II',
 'wojny',
 'światowej',
 'na',
 'Dolny',
 'Śląsk',
 ',',
 'zostały']

## 3.4   Regular Expressions for Detecting Word Patterns
## Using Basic Meta-Characters
### 匹配wordlist以ed結尾的字
### $比對結束位置

In [None]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
[w for w in wordlist if re.search('ed$', w)]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 ...]

### 匹配字母總數8 且第三個字母為j第六個字母為t的單字
### ^比對開頭位置
### . 任一字元

In [164]:
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

## Ranges and Closures
### 匹配開頭是 [ghi] 且第二字母是 [mno] 第三字母是 [jlk] 結尾是 [def] 組合的詞

In [165]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

['gold', 'golf', 'hold', 'hole']

### +代表出現一次或多次

In [166]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [167]:
[w for w in chat_words if re.search('^[ha]+$', w)]

['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

### 當要匹配字元中有 . $ ^ 等特殊字元，前面必須加上 \ 才能使用

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

['0.0085',
 '0.05',
 '0.1',
 '0.16',
 '0.2',
 '0.25',
 '0.28',
 '0.3',
 '0.4',
 '0.5',
 '0.50',
 '0.54',
 '0.56',
 '0.60',
 '0.7',
 ...]

In [170]:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]

['C$', 'US$']

### {x} x必須出現幾次

In [None]:
[w for w in wsj if re.search('^[0-9]{4}$', w)]

['1614',
 '1637',
 '1787',
 '1901',
 '1903',
 '1917',
 '1925',
 '1929',
 '1933',
 '1934',
 '1948',
 '1953',
 '1955',
 '1956',
 '1961',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1971',
 '1972',
 '1973',
 ...]

### {x,y} 最少出現x次最多出現y次

In [172]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]

['10-day',
 '10-lap',
 '10-year',
 '100-share',
 '12-point',
 '12-year',
 '14-hour',
 '15-day',
 '150-point',
 '190-point',
 '20-point',
 '20-stock',
 '21-month',
 '237-seat',
 '240-page',
 '27-year',
 '30-day',
 '30-point',
 '30-share',
 '30-year',
 '300-day',
 '36-day',
 '36-store',
 '42-year',
 '50-state',
 '500-stock',
 '52-week',
 '69-point',
 '84-month',
 '87-store',
 '90-day']

In [173]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

['black-and-white',
 'bread-and-butter',
 'father-in-law',
 'machine-gun-toting',
 'savings-and-loan']

### 匹配ed或ing結尾的詞

In [None]:
[w for w in wsj if re.search('(ed|ing)$', w)]

['62%-owned',
 'Absorbed',
 'According',
 'Adopting',
 'Advanced',
 'Advancing',
 'Alfred',
 'Allied',
 'Annualized',
 'Anything',
 'Arbitrage-related',
 'Arbitraging',
 'Asked',
 'Assuming',
 'Atlanta-based',
 'Baking',
 ...]

### 基本正則表示法用法
![](https://i.imgur.com/ETp6z1p.jpg)

### . 表示任何字元
### ^abc 以abc開頭
### abc$ 以abc結尾
### [abc] 包含abc其中一個
### [A-Z0-9] 包含A-Z或0-9其中一個
### ed|ing|s 包含ed或ing或s
### * 比對前一個字元零次或更多次
### + 比對前一個字元一次或更多次
### ? 比對前一個字元零次或一次
### {n} 比對前一個字元 n 次
### {n,} 比對前一個字元至少 n 次
### {,n} 比對前一個字元
### {m,n} 比對前一個字元至少 n 次，最多 m 次
### a(b|c)+ a後面是b或c且至少出現一次

## 3.5   Useful Applications of Regular Expressions

### 上述的例子使用re.search（regexp，w）匹配符合正規表達式regexp來搜索w。除了檢查正規表達式是否匹配一個word之外，我們還可以使用正規表達式從words中提取出特徵或以特殊的方式來修改原本的words。

In [9]:
import nltk, re

### re.findall() 能根據指定的正規表達式找出所有符合的結果。

In [20]:
word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word)

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

In [7]:
len(re.findall(r'[aeiou]', word))

16

### 從treebank中找出兩個或更多母音的所有序列，並顯示其相對的頻率。

In [14]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
                       for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common(12)

[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106),
 ('ue', 105),
 ('ui', 95)]

## Doing More with Word Pieces

### 有時候英文單詞非常冗長，忽略掉內部的母音，僅保留字首及字尾的母音反而更容易閱讀。

In [40]:
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
# r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


### 將 regular expressions 與 conditional frequency distributions 結合，列出Rotokas語言中所有輔音[ptksvr]及母音[aeiou]每個配對的頻率。

In [18]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


可以發現s與t有互補的現象，代表它們不是這種語言中的獨特的音素。

### 如果我們想查看上表中數字背後的words，可以使用cv_index [ ' 輔音+母音 ' ] 來查詢。

In [41]:
cv_word_pairs = [(cv, w) for w in rotokas_words
                        for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
cv_index['su']

['kasuari']

In [42]:
cv_index['po']

['kaapo',
 'kaapopato',
 'kaipori',
 'kaiporipie',
 'kaiporivira',
 'kapo',
 'kapoa',
 'kapokao',
 'kapokapo',
 'kapokapo',
 'kapokapoa',
 'kapokapoa',
 'kapokapora',
 'kapokapora',
 'kapokaporo',
 'kapokaporo',
 'kapokari',
 'kapokarito',
 'kapokoa',
 'kapoo',
 'kapooto',
 'kapoovira',
 'kapopaa',
 'kaporo',
 'kaporo',
 'kaporopa',
 'kaporoto',
 'kapoto',
 'karokaropo',
 'karopo',
 'kepo',
 'kepoi',
 'keposi',
 'kepoto']

## Finding Word Stems

### 在使用網路搜尋引擎時，我們通常不會在乎word的結尾，例如laptops與laptop是屬於同樣的東西。

#### stem() 將常見的結尾去除

In [43]:
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

### 各種透過正規表達式來處理word

In [44]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

In [45]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

In [46]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

[('process', 'ing')]

In [47]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('processe', 's')]

In [163]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('process', 'es')]

In [51]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

[('language', '')]

### 以上擷取word的方法還是有很多問題存在，我們將重新定義stem() 這個function並將其應用在整個文本。

In [55]:
from nltk import word_tokenize

def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)
[stem(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'sword',
 'i',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

#### 正規表達式仍然將ponds的s及basis的s去除，也產生一些non-words像是distribut及deriv，但這是可以被接受的。

## Searching Tokenized Text

### 可以使用特殊類型的正規表達式來搜索text中多個words。
#### 例如 ＜ａ＞ ＜.*＞ ＜ｍａｎ＞找出所有a man的實例:

In [68]:
from nltk.corpus import gutenberg, nps_chat

moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave


#### XXX XXX bro

In [58]:
chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*> <.*> <bro>")

you rule bro; telling you bro; u twizted bro


#### 以 l 開頭三個字母以上的word

In [77]:
chat.findall(r"<l.*>{3,}")

lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la


### 搜索 "x and other ys" 模式的短句

In [91]:
from nltk.corpus import brown

hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


## 3.6   Normalizing Text

### 在進行處理之前我們經常透過lower()先將文字轉為小寫，以便忽略如The與the的差別。
#### 更進一步分割word的動作更稱為stemming，另一個步驟是確保所得到的word是在字典中已知的word，稱為lemmatization。
#### 以下依次討論這兩個方法，我們先定義我們要使用的data:

In [92]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)

## Stemmers

### NLTK 包含幾個現成的 Stemmers，當你需要時可以先使用目前現有的，因為NLTK中的Stemmer已能處理各種不規則的情況。
#### porter stemmer 和 lancaster stemmer都有自己stripping affixes 的規則:

In [93]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]

['denni',
 ':',
 'listen',
 ',',
 'strang',
 'women',
 'lie',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandat',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcic',
 'aquat',
 'ceremoni',
 '.']

In [94]:
[lancaster.stem(t) for t in tokens]

['den',
 ':',
 'list',
 ',',
 'strange',
 'wom',
 'lying',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'bas',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'pow',
 'der',
 'from',
 'a',
 'mand',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'som',
 'farc',
 'aqu',
 'ceremony',
 '.']

### Stemming 處理的過程並沒有明確定義，我們通常選擇最適合我們心中最適合的stemmer。
#### 如果你要索引texts並使其支持不同詞彙型式的話，Porter Stemmer是個不錯的選擇:
#### (object oriented以出超本書的範圍，string formatting 和 enumerate() 將在3.9及4.2詳細說明)

In [95]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [98]:
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


## Lemmatization

### WordNet lemmatizer 只會將已存在字典中的affixes 刪除，這種額外檢查的過程使得 lemmatizer 比上述的 stemmers 花費更多時間。
#### 注意這邊無法處理 lying，但能將women轉換為woman

In [99]:
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'woman',
 'lying',
 'in',
 'pond',
 'distributing',
 'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

## 3.7   Regular Expressions for Tokenizing Text
### NLTK中已包含許多tokenizers，Tokenization 的工作是將string切割成可識別的word。

## Simple Approaches to Tokenization
### tokenizing text 最簡單的方法就是透過 " " 空白字符來分割:

In [202]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

In [109]:
re.split(r' ', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone\nthough),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very\nwell',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

### 不僅由空白字符來分割，同時也判斷單個或多個換行(\n)及tabs(\t)，也能寫成 re.split (r '\s+', raw )

In [112]:
re.split(r'[ \t\n]+', raw)
# re.split(r'\s+', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very',
 'well',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

### 透過空白字符來分割會出現像是 ' (not ' 或 ' herself, ' 這樣的錯誤。
### 另一種方法則使用Python所提供的character class ' \W '， 相當於 r ' [ ^A-Za-z0-9_ ] '。

In [120]:
re.split(r'\W+', raw)

['',
 'When',
 'I',
 'M',
 'a',
 'Duchess',
 'she',
 'said',
 'to',
 'herself',
 'not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 'I',
 'won',
 't',
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 'Maybe',
 'it',
 's',
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 'tempered',
 '']

### 等同於 r ' [ A-Za-z0-9_ ] ' 及 非空白字元加上零個或一個字母或數字。

In [114]:
re.findall(r'\w+|\S\w*', raw)

["'When",
 'I',
 "'M",
 'a',
 'Duchess',
 ',',
 "'",
 'she',
 'said',
 'to',
 'herself',
 ',',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 ')',
 ',',
 "'I",
 'won',
 "'t",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 '.',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 '-',
 '-Maybe',
 'it',
 "'s",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 '-tempered',
 ',',
 "'",
 '.',
 '.',
 '.']

In [203]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))

["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


![Regular_Expression_Symbols.jpg](https://i.imgur.com/i7aAmQH.jpg)

## NLTK's Regular Expression Tokenizer
###  nltk.regexp_tokenize() 與 re.findall() 類似，顯然前者分詞的效率較高，且不需要特別處理括號。
#### 當使用verbose flag時，不再使用 ' ' 來匹配一個空白字元而改用 \s
#### regexp_tokenize() 還有一個optional parameter "gaps" ，當設為True時意同 re.split()

In [None]:
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''

>>> nltk.regexp_tokenize(text, pattern)

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

## Further Issues with Tokenization
### Tokenization 是一個比你預期還要更艱難的工作，沒有單一的解決方案能解決所有情況，再開發 tokenizer 時使用已經 manually tokenized 過的 raw text 可讓你的 tokenizer 獲得較高品質的 tokens 。

## 3.8   Segmentation
## Sentence Segmentation

### 計算 Brown 文集中每個句子平均有幾個words:

In [138]:
len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())

20.250994070456922

### 在某些情況下text可能只為一串字符，在 tokenizing 為 words之前我們必須將其分割為句子。
### NLTK 透過 Punkt sentence segmenter 使得這個功能得以使用:

In [146]:
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
print(sents[79:89])

['"Nonsense!"', 'said Gregory, who was very rational when anyone else\nattempted paradox.', '"Why do all the clerks and navvies in the\nrailway trains look so sad and tired, so very sad and tired?', 'I will\ntell you.', 'It is because they know that the train is going right.', 'It\nis because they know that whatever place they have taken a ticket\nfor that place they will reach.', 'It is because after they have\npassed Sloane Square they know that the next station must be\nVictoria, and nothing but Victoria.', 'Oh, their wild rapture!', 'oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation were unaccountably Baker Street!"', '"It is you who are unpoetical," replied the poet Syme.']


## Word Segmentation

### 對於有些語言文法word的並不只有單一的用法，Tokenizing text 變得更加困難。
### 例如：愛國人可以分為：「愛國 / 人」 或 「愛 / 國人 」。
### 類似的問題在口語上也會發生，聽者必須將連續的語音分割成一個個詞彙，當我們不認識這個詞彙時將會變成一個困難的問題。
#### 以下的例子將句子的word boundaries刪去:
#### a. doyouseethekitty
#### b. seethedoggy
#### c. doyoulikethekitty
#### d. likethedoggy

### 第一個步驟我們必須先找到一個能將一串字符分為一段一段句子的方法

In [151]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

### 觀察由0和1所組成的strings，他們比text短了一個字符，因為長度為n的text可以在 n-1 個地方被分割。

In [148]:
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
            words.append(text[last:i+1])
            last = i+1
    words.append(text[last:])
    return words

In [152]:
segment(text, seg1)

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

In [153]:
segment(text, seg2)

['do',
 'you',
 'see',
 'the',
 'kitty',
 'see',
 'the',
 'doggy',
 'do',
 'you',
 'like',
 'the',
 'kitty',
 'like',
 'the',
 'doggy']

### 計算目標函數：給定一個假設的source text分割結果，推導出一個 lexicon 和 derivation table ，讓source text能夠將其重建，然後合計每個詞項(包含邊界符號)與 derivation table 的字符數作為segmentation的分數，分數較小者為優。

![Calculation_of_Objective_Function.jpg](https://i.imgur.com/AFfc4mH.jpg)

In [154]:
def evaluate(text, segs):
    words = segment(text, segs)
    text_size = len(words)
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size

In [155]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
seg3 = "0000100100000011001000000110000100010000001100010000001"
segment(text, seg3)

['doyou',
 'see',
 'thekitt',
 'y',
 'see',
 'thedogg',
 'y',
 'doyou',
 'like',
 'thekitt',
 'y',
 'like',
 'thedogg',
 'y']

In [156]:
evaluate(text, seg3)

47

In [157]:
evaluate(text, seg2)

48

In [158]:
evaluate(text, seg1)

64

### 尋找最小化目標函數值0和1的模式

In [204]:

from random import randint

def flip(segs, pos):
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]

def flip_n(segs, n):
    for i in range(n):
        segs = flip(segs, randint(0, len(segs)-1))
    return segs

def anneal(text, segs, iterations, cooling_rate):
    temperature = float(len(segs))
    while temperature > 0.5:
        best_segs, best = segs, evaluate(text, segs)
        for i in range(iterations):
            guess = flip_n(segs, round(temperature))
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        temperature = temperature / cooling_rate
        print(evaluate(text, segs), segment(text, segs))
    print()
    return segs

In [205]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
anneal(text, seg1, 5000, 1.2)

64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
62 ['doy', 'ouseethe', 'kitty', 'seet', 'hedoggy', 'doy', 'ouliket', 'hekittyliket', 'hedoggy']
60 ['doy', 'ousee', 'thekittyseet', 'hedoggy', 'doy', 'oulikethekittylik', 'et', 'hedoggy']
60 ['doy', 'ousee', 'thekittyseet', 'hedoggy', 'doy'

'0010000000000000000100000010010000000000000000001000000'

#### 有了足夠的data可以以合理的準確度將text自動切成words

## 3.9   Formatting: From Lists to Strings

## From Lists to Strings
### 單詞列表(lists of words)是我們處理文本最簡單的一種結構化的物件，當我們要將其輸出時必須將這些lists轉換成strings，Python中我們使用join()來實現。

In [206]:
silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
' '.join(silly)

'We called him Tortoise because he taught us .'

In [207]:
';'.join(silly)

'We;called;him;Tortoise;because;he;taught;us;.'

In [208]:
''.join(silly)

'WecalledhimTortoisebecausehetaughtus.'

## Strings and Formats
### 以下有兩種不同的顯示方式:
### 1. print 指令使Python能夠輸出人可讀性最高的形式
### 2. Naming the variable at a prompt 向我們顯示可以用於重建該物件的string。

In [212]:
word = 'cat'
sentence = """hello
world"""

In [210]:
print(word)

cat


In [213]:
print(sentence)

hello
world


In [214]:
word

'cat'

In [215]:
sentence

'hello\nworld'

### 格式化的輸出通常包含 Variables 和 Pre-specified strings，給定一個頻率分佈 fdist 我們能夠:

In [216]:
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in sorted(fdist):
    print(word, '->', fdist[word], end='; ')

cat -> 3; dog -> 4; snake -> 1; 

### 變數和常數交雜難以閱讀及維護，另一個更好的方法是使用String formatting:

In [217]:
for word in sorted(fdist):
    print('{}->{};'.format(word, fdist[word]), end=' ')

cat->3; dog->4; snake->1; 

In [220]:
'{}->{};'.format ('cat', 3)

'cat->3;'

### 分段解析上述方法

In [221]:
'{}->'.format('cat')

'cat->'

In [222]:
'{}'.format(3)

'3'

In [223]:
'I want a {} right now'.format('coffee')

'I want a coffee right now'

### 我們可以有任意數量個 placeholders，但是str.format method 必須以數目相同的參數來使用。

In [277]:
'{} wants a {} {}'.format ('Lee', 'sandwich', 'for lunch')

'Lee wants a sandwich for lunch'

In [225]:
'{} wants a {} {}'.format ('sandwich', 'for lunch')

IndexError: tuple index out of range

In [228]:
'{} wants a {}'.format ('Lee', 'sandwich', 'for lunch')

'Lee wants a sandwich'

#### format() 參數從左到右取用，多餘的參數則會被忽略。

### 'from {} to {}' 等同於 'from {0} to {1}'，但我們能使用數字來改變默認的順序:

In [229]:
'from {1} to {0}'.format('A', 'B')

'from B to A'

### 用for迴圈把word填入句子

In [230]:
template = 'Lee wants a {} right now'
menu = ['sandwich', 'spam fritter', 'pancake']
for snack in menu:
    print(template.format(snack))

Lee wants a sandwich right now
Lee wants a spam fritter right now
Lee wants a pancake right now


## Lining Things Up
### {:6}表示讓string從寬度6開始往右輸出

In [231]:
'{:6}'.format(41)

'    41'

### <符號使其向左對齊

In [234]:
'{:<6}' .format(41)

'41    '

In [235]:
'{:6}'.format('dog')

'dog   '

### >符號使其向右對齊

In [236]:
'{:>6}'.format('dog')

'   dog'

### 顯示浮點數到小數後面第四位

In [237]:
import math
'{:.4f}'.format(math.pi)

'3.1416'

### 當你想要顯示百分比的時候，在 Format specification 加入 ' ％ ' 則自動將其乘以100

In [238]:
count, total = 3205, 9375
"accuracy for {} words: {:.4%}".format(total, count / total)

'accuracy for 9375 words: 34.1867%'

### Formatting strings 的其中一個重要用途是用於數據列表。

In [239]:
def tabulate(cfdist, words, categories):
    print('{:16}'.format('Category'), end=' ')                    # column headings
    for word in words:
        print('{:>6}'.format(word), end=' ')
    print()
    for category in categories:
        print('{:16}'.format(category), end=' ')                  # row heading
        for word in words:                                        # for each word
            print('{:6}'.format(cfdist[category][word]), end=' ') # print table cell
        print()                                                   # end the row


In [240]:
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
            (genre, word)
            for genre in brown.categories()
            for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
tabulate(cfd, modals, genres)

Category            can  could    may  might   must   will 
news                 93     86     66     38     50    389 
religion             82     59     78     12     54     71 
hobbies             268     58    131     22     83    264 
science_fiction      16     49      4     12      8     16 
romance              74    193     11     51     45     43 
humor                16     30      8      8      9     13 


## Writing Results to a File
### 開啟可寫入的文件 output.txt，將程式碼的輸出保存在文件中。

In [257]:
output_file = open('output.txt', 'w')
words = set(nltk.corpus.genesis.words('english-kjv.txt'))
for word in sorted(words):
    print(word, file=output_file)

In [258]:
len(words)

2789

In [259]:
str(len(words))

'2789'

In [264]:
print(str(len(words)), file=output_file)

## Text Wrapping

### 當程式輸出為text模式而非表格時，為了增加可讀性必須改變顯示方式，下列的輸出已超出一行顯示的長度

In [267]:
saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
          'more', 'is', 'said', 'than', 'done', '.']

for word in saying:
    print(word, '(' + str(len(word)) + '),', end=' ')

After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1), 

### 在Python's textwrap module的幫助下換行

In [275]:
from textwrap import fill
format = '%s (%d),'
pieces = [format % (word, len(word)) for word in saying]
output = ' '.join(pieces)
wrapped = fill(output)
print(wrapped)

After (5), all (3), is (2), said (4), and (3), done (4), , (1), more
(4), is (2), said (4), than (4), done (4), . (1),
