# 3. Processing Raw Text

## 3.1   Accessing Text from the Web and from Disk

## Electronic Books

### nltk中的Gutenberg文集只是其中的小樣本，如果有興趣分析Gutenberg文集的其他文章，可以使用request() 在http://www.gutenberg.org/catalog/ 獲取其他文章。


In [3]:
from urllib import request
url = "http://www.gutenberg.org/files/12345/12345.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

str

In [9]:
len(raw)

281779

In [10]:
raw[:75]

'The Project Gutenberg EBook of Friday, the Thirteenth, by Thomas W. Lawson\r'

### raw()是代表文章的原始內容，包括空白字元

### word_tokenize() 將文章內容依照單字切割，包括標點符號

In [4]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(raw)
type(tokens)

list

In [5]:
len(tokens)

58388

In [16]:
tokens[:10]

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Friday',
 ',',
 'the',
 'Thirteenth',
 ',']

### 將tokens轉換成 nltk.text.Text 型態，在使用collocations()將前後兩兩單詞配對成有意義的詞

In [29]:
text = nltk.Text(tokens)
type(text)

nltk.text.Text

In [19]:
text[1024:1062]

['of',
 'which',
 'I',
 'of',
 'all',
 'men',
 'best',
 'knew',
 'the',
 'meaning',
 '.',
 'The',
 'big',
 'brown',
 'eyes',
 'were',
 'set',
 'on',
 'space',
 ';',
 'the',
 'outer',
 'corners',
 'of',
 'the',
 'handsome',
 'mouth',
 'were',
 'drawn',
 'hard',
 'and',
 'tense',
 'as',
 'though',
 'weighted',
 '.',
 'As',
 'I']

In [20]:
text.collocations()

Barry Conant; Project Gutenberg-tm; Beulah Sands; Wall Street; Stock
Exchange; Project Gutenberg; New York; Bob Brownley; Miss Sands;
Literary Archive; United States; Mr. Brownley; Gutenberg-tm
electronic; electronic works; Archive Foundation; Gutenberg Literary;
Dear Sir; 'the Street; Mr. Randolph; 'Standard Oil


### 使用find()查詢字串第一次出現的位置與使用rfind()查詢字串最後一次出現的位置

In [47]:
raw.find("The big brown eyes")

4854

In [50]:
raw.rfind("though weighted")

4962

In [58]:
tmp = raw[4854:4962]
tmp

'The big brown eyes were set on space; the outer corners of the\r\nhandsome mouth were drawn hard and tense as '

## Dealing with HTML
### Web上的大部分文章都是HTML的形式，可以使用BeautifulSoup來進一步的解析

In [12]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [15]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html,"lxml").get_text()
tokens = word_tokenize(raw)
tokens[:60]

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-']

### 使用concordance()來搜尋字元

In [64]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


## Processing RSS Feeds
### feedparser 是用來處理RSS格式的文章，RSS是一種透過XML特性所制定的格式，讓網站的管理者可以把網頁內容傳給訂閱戶。

In [61]:
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']

'Language Log'

### entries得到文件有幾個文章

In [68]:
len(llog.entries)

13

In [70]:
post = llog.entries[2]
post.title

'The vocabulary of sharp implements in Xinjiang'

In [77]:
content = post.content[0].value
content[:70]

'<p>Public notification posted in villages of <a href="https://en.wikip'

### 使用BeautifulSoup解析並列出原始內容

In [None]:
raw = BeautifulSoup(content, "lxml").get_text()
word_tokenize(raw)

['Public',
 'notification',
 'posted',
 'in',
 'villages',
 'of',
 'Makit',
 'County',
 '(',
 'Màigàití',
 'xiàn',
 '麦盖提县',
 ';',
 'Mәkit',
 'nah̡iyisi',
 '/',
 'Мәкит',
 'наһийиси',
 ...]

## Reading Local Files
### 使用open()跟read()讀取本地文件

In [5]:
f = open('document.txt')
raw = f.read()
raw

'In case the real world’s not scary enough, there are Halloween attractions out there designed to completely freak you out.\nOne called "This Is Real" will "literally kidnap you and stash you in a Brooklyn（New York）warehouse."\nHauntWorld.com’s scariest haunted houses includes Erebus in Pontiac, Michigan, where "things grab you, bite you, land on top of you, and then we will bury you alive."\nBut there are family friendly events like Mickey’s Not-So-Scary Halloween Party at Disney World and Dollywood’s Great Pumpkin Luminights in Tennessee.\nHalloween parades include New York City’s massive parade with gigantic puppets through Greenwich Village on Oct. 31, and New Orleans’ Krewe of Boo parade Oct. 21.\nKey West, Florida, says it’s going ahead with its annual Fantasy Fest despite the aftermath of Hurricane Irma. （AP）'

### 使用for迴圈將文章一行一行讀出
### strip()將換行符號去除

In [9]:
f = open('document.txt')
for line in f:
    print(line.strip())

In case the real world’s not scary enough, there are Halloween attractions out there designed to completely freak you out.
One called "This Is Real" will "literally kidnap you and stash you in a Brooklyn（New York）warehouse."
HauntWorld.com’s scariest haunted houses includes Erebus in Pontiac, Michigan, where "things grab you, bite you, land on top of you, and then we will bury you alive."
But there are family friendly events like Mickey’s Not-So-Scary Halloween Party at Disney World and Dollywood’s Great Pumpkin Luminights in Tennessee.
Halloween parades include New York City’s massive parade with gigantic puppets through Greenwich Village on Oct. 31, and New Orleans’ Krewe of Boo parade Oct. 21.
Key West, Florida, says it’s going ahead with its annual Fantasy Fest despite the aftermath of Hurricane Irma. （AP）


### 也可以使用nltk.data.find()先指到nltk文集的位置，再用read()讀出

In [16]:
import nltk
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'r').read()
raw[:75]

'[Moby Dick by Herman Melville 1851]\n\n\nETYMOLOGY.\n\n(Supplied by a Late Consu'

## Capturing User Input
### 有時我們需要用戶輸入一些文字時可以使用input()將字串儲存並操作它

In [17]:
s = input("Enter some text: ")

Enter some text: Hello World!


In [20]:
print("You typed", len(word_tokenize(s)), "words.")

You typed 3 words.


## The NLP Pipeline
![](https://i.imgur.com/oPrXN0v.jpg)
### 上圖是如何的到詞彙種數的方式，在處理文字時 raw() 是str，而 word_tokenize() 是一個list

In [21]:
raw = open('document.txt').read()
type(raw)

str

In [22]:
tokens = word_tokenize(raw)
type(tokens)

list

In [23]:
words = [w.lower() for w in tokens]
type(words)

list

In [24]:
vocab = sorted(set(words))
type(vocab)

list

### str可以使用 + 相加字串，list則要使用append

In [32]:
query = 'hello'
query += ' world'
query

'hello world'

In [34]:
test = ['john', 'paul', 'george', 'ringo']
test.append('blog')
test

['john', 'paul', 'george', 'ringo', 'blog']

## 3.2   Strings: Text Processing at the Lowest Level
### 在以前的章節中，我們將文本重點放在一個詞上。並沒有處理字元部分，以及如何處理編程語言。在本節中，我們詳細探索字元，並顯示字串，單詞，文件和文件之間的連接。
## Basic Operations with Strings

### 我們可以使用單引號與雙引號將字串框起來，當字串中需要顯示單引號時必須加  \  才能執行

In [35]:
monty = 'Monty Python'
monty

'Monty Python'

In [36]:
circus = "Monty Python's Flying Circus"
circus

"Monty Python's Flying Circus"

In [37]:
circus = 'Monty Python\'s Flying Circus'
circus

"Monty Python's Flying Circus"

### 當字串太長可以使用 \ 或 () 將其與下一行字串連接

In [39]:
couplet = "Shall I compare thee to a Summer's day?"\
    "Thou are more lovely and more temperate:"
couplet

"Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:"

In [40]:
couplet = ("Rough winds do shake the darling buds of May,"
           "And Summer's lease hath all too short a date:")
couplet

"Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:"

### 上述方式並沒有加上換行，使用三格雙引號或單引號就能實現換行

In [41]:
couplet = """Shall I compare thee to a Summer's day?
    Thou are more lovely and more temperate:"""
print(couplet)

Shall I compare thee to a Summer's day?
    Thou are more lovely and more temperate:


In [43]:
couplet = '''Rough winds do shake the darling buds of May,
    And Summer's lease hath all too short a date:'''
print(couplet)

Rough winds do shake the darling buds of May,
    And Summer's lease hath all too short a date:


### 字串也可以作加法與乘法，但不能執行減法與除法

In [44]:
'very' + 'very' + 'very'

'veryveryvery'

In [45]:
'very' * 3

'veryveryvery'

In [46]:
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = [' ' * 2 * (7 - i) + 'very' * i for i in a]
for line in b:
    print(line)

            very
          veryvery
        veryveryvery
      veryveryveryvery
    veryveryveryveryvery
  veryveryveryveryveryvery
veryveryveryveryveryveryvery
  veryveryveryveryveryvery
    veryveryveryveryvery
      veryveryveryvery
        veryveryvery
          veryvery
            very


## Printing Strings
### 在使用print()印出字串時可以使用 + 或 , 相加字串

In [47]:
print(monty)

Monty Python


In [48]:
grail = 'Holy Grail'
print(monty + grail)

Monty PythonHoly Grail


In [49]:
print(monty, grail)

Monty Python Holy Grail


In [50]:
print(monty, "and the", grail)

Monty Python and the Holy Grail


## Accessing Individual Characters
### 字串本身就是個列表，可以使用 [ ] 得到其中字元

In [58]:
monty = 'Monty Python'
monty [0]

'M'

In [52]:
monty [3]

't'

### [-1] 則是獲得字串倒數第一個字元

In [53]:
monty [-1]

'n'

In [54]:
sent = 'colorless green ideas sleep furiously'
for char in sent:
    print(char, end=' ')

c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y 

### 列出 melville-moby_dick.txt 中的字母種類
### isalpha() 判斷是否為字母

In [55]:
from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.most_common(5)

[('e', 117092), ('t', 87996), ('a', 77916), ('o', 69326), ('n', 65617)]

In [56]:
[char for (char, count) in fdist.most_common()]

['e',
 't',
 'a',
 'o',
 'n',
 'i',
 's',
 'h',
 'r',
 'l',
 'd',
 'u',
 'm',
 'c',
 'w',
 'f',
 'g',
 'p',
 'b',
 'y',
 'v',
 'k',
 'q',
 'j',
 'x',
 'z']

## Accessing Substrings
![](https://i.imgur.com/FvzJJ5P.jpg)
## 如上述說的字串本身就是個列表，可以使用列表的方式取的其中資料
### [6:10] 會從[6]位置開始算到[10]位置前，並不包含[10]位置本身

In [59]:
monty = 'Monty Python'
monty [6:10]

'Pyth'

In [61]:
monty[-12:-7]

'Monty'

### 省略前面的數值則會從頭開始算，反之則是算到尾部

In [62]:
monty[:5]

'Monty'

In [63]:
monty[6:]

'Python'

### 可以使用 if in 來搜尋字元有無在字串中出現

In [65]:
phrase = 'And now for something completely different'
if 'thing' in phrase:
    print('found "thing"')

found "thing"


### 也可以使用find()來搜尋，如果有出現會回傳第一次出現位置，反之則回傳 -1

In [69]:
monty.find('Python')

6

In [70]:
monty.find('132')

-1

## More operations on strings
![](https://i.imgur.com/IAiaPwo.jpg)

### rfind() 從尾部開始尋找

In [73]:
monty.find('n')

2

In [74]:
monty.rfind('n')

11

### index()與find() 相同，但找不到時會回傳 ValueError

In [77]:
monty.index('n')

2

In [84]:
monty.rindex('n')

11

In [79]:
monty.index('1223')

ValueError: substring not found

### join()可以將原先字串與()中字元相連起來

In [91]:
monty.join('1234')

'1Monty Python2Monty Python3Monty Python4'

### split() 則是依照給定的條件切割字串，splitlines() 依照\n \r 進行切割

In [93]:
monty.split(' ')

['Monty', 'Python']

### upper()將字串轉為大寫，lower()將字串轉為小寫

In [106]:
monty.upper()

'MONTY PYTHON'

### title()將字串單字開頭都改為大寫其餘為小寫

In [109]:
test = 'test hello world'
test.title()

'Test Hello World'

### strip()去除字串開頭與尾部的空白字元

In [110]:
test = '    123123   '
test.strip()

'123123'

### replace() 將字串依照條件取代其字元

In [111]:
test = '123123123'
test.replace('2','4')

'143143143'

## The Difference between Lists and Strings
### 字串和列表都是序列，我們可以透過索引使用它們，但無法講它們相加，此外列表可以隨意改變其值，但字串不能直接修改

In [112]:
query = 'Who knows?'
beatles = ['John', 'Paul', 'George', 'Ringo']
beatles[0] = "John Lennon"
beatles

['John Lennon', 'Paul', 'George', 'Ringo']

In [113]:
query[0] = 'F'

TypeError: 'str' object does not support item assignment

## 3.3   Text Processing with Unicode
## Extracting encoded text from files
### open()時在後面加上encoding可以指定編碼

In [136]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


### 在字串後面加上encode()也可以指定編碼

In [128]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


### ord()顯示字元的ASCII碼

In [124]:
ord('ń')

324

### 324的十六進制4位符號是0144，我們可以使用\u來反轉字元

In [121]:
nacute = '\u0144'
nacute

'ń'

### 將字元轉成utf8

In [129]:
nacute.encode('utf8')

b'\xc5\x84'

### unicodedata可以讓我們檢查Unicode的屬性
### unicodedata.name()可以顯示字元在unicode的名稱

In [155]:
import unicodedata
lines = open(path, encoding='latin2').readlines()
line = lines[2]
print(line.encode('unicode_escape'))

b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'


In [148]:
for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c, ord(c), unicodedata.name(c)))

ó U+00f3 LATIN SMALL LETTER O WITH ACUTE
ś U+015b LATIN SMALL LETTER S WITH ACUTE
ś U+015b LATIN SMALL LETTER S WITH ACUTE
ą U+0105 LATIN SMALL LETTER A WITH OGONEK
ł U+0142 LATIN SMALL LETTER L WITH STROKE


### 使用正規表使法匹配 \u015b 

In [153]:
print('\u015b')

ś


In [152]:
import re
m = re.search('\u015b\w*', line)
m.group()

'światowej'

### 也可以使用word_tokenize()將單字切割

In [157]:
word_tokenize(line)

['Niemców',
 'pod',
 'koniec',
 'II',
 'wojny',
 'światowej',
 'na',
 'Dolny',
 'Śląsk',
 ',',
 'zostały']

## 3.4   Regular Expressions for Detecting Word Patterns
## Using Basic Meta-Characters
### 匹配wordlist以ed結尾的字
### $比對結束位置

In [None]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
[w for w in wordlist if re.search('ed$', w)]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 ...]

### 匹配字母總數8 且第三個字母為j第六個字母為t的單字
### ^比對開頭位置
### . 任一字元

In [164]:
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

## Ranges and Closures
### 匹配開頭是 [ghi] 且第二字母是 [mno] 第三字母是 [jlk] 結尾是 [def] 組合的詞

In [165]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

['gold', 'golf', 'hold', 'hole']

### +代表出現一次或多次

In [166]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [167]:
[w for w in chat_words if re.search('^[ha]+$', w)]

['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

### 當要匹配字元中有 . $ ^ 等特殊字元，前面必須加上 \ 才能使用

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

['0.0085',
 '0.05',
 '0.1',
 '0.16',
 '0.2',
 '0.25',
 '0.28',
 '0.3',
 '0.4',
 '0.5',
 '0.50',
 '0.54',
 '0.56',
 '0.60',
 '0.7',
 ...]

In [170]:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]

['C$', 'US$']

### {x} x必須出現幾次

In [None]:
[w for w in wsj if re.search('^[0-9]{4}$', w)]

['1614',
 '1637',
 '1787',
 '1901',
 '1903',
 '1917',
 '1925',
 '1929',
 '1933',
 '1934',
 '1948',
 '1953',
 '1955',
 '1956',
 '1961',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1971',
 '1972',
 '1973',
 ...]

### {x,y} 最少出現x次最多出現y次

In [172]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]

['10-day',
 '10-lap',
 '10-year',
 '100-share',
 '12-point',
 '12-year',
 '14-hour',
 '15-day',
 '150-point',
 '190-point',
 '20-point',
 '20-stock',
 '21-month',
 '237-seat',
 '240-page',
 '27-year',
 '30-day',
 '30-point',
 '30-share',
 '30-year',
 '300-day',
 '36-day',
 '36-store',
 '42-year',
 '50-state',
 '500-stock',
 '52-week',
 '69-point',
 '84-month',
 '87-store',
 '90-day']

In [173]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

['black-and-white',
 'bread-and-butter',
 'father-in-law',
 'machine-gun-toting',
 'savings-and-loan']

### 匹配ed或ing結尾的詞

In [None]:
[w for w in wsj if re.search('(ed|ing)$', w)]

['62%-owned',
 'Absorbed',
 'According',
 'Adopting',
 'Advanced',
 'Advancing',
 'Alfred',
 'Allied',
 'Annualized',
 'Anything',
 'Arbitrage-related',
 'Arbitraging',
 'Asked',
 'Assuming',
 'Atlanta-based',
 'Baking',
 ...]

### 基本正則表示法用法
![](https://i.imgur.com/ETp6z1p.jpg)

### . 表示任何字元
### ^abc 以abc開頭
### abc$ 以abc結尾
### [abc] 包含abc其中一個
### [A-Z0-9] 包含A-Z或0-9其中一個
### ed|ing|s 包含ed或ing或s
### * 比對前一個字元零次或更多次
### + 比對前一個字元一次或更多次
### ? 比對前一個字元零次或一次
### {n} 比對前一個字元 n 次
### {n,} 比對前一個字元至少 n 次
### {,n} 比對前一個字元
### {m,n} 比對前一個字元至少 n 次，最多 m 次
### a(b|c)+ a後面是b或c且至少出現一次

## 3.5   Useful Applications of Regular Expressions

### 上述的例子使用re.search（regexp，w）匹配符合正規表達式regexp來搜索w。除了檢查正規表達式是否匹配一個word之外，我們還可以使用正規表達式從words中提取出特徵或以特殊的方式來修改原本的words。

In [4]:
import nltk, re

### re.findall() 能根據指定的正規表達式找出所有符合的結果。

In [12]:
word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word)
# [w for w in word if re.search(r'[aeiou]', w)] # 相對於re.search()的寫法

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

In [7]:
len(re.findall(r'[aeiou]', word))

16

### 從treebank文集中找出兩個或更多連續母音的所有序列，並顯示其相對的頻率。

In [34]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
                       for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common(12)

[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106),
 ('ue', 105),
 ('ui', 95)]

## Doing More with Word Pieces

### 有時候英文單詞非常冗長，忽略掉內部的母音，僅保留字首及字尾的母音反而更容易閱讀。

In [46]:
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
    pieces = re.findall(regexp, word)
#     print(pieces) # e.g. -> ['U', 'n', 'v', 'r', 's', 'l']
    return ''.join(pieces)

english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


### 將 regular expressions 與 conditional frequency distributions 結合，列出Rotokas語言中所有輔音[ptksvr]及母音[aeiou]每個配對的頻率。

In [55]:
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


可以發現s與t有互補的現象，代表它們不是這種語言中的獨特的音素，

### 如果我們想查看上表中數字背後的words，可以使用cv_index [ ' 輔音+母音 ' ] 來查詢。

In [90]:
cv_word_pairs = [(cv, w) for w in rotokas_words
                        for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
cv_index['su']
# [value for (key, value) in cv_word_pairs if key == 'su'] # 相同結果

['kasuari']

In [42]:
cv_index['po']

['kaapo',
 'kaapopato',
 'kaipori',
 'kaiporipie',
 'kaiporivira',
 'kapo',
 'kapoa',
 'kapokao',
 'kapokapo',
 'kapokapo',
 'kapokapoa',
 'kapokapoa',
 'kapokapora',
 'kapokapora',
 'kapokaporo',
 'kapokaporo',
 'kapokari',
 'kapokarito',
 'kapokoa',
 'kapoo',
 'kapooto',
 'kapoovira',
 'kapopaa',
 'kaporo',
 'kaporo',
 'kaporopa',
 'kaporoto',
 'kapoto',
 'karokaropo',
 'karopo',
 'kepo',
 'kepoi',
 'keposi',
 'kepoto']

## Finding Word Stems

### 在使用網路搜尋引擎時，我們通常不會在乎word的結尾，例如laptops與laptop是屬於同樣的東西。
### 詞幹(stem)是詞的一部分，在不同情況下使用，其含義有差異。

#### stem() 將常見的結尾去除來取得stem

In [163]:
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

### 雖然最終還是會使用NLTK內建的stemmers，這邊展示如何透過正規表達式來處理word。

In [137]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

### 如果我們想使用括號來指定分離的範圍但不將其輸出，我們必須在括號內加上?:

In [99]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

### 將 Stem 及 Suffix 分離

In [241]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

[('process', 'ing')]

### 這邊發生了一些問題，找到-s而非確定的-es
#### " ＊ " 運算字符是貪心的，而 .＊ 會盡可能消耗最多的input

In [47]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('processe', 's')]

#### 如果要讓 " ＊ " 不貪心的話，" ＊?" 能達到這樣的效果

In [239]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('process', 'es')]

In [167]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

[('language', '')]

### stem() 仍然存在很多問題，結果如下

In [165]:
from nltk import word_tokenize

def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)
[stem(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'sword',
 'i',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

#### 正規表達式除了將ponds的s去除，也將is及basis的s去除，產生一些non-words像是distribut及deriv，但這在某些應用上市可以被接受的。

## Searching Tokenized Text

### 可以使用特殊類型的正規表達式來搜索text中多個words。
### 在NLTK的findall()定義裡，角括號<>標記出token邊界，且角括號內的空白字符將會被忽略。
#### 例如 ＜ａ＞ ＜.*＞ ＜ｍａｎ＞找出所有a XXX man的實例，將<.*>放置括號中使其僅匹匹配word(e.g. monied)而不是phrase(e.g. a monied man)

In [243]:
from nltk.corpus import gutenberg, nps_chat

moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")
# moby.findall(r"<a> <.*> <man>")

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave


#### XXX XXX bro

In [195]:
chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*> <.*> <bro>")

you rule bro; telling you bro; u twizted bro


#### 以 l 開頭三個字母以上的word

In [77]:
chat.findall(r"<l.*>{3,}")

lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la


### 搜索 "x and other ys" 模式的短句

In [220]:
from nltk.corpus import brown

hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


## 3.6   Normalizing Text

### 在進行處理之前我們經常透過lower()先將文字轉為小寫，以便忽略如The與the的差別。
#### 更進一步分割word的動作更稱為stemming，另一個步驟是確保所得到的word是在字典中已知的word，稱為詞形還原(lemmatization)。
#### 以下依次討論這兩個方法，我們先定義我們要使用的data:

In [92]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)

## Stemmers

### NLTK 包含幾個現成的 Stemmers，當你需要時可以先使用目前現有的，因為NLTK中的Stemmer已能處理各種不規則的情況。
#### Porter stemmer 和 Lancaster stemmer都有自己stripping affixes 的規則:
(可以觀察到Porter stemmer能正確的處理lying -> lie，Lancaster stemmer則無法)

In [93]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]

['denni',
 ':',
 'listen',
 ',',
 'strang',
 'women',
 'lie',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandat',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcic',
 'aquat',
 'ceremoni',
 '.']

In [94]:
[lancaster.stem(t) for t in tokens]

['den',
 ':',
 'list',
 ',',
 'strange',
 'wom',
 'lying',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'bas',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'pow',
 'der',
 'from',
 'a',
 'mand',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'som',
 'farc',
 'aqu',
 'ceremony',
 '.']

### 如果你要索引texts並使其支持不同詞彙型式的話，Porter Stemmer是個不錯的選擇:
#### (object oriented以出超本書的範圍，string formatting 和 enumerate() 將在3.9及4.2詳細說明)

In [231]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

### enumerate() example:

In [227]:
words = ['I', 'turned', 'off', 'the', 'spectroroute']
[ele for ele in enumerate(words)]

[(0, 'I'), (1, 'turned'), (2, 'off'), (3, 'the'), (4, 'spectroroute')]

In [232]:
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
# print(nltk.corpus.webtext.fileids())
text = IndexedText(porter, grail)
text.concordance('lie')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


## Lemmatization 
(詞形還原)

### WordNet lemmatizer 只在產生的詞已在它的詞典中時才會將 affixes(詞綴) 刪除，這種額外檢查的過程使得 lemmatizer 比上述的 stemmers 花費更多時間。
#### 注意這邊並沒有處理 lying，但能將women轉換為woman

In [245]:
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'woman',
 'lying',
 'in',
 'pond',
 'distributing',
 'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

## 3.7   Regular Expressions for Tokenizing Text
(分詞)
### NLTK中已包含許多tokenizers，Tokenization 的工作是將string切割成可識別的word。

## Simple Approaches to Tokenization
### tokenizing text 最簡單的方法就是透過 " " 空白字符來分割:

In [202]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

In [109]:
re.split(r' ', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone\nthough),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very\nwell',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

### 不僅由空白字符來分割，同時也判斷單個或多個換行(\n)及tabs(\t)，也能寫成 re.split (r '\s+', raw )

In [247]:
re.split(r'[ \t\n]+', raw)
# re.split(r'\s+', raw)

['DENNIS:',
 'Listen,',
 'strange',
 'women',
 'lying',
 'in',
 'ponds',
 'distributing',
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses,',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony.']

### 透過空白字符來分割會出現像是 ' (not ' 或 ' herself, ' 這樣的錯誤。
### 另一種方法則使用Python所提供的character class ' \W '， 相當於 r ' [ ^A-Za-z0-9_ ] '。

In [249]:
re.split(r'\W+', raw)

['DENNIS',
 'Listen',
 'strange',
 'women',
 'lying',
 'in',
 'ponds',
 'distributing',
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '']

### 等同於 r ' [ A-Za-z0-9_ ] ' 及 非空白字元加上零個或一個字母或數字。

In [114]:
re.findall(r'\w+|\S\w*', raw)

["'When",
 'I',
 "'M",
 'a',
 'Duchess',
 ',',
 "'",
 'she',
 'said',
 'to',
 'herself',
 ',',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 ')',
 ',',
 "'I",
 'won',
 "'t",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 '.',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 '-',
 '-Maybe',
 'it',
 "'s",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 '-tempered',
 ',',
 "'",
 '.',
 '.',
 '.']

In [203]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))

["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


![Regular_Expression_Symbols.jpg](https://i.imgur.com/i7aAmQH.jpg)

## NLTK's Regular Expression Tokenizer
###  nltk.regexp_tokenize() 與 re.findall() 類似，然而前者分詞的效率更高，且不需要特別處理括號。

In [None]:
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose(詳細) regexps 
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A. 縮寫
...   | \w+(-\w+)*        # words with optional internal hyphens X-X
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82% 貨幣and百分比
...   | \.\.\.            # ellipsis 刪節號
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''

>>> nltk.regexp_tokenize(text, pattern)

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

#### verbose flag告訴Python去掉whitespace and comments，不再使用 '   ' 來匹配一個空白字元而改用 \s
#### regexp_tokenize() 還有一個optional parameter "gaps" ，當設為True時意同 re.split()

## Further Issues with Tokenization
(分詞更進一步的問題)
### Tokenization 是一個比你預期還要更艱難的工作，沒有單一的解決方案能解決所有情況，我們必須根據應用領域來決定哪些是token。
### 在開發 tokenizer 時使用已經 manually tokenized 過的 raw text 可讓你的 tokenizer 獲得較高品質的 tokens。
### NLTK語料庫中包含Penn Treebank的data sample，其中包含:
#### 華爾街日報(WSJ)的原始文本nltk.corpus.treebank_raw.raw()
#### 已做好分詞的版本 nltk.corpus.treebank.words()

## 3.8   Segmentation
## Sentence Segmentation

### 計算 Brown 文集中每個句子平均有幾個words:

In [138]:
len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())

20.250994070456922

### 在某些情況下text可能只為一串字符，在 tokenizing 之前我們必須將其分割為句子。
### NLTK 透過 Punkt sentence segmenter 使得這個功能得以使用，這邊使用The Man Who Was Thursday小說中其中一段當為例子:

In [260]:
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
print(sents[79:89])

['"Nonsense!"', 'said Gregory, who was very rational when anyone else\nattempted paradox.', '"Why do all the clerks and navvies in the\nrailway trains look so sad and tired, so very sad and tired?', 'I will\ntell you.', 'It is because they know that the train is going right.', 'It\nis because they know that whatever place they have taken a ticket\nfor that place they will reach.', 'It is because after they have\npassed Sloane Square they know that the next station must be\nVictoria, and nothing but Victoria.', 'Oh, their wild rapture!', 'oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation were unaccountably Baker Street!"', '"It is you who are unpoetical," replied the poet Syme.']


## Word Segmentation

### 對於一些writing systems來說，word並不只有單一的用法，Tokenizing text 變得更加困難。
### 例如：愛國人可以分為：「愛國 / 人」 或 「愛 / 國人 」。
### 類似的問題在口語上也會發生，聽者必須將連續的語音分割成一個個詞彙，當我們不認識這個詞彙時將會變成一個困難的問題。
#### 以下的例子將句子的word boundaries刪去:
#### a. doyouseethekitty
#### b. seethedoggy
#### c. doyoulikethekitty
#### d. likethedoggy

### 第一個步驟我們必須先找到一個能將一串字符分為一段一段句子的方法

In [271]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

### 觀察由0和1所組成的strings，他們比text短了一個字符，因為長度為n的text可以在 n-1 個地方被分割。

In [262]:
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
            words.append(text[last:i+1])
            last = i+1
    words.append(text[last:])
    return words

In [272]:
segment(text, seg1)

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

In [264]:
segment(text, seg2)

['do',
 'you',
 'see',
 'the',
 'kitty',
 'see',
 'the',
 'doggy',
 'do',
 'you',
 'like',
 'the',
 'kitty',
 'like',
 'the',
 'doggy']

### 計算目標函數：給定一個假設的source text分割結果，推導出一個 lexicon 和 derivation table ，讓source text能夠將其重建，然後合計每個詞項(包含邊界符號)與 derivation table 的字符數作為segmentation的分數，分數較小者為優。

![Calculation_of_Objective_Function.jpg](https://i.imgur.com/AFfc4mH.jpg)

In [278]:
def evaluate(text, segs):
    words = segment(text, segs)
    text_size = len(words)
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size

In [279]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
seg3 = "0000100100000011001000000110000100010000001100010000001"
segment(text, seg3)

['doyou',
 'see',
 'thekitt',
 'y',
 'see',
 'thedogg',
 'y',
 'doyou',
 'like',
 'thekitt',
 'y',
 'like',
 'thedogg',
 'y']

In [156]:
evaluate(text, seg3)

47

In [157]:
evaluate(text, seg2)

48

In [158]:
evaluate(text, seg1)

64

### 尋找最小化目標函數值0和1的模式

In [309]:
from random import randint

def flip(segs, pos):
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]

def flip_n(segs, n):
    for i in range(n):
        segs = flip(segs, randint(0, len(segs)-1))
    return segs

def anneal(text, segs, iterations, cooling_rate):
    temperature = float(len(segs))
    while temperature > 0.5:
        best_segs, best = segs, evaluate(text, segs)
        for i in range(iterations):
            guess = flip_n(segs, round(temperature))
            print(guess)
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        temperature = temperature / cooling_rate
#         print(evaluate(text, segs), segment(text, segs))
#     print()
    return segs

In [310]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
anneal(text, seg1, 5000, 1.2)

1011111000100001010100111111100101101110010100000100011
0001000111100010100000000110010101110001001110011001010
1001010000011011100010010100001101100010100111000101101
1100010000110100001111011010110000110011100100111111010
0000000000110101001011011011011110111010001110100001011
0000110100011100000101110001000000001001010010001100001
1000001101010111000100000110101010011110010011101011100
0011000110001111100011010010100010010101111001000110101
0001000100011011111110100100011110101101010100010110011
0111010010010101001000001110100000000000110111001001101
0111001110010100000100010010111011001011001110111000100
1100011111110011010101100010110000000011001000010101011
1100001101100111000000010111001001110011000001110110011
1100000011000011010001010111111110101000111111101010110
0011101001000000000110011000100000000111000101110010101
1010100000000000001001100101010100000000000000000011101
0001100001010010110101010110010100001110111100111100010
011001001010001000101001000000111011011001100101

0010111000001011001011001010110001100011101110010100011
0100010000000111011010001001000001100101000011101110101
0110100001010111010110000111010010001001110001101001110
1100010100111110110101001000000100010010000110000000010
1001101001101101111110101000011100001001100111010001001
0010001000110101000001111101101000000110111001110101011
0010100000010000010000100001111101000101111110111011111
1010000010100111011011011100000100101000110100100101010
0101100001101110101000000110001010101000101000010100000
1111000110111101100111001111001110001110001011010111110
1101100110011001111101100110000011000010000111000000100
0000110011110000101011010110010101111000100010000100000
0110000001011010101100110010000001101001100110110000001
1010010000010101010100011101101001100111100000000011110
1010111111101001110111000010000011110000101001011000101
0000001101110011100000010001000001110011110111010001000
1000001001111111100011110011101110110011011110100100101
000000110001101100001110111101000000101111001000

0011001010010000100000100011011000100100000010001011010
1100100011111101011000010010011000110010011111111100000
1100000001010001110010101110001111010101011100010111101
0100000000010000010010111110111110111000100100110111011
1100010000011110111110100010011000100010110101000001001
0010010000001100000100111100101111010000000001010010000
1011000010001110001101010011110010001001010001111101001
0000000111101001010001101110101010011110000001101100111
1101011011110111000101110011011100100100000101110101001
0110110011000110110100110110000010011000001010101001100
1101100001000000000100000010001001100100110000101100101
0110000000100010001110100011101011010110100111001000011
0111000010010111011000011011001001001000100000110000101
1100100011111001010000110001011100000100001100110010000
0011001010010100100101001010100001010110011010010010100
1001001000001001000010010100010110000101111110001111000
1100001111000110001000000110100101010010010100010010000
000000110001011111100011110001011100101100001001

0011111000010101000001010111100000010010110101000001100
0011011101101011100111100100010011101101100011000111001
1001000011010001000101111010011110000010101001111000111
1110110010111010010001011110000010011010010111111100000
0011000000110000000001100110110011100000110110010010010
0011010101111011011000111101110011100010011010011110000
1011100001110001100100101110000100101001101110010010110
0000110110100001000101000011001010011110110110101100001
1001100100110101111010011001101110110110001111001110000
1101011001111001100110100011101110100000010101110000011
1000100000111101110011000100000100001101101011010010101
0101101011101010111100111011101110011010010000001100110
0000000001101111000100001000110110101101100101100001100
1000010000011010110111100101110000110001000111110100110
0000110111100101111101000011101110011101000101001110011
1000000001010111010101011011011101111000111100000001010
0000011100000011110010111110110101011010100011010011110
111000010101101110011001001001101011000101110101

1110010100011000010000101100000100111001010110000001110
0100000110101001000100000000100001000001111101000010101
1100110011011001110101100000100101010011101111010010100
1001010101100000000001001010101010110111011010110000110
1011001100010001111110101000001100000010001011001000100
0000011001100000110001100011110010000101101100111101010
1101100101110111110000100000100010100000100011011010100
0101010100011100110001100010100110100010101000010000000
0110100100111100101011111100111011110101010110100101001
0010000111111100101001110110011100000101000010000001000
1110011111011010001000111101110000110011010110101000000
0001101100011110000011000000000000111100101110000001100
0001100010000010101001000101100000101000100101001101101
1110001000110110000011000000010110010110011001000100110
0001010110011000000111010111000101001101101100011011010
1011111111010110111011000001011100101001101100000101111
1011010000001000000010110001101100010100001010000100001
111010110001000001100001101001110101101000110001

0010000111000000001001000011110111010010001101011000100
1100010000100001000000011111011110010010010100101101111
0010001011010101000011101111100111011000010100001110000
1001111110000000100011100001010100100100100101110111010
0110001011000000010110101111010010000110010100101010111
1111000010101010011110000011000101100111110001000000101
0001110000010101010111100101110010010110100100100100000
0000000001001011110110000101000100000100011110001011101
1010011101001101100010111010001001011111110100010001110
0011101000100001010001100110110001010110010010100010000
0000100101101001110011010010001001101111011010011111000
0100011100000100100010000110100001011100101111100000111
1001011001000101011110110111011001100100100011000100110
1010100000100100100110010110001010000011010100001001101
0000100100100101100011010110000010010111000000000011111
0000011101100011100100110000110010101110001000110011101
0001000010101101001110001101011000000110001101001110111
000110101111100101001000101000000010000011100000

0110010101010001100101011100000000000010101000001000110
0110011010110100111100001010110110101101000110000110100
1111101111000001000000101101101000101111011001001110000
0001100001110001001000111011100011000110011111010001100
1001001110100001100101010001110000110110100110000110110
1111001000100011101001000100000000011100001001100011100
0001010010011011001000100010101100000100010110101110011
0110010111000101001110011010101010100001101110001001000
1101101000011000010111100011010001001010110001000011101
1010101001011111010110100010000011011001101000011001000
0010000001111100100010100001000101000000001111001100001
1011100010001001010110000111011000000100011100110001001
0110010010010000111110110011001000001110111100010010001
1000100001000000011100010001010001010001011010101010100
1010000000100011000000101111000111011011010011101001101
0011100011000001001001000010011001001100011101000101001
1000110011000110111001001111101100001111100110001000010
101110001010010100101000100011010000100010111000

1100001010011010001000001010110110000000100110001001101
0001000001011011111100000010000001000101000001110100010
0011100011001010010001101000110010101011001100011010000
1110001101001001010001100010101110011000110011100000000
1010101000101001000001000111110010110101000011101000001
0000010010110101000101100111111001100110101100110110001
0010101010011111000010011011001101010000001011001001000
0101000110100010111110000011010001110111000100010000100
1010110100001001001001010010110010000000111011010001000
0001001100000001010100000010100100111001010000010001001
1010100000010010100110101111110110000000100001000010001
1000110111101000001100110010011001101000100101101101000
1000000000001011100110000110000101011110000000010010101
1001000101111000010110001001001100001100100101011001001
1001111010110001000001000010100100011101010110010011110
0100000010000111101101111100101010110000010101101000110
0100011100000100010011011011000010101001110011010101000
100011000110001001110000000000100000000001000010

0101011100110101000001111110000100100010010011001010110
0000100001111010111001101110001011001010100110101100000
0010101011111111010100000010011110000111010100110010100
0000011100000001001101010000100000001001011000011101011
0100111100101000011011100100010110011100001000101101100
0101101111010011101001101111010000100010001100010011001
0011000110001100110001010010000111100101000111110101010
0111010010001101010110000110100110100000100100100101100
0001111000001110101000110011001001001100011111101101100
0001110011010001000000010010100111110010111100001101110
0100010111100001001111111111011111011010000100100001010
0100010001001001000011100110010010010001000111001000001
0010010000000101110010001110111101110101110101100001101
0010101101100110001100101011001101000011100101000010101
0101000101100001001110101000100010000011000101100011111
0001010011010110101110110000010000110001001010011100010
0011000101010001110000001100110110001010001010100110110
110000101001110010100011111010110000101011010100

0101110010110100000000000010110000001110010110010010000
1000110100101100001000001010001001011010001101100101101
0010001000101000101111100010010101001000000110111000010
1000010001000000101100000011001000100000010010100011000
1000110100011111000000010110101100010110010110000000000
0000110010011001000000011010010000010000111110100010010
0011010100010000001111010011000100101010001110000001100
1011100000000000000101100010000000100010010010110000001
1101010000100101000111001000100000001001010010100000011
1101111001100011010101000000100010110011010100100100011
1010000111110000101001000011000001011110010100111100000
0110100111000011001000110110000101001001011101111010011
0000101011011001001000000011010101001010110110001110001
0100101100000001010110011000000010001010000001101010011
0100100000010100000110000000001110100110100001100100111
0101010001100000101101101010011110010000011110011000110
0110001110101001100011000001011000100001000110000001000
000010000011100100010001010010000010100100101001

1000010111010111101011100000000001001001000010101100000
0001110000110001001110100011100100001001100101011000000
0000101100100011101001011010110000000011000101001000011
0100011100100111100100010010010001001000000101011011011
1100100000101111000011000111101000000110101110101011101
1000000000000010000001001110001101010101110101010010010
0100110101101100000110101011001100001000011110111001000
1111011010010000011001101000110100001111101110010100001
0010101110000001111010101101010000000100010110100100011
0010110001000011001101111010000000000110100100111001000
1001100001100111011101010000000000001000011110000100110
0100010001010101111000000000001010001001000001111001001
0100000010100110011100100101001100001100110100010000000
1000110000010001010110001001000101111000101100000000010
0001110100000011000001010010011110001011010001000101100
0100010000000110010101000001010000110110111011101100101
0000000001101001110100110000000011011001111101000110110
001010110100101010010110001101100000110000011100

0100101110000110100101001111110010010011001110000100111
0001001000010101000010000000111100100100000000101101001
0010001001100100000001000000000101001111001011000111001
0000111100000000010101001001000011111000100101101000101
0101000001000001010101101101000100100100111110100000111
1010100100010001100100000011010101000100101010000000000
1010101010000011100000110100010101110001000101000000011
0001010000101001010001000000101010100110001100001100011
1011000011100001111000011010001101000010011011011010001
1001011000101110001000110110010000010000001011110110010
0000110000001101110000100011100000001100110100000010011
1000000011000100101010110011000000100110000110100110101
0000011110001010110000101100000010110100011000000100010
1000000101001001111100000100100000011001010111000111011
0100101011101011001001000110010100000000000100010110110
1000100000001000010001001111011010101001010101011010000
0000010010100100110001110010111001110101000111111101111
000010000101000100010000001101100100000101111100

0100010110011100000001100111000001101110111010100011111
0010001010001010111101001110111000101011111000000000000
1011101101110111010010001110010001000100110001010011010
0101100011010001111001100100001000010001100111000100000
1011001111101000110001001011010101010001010110100110001
0110001001001010000000101111111000100000000001111001100
0100000000001001001001110001001010110011011000001101011
0010001101100101001011000010010101100000110010100010010
0000001100000010000110000010010110010000010100101010011
1001101000100011110000000100000111111010110100110000001
1010010000000110100011000110010000110000011011000010100
1010011001101110100011101001101010001110010001001101001
1001000110010001010100010010011010000011111111000010010
1011001010010001101000101010111101010000110010010010000
1000001010010101101010100110101010010000101000100110000
0001001101000101111010100000001000111100100010011011100
1001011001001000100011000110000000000101010111100010000
101100000000010110101011111010111100101000110010

1100100111000001110100000010000000000010110011001000000
0111000001100001100100100110000110011010000000000011010
0000000100010011100111010011010001100000000100100111000
0000100000011011110000001010011010101000001000001010000
1110100000101000010100010010100001000100000100000001011
0100101000000101100100110110000010101110100011000011110
1100001001010001100110100010000010100111000100001001000
0101100000010101000000100010000110010000010111010000111
0001100101001001101010000110011100000011001010000011001
1000010000001001000101011011100000011011110100010110000
0011100001100000001001010000000100000101000000000101000
0000001000011000001100100010100101000001001000000010011
1110001110100101001011101010000000011000000110001110001
0001100000000010010010000011000000010010010101001011011
0010001100000010000001001110101100010000010100000100000
0110001011111011000110100011000101001100100100010001011
1001100000000011000100000010101000010000101000100110110
000011011010100011000011011111001101010010010110

0011101001000011100000000011010110010101000100011101000
0000001000110111110010000100000000000101101101000100000
0011110110111101101011011011110100000010000100011110010
0000100110101001000011010010000000100101100101000001000
0110100010101101110001100000000001000011001100100000100
0000110110100001100110101000000101000000100100110111001
1001110010000111001001010010110011000011010000010000001
0110001010100011010000010000011000101000000010010010010
0000111000001111000000110110010000001000100111001000100
0100111111101001011000001110000100001000001110000001000
1000010001100001100000110111001111001101000100010010011
0010010100000101010000110011000000110000011101011110100
0100010111001110011010000010100100100100100000000010010
1010001011101001010000000010001100010101100000001000000
0100000100010011000000001011011110010000100000001001010
0100100011000001000010100111100000001100001101001110110
1010011101001011011010000010001110000100101001110000001
011001011001001000000101101000011100001010010100

0000100111100010101000100010100100010000110101011101011
0000000111001100001010011000000010000111010110000100110
1101101000110011001000001010111110000101100001000010111
1010110001011000110001000010100011001010000110100110000
1000000000110001001000000001111101000100100011000111001
0100001010000101001001100111010010100110111100011110100
0001000010011100100001000000000001001010100010000011001
0110000010000001101001100011100001110000101110001001010
0010100101100000010000001101011110010000010100000100011
1000010000010001001011101010100000001010010100000010010
1001100100001001011100000010011110000111010000000000001
0011100010010000100000001110110100101110100101000001010
0100000000001001000100000011111100101001110100111101101
0010000010010000100000010000100000010000010010100000111
0001000010011100000010010010011100000101000110011101101
0001001010001001100010100100000010111011000111111001010
1011000000101010000010101001100100111000011001101001110
100010000100001101010001101000010000110010010001

0000010011110001100011010011001001110000000101010101100
1000100000001111011011000000010010000100000100110000000
1000000100000100011001110010001100100000101100010000011
0100100011100111010101000101000001001100000101000110110
0000011110100000110000111010011000100111000101011001101
0000010000000010000100100010000000000101100001000110001
1001100000001000000100000110000000011011000000010110011
1010011110000100000100010000000100000001110000011100101
0000000100000100110000000010000000011110111000000101011
1000110000100000000111010000000000001000100111000000010
1001010001001101000011101101011001000000000110111101001
1100000100000110000000000111000000000101001100000000011
0111000001100001010111010001000000100001000101000001100
0010001110100001010000000010000010110001000001000101010
1110111111000000101000010011101000101000001110000010000
1001101000000101001001100000000101001010000100011101000
1010011001011010001001101110010000000000100101010100000
000110000100000100101100100010001100110000110100

0010111010000010010000010010000010000111011010011000000
0010000000010101000000000110001100000100100110001001110
0110000000000101011101000011100000100100001101000000000
0111010111101011100010000101011011000100000100001100010
1000100001000000000010010000001000100010000101110100100
0100000001100011010011000010100000110101000101000011000
1001101110101001100010010011000010011000110111110100001
0000000111010011001100001010100011011010010010000010001
1001001101101000100100101110001100110000110100100100000
0000011111101001100011000010110000011000000010100000110
0000010000000101000001110010110101100101010111100010110
1000000001000001000011001010100001110011010000101010100
0010100001101011010000000010100001011000001000001000010
0100110011011000011010010110001101101010010100000000100
0010010101000001100000101010010010101010000000011000110
0101001000000101111010000011111000010001001011000010011
0011100110100001001001110100100001100010000100011010000
001101001100010001000010010001001000100011000100

1000001010100000000010000010010000010000010110000000010
1010001000000000111000000001110000100011110110111000010
1000101100000000000000001001000100001000000100100101000
1001000000110011000101101000100000000010001111100001000
0101010101010010100100010010000001101000101110001010000
0101100001000001000000000110100011101110101101011010010
1000010000100001100000010010010001100000111010000100100
1110000001000000100100110111100010011010110100001100100
0100010000101000110010000101001010110001001000000000001
0000000000000001000000000010010000010111001110011101110
0000010001101100000100110110000010101000101000000000100
1010000011010101010000010001111001000101000101010001000
0000000101010101010001010010001011011110110101001010000
0001000000000101010001110111010010010011001101000110000
1010001000100100000100010010100010011100001101001110001
1000100001100011010000001000000100101100100101100000000
1010010000100000101001000110000010000100100101101011000
111100000011100100010100110000000110010010100010

1000110110001011000001000010000110001101000101000001011
1001010110000101000000000010111000000000011110010000000
1100110100000101010001100001000001100110000000100000000
0001100100001101001000100111100000001000101100000010001
1000001011101001000110000001000001000010111100111100010
0000010000100000001000010010100100111101101111001100000
1000101000100111000110010011000100000000011101010000100
0100000100000011110001110011010001100010001100001000100
0000001100111001100010000011010000001100101100111100010
0000101111000111101000001110100010010000010000000001001
0011100000000001100000000010000101100000001000111100010
0001010001000101000011010110001001010011000110100000001
0101010001001001000000010101000010100001000101001001010
0110010100000001000100110110000000100011010100110011000
0100000011001101011100000010000000000000001100000000001
0000011000000100000000000000100000000101011000000000010
0100001100001010101100110010101000011010001101010000010
000000011010001100010000000100010000010010110110

0001000011011001000011000000110100001100010100011001001
0000010000110100000001010110010110010100000001000100101
1110000010000111000100010010000100100010010110000110100
0000000000000001001100100010000001000000000101000010000
0000001000000011010000000001000011001000110011000010000
0101010011010110010001000010100001010000001101000000101
0100100000101100000010111100010100010001010000000000000
0000100011000001000100000011010000000111000000101001010
0000110000000011111001000011101111000100000100010001000
0011100100001111010000000100101000100000000101000000011
0000000000100100000001001010000100000011110100010011100
0100001100000001000000010110110001000000001110000000000
1110000000001001000000010011000000010011101100011000100
0100100000011000000000110011000010000010000110000010000
0000001011100001001010000001000001010101010110100000010
1000000000011011000000010010000001010000100110111000000
1001100000100000001100011010000100000111010100010010011
000010000011100000000111001010010000000100000000

1000010000010101000001100111000001101000101110010001000
0111000111010001000100001000000011001000010000101000100
0010101101000011001000000010001000000001000011111010000
0010100011000001100000000010001010100101100100101001000
0001001000000001110011101001000001100000000101110000010
0010000000110001011000100110011001001101000100001001010
0000001110001001000000000011000001100001011110100000110
0000001001000001001100000100000000000111000100000101010
0001000000000101100000000010001011000101110101011010000
1000000000010010010010010011001010101100010100100001000
0000001101100001000100001010100101101011110010110001000
1010100001000001101110100010001000000000001000100110001
0000011001010000100011001011100010010110010000000100000
0101000101100001000101100010000000100000010100000001100
0000000011000011101010000101001100100100001001000110110
0000110101101001000000001011000100000110000010000011001
0111001011000000101100001010000000011000000010100000000
000001101001010001000000001000000100000100100000

0010110101001001100000000001000001010000101000010011100
0000000100001100011000000110001010001110000001100011100
0001000110001100100011100010000101011000000100010111000
0110000001000011100101100000000000101100001100110000010
0000000000011011111001000010001001001001000100100100010
1001001100100111000000100011010010010101100110100000001
0001010101000001000001001010000000101001011110000101000
0000000001100001000100101010100000000001000110000001001
1000000011001101000100100011000000011000101100100010011
0110110010010010000000000010111000011010010000000001001
0000010100101001000101100011000100001010010010101000101
0000000001000001101100101010001010001001001100100000011
0101000011100001011001100110110100000000000110010010101
0000000001010101101100101010001101001000010100100000000
0000000010010001101000000010100000010101000101000010000
0010000010000001001000010010110100110000010010000001010
0001000110100001000100100000011000000000100110010010100
010001110001000100000000101110000100000011011001

0101010101010001110100000011011010101000110101000001001
0101001100010101111011000010111101100001011110100010010
1001011101110001101100000111011010000100000100110010000
0101001100100001100100100011101000110000100010100100011
0001101100001011110100100011010100011100110100100000010
0010010000000000111000001110111001100010100010101000001
0001100011010101110000000011110000100100000100000001000
0001000000101001100010110010011000100101000000111000011
0001101010100011110010000110110010100000001001001000101
0000000010000001100000000010010110001011101010100001000
0100101000001001100001000110011001100011100101101001000
0010011000110001110000011011001010100000110100110000001
1011001000110001110000000010011010100010100000100000011
0001001110000001100000000100111011100000111110000001000
0001001100010101001100000000110000001010100100110000100
1001000110000001110100000010011001000000010100100001010
1001100001110001100000000010100010000000000000000100001
000100111001010001100000000001110011110000110010

1001001000100001000001111010111100100010000100100001011
0001010011010001110000100010011010100000001101001101100
0011001000000011100001000010011001100000001110100001101
0000011000000011100000000010011101001000000100110100110
0101111100000001110000100010011001100000101100100001001
0001000001000001110000000010101000010000001100100110100
0000001011000001100100000011010110111000010100000001000
0101111000001001111110010000011001100100110000100000000
1001101100010011110001100011110011100000000100100110110
0101010000101011010000000010011000101000001001100010011
0001101000001000110000010111001010010000000000110000101
0000011110100001100010000010011000101100110100110000000
0000011001001001111110000110001100100000100111000000000
1100111100000001110110100110011000100000000100100010100
0010011010001001110100000011111010110000011010101000100
0101000100000011100000010110000000100001100100111000100
0011001000010001110100001010000000100110010101100001010
000100101010000111010000000001000010010010001010

0001001000000001100000010010011000000100000100100010000
0011001011000100111000000010001001101000000110110000000
0101000100000001000110100110001010100000110110100100000
0001001000010011110001000110111000100000100100100010101
0001001000000001010000100101011011000000010100100010001
0011011000100001110000011010111100100010101111101011000
0001000000010000010101000100011000101100010100110010000
0000011000011001110100001011000100110000000100101000001
0001001101000010110000100110001111101000000000010010000
0010000010010101110000101010011000110010000000010001100
1001011000000001101010000010011000110010000000110101000
0001011000010001010010001110011100001000000110101000000
1000000100010011110100000000011100000000000100000000000
1101101000000001110000101010000010100010001101101000100
0001001010010101010000101010111110100000000100111000000
0001000001000110110000010111011001100000010100111000001
0001101010000101100010000010110000010000000101101000101
001110110000000111100000011010001010000100011110

0001001010100001100000010010011010101010011000110000000
1101011001000001111000100000111000100000001110101001100
0000101000100011110101010000011000100100010110000000001
1100001000000001110000100001110100111000000100100001010
1011101000000001110010100000011000100000000100101010010
0001001001010000110000110010011000101011100111100000000
0001111000000000110110010100111000100101000000011000000
0001011100110001111000010000111000110000000000101000000
0101001010101101110000000000011100100000000000110110000
1001000010000001110101000000011001100100000101100100001
0001100101000101100000000010000000100011000000110000100
1010001000010001000100000000011000100101001100100010001
0011000000100001110000000110001001100011010110001000100
0000001000000010111000000001011010100000000100110100101
1100011110100011110000010011001000100100000010100000100
0111001110010101110010000100110000100000010100100000010
0100101000001101110000110010001000000100101110110100000
100110100001100011001100001011000110000000010010

0101011000000001000010010010010000111010000110000000001
0001001011000000110000011111010100100010000100101000101
0011100000000001110000000011001000001000100101110110010
0000110001000000110000011000011010110010000000100000000
0011101000001010110001001010011000100001001101000000011
0011111001000001111110000000101000000000000100100011000
0010001000010000110000000010001000100010000000100000000
0000001000000101010000000011011000111000101100001000001
0001011001010000010001001110011010100000000100111100001
1011111010000101110000100010011000001000000100100000000
1001101000000111011000000011101011110001000100101001000
0001000001001001110000000000111101110000000110100100001
0000001100000011010100011110011000101000010100000000000
1001000001000001100100010010101100100000100111100010000
0111001001001100110100000010001110000100000100100111000
0101001000011101111000001010011000100010001111101100001
0000011010010001000000011010011000101101100000100000000
100110110100000111100001000001100010000011100000

0000101000100001111000000010011100100001000101100001010
0001111010001011110000000010110100100010000100100000011
0000000000010011110010010011011101100001000100000000000
1001011000101001000000000010011001100010000100100000010
0101011001110001100010000011111000100000100100110000000
0011000100000001110000001011010011110010000101100000000
0111111000001001110000000111011001100001100100100000001
0101000100000001110000000110010000110010010100100001000
0111101000000001110000000110011000100000000101000000001
0011001100000001010000000010001000100001011101100010000
0000101001000001110000000110011001100000000110101010010
0001000000000001010000000010011100100001110001110100100
0001001000100001010000000010001100100001000100000010000
0010001000000000110100110010111010101000000100100000000
1011001000010001110100010110111000100000010011100001010
0101011000010001110000000000010000000000000000100001001
0001001011000001110001010011101000100010100100110010110
101101110100000101000000001101100010000010000010

0001001000000101010000000000001100100000000100100000000
0001111000010011110100000000001000111000001100010100000
0001011000010111011000001010011000100000000100101100101
1001001000000011110000000110110000010011000100010000000
0101000010001101110000000110110001100000001100100010000
1001001000000001110010000000011000110000000101100010111
0001101100001111100001000010011001101000000100100000000
0000000010000101010100000110011000000001000000100001110
1101101000000001110001000010001100100100000100010010100
0000001000001001110100001010001010000010000111101000000
0001000100100101100100000000010000100010100100100010000
0001011000100001100100010010011100100000100001000100000
0010011000110100110000000000001000100010000110100000000
0101001000010001110000100010111100100000000000101000000
0101011000001001110010101000011000100011000100000000100
1000000100001000110000000000101010100000001110100000100
0001001000100001010010000010011010001001011110101000000
110110100000000110000001001001100010000001011011

0001001000100001000000000010010000110000010001100110100
0101001000000001111000010010010000110000101000100000001
0000001001010001110000000010010001000000000000100001100
0101011000100001100010100000011000001010000010100000001
0101001100011001110000000010111100100100010110000000001
0001001000000001110000000010011000110100101000100001001
0001101011000001110000001010011000100010100100101000110
0001010000000001110000000000011000100011001100110010001
0001000000001000110000001000011001001000000001000000000
0001011001001001110100000010011001100000000110110000000
1101100000000001000000000011111000110010000000000100000
1110001000001001110000010010011000100000000100001110100
0011001000001010110101001010101000100000001100000000000
1011001000010001110010010110001000100000110000110100001
0001011000000010100100001110010001110000010100100000000
0001101000100001100101000100011010100000000100100010000
0001000001000001110000000010011000011010100100100010010
010100101010010101000000001001100010000000010110

0011001000001001110000010010101000110000000100100111110
0001001000010001110100000010110000100000001100110000100
1001011100000011101000010000011010100000000100100000000
1001001000000101100000100011101000110000000101100000000
0001011001100000110100000010011001101000110000100001000
0000000000000001110000011010011000100000001101110101011
0001110010100001100000000010011110100110000100100100000
0011001000001000111000000011011000110000000100010000010
0001100000000000010000000001111000110000000101100100010
0011000000010001110110000010111000100100000110110000000
1001000000000011010000000010011010110100010100100100000
0001011000000101110100000010010000100001100100000000011
0001001000000001011000000011111010100001000100101100001
0001001000000111010000000110011000100100100000100001010
0001101000010000100000000010011011110000010100100100000
0001001000000001110100101110001010100001100100100100000
1101001011100001110000000010011100000000100100100010000
000100100000000111000000101001000110001000011000

0101001000101001110000000000011001101001010100100000001
0001001000000001101000100000101000100000000110100010010
1111001000100001110011010110011000100000000101100011000
1001011000000011110000000010100000100100000111000000010
0001011000000011110000000010001100100001011110000001100
0011101000010001100000010010010000100000000100100001000
0111001000000001110000010000001001100000100100100001001
0101001000000000110000000000011101000000000110101000100
0001010000101001110001000011011000101001000100000000000
0010001000100101100100000011111000000000000100100000000
0001011100101001110000101010001000100110100101100000000
0000100000100001111000000110011000100000101100100000010
0111000000000101111000100010000000100010000100110000100
0001011000010001010010100111011000100001010100110000001
0001001000000001110010000010010110100001000100110001000
0001000000000011111010000010011001100001001100110000010
0001000001000011111000000011001101100000000100110000000
101111100001000111000000001001100011100010011010

0000101000000001010001000110011000100001000000110000000
0000101100110001010101100010011100111001000100100000001
0000111001010111010000100010001001100101010100100000000
0000101000011000010001000011011010100001000100100100100
1010101000010001010000000011011100101001000000100100110
0000111000010101010001010010001000100001000100100010001
0000101101010001110000110110011000100001000000100000000
0100101010010001011100000010111100000001000101110000000
0000001001111000010100000011011100100001001100100000000
0000101000010001000010000010011100101001000110100000011
0000101000010001011000000010011010101101000100100101001
0100100000110001010000000010011000101000011100100100001
0110100010011001110000000110011010100001000110100000000
0000101000110000011000000110011000101000000100100001000
0000101100010011110000000111010000010001000100101000000
0000101000010000110000000100010100100001000100100000010
0010101001111001010110000000011000100001000110100100000
000010100001101101000000001001001010001100110110

0000100000000101010001110011011010100001000100110000000
0000110001011001000010000010011010100001000100000100000
0010101000010001100100010000011000100001010100100000000
1000101000010001000000010010011000100001001100100011010
0000100001010101010000101001011000100101000100100100000
0000100100010110000000000010011000110000000100100000010
0110110000010001010000010010011000100000000100100010000
1010111000010101010000000010111000100011101100100010000
1100101000010001011010000000011000100001000110111010000
0000101000000001000000001010011000000011000100110001000
1000001000010000010000100110011001000000000100100010000
0000100000011101011010000011010000101001000100100000100
0001100000010011010100000011000000100000000000100000000
0001101000010011010010000010011000100001100100110010001
0010101000010000011000010010011000100001010101100000001
0000101000010101000000000010011100101001000000100000000
0110001000010000011000100011011000100001000100100000000
000010110001000001000000011001110011000110010010

0000101000010101011000000010011000000001100100110100100
0001101001010101010101010010011000100000000100101000100
0000101001010101000001000010110000100001001100100000000
0010111010010001000000000010011000111001000100100000010
1000101000010000011000000010010100100000000100100010000
0000101000101001111000100010010000100001000100100001001
0000111010000011110010000000011000110001000100100010000
0001101010011001110000000010011100001000000100100100000
0000110000010001010000001010111000100001001001100010010
0000101000001001010000000100111000101001000100110000000
1000101100010001000001000000110000100001010100101000000
0000101000000001110000010011100000100001000101101000000
0000100000110101010100000110011000000001000100100100000
0000101001010101011000001011011001110001100100000000000
0000101010010001010010000110101100101001100100100000100
0000101001010101010000011011011000100001001101010000000
0010101000010001000000000010100001100101000100100000000
000010000001010101010000101001100010010110010110

1111101100000001010000000011011100100001000010100000000
0011001000010001010000001011011100110001000000100001100
1001101000111001010000010010011100100001001100100000001
1011101000010001011000000011010100101000001100100000000
1011100001010001010000000010111100101001000100110000100
0011101000110001010000000011011100100001100100000001000
1011101100010001010000001001011100100001010100100100000
1011001000110001010000000011011101100001100100100001000
1011101000000011010010000011001100100001000000101010000
1101101000010001110001000011011100100011000000100000010
1111101010110001010001000011111110100001000100100000100
1011101000010001010100000011001100000001010100101000000
1011101100011000010000001011011100110011000100100000010
1011101000010001000001000011011100101000001000100001000
1011101000000101010000000111011110100001000000110000100
1011101100010001010100000011011110100001110100100000000
1001101000010001010000001011111100100001001100100100000
100100100011000101000000001101010011000100010010

1011011000010001010000000111011000100011001110100000000
1011101000010011010000000001001101100001001100100100010
1011001000010001010000101111011100100001000000000010000
1001101001010001010000010011010100110000000100100000100
1111110000010001010000000011111101100001011100100000000
0010111000010001010001000011011100100001000110100101000
1011101000010101010000100011011000100011000100101000000
1011101101000001010001000001011101100001001100100000000
1111101100010011010000010011011100100001000001100000100
1111101000010000010010000011011100100001001100101000000
1011101010010111010000000011011101101011000100100100000
1011101001011001010000100011011100100101001000100100000
1011101000110101010000000011111100100001000101100010000
1011101000110011000000000011011000100000000100101100000
1011101110010001000100000001011110100001001100100000000
1010110001010001010000000011011100100101000100100000000
1101101010010000011001000011011100100001001100100000000
001110101000001111000000001101110010000100010010

1011101000010011010010000010011100100011000000100000010
1111101010010001000000010011000100100001000100100000000
1011101000010101010000000111111100100001000100101000000
1011101100010001010001010011111100100001000100110001000
1011101010010001010000001011011100100001000100100000000
1001001000010001010000000011011100100001001100101000000
1011111000000000000010000011011100100001000100110000000
1001101000010001110000000011011000100001000100100001101
1011001000010001010000001011011100000011010100100100000
1010101000011101010100010011011100101001000100100000000
1001100000010001010000000010010100100000000100110000000
1011101000110000010000000011111100100101000100100000000
1011101010010001011010000011010000100001000100100001000
0011101000010001110000000011010100111001000100100000010
1011101000000001010100000001011100100101000101100100000
1010101000011001010001000011011100110001000000100000001
1011101000010001000000000001011100110001110101100000000
101110100001000101100000011101110011000010010010

1111001000010001010000001011110110100001000100100000000
0001011000010001010000000011010100100011000100100000000
1010001000110011010000100011011100100001000100100000010
1010101000010001010100000011011100101101000100110100000
1011101000010111010000000011011110100001101100100000010
1011001000010001000000000011111100000001010100110000000
1011101010010010010000000001011100000001000100000000000
1011001000000001011000000011010100100001000100100000101
1011101000010001110000001011011100100001010100100000100
0011111001010001000000000011011100100001000100110000100
1011101000010000110000000010011100100010000100100000010
1011101000010001010000000010010100100001000101100100000
1011101000010101010000100111001101100001000100101000000
1111100000010001110001000011011100100001000100100000000
1011101000010001010010000011011100100001010101000000000
1011101000010000010101000011011100100001000100110000000
1011101000110001010000000000011100101001000000100010000
101110100101010101000000001101110010000100001011

1011100000000001110000000011011100100001000000100010000
1010101010010001010000000011011100100001000000100000000
1011101000010001010100000010011010100001000100110000000
1011101000010001010000000001010100100101110100100000000
0111100000010001010000000011011101100000000100100000000
1011101000010001111000000111011100100001000100100000000
1011101000010001010000010011011101100001000110100000000
1010101000010001010000000010011100100011001100100010000
1011101000010101010000100011001100100011000110100000000
1011001000110101000000000011011100100101000100100000000
1011101000010001010000011011011110100000001100100000000
1011111011010000010000100011011100100001000100100000000
1011011000010001010000000011011000100001010100000000000
1011101000010001010000100111011101100001000101100000001
1011101000010001010100000011011100100001000100101000010
1011101000000001010000101011011100100001000110100000001
1011101000010001010100000010111100100011100100100000000
001110100001001101000000101101010010000100010010

1011101000010001010000000011011000100101000110000000100
1011101001010011110000000011010100100001000100000000000
1001101001010001010000000001011100100001100100100001000
1111101000000001010000001011011100100001000101100000100
1011101010010011011000000011011100100101000100101000000
1001101000010001110000000011011100100001000000100100001
1011101100010001010000000011011101100101010100100010000
1011100000010111010010000011011100100001000100100010000
1111101000000001011000000011011100100001001100100000100
1011101000010001010000000001001000110001000100100001000
1011111010010001010000000010011100000000000100100000000
1011101001010001010000000011111101101001000101100000000
1011101000010001011000000011011100001000000100100000010
1001101000010001010100000010010100000001000100100000000
1001101000010101010000000011011100100001000100111010000
1011101000010001000110000011011100100101000100101000000
1011101000010011011000001011001100100001000100100010000
101110100001001101000001001101110010000000011010

1111101000010001110010000011001100100101000100100000000
1010101010010001010100000111011100100101000100100000000
1011101001011001010000000011001101100001000110100000000
0011101000000001011000000010011100100001000100100000100
1011101000010011010000000011111100100001001100100000000
1011101000010000010000001001011100100001100100100100000
1011101000010001010000000011011100101010000100100000101
1111001000000001010000001011011000100001000100100000000
1011101000010001000000000111011000000001001100100000000
0111101110010001010000100011011100100001000100100000000
1011101000010011010100010011011100101001000100100000001
0011001100000001010000000011011100100001100100100000000
1011100010010011010000000011011100100001000000100001000
1011101000011001010000010011011100110001000100000001000
1010101000010001110000100011011100100101000100000000000
1011101100010001011000000011011100000001001100000000000
1011101000010111000000000011111110100001000100100000000
101110100001000101010000001101100010001100110011

1011100000010001010000010011111100000001000100100000000
1011101100010001010000000011011100100001000000000001000
1011101000010001010000000011001100101001000110100001000
1011101000010001010100000010011100100001000100000100000
1111101000010001010000001011011100100100000100100000000
1011101000010001010000001011010100100001000100101001000
1011101000010001010000011011011100100000000100100000001
1011111010010001010000010011011110100001000100100000000
1011101000000001011000000010011100100001000100100010000
1011101001010001010100000011011110100011000100100000000
1011101000010001010100000011011100100001000000100000101
1011100000010001011000000011010110100001000100100000000
0011101000010101010000010011001100100001000100100000000
1001001100010001010000000011011100100000000100100000000
1011101000010001010000000001010100100001000100100000110
1010101000110001010000000011011100000001000100100000010
1011101000010011010000001111011100100001000100100010000
101110000001010101000001001101110010000100110010

1011101001010001010000001011011111100001000100100000000
1011101000010001010000000011011101100001000101101010000
1011101000010001010100000011111100100001110100100000000
1001101000010001010001000011011100100101000100100001000
1010101000010011010000000011011000100001000100100001000
1111101000010000010010000011011100100011000100100000000
1111101000010001010001000011011100100001010000100000000
1011001000011001010000000011011110100001000100100100000
0011101000010001010100000011011100100001001100100010000
1011001000011001010010000010011100100001000100100000000
0011101000011001010000000011011110100000000100100000000
1011101000010001010000000010011000100101100100100000000
1011101000010001010000000011011100100000000100100010000
1001101001011001010000000011011100100001000000100000000
1011101001110001010000000011011100100001000100100000000
1011101000011001010000100011001100100001010100100000000
1011101010010001010000000011011100100001100100100000000
101110100001000101000000010101110010000100010010

1011101001000001010000000111011000100000000100100000010
1011101000010100010000100011011100100000000100100000000
1011101000001001010000000011011100101000000100101000010
1011101010000011010010000011011100100000000100100000001
0011101000000001010010000011111100100000000100100100000
1011101000000011010000000011011100100000000100110000000
1011001000000001010010000010011100100000000000100000000
1111101000000001010000100011011100100001000100100000010
1011101000010001010000010011011000100000000100100000010
1011101001000000010000000011111100100000000110100000000
1011101010010001010000000011011100100000000100100011000
1011101000000000011000000011111100100001000100100000000
1011101000000001010000000011001100100000000100110100010
1011101000000101011000010011011100110000000100100000000
1011101000000001010000010011010100100000000100110000001
1011101100000001110000000011011100100001000100100100000
1011101000000001011000100111111100100000000100100000000
111110100000000101000000001101110010000001010010

1011101000000001010000000010011100100100000100110100000
1011101000000001010000001011001100100000000010100000000
1011101000000001010100000001011100100000000100000010000
1111101000000001011000100010011100100000000100100000000
1011101000000001010000000011011100100000000100101001000
1011101000000001010000001011111100100000000110100001000
1010101000000011010000000011011100100000000100100000000
1010101000000000010100000011011100100000100100100000000
1011101000000001010000000010011101110000010100100000000
0011101000000001010000000011011100100010000100100010001
1011100000000001010000000011011100101000000100100000000
1011101010000001010000000111011100100000010110100000000
1011101001001001010000000011011100100001000100100000100
1011101000000001010000000111011100100001000110100000100
1011101000000001010000010010011101100000000100100000010
1010001000000001010000000011011101100100000100100000000
1011100000000101000000000011011100100000000100110000000
101110100000010101000000001101110000000000010010

1010101000000001010000000011010100101000000100000100000
1010101010000001010000000011010100100000000100100110000
0010101000010001010001000011010100100000000100100000000
1010101000000001010000000111010101100000001100100000000
1010101010000001010000100011010100100000100100100000000
1010001000000001010000000001010100100000000100100000100
1010101000101001010000001011010100100000000100100000000
1011101000010001010000000011010100100000000110100000000
1011101001000001010000000011010100100000000100100000010
1010001000001001010000000011010100100000000100100010000
1010101000100001110000000011010110100000000100100000000
0010101100000001010000000011010100100000000100110000000
1010101000000011010000100011010100100000000100100000100
1000101000000001010000000011010100100100000100100000100
1010101000000001110000000011010100100000000101100001000
1010101000001001010000000011010100100000000100110010000
1010101000000001010100000011011100100000000100100000100
101010100000000101000000001101011010000010010110

1010101000000001010000000011010100110000000000100001000
1011101000001001010000010011010100100000000100100000000
1010101000000001010000100011010100100000000100110000100
1010111000000001010000000011010100100000010100000000000
1010101010000001010000000010010100100000000110100000000
1010101000000001110000010011010101100000000100100000000
1010101000000001010000000011110100100000001110100000000
1010101000000001010000000011010110100000010100000000000
1010101000000001010100000010010100100000000000100000000
1000101000000001010000000010010100100000000100100000001
1010101000000001010000000011011100100000001100100000010
1010101000000001010000000111010100000000000100110000000
1010101000000011010000000011010100100000010100100000100
1010100000000001010000000001010100100000000100100100000
0010101000000101010000000011011100100000000100100000000
1010001000100001010010000011010100100000000100100000000
1010101100001001000000000011010100100000000100100000000
101010100000000101000100001101010010110000010010

1011101000000001010000000011010100100000000100100000000
0010101000000001010000000011010100100010000100100000100
1000110000000001010000000011010100100000000100100000000
1010101000000001010000000011010100100100000010100000000
0010101000001001010000000011010100100000000100101000000
1010101000000001000000000011010100000000000100100001000
1110101000000001010000000011010100100000100100100001000
1110101000000001010000001011010100100000000100000000000
1010101000000001010000000011010100000000000100100101000
1010111000000001010100000011010101100000000100100000000
1010101000000101010000110011010100100000000100100000000
1010101000000001010000000011010100100100100100110000000
1010101000010001010000000011010100100010100100100000000
1010101000000001010100000011000100100000000100100100000
1010111000000001010000000011011101100000000100100000000
1110101100000001000000000011010100100000000100100000000
1010101000000001010000000011010101100000000100100101000
100010100000000100000000001111010010000000010010

1010101000000001110000010011010100100000000100100000000
1010101010000001010000000011010100110000000100100000000
1011101000000001010000000011010100100000000100110000000
1010101000000001010000000111010100100100000100100000000
1010101000000001010000000011000100100000000110100000000
1010101000000001010000000011010100100000000100100000000
0010101000000101010000000011010100100000000100100000000
1110101000000001010000000011010100100000100100100000000
1010101000000001010000000011010101100000000100110000000
1010101000000001010000000011010100010000000100100000000
1010101000000000000000000011010100100000000100100000000
1010101000000001010000000011010100100000000100110001000
1110100000000001010000000011010100100000000100100000000
1010101000000001010000000011010100100001000100100000001
1110101000000001010000000011010100100000100100100000000
1010101000000001010000000111010000100000000100100000000
1011101000001001010000000011010100100000000100100000000
101010100000000101000000001111010010000000110010

0010101000000001010000000011010100100000000110100000000
1010101000000001010100000011010100100000000100000000000
1010101000000001010000000011010100100010000100100000001
1010101000000001010010000011010100100100000100100000000
1010101000000001010000000011110100100000000110100000000
1010101000000001010100000011010100100000000100100001000
1010101000000001000000000011010101100000000100100000000
1010101000011001010000000011010100100000000100100000000
1010101000000001000000000011010100100000000100100000100
1000101000000001010000000011010100100000010100100000000
1011101000000001010000100011010100100000000100100000000
1010101000000001010100000011010100000000000100100000000
1010101000000001010000000011010101100000001100100000000
1010101010000001010000000001010100100000000100100000000
1010101010000001010000000011010100100000000100101000000
1010101000000001010000000011010100100100000100100000100
1010101000000001010000000011000100100000000000100000000
101010100000001101000000001101010010000000010010

1000101000000001000000000011100100100000000100100000000
1000101000001001110000000011000100100000000100100000000
1000101000100001010000000011000100100000001100100000000
1000101000000001010000000011000100101000001100100000000
0000101000000001010000000011000000100000000100100000000
1000101000000001010000000011000000100000000100100001000
1000101000000001010000000010000100100000001100100000000
1000101000000001010000100011001100100000000100100000000
1100101000000001010000000011000100100000001100100000000
1000101000000001010000000111000100100010000100100000000
1000101000000101000000000011000100100000000100100000000
1000101000000001010000000010100100100000000100100000000
1000100000001001010000000011000100100000000100100000000
1000101000000001010010000011000100100000000100000000000
1000101000000001010000000011001100100000000100101000000
1000101100000000010000000011000100100000000100100000000
1000101100000001010000000011000100100001000100100000000
100010100000000101000000001100010010000001010010

1000101000000011010010000011000100100000000100100000000
1000101000000001010001000111000100100000000100100000000
1000101000000001010000000011000110100000000100100100000
1000101010000001011000000011000100100000000100100000000
1001101000000001010000000011000100100000000100100100000
1000101100000001010000000011000100100000000100100100000
0001101000000001010000000011000100100000000100100000000
1000101000010001010000000011000100100000000100100000100
1000101000000001010000000011000000100000000100110000000
1000101000000101010000000111000100100000000100100000000
1000101000000001010000000011000100100010000110100000000
1000101000000001010000000011000100100000000101100000001
1000111000000001010000000011000100100000000110100000000
1000101000000001010010000011001100100000000100100000000
1001101000000001010000000011001100100000000100100000000
1000101000100001010000000011000100100000100100100000000
1000101000001001010000010011000100100000000100100000000
100010100001000101100000001100010010000000010010

0000101000100000010000000010000100100000000100100000000
0000101000000001010000000010000100100000000101100000001
0000101000000001110000000010100100100000000100100000000
0000101000110001010000000010000100100000000100100000000
0000101000100001010000000010000100100100000100100000000
0000111000000001010000000010000100100000000110100000000
0010101000000001010000000010100100100000000100100000000
0000101000000001010000000010000100000000000100100000010
0000101000000001110000000010000100100000000100110000000
0000101000000001000000000010000100100000001100100000000
0000101100000001010000000010100100100000000100100000000
0000101000000001010000000110000100101000000100100000000
0000101000000001010000000010000100100000000100000000010
0000101000000001010000000010000100100000000110100000001
0000101000000001010000000010000100100100000100100000010
0000101000000001110001000010000100100000000100100000000
0000101000010001011000000010000100100000000100100000000
000010100000000101000000101000010010000100010010

0000101000000001010000000011001100100000000100100000000
0000101000000001011000000011000100100000000100100000000
0000101000000001011000000010000100100000000100100010000
0000101000000001010000000010000101100000000100000000000
0000101000000001010010000010000100100000100100100000000
0000100000000001010000000010000100100000000100100100000
0001001000000001010000000010000100100000000100100000000
0000101000000001010000000010100100100000000100100001000
0000101000000001010000000010000100100000000100101010000
0000101000000001010000000010000100100000001100100010000
0000101000000000010000000010000100110000000100100000000
0000101000000101010000000010000100100000000100100010000
0000101000000001010000000010000100100100000100100000001
0000101000000001010000000010000100110000000100101000000
0000101000000001010000000010000100100000000110100100000
0000101000000001010000000010000100100000000100100000000
0000101000000001011000000110000100100000000100100000000
000010101000000101000000001000010010000000011010

0000101000000001011000000010000100100000000100100000000
0000101000000001010000000010000100100000010100100000000
0000101000000001010000000010000100100000000100110000000
0000001000000001010000000010000100100000000100100000000
0000101000000000010000000010000100100000000100100000000
0001101000000001010000000010000100100000000100100000000
0000101000000001010000000010000101100000000100100000000
0000101000000001010000000010000100100000001100100000000
0000101000000001010000000010000100100100000100100000000
0000101010000001010000000010000100100000000100100000000
0000101000000001010000000010000100100001000100100000000
0001101000000001010000000010000100100000000100100000000
0000101000000001010000000010000100100001000100100000000
0000101000000001010000000010000100100000000101100000000
0000101000000001010001000010000100100000000100100000000
0000101000000000010000000010000100100000000100100000000
0000101000000001010000000110000100100000000100100000000
000010100000000101000000001000010010000000010010

0000101000000001010000000000000100100000000100100000000
1000101000000001010000000010000100100000000100100000000
0000101000000001010000010010000100100000000100100000000
0000101000000001010000000010000100100000000100100100000
0000101000000001010000001010000100100000000100100000000
0000101001000001010000000010000100100000000100100000000
0000101000000001010000000010000100100000000100101000000
0000101000000001010000000010000100100001000100100000000
0000101000000001010000000010000100100000000000100000000
0000111000000001010000000010000100100000000100100000000
0000101000000001010000000010000100100000000110100000000
0000101000000001010000000010000100100000100100100000000
0000101000000001010100000010000100100000000100100000000
0000101000001001010000000010000100100000000100100000000
0000101000010001010000000010000100100000000100100000000
0000100000000001010000000010000100100000000100100000000
0000101000001001010000000010000100100000000100100000000
000010100000000101000000001000010010100000010010

0000101000000001010000000010000100100000000100100010000
0000101000000001010000000010000100100000000100100001000
0000101000000001010010000010000100100000000100100000000
0000101000000000010000000010000100100000000100100000000
0000101000000001011000000010000100100000000100100000000
0001101000000001010000000010000100100000000100100000000
1000101000000001010000000010000100100000000100100000000
0000101010000001010000000010000100100000000100100000000
0000101000000001010000000010000100100000001100100000000
0000101100000001010000000010000100100000000100100000000
0000101000001001010000000010000100100000000100100000000
0000101001000001010000000010000100100000000100100000000
0000101000000101010000000010000100100000000100100000000
0000101000000001010000000110000100100000000100100000000
0000101000000001010000000010000100100000000100100001000
0000101000000001010000000000000100100000000100100000000
0001101000000001010000000010000100100000000100100000000
000010100000000101000000001000010011000000010010

0000101000000001010000100010000100100000000100100000000
0010101000000001010000000010000100100000000100100000000
0000101000000001010000000010000100100000000100100000010
0000101000000001010001000010000100100000000100100000000
0000101000000001010000000010000100100000000100100000010
0000101000000001010100000010000100100000000100100000000
0000101000000001010000000010000100100000000100110000000
0000101000000001010010000010000100100000000100100000000
0000101000000001010000000010000100100000000100101000000
0000101000000001010000000000000100100000000100100000000
0000101000000001110000000010000100100000000100100000000
0000101000000001010000000010000100100010000100100000000
0000101100000001010000000010000100100000000100100000000
0000101000000001010001000010000100100000000100100000000
0000101000000001010000000010100100100000000100100000000
0000101000000001010000000010000100000000000100100000000
0000101000000001010000100010000100100000000100100000000
000010100000000101000000001000010010000100010010

0000001000000001010000000010000100100000000100100000000
0000101000000001011000000010000100100000000100100000000
0000101000000001010000000010000100100000000100100000001
0000101000000001010001000010000100100000000100100000000
0000101000000001010000000010000100100000000100110000000
0000101000000000010000000010000100100000000100100000000
0000101000000001010000000010000100100100000100100000000
0000101000000001010000000010000101100000000100100000000
0000101000000001010000000010000100100000100100100000000
0000101000000001010000000010000100100000000100100100000
0100101000000001010000000010000100100000000100100000000
0000100000000001010000000010000100100000000100100000000
0000111000000001010000000010000100100000000100100000000
0000101100000001010000000010000100100000000100100000000
0000101000000001000000000010000100100000000100100000000
0000101000000001010000000010000110100000000100100000000
0000101000000101010000000010000100100000000100100000000
000010100000000101000000001000010010010000010010

0000101000000001010000000010000100100000000100100001000
0000101000000001010000000010000100101000000100100000000
0000101000000101010000000010000100100000000100100000000
0000101000000001010000000010000100100001000100100000000
0000101000000011010000000010000100100000000100100000000
0000101000000001010000000010000100110000000100100000000
0000101000000001010000000010000100100000010100100000000
0000101000000001010000000010000100100000000100100010000
0000101000000001010000000010000100100000000100100001000
0000101000000001010000000010100100100000000100100000000
0000101000000001010000000010000100100000000100000000000
0000101000000101010000000010000100100000000100100000000
0000101000000001010000000010000100100000000100101000000
0000101000000001010010000010000100100000000100100000000
0000101001000001010000000010000100100000000100100000000
0000101000000001010000000000000100100000000100100000000
0000101000000001010000000010000100100000000100100000100
000010100000000101000000001000010010010000010010

0000101000000001010000000010000100100000100100100000000
0000101000000001010000000010000100000000000100100000000
0000101000000001010000000010000100100000010100100000000
0000101000000001010000000010000101100000000100100000000
0000101000000001010000000010000100100000000110100000000
0000001000000001010000000010000100100000000100100000000
0000101000000001010100000010000100100000000100100000000
0000101000001001010000000010000100100000000100100000000
0000101000000001010000000010000100100100000100100000000
0000101000000001010000000010000100100000001100100000000
0000111000000001010000000010000100100000000100100000000
0000101000000101010000000010000100100000000100100000000
0000101000000001010000000000000100100000000100100000000
0000101000000001010000000010000100100000100100100000000
0000101000000001010000000010000100100000000100100100000
0001101000000001010000000010000100100000000100100000000
0000101000001001010000000010000100100000000100100000000
000010100000000101000000001000010010000000000010

0000101000000001010000000010000100100000100100100000000
0000101000000001010000000010000100100000000100000000000
0000111000000001010000000010000100100000000100100000000
0000101000000001010000000010000110100000000100100000000
0000101000000001010000000010000100000000000100100000000
0000101000000001010000000011000100100000000100100000000
0000101000000001010100000010000100100000000100100000000
0001101000000001010000000010000100100000000100100000000
0000101000010001010000000010000100100000000100100000000
0000101000000001010000100010000100100000000100100000000
0000101000000001010010000010000100100000000100100000000
0000101000000001010000000010000100100000000100100000001
0000101000000001010010000010000100100000000100100000000
0000111000000001010000000010000100100000000100100000000
0000101000100001010000000010000100100000000100100000000
0000101000000001010000000010000100100001000100100000000
0000101000000001110000000010000100100000000100100000000
000010100000000101000000001000010010000000010010

0000101000000001010000000010000101100000000100100000000
0000101000000001010000000010000100100000000100000000000
0000101000000001010100000010000100100000000100100000000
0000101000000001011000000010000100100000000100100000000
0000101000000001010000000010000100100001000100100000000
0000101000000001010000000010000100100000000100100001000
0000100000000001010000000010000100100000000100100000000
0000101000000101010000000010000100100000000100100000000
0000101000000001010000000010000100100000000100100000010
0000101000000001010000000010000101100000000100100000000
0000101000000001010000000010100100100000000100100000000
0000101000000001010000000010000100000000000100100000000
0000101000000001010000000011000100100000000100100000000
0000101000010001010000000010000100100000000100100000000
0000101000000001010000000010000100110000000100100000000
0000101000000001010000000010000100100000000000100000000
0001101000000001010000000010000100100000000100100000000
000010100000000101000000001000010011000000010010

0000101001000001010000000010000100100000000100100000000
0000101000000001010000000010000100100000000100100001000
0000101000000001010010000010000100100000000100100000000
0000101000000001010000000010000100100000000100100010000
0100101000000001010000000010000100100000000100100000000
0000101000000001010000100010000100100000000100100000000
0000111000000001010000000010000100100000000100100000000
0000101000000001010000000010001100100000000100100000000
0000101000000001011000000010000100100000000100100000000
0000101000000011010000000010000100100000000100100000000
0000101000000001010000000010000100100000000100100000001
0000100000000001010000000010000100100000000100100000000
0010101000000001010000000010000100100000000100100000000
0000101000000001010000000010000100100000001100100000000
0000101000000001010000000010000100100000001100100000000
0000101000000001010000000010000100100001000100100000000
0000101000000001010000000010000100100000000100100000010
000010100000000101100000001000010010000000010010

0000101000000001010000000010000100110000000100100000000
0000101000000001010000000010000100100000001100100000000
0000101001000001010000000010000100100000000100100000000
0000101000000001010000000010000100100000000100110000000
0000101000000001010000000110000100100000000100100000000
0000101000000001010000000010000100100000000100100100000
0000101000000001011000000010000100100000000100100000000
0000101000000001010000000010000100100000000100110000000
0000101100000001010000000010000100100000000100100000000
0000101000000001010000000010000100100100000100100000000
0000101000000001010000000010001100100000000100100000000
0000101000000001010000000010000100100000000100100010000
0000101000000001000000000010000100100000000100100000000
0000101000001001010000000010000100100000000100100000000
0000101000000001010000010010000100100000000100100000000
0000101000000001010000000010000100100000000100000000000
0000101000000001010000000010000100100000100100100000000
000110100000000101000000001000010010000000010010

0000101000000001010000000010000100100000000100100000001
0000101000000001010000000010000100100000000100100000010
0000101000000001010000000010000100100000000100100010000
0000101000001001010000000010000100100000000100100000000
0000101000000001010000000010000100100010000100100000000
0000101000000001010000000011000100100000000100100000000
0000101000000001010000100010000100100000000100100000000
0000101000000001010000000010000100100010000100100000000
0000101000000001010000000010000100100000000100100100000
0000101000000001010000000010000100100000000100100100000
0000101000000001010000000010000100100000000100000000000
0000101000000001010000010010000100100000000100100000000
0000101000000001010000000010000100100100000100100000000
0001101000000001010000000010000100100000000100100000000
0000101000000001010000000011000100100000000100100000000
0000101000000001010100000010000100100000000100100000000
0000101000000001010000000010000100000000000100100000000
000010100000000101000000001000010010000000010010

'0000101000000001010000000010000100100000000100100000000'

#### 有了足夠的data可以以合理的準確度將text自動切成words

## 3.9   Formatting: From Lists to Strings

## From Lists to Strings
### 單詞列表(lists of words)是我們處理文本最簡單的一種結構化的物件，當我們要將其輸出時必須將這些lists轉換成strings，Python中我們使用join()來實現。

In [206]:
silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
' '.join(silly)

'We called him Tortoise because he taught us .'

In [207]:
';'.join(silly)

'We;called;him;Tortoise;because;he;taught;us;.'

In [208]:
''.join(silly)

'WecalledhimTortoisebecausehetaughtus.'

## Strings and Formats
### 以下有兩種不同的顯示方式:
### 1. print 指令使Python能夠輸出人可讀性最高的形式
### 2. Naming the variable at a prompt 向我們顯示可以用於重建該物件的string。

In [212]:
word = 'cat'
sentence = """hello
world"""

In [210]:
print(word)

cat


In [213]:
print(sentence)

hello
world


In [214]:
word

'cat'

In [215]:
sentence

'hello\nworld'

### 格式化的輸出通常包含 Variables 和 Pre-specified strings，給定一個頻率分佈 fdist 我們能夠:

In [216]:
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in sorted(fdist):
    print(word, '->', fdist[word], end='; ')

cat -> 3; dog -> 4; snake -> 1; 

### 變數和常數交雜難以閱讀及維護，另一個更好的方法是使用String formatting:

In [217]:
for word in sorted(fdist):
    print('{}->{};'.format(word, fdist[word]), end=' ')

cat->3; dog->4; snake->1; 

In [220]:
'{}->{};'.format ('cat', 3)

'cat->3;'

### 分段解析上述方法

In [221]:
'{}->'.format('cat')

'cat->'

In [222]:
'{}'.format(3)

'3'

In [223]:
'I want a {} right now'.format('coffee')

'I want a coffee right now'

### 我們可以有任意數量個 placeholders，但是str.format method 必須以數目相同的參數來使用。

In [277]:
'{} wants a {} {}'.format ('Lee', 'sandwich', 'for lunch')

'Lee wants a sandwich for lunch'

In [225]:
'{} wants a {} {}'.format ('sandwich', 'for lunch')

IndexError: tuple index out of range

In [228]:
'{} wants a {}'.format ('Lee', 'sandwich', 'for lunch')

'Lee wants a sandwich'

#### format() 參數從左到右取用，多餘的參數則會被忽略。

### 'from {} to {}' 等同於 'from {0} to {1}'，但我們能使用數字來改變默認的順序:

In [229]:
'from {1} to {0}'.format('A', 'B')

'from B to A'

### 用for迴圈把word填入句子

In [230]:
template = 'Lee wants a {} right now'
menu = ['sandwich', 'spam fritter', 'pancake']
for snack in menu:
    print(template.format(snack))

Lee wants a sandwich right now
Lee wants a spam fritter right now
Lee wants a pancake right now


## Lining Things Up
### {:6}表示讓string從寬度6開始往右輸出

In [231]:
'{:6}'.format(41)

'    41'

### <符號使其向左對齊

In [234]:
'{:<6}' .format(41)

'41    '

In [235]:
'{:6}'.format('dog')

'dog   '

### >符號使其向右對齊

In [236]:
'{:>6}'.format('dog')

'   dog'

### 顯示浮點數到小數後面第四位

In [237]:
import math
'{:.4f}'.format(math.pi)

'3.1416'

### 當你想要顯示百分比的時候，在 Format specification 加入 ' ％ ' 則自動將其乘以100

In [238]:
count, total = 3205, 9375
"accuracy for {} words: {:.4%}".format(total, count / total)

'accuracy for 9375 words: 34.1867%'

### Formatting strings 的其中一個重要用途是用於數據列表。

In [239]:
def tabulate(cfdist, words, categories):
    print('{:16}'.format('Category'), end=' ')                    # column headings
    for word in words:
        print('{:>6}'.format(word), end=' ')
    print()
    for category in categories:
        print('{:16}'.format(category), end=' ')                  # row heading
        for word in words:                                        # for each word
            print('{:6}'.format(cfdist[category][word]), end=' ') # print table cell
        print()                                                   # end the row


In [240]:
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
            (genre, word)
            for genre in brown.categories()
            for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
tabulate(cfd, modals, genres)

Category            can  could    may  might   must   will 
news                 93     86     66     38     50    389 
religion             82     59     78     12     54     71 
hobbies             268     58    131     22     83    264 
science_fiction      16     49      4     12      8     16 
romance              74    193     11     51     45     43 
humor                16     30      8      8      9     13 


## Writing Results to a File
### 開啟可寫入的文件 output.txt，將程式碼的輸出保存在文件中。

In [257]:
output_file = open('output.txt', 'w')
words = set(nltk.corpus.genesis.words('english-kjv.txt'))
for word in sorted(words):
    print(word, file=output_file)

In [258]:
len(words)

2789

In [259]:
str(len(words))

'2789'

In [264]:
print(str(len(words)), file=output_file)

## Text Wrapping

### 當程式輸出為text模式而非表格時，為了增加可讀性必須改變顯示方式，下列的輸出已超出一行顯示的長度

In [267]:
saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
          'more', 'is', 'said', 'than', 'done', '.']

for word in saying:
    print(word, '(' + str(len(word)) + '),', end=' ')

After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1), 

### 在Python's textwrap module的幫助下換行

In [275]:
from textwrap import fill
format = '%s (%d),'
pieces = [format % (word, len(word)) for word in saying]
output = ' '.join(pieces)
wrapped = fill(output)
print(wrapped)

After (5), all (3), is (2), said (4), and (3), done (4), , (1), more
(4), is (2), said (4), than (4), done (4), . (1),
