# NLTK for Python Test Bed
## Chapter 3:  Processing Raw Text
### Using examples from http://www.nltk.org
##### Whitney King  (6/5/2018)

In [1]:
from IPython.display import Image, display, HTML
import unicodedata
import nltk
from nltk import word_tokenize
import pandas as pd
import pprint as pp
import datetime as dt
import emoji
import re
#nltk.download()

In [2]:
#Define custom functions

def line():
    print("-" * 80)

## _Accessing Text from the Web and from Disk_

### _Electronic Books_

Resources such as *Project Gutenberg* have text versions of books online for free. There are over 25,000 to choose from on the PG website, in over 50 languages, all of which can be downloaded in ASCII.

In [18]:
from urllib import request
url = "http://www.gutenberg.org/files/19033/19033.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
print(len(raw), ' | ', type(raw), ' | ',  raw[:69])
line()
tokens = word_tokenize(raw)
print(type(tokens), len(tokens), tokens[:10])

74726  |  <class 'str'>  |  The Project Gutenberg EBook of Alice in Wonderland, by Lewis Carroll
--------------------------------------------------------------------------------
<class 'list'> 15758 ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Alice', 'in', 'Wonderland', ',', 'by']


Since a lot of descriptive information about the book is appearing in the collocations, they should be trimmed from the raw data

In [19]:
text = nltk.Text(tokens)
print(type(text))
pp.pprint(text[976:991])
pp.pprint(text.collocations())

<class 'nltk.text.Text'>
['alas',
 '!',
 'either',
 'the',
 'locks',
 'were',
 'too',
 'large',
 ',',
 'or',
 'the',
 'key',
 'was',
 'too',
 'small']
Project Gutenberg-tm; Project Gutenberg; said Alice; Literary Archive;
White Rabbit; Archive Foundation; Gutenberg-tm electronic; Gutenberg
Literary; electronic works; United States; March Hare; public domain;
set forth; golden key; electronic work; white kid-gloves; Gutenberg-tm
License; play croquet; Mary Ann; thought Alice
None


In [20]:
raw = raw[raw.find("I--DOWN THE RABBIT-HOLE"):raw.rfind("End of the Project Gutenberg EBook")]

raw.find("I--DOWN THE RABBIT-HOLE") #Trimmed book text

0

In [21]:
tokens = word_tokenize(raw)
text = nltk.Text(tokens)
text.collocations()
pp.pprint(text.collocations())

said Alice; White Rabbit; March Hare; golden key; white kid-gloves;
play croquet; Mary Ann; thought Alice; inches high; little golden;
feet high; cool fountains; yer honor; good deal; low voice; asking
riddles; right size; trembling voice; shrinking rapidly; came upon
said Alice; White Rabbit; March Hare; golden key; white kid-gloves;
play croquet; Mary Ann; thought Alice; inches high; little golden;
feet high; cool fountains; yer honor; good deal; low voice; asking
riddles; right size; trembling voice; shrinking rapidly; came upon
None


### _HTML_

Packages such as BeautifulSoup enable the ability to wrange text from HTML web pages.

In [22]:
from bs4 import BeautifulSoup
url = "http://sailormoon.wikia.com/wiki/Sailor_Galaxia"
html = request.urlopen(url).read().decode('utf8')
print(html[:1000])
line()

html_raw = BeautifulSoup(html[554:126566], "lxml")
tokens = word_tokenize(html_raw.get_text())
print(sorted(list(set([w.lower() for w in tokens if w.isalpha() and len(w) > 3]))[:100]))
line()

html_text = nltk.Text([w.lower() for w in tokens if w.isalpha() and len(w) > 3])
html_text.concordance('sailor')
line()

html_text.collocations()

<!doctype html>
<html lang="en" dir="ltr" class="">
<head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<meta name="generator" content="MediaWiki 1.19.24" />
<meta name="keywords" content="Sailor Moon Wiki,sailormoon,Sailor Galaxia,Act 50 - Stars 1,Chaos,Sailor Chaos,Sailor Chi and Sailor Phi,Sailor Lethe,Sailor Mnemosyne,Sailor Iron Mouse,Sailor Aluminum Siren,Sailor Lead Crow,Sailor Tin Nyanko" />
<meta name="description" content="Sailor Galaxia is one of the main antagonists in the final arc of Sailor Moon. She is well known amongst the Galaxy for wreaking havoc and ruining worlds in her quest to obtain the strongest Senshi Crystal. She is the most powerful Sailor Senshi in the galaxy and the ruler of Shadow Galactica..." />
<meta name="twitter:card" content="summary" />
<meta name="twitter:site" content="@getfandom" />
<meta name="twitter:url" content="http://sailormoo

### Search Engine Results

While a pretty rudimentary method, hits for collocations of words via a search engine such a Google can provide fairly enlightening information. Web search engines provide an efficient means of searching this large quantity of text for relevant linguistic examples. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only allow you to search for individual words or strings of words, sometimes with wildcards. Overall, they are a handy tool for performing quick checks on a theory.

**Google Hits for Collocations:** 

    Number of hits for collocations involving absolutely or definitely, followed by one of adore, love, like, or prefer

|Google hits | adore | love | like | prefer |
|-----|-----|-----|-----|-----|
|absolutely | 289,000 | 905,000 | 16,200 | 644|
|definitely | 1,460 | 51,000	 | 58,000 | 62,600|
|ratio | 198:1 | 18:1 | 1:10 | 1:97|

### RSS Feeds

On the modern web, blogs are an important vecotr for individual voices to be heard, both formally and informally. Blogs are published as RSS feeds, which can be used as a souce of text. The Python library ```feedparser``` can be used to access the content of a blogs RSS feed, which can then be munged and analyzed using various other libraries.

Most blogs use the same schema, which can then be broken down to get the text parts.

In [23]:
import feedparser as fp

# ArsTechnia RSS Feed
get_time = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
llog = fp.parse('http://feeds.arstechnica.com/arstechnica/index')

print('{0} | Retrieved: {1} PST'.format(llog.feed.title, get_time))
line()

for i in range(0, len(llog.entries)):
    entry = llog.entries[i]
    display(HTML('[{0}]&nbsp;&nbsp;<a href="{1}" target="_blank">{2}</a>'\
                 .format(i+1, entry.link, entry.title)))

Ars Technica | Retrieved: 2018-06-07 18:56:54 PST
--------------------------------------------------------------------------------


In [24]:
import warnings
warnings.filterwarnings('ignore')

# Dig down into a post
post = llog.entries[8]
print(post.title)
line()

# Get post content
content = post.content[0].value
raw_content = BeautifulSoup(content).get_text()
print(word_tokenize(raw_content))

Finally, scientists have found intriguing organic molecules on Mars
--------------------------------------------------------------------------------
['Enlarge', '/', 'Since', '2012', ',', 'NASA', "'s", 'Curiosity', 'rover', 'has', 'been', 'trying', 'to', 'find', 'organic', 'molecules', '.', 'Now', ',', 'it', 'has', 'succeeded', '.', '(', 'credit', ':', 'NASA', ')', 'After', 'more', 'than', 'four', 'decades', 'of', 'searching', 'for', 'organic', 'molecules', 'on', 'the', 'surface', 'of', 'Mars', ',', 'scientists', 'have', 'conclusively', 'found', 'them', 'in', 'mudstones', 'on', 'the', 'lower', 'slopes', 'of', 'Mount', 'Sharp', '.', 'A', 'variety', 'of', 'organic', 'compounds', 'were', 'discovered', 'by', 'NASA', "'s", 'Curiosity', 'rover', ',', 'which', 'heated', 'the', 'Martian', 'rocks', 'to', '500°', 'Celsius', 'to', 'release', 'the', 'chemicals', '.', 'The', 'finding', 'is', 'significant—for', 'life', 'to', 'have', 'ever', 'existed', 'on', 'Mars', 'there', 'would', 'almost', 'certa

### _Reading Local Files_

In addition to the ```open()``` function built into Python that read flat files, third party packages such as ```pypdf``` and ```pywin32``` will read PDF and Word files.

* Corpus files can also be read in this same manner
* Any string, including ```input()``` can be tokenized

In [25]:
f = 'Pokemon.txt'
with open(f, 'r') as fraw:
    line = fraw.readline()
    while line:
        print("{}".format(line.strip()))
        line = fraw.readline()

Key	Number	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
1	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	FALSE
2	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	FALSE
3	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	FALSE
4	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	FALSE
5	4	Charmander	Fire		309	39	52	43	60	50	65	1	FALSE
6	5	Charmeleon	Fire		405	58	64	58	80	65	80	1	FALSE
7	6	Charizard	Fire	Flying	534	78	84	78	109	85	100	1	FALSE
8	6	CharizardMega Charizard X	Fire	Dragon	634	78	130	111	130	85	100	1	FALSE
9	6	CharizardMega Charizard Y	Fire	Flying	634	78	104	78	159	115	100	1	FALSE
10	7	Squirtle	Water		314	44	48	65	50	64	43	1	FALSE
11	8	Wartortle	Water		405	59	63	80	65	80	58	1	FALSE
12	9	Blastoise	Water		530	79	83	100	85	105	78	1	FALSE
13	9	BlastoiseMega Blastoise	Water		630	79	103	120	135	115	78	1	FALSE
14	10	Caterpie	Bug		195	45	30	35	20	20	45	1	FALSE
15	11	Metapod	Bug		205	50	20	55	25	25	30	1	FALSE
16	12	Butterfree	Bug	F


### Unicode

Unicode is a method for encoding characters to information that supports over 1 mioon characters. The ```open()``` function can read encoded data into Unicode strings, and write out Unicode strings in encoded form. Each character is assigned a number called **code point**, which takes on the form of ```\uXXXX```, where ```XXXX``` is a 4 digit hexadecimal number. From a Unicode perspective, characters are abstract entities which can be realized as one or more **glyphs**. Glyphs can appear on a screen or be printed on paper. A font is a mapping from characters to glyphs.

In [5]:
# These are not working in the emoji module directly, even with modifications.
# Investigate further; they are directly from the module

def emoji_lis(string):
    """Return the location and emoji in list of dic format
    >>>emoji.emoji_lis("Hi, I am fine. 😁")
    >>>[{'location': 15, 'emoji': '😁'}]
    """
    _entities = []
    for pos,c in enumerate(string):
        if c in emoji.UNICODE_EMOJI:
            _entities.append({
                "location":pos,
                "emoji": c
                })
    return _entities

def emoji_count(string):
    """Returns the count of emojis in a string"""
    c=0
    for i in string:
        if i in emoji.UNICODE_EMOJI:
            c=c+1
    return(c)

In [6]:
# http://unicode.org/Public/emoji/11.0/ for updated list of codes
emoji_pattern = emoji.get_emoji_regexp()

In [7]:
# 😎\U1F60E 👾\U1F47E 🖖\U1F596 💞\U1F49E 🦄\U1F984 🌮\U1F32E

text = u"🦄 This text is rocking 🌮 some pretty sweet emojis 😎 👾 🖖 💞"

find_emoji = emoji_pattern.findall(text)
print(find_emoji)

['🦄', '🌮', '😎', '👾', '🖖', '💞']


In [116]:
# Currently replaces with English, but could be extended to replace with Unicode
demoji = emoji.demojize(text, delimiters=('__','__'))
print(demoji)

__unicorn_face__ This text is rocking __taco__ some pretty sweet emojis __smiling_face_with_sunglasses__ __alien_monster__ __vulcan_salute__ __revolving_hearts__


In [93]:
pp.pprint(emoji_lis(text))
line()
print('Emojis: ', emoji_count(text))

[{'emoji': '🦄', 'location': 0},
 {'emoji': '🌮', 'location': 23},
 {'emoji': '😎', 'location': 50},
 {'emoji': '👾', 'location': 52},
 {'emoji': '🖖', 'location': 54},
 {'emoji': '💞', 'location': 56}]
--------------------------------------------------------------------------------
Emojis:  6


The module ```unicodedata``` lets us inspect the properties of Unicode characters. We can use this to get the unicode values of the emojis, as well as the other hardcoded properties, or we can encode the tag in other ways. Either way, there is a weath of possibility for how emojis can be used in text analysis.

In [101]:
for c in text:
    if ord(c) > 127: # Skips single characters
        print('{} U+{:04x} {}'.format(c.encode('utf8'), ord(c), unicodedata.name(c)))

b'\xf0\x9f\xa6\x84' U+1f984 UNICORN FACE
b'\xf0\x9f\x8c\xae' U+1f32e TACO
b'\xf0\x9f\x98\x8e' U+1f60e SMILING FACE WITH SUNGLASSES
b'\xf0\x9f\x91\xbe' U+1f47e ALIEN MONSTER
b'\xf0\x9f\x96\x96' U+1f596 RAISED HAND WITH PART BETWEEN MIDDLE AND RING FINGERS
b'\xf0\x9f\x92\x9e' U+1f49e REVOLVING HEARTS


In [102]:
unicode = text.encode('unicode_escape')
print(unicode)
unicode.find(b'\U0001f60e')

b'\\U0001f984 This text is rocking \\U0001f32e some pretty sweet emojis \\U0001f60e \\U0001f47e \\U0001f596 \\U0001f49e'


68

### Tokenizing Text

When we tokenize a string we produce a list (of words), and this is Python's ```<list>``` type. Strings and lists are both kinds of sequence. We can pull them apart by indexing and slicing them, and we can join them together by concatenating them. However, we cannot join strings and lists.

Now that you can present emoji (or any other glyph) data in your desired format, you can tokenize in in whichever format you find more suitable to your needs:

In [117]:
# Unicode tokens
unicode_tokens = word_tokenize(demoji.lower())
print(unicode_tokens)

['__unicorn_face__', 'this', 'text', 'is', 'rocking', '__taco__', 'some', 'pretty', 'sweet', 'emojis', '__smiling_face_with_sunglasses__', '__alien_monster__', '__vulcan_salute__', '__revolving_hearts__']


In [118]:
# Shortcode tokens
alias_tokens = word_tokenize(unicode.decode('utf-8').lower())
print(alias_tokens)

['\\u0001f984', 'this', 'text', 'is', 'rocking', '\\u0001f32e', 'some', 'pretty', 'sweet', 'emojis', '\\u0001f60e', '\\u0001f47e', '\\u0001f596', '\\u0001f49e']


In [120]:
# Emoji tokens
emoji_tokens = word_tokenize(text.lower())
print(emoji_tokens)

['🦄', 'this', 'text', 'is', 'rocking', '🌮', 'some', 'pretty', 'sweet', 'emojis', '😎', '👾', '🖖', '💞']


![NLPPipeline](http://whitneyontheweb.com/images/pipeline1.png "NLP Pipeline")

|```Method``` | Functionality | 
|-----|-----|-----|
|```s.find(t)``` | index of first instance of string t inside s (-1 if not found) | 
|```s.rfind(t)``` | index of last instance of string t inside s (-1 if not found) | 
|```s.index(t)``` | like s.find(t) except it raises ValueError if not found | 
|```s.rindex(t)``` | like s.rfind(t) except it raises ValueError if not found | 
|```s.join(text)``` | combine the words of the text into a string using s as the glue | 
|```s.split(t)``` | split s into a list wherever a t is found (whitespace by default) | 
|```s.splitlines()``` | split s into a list of strings, one per line | 
|```s.lower()``` | a lowercased version of the string s | 
|```s.upper()``` | an uppercased version of the string s | 
|```s.title()``` | a titlecased version of the string s | 
|```s.strip()``` | a copy of s without leading or trailing whitespace | 
|```s.replace(t, u)``` | replace instances of t with u inside s | 

### _Regular Expressions_

Regexes are pattern matching expressions. Regexes can be used to extract for many things; examples include extracting date parts in the desired format from text, or locating words that end in 'ed' or 'ing', or words that have certain characters spliced throughout.

In [26]:
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

# Date Parts
print([int(n) for n in re.findall(r'\d+', '2009-12-31')])

# Words ending in -ed
ed_words = [w for w in wordlist if re.search('ed$', w)]
print(ed_words[50:60])

# Words ending in -ed
ing_words = [w for w in wordlist if re.search('ing$', w)]
print(ing_words[30:40])

# Words with ..j..t.. (8 letters, 2 rand, j, 2 rand, t, 2 rand) 
# ^ matches start of string, $ matches end limits results to 8 characters
jt_words = [w for w in wordlist if re.search('^..j..t..$', w)]
print(jt_words[:10])

[2009, 12, 31]
['adreamed', 'adscripted', 'aduncated', 'advanced', 'advised', 'aeried', 'aethered', 'afeared', 'affected', 'affectioned']
['aging', 'agoing', 'agreeing', 'ailing', 'aiming', 'airing', 'aisling', 'alarming', 'allthing', 'alluring']
['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector']


Regular expressions are great for extracting parts of words, amoung other things.

|```Operator``` | Behavior | 
|-----|-----|-----|
|```.``` | Wildcard, matches any character | 
|```^abc``` | Matches some pattern abc at the start of a string | 
|```abc$``` | Matches some pattern abc at the end of a string | 
|```[abc]``` | Matches one of a set of characters | 
|```[A-Z0-9]``` | Matches one of a range of characters | 
|```ed\ing\s``` | Matches one of the specified strings (disjunction) | 
|```*``` | Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure) | 
|```+``` | One or more of previous item, e.g. a+, [a-z]+ | 
|```?``` | Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? | 
|```{n}``` | Exactly n repeats where n is a non-negative integer | 
|```{n,}``` | At least n repeats | 
|```{,n}``` | No more than n repeats | 
|```{m,n}``` | At least m and no more than n repeats | 
|```a(b\c)+``` | Parentheses that indicate the scope of the operators | 

The T9 system is used for entering text on mobile phones

* Two or more words that are entered with the same sequence of keystrokes are known as **textonyms**

In [33]:
# 3 letter textonyms
print([w for w in wordlist if re.search('^[abc][abc][def]$', w)])

# 5 letter textonyms
print([w for w in wordlist if re.search('^[pqrs][abc][abc][def][def]$', w)])

# "finger-twisters"
# words that only use part of the number-pad. 
# For example «^[ghijklmno]+$», or more concisely, «^[g-o]+$», 
# will match words that only use keys 4, 5, 6 in the center row, 
# and «^[a-fj-o]+$» will match words that use keys 2, 3, 5, 6 in the top-right corner
ft_g_o = [w for w in wordlist if re.search('^[g-o]+$', w)]
ft_a_fj_o = [w for w in wordlist if re.search('^[a-fj-o]+$', w)]
print(ft_g_o[:20])
print(ft_a_fj_o[20:40])

['ace', 'bad', 'bae', 'cad']
['paced', 'scaff']
['g', 'ghoom', 'gig', 'giggling', 'gigolo', 'gilim', 'gill', 'gilling', 'gilo', 'gim', 'gin', 'ging', 'gingili', 'gink', 'ginkgo', 'ginning', 'gio', 'glink', 'glom', 'glonoin']
['able', 'abloom', 'abode', 'abolla', 'aboma', 'aboon', 'academe', 'acana', 'acca', 'accede', 'accedence', 'accend', 'accolade', 'accoladed', 'accolle', 'accommodable', 'ace', 'ackman', 'acle', 'acme']


So what did the special characters accomplish for us in that last regex? 
-  ```+``` simply means "one or more instances of the preceding item". 
-  If we replace ```+``` with ```*```, it means "zero or more instances of the preceding item"
- Note that the + and * symbols are sometimes referred to as **Kleene closures**, or simply **closures**
- ^ inside of square brackets [^ed] matches any charcters EXCEPT those included in the bracket with it.

If we explore these more, we can see how they behave when applied in different contexts.

In [573]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))

chat_posts = []
for post in nltk.corpus.nps_chat.posts():
    jp = ' '.join([w for w in post])
    chat_posts.append(jp)

In [38]:
# One or more of each of the letters annotated with +
mine = [w for w in chat_words if re.search('^m+i+n+e+$', w)]
print(mine)

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']


In [39]:
# One or more of either of the bracketed characters in the set annotated with +
h_or_a = [w for w in chat_words if re.search('^[ha]+$', w)]
print(h_or_a[:10])

['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh', 'ahhahahaha', 'ahhh', 'ahhhh']


In [40]:
# Tokens with all non-vowel characters
non_vowels = [w for w in chat_words if re.search('^[^aeiouAEIOU]+$', w)]
print(non_vowels[305:345])

['99', '99701', '99703', '9:10', ':', ':(', ':)', ':):):)', ':-(', ':-)', ':-@', ':.', ':/', ':@', ':D', ':P', ':]', ':p', ':|', ';', '; ..', ';)', ';-(', ';-)', ';0', ';]', ';p', '<', '<,', '<-', '<--', '<---', '<----', '<----------', '<3', "<3's", '<33', '<333', '<3333', '<33333']


In [30]:
# Tokens that are decimal numbers
print([w for w in chat_words if re.search('^[0-9]+\.[0-9]+$', w)])

['1.98', '1.99', '102.6', '121.7', '147.7', '2.3', '39.3', '4.20', '45.5', '64.8', '9.53', '98.5', '98.6']


In [64]:
# Tokens that start with a capital letter, and contain only alphabetic characters
alphabetic = [w for w in chat_words if re.search('^[A-Z][^0-9]+', w)]
print(alphabetic[200:250])

['Diary', 'Did', 'Diego', 'Dipset', 'Dixie', 'Do', 'Does', 'Doing', 'Dokken', 'Dolls', 'Dood', 'Down', 'Downy', 'Dr', 'Dr.', 'Dreams', 'Drew', 'Drive', 'Drop', 'Dude', 'Dustin', 'Dying', 'ELSE', 'ENOUGH', 'EST', 'EVEN', 'EVERYTHING', 'Earth', 'Easily', 'Eastern', 'Eddie', 'Edgewood', 'Eggs', 'Elev', 'Elle', 'End', 'Eticket', 'Evanescence', 'Even', 'Everyone', 'Everytime', 'Evil', 'Eyes', 'FACE', 'FEMALE', 'FF', 'FINE', 'FL', 'FOLKS', 'FROM']


In [107]:
# Sly posts (one word with a smiley face)
[p for p in chat_posts if re.search('^[a-zA-z]+\s(\;\)|(\:\)))', p)]

['OOooOO :)', 'funny :)', 'haha ;)', 'ty :)', 'hugsss :)', 'ty :)']

In [109]:
# Posts with <3
[p for p in chat_posts if re.search('[a-zA-z]+\s(\<3)', p)]

['Daniel <3',
 'Marlaya <333333333 !!!!!',
 'U190 loves U3 =] <3333',
 'heya tiff <3333',
 ". ACTION <3's all over U197 ..",
 '. ACTION is <3 all ovr .']

In [110]:
# 4 digit numbers (Years)
[w for w in chat_words if re.search('^[0-9]{4}$', w)]

['1200', '1299', '1900', '1930', '1980', '1985', '1996', '2006']

In [112]:
# Double hyphenated phrases
# First phrase at least 5 characters long, 
# second between 2-3 characters
# 3rd no more than 6 letters
[w for w in chat_words if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

['peace-and-quiet']

In [117]:
# Capital letter between A-Z, plus 3 numbers between 0-9 (placeholder usernames)
print([w for w in chat_words if re.search('^[A-Z][0-9]{3,5}$', w)])

['U100', 'U101', 'U102', 'U103', 'U104', 'U105', 'U106', 'U107', 'U108', 'U109', 'U110', 'U111', 'U112', 'U113', 'U114', 'U115', 'U116', 'U117', 'U118', 'U119', 'U120', 'U121', 'U122', 'U123', 'U126', 'U128', 'U129', 'U130', 'U132', 'U133', 'U134', 'U136', 'U137', 'U1370', 'U138', 'U139', 'U141', 'U142', 'U143', 'U144', 'U145', 'U146', 'U147', 'U148', 'U149', 'U150', 'U153', 'U154', 'U155', 'U156', 'U158', 'U163', 'U164', 'U165', 'U168', 'U169', 'U170', 'U172', 'U175', 'U181', 'U190', 'U196', 'U197', 'U219', 'U520', 'U542', 'U819', 'U820', 'U988', 'U989']


## Applying Regular Expressions

Regular Expressions have a multitude of uses beyond basic search function. The ```re.findall()``` method finds all (non-overlapping) matches of the given regular expression

In [21]:
# Find all individual vowels
word = 'supercalifragilisticexpialidocious'
print(re.findall(r'[aeiou]', word))

['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']


Another useful application is the ability to alter string by removing parts you don't want, such as vowels. We use re.findall() to extract all the matching pieces, and ''.join() to join them together.

It's widely noted that English can be comprehended if you remove some of the letters due to the high level of redundancy in it's structures. The regular expression in our next example matches *initial vowel sequences, final vowel sequences, and all consonants; everything else is ignored.* 
- This three-way disjunction is processed left-to-right, if one of the three parts matches the word, any later parts of the regular expression are ignored.

In [18]:
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'

def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

print(nltk.tokenwrap(compress(w) for w in tokens[10:85]))

fnl arc of Slr Mn . She is wll knwn amngst the Glxy fr wrkng hvc and
rnng wrlds in hr qst to obtn the strngst Snshi Crystl . She is the mst
pwrfl Slr Snshi in the glxy and the rlr of Shdw Glctca ... '' / > Slr
Glxia | Slr Mn Wki | FNDM pwrd by Wkia a : lng ( ar ) , a : lng ( ckb
) , a


We can further develop patterns to get all sequences of two or more vowels (or other such interesting subsets of characters or words), and determine their relative frequency

In [193]:
wsj = sorted(set(nltk.corpus.treebank.words()))

# Freq Dist of two or more vowels
fd = nltk.FreqDist([vs for word in wsj
           for vs in re.findall(r'[aeiou]{2,}', word.lower())])

fd.most_common(12)

[('io', 556),
 ('ea', 494),
 ('ou', 334),
 ('ie', 333),
 ('ai', 269),
 ('ia', 263),
 ('ee', 219),
 ('oo', 176),
 ('au', 120),
 ('ua', 110),
 ('ue', 106),
 ('ui', 97)]

In [194]:
# Frequency matrix of lower case vowel pairs
fd_dist = [vs for word in wsj
           for vs in re.findall(r'[aeiou][aeiou]', word.lower())]

fd = nltk.ConditionalFreqDist(fd_dist)
fd = pd.DataFrame(fd).fillna(value=0).astype(dtype=int)
fd

Unnamed: 0,a,e,i,o,u
a,3,504,266,62,110
e,13,223,336,16,111
i,272,89,2,66,101
o,6,46,586,177,13
u,120,23,14,340,1


In [212]:
# Frequency matrix of lower case letter pairs
cfd_nums = [vs for word in wsj
            for vs in re.findall(r'[a-z][a-z]', word.lower())]

cfd = nltk.ConditionalFreqDist(cfd_nums)
df_cfd = pd.DataFrame(cfd).fillna(value=0).astype(dtype=int)
df_cfd

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,...,q,r,s,t,u,v,w,x,y,z
a,1,183,288,102,163,135,106,157,100,23,...,0,324,145,239,45,74,124,7,19,15
b,110,14,1,1,12,1,0,5,35,0,...,0,27,7,3,17,0,0,0,8,0
c,214,4,17,4,157,1,0,1,256,0,...,0,70,142,35,57,0,2,0,5,0
d,188,1,3,20,594,0,2,1,98,0,...,0,97,8,1,37,0,2,0,1,1
e,7,183,245,477,93,121,197,187,144,25,...,0,773,415,630,51,276,95,13,78,37
f,36,0,0,3,29,36,1,3,53,0,...,0,15,13,0,2,0,1,0,1,0
g,110,0,0,25,20,0,19,0,68,0,...,0,64,1,0,37,0,1,0,0,0
h,5,2,263,1,7,0,94,2,1,0,...,0,9,190,209,0,0,51,0,0,2
i,117,82,176,304,30,171,90,135,4,6,...,0,329,272,597,38,157,83,10,34,13
j,0,3,0,0,1,0,1,0,1,0,...,0,1,0,0,0,1,0,0,0,0


To make use of these word segments, it would be really useful to be able to see the actual words that these frequencies come from in the event that we have questions about patterns that arise in the data. We can look for patterns such as *partial complimentary distributions*, which suggest things like letter pairs not being a distinct **phoneme** (perceptually distinct units of sound used to distinguish one word from another) in the language.

In [201]:
cfd_pairs = [(cv, w) for w in wsj
            for cv in re.findall(r'[a-z][a-z]', w.lower())]

cfd_index = nltk.Index(cfd_pairs)

In [270]:
print(cfd_index['wm'])
print(cfd_index['hr'])
print(cfd_index['fy'])

['Bowman', 'Lawmakers', 'lawmakers', 'lawmaking']
['Fahrenheit', 'Mehrens', 'Schroder', 'cutthroat', 'synchronized']
['defying', 'identify', 'modify', 'notify']


In [266]:
# Find Indexes with [AQ] letter pairs
aq = [w for w in cfd_index if re.search('^[aq]+$', w)]
print({k:cfd_index[k] for k in aq if k in cfd_index})

{'aq': ['Aquino', 'Nasdaq'], 'aa': ['Jalaalwalikraam'], 'qa': []}


In [280]:
# Find Indexes with [FY] letter pairs
yx = [w for w in cfd_index if re.search('^[yx]+$', w)]
display({k:cfd_index[k] for k in yx if k in cfd_index})

{'xy': ['sexy']}

### Finding Word Stems

A lot of word processing happens around us that we hardly even notice. For example, when we search on the web, results are concatenated by similar terms, and we usually pay it know mind. In fact, it's a behavior we've come to expect for a good user experience. *Computer* and *Computers* are just two forms of the same dictionary word, or *lemma*. For many processing tasks, we need to be able to ignore word endings, and deal directly with word stems.

There are many ways of doing this, some more cleanly than others, but Regexes offer yet another means of doing so quickly and efficiently.

In [None]:
suffixes = ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']

Now, we want to make a regex that will take these suffixes and chop them off words. In this example, we'll pull a random word out of our index using the letter pairs we created.

In [303]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', cfd_index['rt'][10])

['ment']

```re.findall()``` just gave us the suffix even though the regular expression matched the entire word. This is because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ```?:```

In [304]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', cfd_index['rt'][10])

['Department']

This is good progress since now we have the word stem instead of the suffix, but instead of just returning one or the other, we should make the Regex give us both parts back separately. This can be done by parenthesising the arguments in the regex pattern.

In [305]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', cfd_index['rt'][10])

[('Depart', 'ment')]

This is looking like what we would expect of a stemmer, but what happens if we try some more word stems?

In [326]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', cfd_index['pr'][13])

[('Preference', 's')]

In the case of 'Preferences', the word stem should have been '-es'. This happened because the ```\*``` operator is greedy, so ```.\*``` tries to eat as much of the input as possible. If we change the expression to use the non-greedy version of the \* operator, ```*?```, it should instead feed that e into the other output where it's expected. We can even add the same to parentheses for the suffix, to make that content all together optional, thus allowing empty suffixes.

In [328]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', cfd_index['pr'][13])

[('Preferenc', 'es')]

In [332]:
print(re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', cfd_index['rt'][10]))
def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem

stems = [stem(t) for t in cfd_index['rt'][40:60]]

print(stems)

[('Depart', 'ment')]
['Participant', 'Particular', 'Part', 'Partner', 'Partnership', 'Party', 'Porter', 'Port', 'Portugal', 'Report', 'Robert', 'Robertson', 'Stuart', 'Uncertainty', 'Unfortunate', 'Virtual', 'Wadsworth', 'Wertheim', 'Westport', 'Woolworth']


There's still a lot of problems here, but overall this is a good general understanding of how stemming works, and how it can be done smarter by using Regexes. At the end of the day, you're almost always going to want to use a tried and true readily available stemmer such as the one already built into NLTK. In general, it's important to understand the fundamentals of how the process of stemming works, and what the concept of lemmatization is. 

### Searching Tokenized Text

Regexes can also be formatted in a manner tha allows you to search across multiple words in a corpus. For example, if we were curious about the impact of a particular event, we could search for patterns involving <the> <event> and see what comes up in context. Angle brackets are used to mark token boundaries, and any whitespace between the angle brackets is ignored **(behaviors that are unique to NLTK's ```findall()``` method for texts)**

It is easy to build search patterns when the linguistic phenomenon we're studying is tied to particular words. In some cases, a little creativity will go a long way. For instance, searching a large text corpus for expressions of the form ```x and other ys``` allows us to discover **hypernyms**

In [176]:
from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


In [337]:
hobbies_learned.findall(r"<\w*> <\w*> <or> <other> <\w*> <\w*>")

with metal or other hard material; the rain or other inclement
weather; that Jones or other Confederate commanders; national forest
or other lands under; a tree or other object appears; on acetate or
other types of; other changes or other energies involved; of State or
other public officials; the president or other executive of; the
community or other school district; No soap or other cleaning agent; a
thermometer or other equivalent equipment


In [338]:
hobbies_learned.findall(r"<\w*> <and> <not> <\w*>")

campaigners and not only; Delawares and not much; official and not
just; sided and not balanced; 20 and not more; you and not cutting;
chlorine and not by; dispersed and not politically; lawmaking and not
leave; aside and not used; district and not that; expenses and not
the; moral and not by; hypothesis and not a; books and not out; will
and not inclined


**With enough text, this approach would give us a useful store of information about the taxonomy of objects, without the need for any manual labor**
- However, our search results will usually contain false positives
  -  For example, the result: ```demands and other factors``` suggests that demand is an instance of the type factor, but this sentence is actually about wage demands
  -  On the inverse, these searches will also suffer from false negatives, where cases are omitted that you would want to be included.
-  Nevertheless, we could construct our own ontology of English concepts by manually correcting the output of such searches. 
  -  Generally, there is a balance stuck in some combination of automatic and manual processing.
  
## Normalizing Text

As we've processed texted thus far, one of the steps along the way has been to call the ```.lower()``` function. By calling this function, we've normalized the text by getting rid of upper case characters, and ignoring distinctions between upper and lower casing in type face. We then went a step further into stemming with Regexes.

### Lemmatization

Regexes...w hile quick and dirty, and good for certain applications, are not good for others. Other applications will demand proper dictionary words be returned, and not partial words such as 'Preferenc'. This is where the task known as lemmatization comes back in. We'll cover this by going over a few of NLTKs built in stemmers. **Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind**.


In [341]:
tokens = chat_words

In [374]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

-  ```PorterStemmer()``` is a good choice if you are indexing some texts and want to support search using alternative forms of words

In [378]:
porter = nltk.PorterStemmer()
print([porter.stem(t) for t in tokens[500:550]])

['aww', 'awww', 'B', 'baaaaalllllllliiiiiiinnnnnnnnnnn', 'BE', 'big', 'blond', 'boot', 'booti', 'boy', 'but', 'but', 'bye', 'back', 'barbiee', 'baromet', 'beach', 'becaus', 'been', 'ben', 'benjamin', 'better', 'bibl', 'biiiiiitch', 'biographi', 'birdgang', 'bloooooooood', 'bloooooooooood', 'bloooooooooooood', 'bone', 'bonu', 'book', 'boon', 'booyah', 'borat', 'born', 'box', 'boyz', 'break', 'break', 'broken', 'bud', 'burger', 'but', 'bwhaha', 'bye', 'C', 'CA', 'cali', 'can']


In [371]:
stem_porter = IndexedText(porter, chat_posts)

In [376]:
stem_porter.concordance('damn')

not tried myspace for dating . JOIN JOIN damn U15 .. mopeds r for old men and gay
ts me all tingly n stuff bye honey bunny damn 19 / F / Wisconsin ooeer is sum1 go
ie like U12 or come haha did , no takers damn knows what everyone gets U44 for x-
 is having issues . ACTION waves to U5 . damn PART JOIN so U88 - why no pic . ACT
fy PART JOIN you never stay and chat U69 damn lol aww looks for D :beer: chatting
ve JOIN hahahahahahahahahahahahahahahaha damn hey ppl u twizted bro hey guys MY D
. pm me if you want to chat .... yea wow damn Barbieee . PART 15 f cali guys pm m
is fun haha PART i am satan and u r dead damn not me this room is bluer than my b
pewing goodie pm me haha hurling yes lol damn . ACTION * Pearl Jam - Better Man .
TION chops U34 's finers AND toes off .. DAMN JOIN . ACTION Fingers * . What did 


In [377]:
lancaster = nltk.LancasterStemmer()
print([lancaster.stem(t) for t in tokens[550:600]])

['cap', 'cdt', 'chat', 'chathid', 'chip', 'choco', 'co', 'com', 'com', 'csi', 'cst', 'ct', 'cuz', 'californ', 'cam', 'can', 'canehd', 'cardin', 'cardn', 'card', 'car', 'carolin', 'catterick', 'ceil', 'chamillionair', 'chang', 'chang', 'chat', 'check', 'check', 'cheeeez', 'chic', 'chick', 'childr', 'chin', 'chingy', 'chop', 'chris', 'christianity', 'ciar', 'city', 'cleveland', 'clock', 'coincid', 'com', 'comply', 'connect', 'connecticut', 'consid', 'constitut']


* The ```WordNetLemmatizer()``` only removes affixes if the resulting word is in its dictionary
* The ```WordNetLemmatizer()``` is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas (or lexicon headwords)

In [382]:
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t)  for t in tokens[600:650]])

['Cookies', 'Cool', 'Could', 'Course', 'Covered', 'Cradle', 'Craig', 'Crazy', 'Cream', 'Cry', 'Ct', 'Ctrl', 'Cum', 'Current', 'Cute', 'Cyber', 'D', 'DAMN', 'DAamn', 'DELIGHTFUL', 'DETROIT', 'DING', 'DIRTY', 'DJ', 'DO', 'DOES', 'DOING', 'DON', 'DONT', 'DOWNS', 'DVD', 'Dakota', 'Damn', 'Dang', 'Daniel', 'Daveeee', 'David', 'Dawn', 'Dawnstar', 'Days', 'Death', 'Deep', 'Define', 'Denver', 'Depends', 'Devil', 'Dew', 'Diary', 'Did', 'Diego']


Another normalization task involves identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. 
-  For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. 
-  This keeps the vocabulary small and improves the accuracy of many language modeling tasks

## Tokenizing Text with Regexes

Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. The very simplest method for tokenizing text is to split on whitespace.

 - When using a regular expression, you must also account for tabs and new lines
 - This can be further refined into splitting the text on anything other than a word character
 - We can use \W in a simple regular expression to split the input on anything other than a word character

In [401]:
re_split_raw = re.split(r'[ \W]+', raw)
print(re_split_raw[100:150])

['up', 'and', 'picking', 'the', 'daisies', 'when', 'suddenly', 'a', 'White', 'Rabbit', 'with', 'pink', 'eyes', 'ran', 'close', 'by', 'her', 'There', 'was', 'nothing', 'so', 'very', 'remarkable', 'in', 'that', 'nor', 'did', 'Alice', 'think', 'it', 'so', 'very', 'much', 'out', 'of', 'the', 'way', 'to', 'hear', 'the', 'Rabbit', 'say', 'to', 'itself', 'Oh', 'dear', 'Oh', 'dear', 'I', 'shall']


| Symbol | Function | 
|-----|-----|-----|
| \b | Word boundary (zero width) | 
| \d | Any decimal digit (equivalent to [0-9]) | 
| \D | Any non-digit character (equivalent to [^0-9]) | 
| \s | Any whitespace character (equivalent to [ \t\n\r\f\v]) | 
| \S | Any non-whitespace character (equivalent to [^ \t\n\r\f\v]) | 
| \w | Any alphanumeric character (equivalent to [a-zA-Z0-9_]) | 
| \W | Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_]) | 
| \t | The tab character | 
| \n | The newline character | 


You will see empty strings at the start and the end with certain texts, to demonstrate why, we split a string with ```xx``` on ```x```. We get the same tokens, but without the empty strings, with  ```re.findall(r'\w+', raw)```, using a pattern that matches the words instead of the spaces

In [386]:
'xx'.split('x')

['', '', '']

In [402]:
re_split_raw = re.findall(r'\w+', raw)
print(re_split_raw[100:150])

['up', 'and', 'picking', 'the', 'daisies', 'when', 'suddenly', 'a', 'White', 'Rabbit', 'with', 'pink', 'eyes', 'ran', 'close', 'by', 'her', 'There', 'was', 'nothing', 'so', 'very', 'remarkable', 'in', 'that', 'nor', 'did', 'Alice', 'think', 'it', 'so', 'very', 'much', 'out', 'of', 'the', 'way', 'to', 'hear', 'the', 'Rabbit', 'say', 'to', 'itself', 'Oh', 'dear', 'Oh', 'dear', 'I', 'shall']


Now that words are being matched the way we expect, we can delve into extending the functionality. Lets say we want to match any sequence of word characters... if no match is found, we want to try to match any non-whitespace character (\S is the compliment to \s) followed by other word characters.

This means that punctuation is grouped with any following letters (e.g. 's) but that sequences of two or more punctuation characters are separated.

In [403]:
re_split_raw = re.findall(r'\w+|\S\w*', raw)
print(re_split_raw[100:150])

['pleasure', 'of', 'making', 'a', 'daisy', '-chain', 'would', 'be', 'worth', 'the', 'trouble', 'of', 'getting', 'up', 'and', 'picking', 'the', 'daisies', ',', 'when', 'suddenly', 'a', 'White', 'Rabbit', 'with', 'pink', 'eyes', 'ran', 'close', 'by', 'her', '.', 'There', 'was', 'nothing', 'so', 'very', 'remarkable', 'in', 'that', ',', 'nor', 'did', 'Alice', 'think', 'it', 'so', 'very', 'much', 'out']


We could further generalize to allow words with hyphens, ellipisism, open parenthesises, and apostrophes inside the words.

In [407]:
re_split_raw = re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)
print(re_split_raw[100:150])

['pleasure', 'of', 'making', 'a', 'daisy-chain', 'would', 'be', 'worth', 'the', 'trouble', 'of', 'getting', 'up', 'and', 'picking', 'the', 'daisies', ',', 'when', 'suddenly', 'a', 'White', 'Rabbit', 'with', 'pink', 'eyes', 'ran', 'close', 'by', 'her', '.', 'There', 'was', 'nothing', 'so', 'very', 'remarkable', 'in', 'that', ',', 'nor', 'did', 'Alice', 'think', 'it', 'so', 'very', 'much', 'out', 'of']


Tokenization turns out to be a far more difficult task than you might have expected. No single solution works well across-the-board, and we must decide what counts as a token depending on the application domain.

When developing a tokenizer it helps to have access to raw text which has been manually tokenized, in order to compare the output of your tokenizer with high-quality (or "gold-standard") tokens.

A final issue for tokenization is the presence of contractions, such as didn't. If we are analyzing the meaning of a sentence, it would probably be more useful to normalize this form to two separate forms: did and n't (or not). We can do this work with the help of a lookup table.

### NLTK Regex Tokenizer

The function ```nltk.regexp_tokenize()``` is similar to ```re.findall()``` (as we've been using it for tokenization). However, ```nltk.regexp_tokenize()``` is more efficient for this task, and avoids the need for special treatment of parentheses.

We can evaluate a tokenizer by comparing the resulting tokens with a wordlist, and reporting any tokens that don't appear in the wordlist, using ```set(tokens).difference(wordlist)```

In [418]:
re_tokenize_raw = nltk.regexp_tokenize(raw, r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*")

print(re_tokenize_raw[100:150])

['pleasure', 'of', 'making', 'a', 'daisy-chain', 'would', 'be', 'worth', 'the', 'trouble', 'of', 'getting', 'up', 'and', 'picking', 'the', 'daisies', ',', 'when', 'suddenly', 'a', 'White', 'Rabbit', 'with', 'pink', 'eyes', 'ran', 'close', 'by', 'her', '.', 'There', 'was', 'nothing', 'so', 'very', 'remarkable', 'in', 'that', ',', 'nor', 'did', 'Alice', 'think', 'it', 'so', 'very', 'much', 'out', 'of']


## Segmentation

Tokenization is a more specific practice of the more general concept of segmentation. Segmentation thus far has been done in a very specific manner, however there are many other ways in which a text corpus can be segmented in which to gain knowledge from it.

### Sentence Segementation

Manipulating texts at the level of individual words often presupposes the ability to divide a text into individual sentences.. Some corpora, such as the Brown corpus, already provide access to the data at the sentence level. In other cases, the text will need to be manually processed into sentences first.

In [422]:
print('Avg Words Per Snt Brown Corpus: '\
      , len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()))

Avg Words Per Snt Brown Corpus:  20.250994070456922


In [32]:
text = nltk.corpus.gutenberg.raw('bryant-stories.txt')
sents = nltk.sent_tokenize(text)
pp.pprint(sents[79:89])

['"I will, then," said the little Red Hen, and she planted the grain of\r\n'
 'wheat.',
 'When the wheat was ripe she said, "Who will take this wheat to the\r\nmill?"',
 '"Not I," said the Goose.',
 '"Not I," said the Duck.',
 '"I will, then," said the little Red Hen, and she took the wheat to the\r\n'
 'mill.',
 'When she brought the flour home she said, "Who will make some bread with\r\n'
 'this flour?"',
 '"Not I," said the Goose.',
 '"Not I," said the Duck.',
 '"I will, then," said the little Red Hen.',
 'When the bread was baked, she said, "Who will eat this bread?"']


Sentence segmentation is difficult because period is used to mark abbreviations, and some periods simultaneously mark an abbreviation and terminate a sentence, as often happens with acronyms like U.S.A.

### Word Segmentation

For some writing systems, tokenizing text is made more difficult by the fact that there is no visual representation of word boundaries. For example, in Chinese, the three-character string: 爱国人 (ai4 "love" (verb), guo2 "country", ren2 "person") could be tokenized as 爱国 / 人, "country-loving person" or as 爱 / 国人, "love country-person."

A similar problem arises in the processing of spoken language, where the hearer must segment a continuous speech stream into individual words. A particularly challenging version of this problem arises when we don't know the words in advance. This is the problem faced by a language learner, such as a child hearing utterances from a parent. Consider the following artificial example, where word boundaries have been removed:

1.  doyouseethekitty
2.  seethedoggy
3.  doyoulikethekitty
4.  likethedoggy

Our first challenge is simply to represent the problem: we need to find a way to separate text content from the segmentation. We can do this by annotating each character with a boolean value to indicate whether or not a word-break appears after the character (an idea that will be used heavily for "chunking").

Let's assume that the learner is given the utterance breaks, since these often correspond to extended pauses. Here is a possible representation, including the initial and target segmentations.

-  The segmentation strings consist of zeros and ones
-  They are one character shorter than the source text
  -  A text of length ```n``` can only be broken up in ```n-1``` places.
  -  ```seg1``` and ```seg2``` represent the initial and final segmentations of some hypothetical child-directed speech
  -  The ```segment()``` function can use them to reproduce the segmented text

In [431]:
# Unsegemented text string with word and sentence representations within
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

# Known binary representations of sentence(seg1) and word (seg2) segmentation in text
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"

In [448]:
# Segment text based on location of 1/0 in the segment index
def segment(text, segs):
    segments = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':                  # Segment whenever i = 1
            segments.append(text[last:i+1]) # Append segment length equal to [last:i+1]
            last = i+1                      # Increment last to start the next segment
    segments.append(text[last:])            # Catch the last segment
    return segments                         # Return all identified segments

In [449]:
# Test Segment Known Sentences
segment(text, seg1)

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

In [450]:
# Test Segment Known Words
print(segment(text, seg2))

['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you', 'like', 'the', 'kitty', 'like', 'the', 'doggy']


It may have seemed a little abstract at first, but now we can visualize how the segmentation comes together, and the words and sentences are formed using this type of abstraction. Now the segmentation task becomes a search problem: **find the bit string that causes the text string to be correctly segmented into words.**


<img src='https://www.nltk.org/images/brent.png'>

>-  We assume the learner is acquiring words and storing them in an internal lexicon. 
>-  Given a suitable lexicon, it is possible to reconstruct the source text as a sequence of lexical items
>-  We can define an objective function as a scoring function whose value we optimize 
>  - This is based on the size of the lexicon and the amount of information needed to reconstruct the source text
<br>

> **Calculation of Objective Function, Given a hypothetical segmentation of the source text:**
>1.  Derive a lexicon and a derivation table that permit the source text to be reconstructed
>2.  Total up the number of characters used by each lexical item (including a boundary marker), plus the number of lexical items used by each derivation
>  -  This serves as a score of the quality of the segmentation
>  -  Smaller values of the score indicate a better segmentation
>3.  Sum the total of the derived and lexicon table lengths

### Objective Function

In [459]:
# Objective funtion to evaluate and score segmentation of bit strings into words
def evaluate(text, segs):
    words = segment(text, segs) # Segment words
    text_size = len(words)      # Derived Lexicon Size Table
    
    # Objective Function Calculation (Lexicon Size)
    # =============================================================================
    # sum(len(word) + 1 for word in set(words))
    #
    # len(word)               = num characters used by each lexical item
    # + 1                     = a boundary marker
    # for word in set(words)  = num of lexical items used by each derivation
    #
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size                          # Derived Lexicon + Lexicon Size

In [456]:
seg3 = "0000100100000011001000000110000100010000001100010000001"

In [591]:
seg4 = "0001001000000101001011000101001000100010011111100010110"

In [457]:
print(segment(text, seg3))

['doyou', 'see', 'thekitt', 'y', 'see', 'thedogg', 'y', 'doyou', 'like', 'thekitt', 'y', 'like', 'thedogg', 'y']


In [477]:
print(evaluate(text, seg3))
print(evaluate(text, seg2))
print(evaluate(text, seg1))

77
48
64


In [568]:
print(evaluate(text, seg4))

43


The final step is to **search for the pattern of zeros and ones that minimizes this objective function**. We can accomplish this through a process known as **annealing**. 

>Per Katrina Ellison Gelman's <a href='http://katrinaeg.com/simulated-annealing.html'>blog on simulated annealing</a>, simulated annealing is 'a method for finding a good (not necessarily perfect) solution to an optimization problem'. She ellaborates, 
-  'Broadly, an optimization algorithm searches for the best solution by generating a random initial solution and "exploring" the area nearby
-  If a neighboring solution is better than the current one, then it moves to it
-  If not, then the algorithm stays put.'

Lastly, Katrina gives a very clear basic outline of the algorithm that is required to minimize our objective function through annealing:

1. First, generate a random solution
2. Calculate its cost using some cost function you've defined
3. Generate a random neighboring solution
 -  "Neighboring" means there's only one thing that differs between the old solution and the new solution. Effectively, you switch two elements of your solution and re-calculate the cost.
4. Calculate the new solution's cost
5. Compare them:
  1. If cnew < cold: move to the new solution
     - This is good, smaller is better, so the new one is retained
  2. If cnew > cold: maybe move to the new solution
     -  Sometimes the lower value is retained as not to get trapped in local maxima
     -  This is not implemented in this solution 
  3. Repeat steps 3-5 above until an acceptable solution is found or you reach some maximum number of iterations.

In [569]:
from random import randint

# Flips the bit sign (1/0) of segment bit string at index position
def flip(segs, pos):
    flip = str(1-int(segs[pos]))              # Inverts 1/0 at specified position
    return segs[:pos] + flip + segs[pos+1:]   # Puts inverted value back in string

# Flips n randomly selected indexes in bit string
def flip_n(segs, n):
    for i in range(n):    # Flip random position indexes n times
        segs = flip(segs, randint(0, len(segs)-1))
    return segs           # Generates new random bit string

# 'Heat and Allow to Cool Slowly': A Metaphor for our Optimization Algorithm
#
# Args: text: text
# segs: bit string
# iterations: max iterations
# cool: rate to cool temperature each iteration
#
def anneal(text, segs):
    guesses = 1000                      # Healthy minimum is 100 - 1000
    cool = .95                             # Generally alpha between .8 and .99
    T = float(len(segs))                   # Tempurature: f() of current iteration
    

    while T > 0.1:                         # Usually between 0 and 1             
        # best_segs = segs                 # Retain segs as best_segs
        # best = evaluate(text, segs)      # Evaluate score of segs
        best_segs, best = segs, evaluate(text, segs)
        
        # Compare random bit strings to current best bit string iteratively
        for i in range(guesses):           
            guess = flip_n(segs, round(T))    # Generate new random bit string
            score = evaluate(text, guess)     # Evaluate score of random bit string
         
            if score < best:                  # If lower score than best is found:
                # best = score                # Retain the new score as best score
                # best_segs = guess           # Retain new segment as best segment
                best, best_segs = score, guess
        
        # Prep for next loop or hit max      
        score, segs = best, best_segs         # Save score and segs
        T = T * cool                          # Cool tempurature for next loop
        
        print(evaluate(text, segs)            # Display report of evaluation score
            , segment(text, segs))            # and segmented words each cooloff
    print()
    return segs

In [592]:
bestseg = anneal(text, seg4)
print(bestseg)

82 ['d', 'oyo', 'u', 'se', 'e', 't', 'h', 'e', 'ki', 't', 'tyseet', 'hedoggydoyoulik', 'e', 'thekit', 't', 'ylikethe', 'doggy']
78 ['d', 'oyous', 'e', 'e', 't', 'heki', 't', 'ty', 'seethe', 'doggy', 'doy', 'ouliketh', 'e', 'k', 'itt', 'ylikethe', 'doggy']
78 ['d', 'oyous', 'e', 'e', 't', 'heki', 't', 'ty', 'seethe', 'doggy', 'doy', 'ouliketh', 'e', 'k', 'itt', 'ylikethe', 'doggy']
78 ['d', 'oyous', 'e', 'e', 't', 'heki', 't', 'ty', 'seethe', 'doggy', 'doy', 'ouliketh', 'e', 'k', 'itt', 'ylikethe', 'doggy']
78 ['d', 'oyous', 'e', 'e', 't', 'heki', 't', 'ty', 'seethe', 'doggy', 'doy', 'ouliketh', 'e', 'k', 'itt', 'ylikethe', 'doggy']
78 ['d', 'oyous', 'e', 'e', 't', 'heki', 't', 'ty', 'seethe', 'doggy', 'doy', 'ouliketh', 'e', 'k', 'itt', 'ylikethe', 'doggy']
77 ['do', 'you', 's', 'eethekit', 'tyseeth', 'e', 'do', 'ggydo', 'you', 'like', 't', 'hekit', 't', 'y', 'likethedoggy']
77 ['do', 'you', 's', 'eethekit', 'tyseeth', 'e', 'do', 'ggydo', 'you', 'like', 't', 'hekit', 't', 'y', 'likethe

43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']
43 ['doyou', 'se', 'ethekitty', 'se', 'ethedoggy', 'doyou', 'lik', 'ethekitty', 'lik', 'ethedoggy']


Notice that the best segmentation includes "words" like ```thekitty```, since there's not enough evidence in the data to split this any further. 

This performs a *Non-Deterministic Search Using Simulated Annealing*, and the models begins searching with phrase segmentations only; it randomly perturbs the zeros and ones proportional to the "temperature"
-  With each iteration the temperature is lowered and the perturbation of boundaries is reduced
-  With enough data, it is possible to automatically segment text into words with a reasonable degree of accuracy
-  Such methods can be applied to tokenization for writing systems that don't have any visual representation of word boundaries

In [493]:
bestest = bestseg
print(bestest)

0000100000100001000001000010000100000010000100000010000


## Formatting: An Exploration of Outputs

Often we write a program to report a single data item, such as a particular element in a corpus that meets some complicated criterion, or a single summary statistic such as a word-count or the performance of a tagger. 

More often, we write a program to produce a structured result; for example, a tabulation of numbers or linguistic forms, or a reformatting of the original data. When the results to be presented are linguistic, textual output is usually the most natural choice. Here, we explore a variety of ways to format data.

### Lists to Strings

One of the simplest formatting conversions is using ```join()``` to take a list, and merge it into one single string. However, we don't haveto make it all one big string... whatever we want to join the letters on, we could use as the delimiter between the quotes in the join statement.

In [589]:
print(''.join(re_split_raw[100:150]))
print()
print('|'.join(re_split_raw[100:150]))

pleasureofmakingadaisy-chainwouldbeworththetroubleofgettingupandpickingthedaisies,whensuddenlyaWhiteRabbitwithpinkeyesranclosebyher.Therewasnothingsoveryremarkableinthat,nordidAlicethinkitsoverymuchoutof

pleasure|of|making|a|daisy-chain|would|be|worth|the|trouble|of|getting|up|and|picking|the|daisies|,|when|suddenly|a|White|Rabbit|with|pink|eyes|ran|close|by|her|.|There|was|nothing|so|very|remarkable|in|that|,|nor|did|Alice|think|it|so|very|much|out|of


### String Formats

There are two ways to display the contents of an object. The ```print``` command yields Python's attempt to produce the most human-readable form of an object. The second method — *naming the variable at a prompt* — shows us a string that can be used to recreate this object. 

It is important to keep in mind that both of these are just strings, displayed for the benefit of us, the users. They do not give us any clue as to the actual internal representation of the object. There are many other ways to represent an object as a string of characters. This may be for the benefit of a human reader, or because we want to export our data to a particular file format for use in an external program.

In [593]:
# Formatted output containing a combination of variables and pre-specified strings
# Using print() syntax
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in sorted(fdist):
    print(word, '->', fdist[word], end='; ')

cat -> 3; dog -> 4; snake -> 1; 

Another solution, which we used quitely in some example above, is to use the ```format()``` method, which has extensive customization syntax available. 

The curly brackets '{}' mark the presence of a replacement field: this acts as a placeholder for the string values of objects that are passed to the ```format()``` method. We can embed occurrences of '{}' inside a string, then replacet them with strings by calling ```format()``` with appropriate arguments. A string containing replacement fields is called a format string.
-  We can have any number of placeholders, but the str.format method must be called with exactly the same number of arguments

In [594]:
# Formatted output containing a combination of variables and pre-specified strings
# Using format() function
for word in sorted(fdist):
    print('{0}->{1};'.format(word, fdist[word]), end=' ')

cat->3; dog->4; snake->1; 

In [596]:
# Iterative format() template fill
template = 'Lee wants a {} right now'
menu = ['sandwich', 'pizza', 'hot potato']

for snack in menu:
    print(template.format(snack))

Lee wants a sandwich right now
Lee wants a pizza right now
Lee wants a hot potato right now


#### Alignment

Things get be hard to make sense of if they aren't lined up on the screen properly, so you can take advantage of ```format``` to align you outputs with padding using a : between the curly brackets, followed by an integer representing spaces in padding. This integer can be preceded with a > or < symbol to specify left or right jsutification on top of that.

In [605]:
print('{:6}Hi'.format(42))
print('{:<6}Hi' .format(42))

    42Hi
42    Hi


#### Precision

Other special formatting can be utilized to handle precision or recognition of various number types. The string formatting is smart enough to know that if you include a '%' in your format specification, then you want to represent the value as a percentage, so you don't need to do extra multiplication.

In [609]:
# Pi to 10 digits of Precision
import math
print('{:.10f}'.format(math.pi))
print('{:0g}'.format(2e10))

count, total = 8500, 9000
print("Accuracy for {} words: {:.4%}".format(total, count / total))

3.1415926536
2e+10
Accuracy for 9000 words: 94.4444%


### Tabulating Data

We saw what a tabulating looks like from a conditional frequency distribution, but it's a good idea to have an algorithm to produce one with word distributions across topics for our toolkit.

In [766]:
# Tabulate Word Distriubtion Across Topics
def tabulateFile(name, cfdist, words, topics):
    # Open file for writing output
    output = open('{0}_tabulate.txt'.format(name), 'w')
    print('{:16}'.format('Topic')
          , end=' ', file=output)            # column headings
    
    for word in words:
        print('{:>6}'.format(word), end=' '
              , file=output)                 # print to output file
    print(file=output)
    
    for topic in topics:
        print('{:16}'.format(topic)
              , end=' ', file=output)        # row heading
        for word in words:                   # for each word
            print('{:6}'.format(cfdist[topic][word])
                  , end=' ', file=output)    # print table cell
        print(file=output)
    output.close()
    
    # Display printed results from file
    with open('{0}_tabulate.txt'.format(name), 'r') as t:
        print(t.read())

# For comparison, we could also format this in a manner that is dataframe ready
def tabulateReturn(cfdist, words, topics):
    tabbed = []
    # Row Headings
    for topic in topics:
        row = []
        row.append('{:<16}'.format(topic))         # Uses topics as row names
        # For each word
        for word in words:                         # Fill table with
            row.append('{:5}'                      # word frequency in topic
                       .format(cfdist[topic][word]))  
        tabbed.append(row)
    return tabbed

In [768]:
cfd = nltk.ConditionalFreqDist(
           (genre, word)
           for genre in brown.categories()
           for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']


tabulateFile('brown', cfd, modals, genres)

df_brown = pd.DataFrame(tabulateReturn(cfd, modals, genres)
                        , columns=('Topic', 'can', 'could', 'may', 'might', 'must', 'will'))

Topic               can  could    may  might   must   will 
news                 93     86     66     38     50    389 
religion             82     59     78     12     54     71 
hobbies             268     58    131     22     83    264 
science_fiction      16     49      4     12      8     16 
romance              74    193     11     51     45     43 
humor                16     30      8      8      9     13 



In [764]:
df_brown

Unnamed: 0,Topic,can,could,may,might,must,will
0,news,93,86,66,38,50,389
1,religion,82,59,78,12,54,71
2,hobbies,268,58,131,22,83,264
3,science_fiction,16,49,4,12,8,16
4,romance,74,193,11,51,45,43
5,humor,16,30,8,8,9,13


### Text Wrapping 

When the output of our program is text-like, instead of tabular, it will usually be necessary to wrap it so that it can be displayed conveniently. We can take care of line wrapping with the help of Python's ```textwrap``` module. For maximum clarity we will separate each step onto its own line.

In [770]:
saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
          'more', 'is', 'said', 'than', 'done', '.']

# Prints words, followed by their length in order
for word in saying:
    print(word, '(' + str(len(word)) + '),', end=' ')

After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1), 

In [772]:
from textwrap import fill

format = '%s_(%d),'
pieces = [format % (word, len(word)) for word in saying]
output = ' '.join(pieces)
wrapped = fill(output)
print(wrapped.replace('_', ' '))

After (5), all (3), is (2), said (4), and (3), done (4), , (1),
more (4), is (2), said (4), than (4), done (4), . (1),


## Additional Exercises

In [847]:
# Write a statement that changes s to "colourless" 
# using only the slice and concatenation operations
s = 'colorless'
s = s[:4] + 'u' + s[4:]
print(s)

colourless


In [779]:
# Use slice notation to remove the affixes from these words:
words = ['dishes', 'running', 'nationality', 'undo', 'preheat']

print(words[0][:-2])
print(words[1][:-4])
print(words[2][:-5])
print(words[3][:-2])
print(words[4][:-4])

dish
run
nation
un
pre


In [784]:
# What happens if you ask the interpreter to evaluate monty[::-1]?
monty = 'Monty Python'

# Reverses a string
print(monty[::-1])

nohtyP ytnoM


In [843]:
# Write a utility function that takes a URL as its argument, 
# and returns the contents of the URL, with all HTML markup removed

def textFromHTML(url):
    from urllib import request 
    from bs4 import BeautifulSoup
    
    html = request.urlopen(url).read().decode('utf8')
    html_raw = BeautifulSoup(html, "lxml")
    tokens = word_tokenize(html_raw.get_text())
    html_text = nltk.Text([w.lower() for w in tokens if w.isalpha() and len(w) > 3])
    return html_text.tokens

In [796]:
textFromHTML('https://en.wikipedia.org/wiki/Contoso')

['contoso',
 'wikipedia',
 'function',
 'wgcanonicalnamespace',
 'wgcanonicalspecialpagename',
 'false',
 'wgnamespacenumber',
 'wgpagename',
 'contoso',
 'wgtitle',
 'contoso',
 'wgcurrevisionid',
 'wgrevisionid',
 'wgarticleid',
 'wgisarticle']

In [842]:
### Read in some text from a corpus, tokenize it, and print the list of all wh-word types that occur.
#
# (wh-words in English are used in questions, relative clauses and exclamations: 
# who, where, which, what, why, and so on (*and also including how)

q = ['what', 'when', 'where', 'while', 'who', 'why', 'which', 'whom', 'whose', 'how']
qfd = nltk.FreqDist([word for word in monty if word in q])
display(qfd)

FreqDist({'how': 6,
          'what': 27,
          'when': 12,
          'where': 6,
          'which': 10,
          'while': 2,
          'who': 21,
          'whom': 4,
          'whose': 2,
          'why': 3})

In [49]:
# Are you able to write a regular expression to tokenize text in such a way 
# that the word don't is tokenized into do and n't? 
# Explain why this regular expression won't work: «n't|\w+»

apostrophe_match = '[a-z]+n\'t'

conjunctions = re.findall(apostrophe_match, text)

tokens = [(word[:-3], word[-3:]) for word in conjunctions]
print(tokens)

[('ca', "n't"), ('ca', "n't"), ('ca', "n't"), ('wo', "n't"), ('had', "n't"), ('ca', "n't"), ('could', "n't"), ('ca', "n't"), ('could', "n't"), ('ca', "n't"), ('could', "n't"), ('o', "n't"), ('ca', "n't"), ('could', "n't"), ('ca', "n't"), ('could', "n't"), ('did', "n't"), ('ca', "n't"), ('ca', "n't"), ('do', "n't"), ('did', "n't"), ('could', "n't"), ('was', "n't"), ('could', "n't"), ('do', "n't"), ('wo', "n't"), ('ca', "n't"), ('do', "n't"), ('do', "n't"), ('o', "n't"), ('dare', "n't"), ('dare', "n't"), ('is', "n't"), ('did', "n't"), ('did', "n't"), ('did', "n't"), ('was', "n't"), ('was', "n't"), ('was', "n't"), ('was', "n't"), ('did', "n't"), ('could', "n't"), ('ca', "n't"), ('a', "n't"), ('could', "n't"), ('did', "n't"), ('do', "n't"), ('do', "n't"), ('could', "n't"), ('was', "n't"), ('ai', "n't"), ('ai', "n't"), ('o', "n't"), ('ai', "n't"), ('ai', "n't"), ('ai', "n't"), ('ai', "n't"), ('do', "n't"), ('did', "n't"), ('could', "n't"), ('did', "n't"), ('do', "n't"), ('do', "n't"), ('did

#### Pig Latin is a simple transformation of English text. 
1. Write a function to convert a word to Pig Latin.
2. Write code that converts text, instead of individual words.
3. Extend it further to preserve capitalization, to keep qu together (i.e. so that quiet becomes ietquay), and to detect when y is used as a consonant (e.g. yellow) vs a vowel (e.g.  style).

In [23]:
# Each word of the text is converted as follows: 
# Move any consonant (or consonant cluster) that appears at the start 
# of the word to the end, then append ay, 
# e.g. string → ingstray, idle → idleay

def convertToPigLatin(text):
    out = []
    for w in text: # Doesn't yet support QU or Y as Vowels
        split = re.sub('^[bfdtlmnrpvcgjkqsxz]+|qu+', '' , w, flags=re.I)
        front = re.match('^[bfdtlmnrpvcgjkqsxz]|qu+', w, flags=re.I)
        if front:
            piggify = str.format('{0}{1}ay', split, front.group(0))
        else: piggify = str.format('{0}ay', split)
        out.append(piggify)
    return out


In [24]:
convertToPigLatin(bland)

['ewlynay',
 'ormedfay',
 'andbay',
 'ideasay',
 'areay',
 'inexpressibleay',
 'inay',
 'anay',
 'infuriatingay',
 'wayay']

In [14]:
[word for word in bland]

['newly',
 'formed',
 'bland',
 'ideas',
 'are',
 'inexpressible',
 'in',
 'an',
 'infuriating',
 'way']

**Define a variable ```silly``` to contain the string: ```'newly formed bland ideas are inexpressible in an infuriating
way'```** 
...This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous nonsense phrase, ```colorless green ideas sleep furiously```, according to Wikipedia. 
Now write code to perform the following tasks:

1. Split silly into a list of strings, one per word, using Python's split() operation, and save this to a variable called bland.
2. Extract the second letter of each word in silly and join them into a string, to get 'eoldrnnnna'.
3. Combine the words in bland back into a single string, using join(). Make sure the words in the resulting string are separated with whitespace.
4. Print the words of silly in alphabetical order, one per line.

In [5]:
silly = 'newly formed bland ideas are inexpressible in an infuriating way'
bland = silly.split(' ') #1 Check
print(bland)

s = ''
for word in bland:
    s += word[1]         #2 Check 
print(s)
    
join = ' '.join(bland)   #3 Check
print(join)

for word in sorted(bland):
    print(word)          #4 Check

['newly', 'formed', 'bland', 'ideas', 'are', 'inexpressible', 'in', 'an', 'infuriating', 'way']
eoldrnnnna
newly formed bland ideas are inexpressible in an infuriating way
an
are
bland
formed
ideas
in
inexpressible
infuriating
newly
way


### Soundex Algorithm

According to Wikipedia, Soundex is a phonetic algorithm for **indexing names by sound**, as pronounced in English that has been around since the 1930s. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. 

The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants.

The correct value can be found as follows:

1. Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
2. Replace consonants with digits as follows (after the first letter):
  - b, f, p, v → 1
  - c, g, j, k, q, s, x, z → 2
  - d, t → 3
  - l → 4
  - m, n → 5
  - r → 6

In [900]:
# Implementation of Soundex Name Indexing Algorithm in Python
def soundex(name):
    import re
    
    output = name[0].upper()               # Retain the first letter of the name 
    name = re.sub('[hw]'                   # Drop all occurrences of h, w
                  , '', name, flags=re.I)  # flags=re.I |ignore case
    
    # Replace consonants with their respective digits
    name = re.sub('[bfpv]+', '1', name, flags=re.I)
    name = re.sub('[cgjkqsxz]+', '2', name, flags=re.I)
    name = re.sub('[dt]+', '3', name, flags=re.I)
    name = re.sub('l+', '4', name, flags=re.I)
    name = re.sub('[mn]+', '5', name, flags=re.I)
    name = re.sub('r+', '6', name, flags=re.I)
    
    # Chop off the first from the transformed string
    name = name[1:]
    # Drop all occurrences of a, e, i, o, u, y
    name = re.sub('[aeiouy]', '', name, flags=re.I)
    # Append first 3 digits of transformed text to output
    output += name[:3]
    
    # Fill with 0s in the event there are < 4 chars
    if len(output) < 4:
        output += '0'*(4-len(output))
    return str(output)    

In [901]:
print(soundex('Whitney'))
print(soundex('Stephen'))
print(soundex('Mike'))
print(soundex('Gloria'))
print(soundex('Nat'))

W350
S315
M200
G460
N300


### Language Identification 

With the help of a multilingual corpus such as the Universal Declaration of Human Rights Corpus (nltk.corpus.udhr), and NLTK's frequency distribution and rank correlation functionality (nltk.FreqDist, nltk.spearman_correlation), develop a system that guesses the language of a previously unseen text

**Conclusions:** After making this, it's functional, but doesn't seem to do a great job when presented with more than a handful of languages at a time, espcially when more than one of those languages have very similar roots (ie German and Dutch or Spanish and Portuguese). It/s also very sensitive to length and use of characters in the input text.

In [275]:
# Ingests specified language corpuses from UDHR (fileids):
# Returns a tokenized frequency distribution for each to use as
# a base from which to calculate correlation to the unseen text

from nltk import *
from nltk.corpus import udhr
from nltk import spearman_correlation
from nltk.corpus import gutenberg

def indexlangs():   
    # Gather list of available language corpuses
    langs = udhr.fileids()
    
    # Match Latin1 encoded languages
    L = ['French_Francais-Latin1', 'Spanish-Latin1', 'German_Deutsch-Latin1' , 'English-Latin1']
    langs = [fileid for fileid in langs if fileid in L]
    langs = list(enumerate(langs)) # Easy-nav index of languages
    
    udhr_corpus = []
    for lang in langs:
        udhr_corpus.append(processText(udhr.words(lang[1])))
    
    fd_lang = []
    rank_lang = []
    for lang in udhr_corpus:
        fq, rk = freqDistRank(lang)
        fd_lang.append(fq)
        rank_lang.append(rk)
    return langs, rank_lang

def indexUnseenText(text):
    # Ingests unseen texts, returns tokenized frequency distribution
    unseen = processText(text)
    
    # Get FreqDist, and Ranks for text
    fd_unseen, rank_unseen = freqDistRank(text)
    return unseen, rank_unseen

def processText(text):
    # Tokenize each letter of each word, in each language
    text = [list(word.lower()) for word in text if str(word).isalpha()]
    
    # Transpose character lists of each word, creating one 
    # long tokenized list of characters for each language
    text = [char for word in text for char in word] 
    return text

def freqDistRank(text):
    # Returns FreqDist then converts that distribution
    # to a Spearman ranking by score
    
    dist = FreqDist(text)
    rank = list(ranks_from_sequence(dist))
    return dist, rank

def correlation(unseen_ranks, lang_ranks):
    # Calculates Spearman correlation between unseen text
    # ranks and the ranks of the indexed langs
    sc = []
    for lang in lang_ranks:
        cor = spearman_correlation(lang, unseen_ranks)
        sc.append(cor)
    return sc

def matchLanguage(text):
    # Calls all the functions to correlate unseen text to languages
    langs, lang_ranks = indexlangs()
    unseen, unseen_ranks = indexUnseenText(text)
    cor = correlation(unseen_ranks, lang_ranks)
    
    # Compiles languages and spearman correlations into one list
    zippy = list(zip(langs, cor))
    
    # Sorts the lists, and returns the highest value correlation
    zippy = sorted(zippy, key=lambda x: x[1]).pop()

    return zippy


In [276]:
spanish = "Dime con quién andas, y te diré quién eres"
french = "Chacun voit midi à sa porte"
german = "Des Teufels liebstes Möbelstück ist die lange Bank"
english = "If you are going through hell, keep going."

display(matchLanguage(spanish))
display(matchLanguage(french))
display(matchLanguage(german))
display(matchLanguage(english))
# 50% success rate... blah

((2, 'German_Deutsch-Latin1'), -0.6401098901098901)

((1, 'French_Francais-Latin1'), 0.08970588235294119)

((0, 'English-Latin1'), -0.6373626373626373)

((0, 'English-Latin1'), -0.5053571428571428)

In [179]:
# Whoops, I wrote my own Frequency Distribution function

#freqs = {}
#for lang in fids:
#    freqs[lang] = {}
#    for word in udhr.words(lang):
#        if word.isalpha():
#            for char in word.lower():
#                if char in freqs[lang]:
#                    freqs[lang][char] += 1
#                else: freqs[lang][char] = 1
#display(freqs)           