# Natural Language Processing with Python
## Chapter 3 Processing Raw Text
### 3.1 Accessing Text from the Web and from Disk
#### Electronic books
#### Dealing with HTML
#### Processing Search Engine Results
#### Processing RSS feeds
#### Reading local files
#### Extracting Text from PDF, MSWord and other Binary Formats
#### Capturing User Input
#### The NLP Pipeline
### 3.2 Strings: Text Processing at the Lowest Level
#### Basic Operations with Strings
#### Printing Strings
#### Accessing Individual Characters
#### Accessing Substrings
#### More operations on strings
#### The Difference between Lists and Strings
### 3.3 Text Processing with Unicode
#### What is Unicode?
#### Extracting encoded text from files
#### Using your local encoding in Python
### 3.4 Regular Expressions for Detecting Word Patterns
#### Using Basic Meta-Characters
#### Ranges and Closures
### 3.5 Useful Applications of Regular Expressions
#### Extracting Word Pieces
#### Doing More with Word Pieces

In [1]:
import nltk
import re
words = nltk.corpus.toolbox.words('rotokas.dic')
CosVow = [j for i in words for j in re.findall(r'[ptksvr][aeiou]',i) ]
CosVowCFD = nltk.ConditionalFreqDist(CosVow)
CosVowCFD.tabulate()

a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


In [8]:
CosVowWord = [(cv,w)  for w in words for cv in re.findall(r'[ptksvr][aeiou]',w)]
CosVowIndex = nltk.Index(CosVowWord)
CosVowIndex['pa']

['kaakuupato',
 'kaapa',
 'kaapiepato',
 'kaapopato',
 'kaepaa',
 'Kakarapaia',
 'kakupaa',
 'kakuparei',
 'kakupato',
 'kapa']

#### Finding Word Stems

In [28]:
def stem(word):
    StemPattern = r'^(.*?)(ing|ly|ed|y|ious|ies|ive|es|s|ment)?$'
    stem,s = re.findall(StemPattern,word)[0]
    return stem
raw = """
DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony.
"""
StemWords = nltk.word_tokenize(raw)
print([stem(i) for i in StemWords])

['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremon', '.']


#### Searching Tokenized Text

In [29]:
from nltk.corpus import gutenberg,nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r'<a>(<.*>)<man>')

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave


### 3.6 Normalizing Text
#### Stemmers

In [32]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
print([porter.stem(i) for i in StemWords])
print([lancaster.stem(i) for i in StemWords])

['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']
['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']


In [7]:
class IndexedText(object):
    def __init__(self,_text,_stemmer):
        self._text = _text
        self._stemmer = _stemmer
        self._index = nltk.Index((self._stem(word),i) for (i,word) in enumerate(self._text))
    def _stem(self,word):
        return self._stemmer.stem(word).lower()
    def concordance(self,word,width=40):
        key = self._stem(word)
        WordCount = int(width/4)
        for i in self._index[word]:
            lcontext = ' '.join(self._text[i-WordCount:i])
            rcontext = ' '.join(self._text[i:i+WordCount])
            ldisplay = '{:>{width}}'.format(lcontext[-width:],width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width],width=width)
            print(ldisplay,rdisplay)
stemmer = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(grail,stemmer)
text.concordance('lie')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


#### Lemmatization

In [33]:
WordNetLemma = nltk.WordNetLemmatizer()
print([WordNetLemma.lemmatize(i) for i in StemWords])

['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


### 3.7 Regular Expressions for Tokenizing Text
#### Simple Approaches to Tokenization
#### NLTK's Regular Expression Tokenizer

In [39]:
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'(?:[A-Z]\.)+|\w+(?:-\w+)*|\$?\d+(?:.\d+)?%?|\.\.\.|[][,.:"\'?():-_`]'
nltk.regexp_tokenize(text,pattern)

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

#### Further Issues with Tokenization
### 3.8 Segmentation
#### Sentence Segmentation
#### Word Segmentation

In [68]:
def segment(text,segstr):
    result = []
    last = 0
    for i in range(len(segstr)):
        if segstr[i] == '1' :
            result.append(text[last:i+1])
            last = i+1
    result.append(text[last:])        
    return result
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0100100100100001001001000010100100010010000100010010000"
print(segment(text,seg1))  

['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you', 'like', 'the', 'kitty', 'like', 'the', 'doggy']


In [69]:
def evaluate(text,segstr):
    WordList = segment(text,segstr)
    lexicon = sum([len(i)+1 for i in set(WordList)])
    devriation = len(WordList)
    return lexicon + devriation
evaluate(text,seg1)

48

In [96]:
from random import randint
def FlipStr(string,pos):
    return string[:pos] + str(1-int(string[pos])) + string[pos+1:]
def CreateNFlip(text,n):
    for i in range(n):
        text = FlipStr(text,randint(0,len(text)-1))
    return text
def anneal(text,seg,iteration,CoolRate):
    temperature = float(len(seg))
    while temperature > 0.5 :
        BestScore,BestSeg = evaluate(text,seg),seg
        for i in range(iteration):
            guess = CreateNFlip(seg,round(temperature))
            score = evaluate(text,guess)
            if score < BestScore :
                BestScore,BestSeg = score,guess
        seg = BestSeg
        temperature = temperature * CoolRate
        print(BestScore,segment(text,BestSeg))
seg2 = "0000000000000001000000000010000000000000000100000000000"  
text2 = '盼望着盼望着东风来了春天的脚步近了' 
seg3 = '0010001000001000'
anneal(text,seg2,1000,0.95)

64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouse

### 3.9 Formatting: From Lists to Strings
#### From Lists to Strings
#### Strings and Formats
#### Lining Things Up
#### Writing Results to a File
#### Text Wrapping
### 3.10 Summary
### 3.11 Further Reading
### 3.12 Exercises
#### 1. Define a string `s = 'colorless'`. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.

In [99]:
s = 'colorless'
s[:4]+'u'+s[4:]

'colourless'

#### 2. We can use the slice notation to remove morphological endings on words. For example, `'dogs'[:-1]` removes the last character of `dogs`, leaving `dog`. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): `dish-es`, `run-ning`, `nation-ality`, `un-do`, `pre-heat`.

In [100]:
'dish-es'[:4]

'dish'

In [101]:
'un-do'[3:]

'do'

#### 3. We saw how we can generate an `IndexError` by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

In [102]:
[1,2,3][-5]

IndexError: list index out of range

#### 4. We can specify a "step" size for the slice. The following returns every second character within the slice: `monty[6:11:2]`. It also works in the reverse direction: `monty[10:5:-2]` Try these for yourself, then experiment with different step values.
#### 5. What happens if you ask the interpreter to evaluate `monty[::-1]`? Explain why this is a reasonable result.
#### 6. Describe the class of strings matched by the following regular expressions.
+ `[a-zA-Z]+`
+ `[A-Z][a-z]*`
+ `p[aeiou]{,2}t`
+ `\d+(\.\d+)?`
+ `([^aeiou][aeiou][^aeiou])*`
+ `\w+|[^\w\s]+`  

Test your answers using `nltk.re_show()`.
#### 7. Write regular expressions to match the following classes of strings:
##### a. A single determiner (assume that a, an, and the are the only determiners).
##### b. An arithmetic expression using integers, addition, and multiplication, such as `2*3+8`.

In [None]:
r'\ban\b|\bthe\b|\ba\b'
r'(\d+[\+\*]?)+'

#### 8. Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use `from urllib import request` and then `request.urlopen('http://nltk.org/').read().decode('utf8')` to access the contents of the URL.
#### 9. Save some text into a file `corpus.txt`. Define a function `load(f)` that reads from the file named in its sole argument, and returns a string containing the text of the file.
##### a. Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag `(?x)`.
##### b. Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.
#### 10. Rewrite the following loop as a list comprehension:
```
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = []
for word in sent:
    word_len = (word, len(word))
    result.append(word_len)
result
[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]
```

In [104]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
print([(i,len(i)) for i in sent])

[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]


#### 11. Define a string `raw`containing a sentence of your own choosing. Now, split `raw` on some character other than space, such as `'s'`.
#### 12. Write a `for` loop to print out the characters of a string, one per line.
#### 13. What is the difference between calling `split` on a string with no argument or with `' '` as the argument, e.g. `sent.split()` versus `sent.split(' ')`? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? (In IDLE you will need to use `'\t'` to enter a tab character.)
#### 14. Create a variable `words` containing a list of words. Experiment with `words.sort()` and `sorted(words)`. What is the difference?
#### 15. Explore the difference between strings and integers by typing the following at a Python prompt: `"3" * 7` and `3 * 7`. Try converting between strings and integers using `int("3")` and `str(3)`.
#### 16. Use a text editor to create a file called `prog.py` containing the single line `monty = 'Monty Python'`. Next, start up a new session with the Python interpreter, and enter the expression `monty` at the prompt. You will get an error from the interpreter. Now, try the following (note that you have to leave off the `.py` part of the filename):
'''
from prog import monty
monty
'''
This time, Python should return with a value. You can also try `import prog`, in which case Python should be able to evaluate the expression `prog.monty` at the prompt.
#### 17. What happens when the formatting strings %6s and %-6s are used to display strings that are longer than six characters?

In [109]:
print('%6s'% 'ddddddd')

ddddddd


#### 18. Read in some text from a corpus, tokenize it, and print the list of all wh-word types that occur. (wh-words in English are used in questions, relative clauses and exclamations: who, which, what, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

In [115]:
from nltk.corpus import gutenberg
tokens = gutenberg.words('austen-emma.txt')
print(set([i for i in tokens if i.startswith('Wh')]))

{'What', 'When', 'While', 'Whenever', 'Where', 'Whatever', 'Whether', 'Whoever', 'Which', 'Why', 'Who', 'Whom'}


#### 19. Create a file consisting of words and (made up) frequencies, where each line consists of a word, the space character, and a positive integer, e.g. `fuzzy 53`. Read the file into a Python list using `open(filename).readlines()`. Next, break each line into its two fields using `split()`, and convert the number into an integer using `int()`. The result should be a list of the form: `[['fuzzy', 53], ...]`.
#### 20. Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.
#### 21. Write a function `unknown()` that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using `re.findall()`) and remove any items from this set that occur in the Words Corpus (`nltk.corpus.words`). Try to categorize these words manually and discuss your findings.
#### 22. Examine the results of processing the URL `http://news.bbc.co.uk/` using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.
#### 23. Are you able to write a regular expression to tokenize text in such a way that the word don't is tokenized into do and n't? Explain why this regular expression won't work: «`n't|\w+`».

In [132]:
re.findall(r'n\'t|\w+[^(n\'t)]','don\'t')

['do', "n't"]

#### 24. Try to write code to convert text into hAck3r, using regular expressions and substitution, where `e` → `3`, `i` → `1`, `o` → `0`, `l` → `|`, `s` → `5`, `.` → `5w33t!`, `ate` → `8`. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map `s` to two different values: `$` for word-initial `s`, and `5` for word-internal `s`.

In [191]:
def convert(string):
    result = ''
    result = re.sub(r'ate',r'8',string)
    result = re.sub(r'e',r'3',result)
    result = re.sub(r'o',r'0',result)
    result = re.sub(r'i',r'1',result)
    result = re.sub(r'l',r'|',result)
    result = re.sub(r'\.',r'5w33t!',result)
    result = re.sub(r'\bs',r'$',result)
    result = re.sub(r'\Bs',r'5',result)
    return result
convert('seil os. ate')

'$31| 055w33t! 8'

#### 25. Pig Latin is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay, e.g. string → ingstray, idle → idleay. `http://en.wikipedia.org/wiki/Pig_Latin`
##### a. Write a function to convert a word to Pig Latin.
##### b. Write code that converts text, instead of individual words.
##### c. Extend it further to preserve capitalization, to keep `qu` together (i.e. so that `quiet` becomes `ietquay`), and to detect when `y` is used as a consonant (e.g. `yellow`) vs a vowel (e.g. `style`).

In [32]:
def convertPL(word):
    cos = re.match(r'qu|[^aeiouAEIOUyY]*',word)
    return word[cos.span()[1]:] + cos.group() + 'ay'
convertPL('style')

'ylestay'

#### 26. Download some text from a language that has vowel harmony (e.g. Hungarian), extract the vowel sequences of words, and create a vowel bigram table.
#### 27. Python's `random` module includes a function `choice()` which randomly chooses an item from a sequence, e.g. `choice("aehh ")` will produce one of four possible characters, with the letter `h` being twice as frequent as the others. Write a generator expression that produces a sequence of 500 randomly chosen letters drawn from the string "aehh ", and put this expression inside a call to the `''.join()` function, to concatenate them into one long string. You should get a result that looks like uncontrolled sneezing or maniacal laughter: `he  haha ee  heheeh eha`. Use `split()` and `join()` again to normalize the whitespace in this string.

In [34]:
import random
print(''.join([random.choice('aehh ') for i in range(100)]))

eaeh aeeh e  hhhaeaahhhea aahhh e hhaeeeahehah  eehehhaaeea eeea a hha hahee hhah ah hhehe h hheheae


#### 28. Consider the numeric expressions in the following sentence from the MedLine Corpus: The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively. Should we say that the numeric expression 4.53 +/- 0.15% is three words? Or should we say that it's a single compound word? Or should we say that it is actually nine words, since it's read "four point five three, plus or minus zero point fifteen percent"? Or should we say that it's not a "real" word at all, since it wouldn't appear in any dictionary? Discuss these different possibilities. Can you think of application domains that motivate at least two of these answers?
#### 29. Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define $μ_w$ to be the average number of letters per word, and $μ_s$ to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: $4.71 μw + 0.5 μs - 21.43$. Compute the ARI score for various sections of the Brown Corpus, including section `f` (lore) and `j` (learned). Make use of the fact that `nltk.corpus.brown.words()` produces a sequence of words, while `nltk.corpus.brown.sents()` produces a sequence of sentences.

In [151]:
from nltk.corpus import brown
def ari(WordList,SenList):
    uw = sum([len(i) for i in WordList])/len(WordList)
    us = sum([len(i) for i in SenList])/len(SenList)
    return 4.71*uw+0.5*us-21.43
ari(brown.words(categories='lore'),brown.sents(categories = 'lore'))

10.254756197101155

#### 30. Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.
#### 31. Define the variable saying to contain the list `['After', 'all', 'is', 'said', 'and', 'done', ',', 'more', 'is', 'said', 'than', 'done', '.']`. Process this list using a `for` loop, and store the length of each word in a new list `lengths`. Hint: begin by assigning the empty list to `lengths`, using `lengths = []`. Then each time through the loop, use `append()` to add another length value to the list. Now do the same thing using a list comprehension.

In [None]:
l = ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more', 'is', 'said', 'than', 'done', '.']
lengths = [len(i) for i in l]

#### 32. Define a variable `silly` to contain the string: `'newly formed bland ideas are inexpressible in an infuriating way'`. (This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous nonsense phrase, colorless green ideas sleep furiously according to Wikipedia). Now write code to perform the following tasks:
##### a. Split `silly` into a list of strings, one per word, using Python's `split()` operation, and save this to a variable called `bland`.
##### b. Extract the second letter of each word in `silly` and join them into a string, to get `'eoldrnnnna'`.
##### c. Combine the words in `bland` back into a single string, using `join()`. Make sure the words in the resulting string are separated with whitespace.
##### d. Print the words of `silly` in alphabetical order, one per line.

In [47]:
silly = 'newly formed bland ideas are inexpressible in an infuriating way'
bland = ''.join([i[1] for i in silly.split()])
print(bland)
print(sorted(silly.split()))

eoldrnnnna
['an', 'are', 'bland', 'formed', 'ideas', 'in', 'inexpressible', 'infuriating', 'newly', 'way']


#### 33. The `index()` function can be used to look up items in sequences. For example, 'inexpressible'.index('e') tells us the index of the first position of the letter e.
##### a. What happens when you look up a substring, e.g. `'inexpressible'.index('re')`?
##### b. Define a variable `words` containing a list of words. Now use `words.index()` to look up the position of an individual word.
##### c. Define a variable `silly` as in the exercise above. Use the `index()` function in combination with list slicing to build a list `phrase` consisting of all the words up to (but not including) `in` in silly.

In [48]:
print('inexpressible'.index('re'))
silly[:silly.index('in')]

5


'newly formed bland ideas are '

#### 34. Write code to convert nationality adjectives like Canadian and Australian to their corresponding nouns Canada and Australia (see `http://en.wikipedia.org/wiki/List_of_adjectival_forms_of_place_names`).

In [51]:
def convertadj(word):
    return re.sub(r'ian\b',r'a',word)
convertadj('canadian')

'canada'

#### 35. Read the LanguageLog post on phrases of the form as best as p can and as best p can, where p is a pronoun. Investigate this phenomenon with the help of a corpus and the `findall()` method for searching tokenized text described in 3.5. `http://itre.cis.upenn.edu/~myl/languagelog/archives/002733.html`
#### 36. Study the lolcat version of the book of Genesis, accessible as `nltk.corpus.genesis.words('lolcat.txt')`, and the rules for converting text into lolspeak at `http://www.lolcatbible.com/index.php?title=How_to_speak_lolcat`. Define regular expressions to convert English words into corresponding lolspeak words.
#### 37. Read about the `re.sub()` function for string substitution using regular expressions, using `help(re.sub)` and by consulting the further readings for this chapter. Use `re.sub` in writing code to remove HTML tags from an HTML file, and to normalize whitespace.
#### 38. An interesting challenge for tokenization is words that have been split across a line-break. E.g. if long-term is split, then we have the string `long-\nterm`.
##### a. Write a regular expression that identifies words that are hyphenated at a line-break. The expression will need to include the `\n` character.
##### b. Use re.sub() to remove the `\n` character from these words.
##### c. How might you identify words that should not remain hyphenated once the newline is removed, e.g. 'encyclo-\npedia'?

In [77]:
s = 'long-\nterm'
re.sub(r'(\w+-)(\n)(\w+)',r'\1\3',s)

'long-term'

#### 39. Read the Wikipedia entry on Soundex. Implement this algorithm in Python.

In [132]:
def SoundEx(word):
    result = word[0]
    cos = word.lower()
    cos = re.sub(r'([bfpv])+([hw][bfpv]+)?',r'\1',cos)
    cos = re.sub(r'([cgjkqsxz])+([hw][cgjkqsxz]+)?',r'\1',cos)
    cos = re.sub(r'([dt])+([hw][dt]+)?',r'\1',cos)
    cos = re.sub(r'(l)+([hw]l+)?',r'l',cos)
    cos = re.sub(r'([mn])+([hw][mn]+)?',r'\1',cos)
    cos = re.sub(r'(r)+([hw]r+)?',r'\1',cos) 
    cos = cos[1:]
    cos = re.sub(r'[bfpv]',r'1',cos)
    cos = re.sub(r'[cgjkqsxz]',r'2',cos)
    cos = re.sub(r'[dt]',r'3',cos)
    cos = re.sub(r'l',r'4',cos)
    cos = re.sub(r'[mn]',r'5',cos)
    cos = re.sub(r'r',r'6',cos)
    cos = re.sub(r'[aeiouyhw]',r'',cos)
    result += cos
    if len(result[1:]) < 3 :
        result = result[0] + result[1:].ljust(3,'0')
    return result[:4]
SoundEx('Tymczak')

'T522'

#### 40. Obtain raw texts from two or more genres and compute their respective reading difficulty scores as in the earlier exercise on reading difficulty. E.g. compare ABC Rural News and ABC Science News (`nltk.corpus.abc`). Use Punkt to perform sentence segmentation.

In [155]:
text1 = nltk.corpus.abc.raw('rural.txt')
TokenList1 = nltk.corpus.abc.words('rural.txt')
SentList1 = [i.split() for i in nltk.sent_tokenize(text1)]
print(ari(TokenList1,SentList1))
text2 = nltk.corpus.abc.raw('science.txt')
TokenList2 = nltk.corpus.abc.words('science.txt')
SentList2 = [i.split() for i in nltk.sent_tokenize(text2)]
print(ari(TokenList2,SentList2))

10.66074843699441
10.703963706930097


#### 41. Rewrite the following nested loop as a nested list comprehension:
```
words = ['attribution', 'confabulation', 'elocution',
         'sequoia', 'tenacious', 'unidirectional']
vsequences = set()
for word in words:
    vowels = []
    for char in word:
        if char in 'aeiou':
            vowels.append(char)
    vsequences.add(''.join(vowels))
sorted(vsequences)
['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']
```

In [157]:
words = ['attribution', 'confabulation', 'elocution','sequoia', 'tenacious','unidirectional']
sorted([''.join([i for i in word if i in 'aeiou']) for word in words])

['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']

#### 42. Use WordNet to create a semantic index for a text collection. Extend the concordance search program in 3.6, indexing each word using the offset of its first synset, e.g. `wn.synsets('dog')[0].offset` (and optionally the offset of some of its ancestors in the hypernym hierarchy).

In [16]:
from nltk.corpus import wordnet as wn
class WNIndexedText(IndexedText):
    def __init__(self,text,stemmer):
        self._text = _text
        self._stemmer = _stemmer
        self._index = nltk.Index((self._stem(word),wn.synsets(self._stem(word))[0].offset()) for word in self._text)

#### 43. With the help of a multilingual corpus such as the Universal Declaration of Human Rights Corpus (`nltk.corpus.udhr`), and NLTK's frequency distribution and rank correlation functionality (`nltk.FreqDist, nltk.spearman_correlation`), develop a system that guesses the language of a previously unseen text. For simplicity, work with a single character encoding and just a few languages.

In [163]:
from nltk.corpus import udhr
FileList = ['English-Latin1','French_Francais-Latin1','Spanish-Latin1']
LangFreqList = [nltk.FreqDist(udhr.words(i)) for i in FileList]
test = nltk.FreqDist(brown.words('ca01'))
for i in range(3):
    print(FileList[i],nltk.spearman_correlation(test,LangFreqList[i]))

English-Latin1 0.9751468222795795
French_Francais-Latin1 -2.5911764705882354
Spanish-Latin1 -6.782967032967033


#### 44. Write a program that processes a text and discovers cases where a word has been used with a novel sense. For each word, compute the WordNet similarity between all synsets of the word and all synsets of the words in its context. (Note that this is a crude approach; doing it well is a difficult, open research problem.)

#### 45. Read the article on normalization of non-standard words (Sproat et al, 2001), and implement a similar system for text normalization.