In [1]:
import nltk
import re
import pprint
import random
from urllib import request
from nltk import word_tokenize
from nltk.corpus import brown
from nltk.corpus import wordnet as wn

**1. Define a string ** `s = 'colorless'` **. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.**

In [2]:
s = 'colorless'
s = s[:4] + 'u' + s[4:]

**2. We can use the slice notation to remove morphological endings on words. For example, ** `'dogs'[:-1]` ** removes the last character of ** `dogs` **, leaving ** `dog` **. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): ** `dish-es` **, ** `run-ning` **, ** `nation-ality` **, ** `un-do` **, ** `pre-heat` **.**

In [3]:
dish = 'dishes'[:-2]
run = 'running'[:-4]
nation = 'nationality'[:-5]
do = 'undo'[2:]
heat = 'preheat'[3:]

**3. We saw how we can generate an ** `IndexError` ** by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?**

Yes, that is possible. Given a string `s`, `s[-(len(s)+1)]` will generate an `IndexError` since it goes too far to the left.

**4. We can specify a "step" size for the slice. The following returns every second character within the slice: ** `monty[6:11:2]` **. It also works in the reverse direction: ** `monty[10:5:-2]` ** Try these for yourself, then experiment with different step values.**

Omitted.

**5. What happens if you ask the interpreter to evaluate ** `monty[::-1]` **? Explain why this is a reasonable result.**

In [4]:
monty = 'Monty Python'
monty[::-1]

'nohtyP ytnoM'

Reverse the string. `monty[:]` is the string itself, and `:-1` takes the reverse order.

**6. Describe the class of strings matched by the following regular expressions.**  
a. `[a-zA-Z]+`  
b. `[A-Z][a-z]*`  
c. `p[aeiou]{,2}t`  
d. `\d+(\.\d+)?`  
e. `([^aeiou][aeiou][^aeiou])*`  
f. `\w+|[^\w\s]+`  
**Test your answers using ** `nltk.re_show()`.

a. Normal words(with one or more letters in either upper or lower case)  
b. Titled words(first letter is upper case)  
c. Words starting with `p`, ending with `t`, and with 0 to 2 vowel(s) between. E.g., `pt`, `pet`, `poet`, etc.  
d. Real numbers(integers and fractions)  
e. [Consonant-Vowel-Consonant] with zero or more times  
f. Alphanumeric character(s) or non-whitespace character(s), can be used for tokenizing

**7. Write regular expressions to match the following classes of strings:**  
a. **A single determiner (assume that ** `a`**, **`an`**, and ** `the` ** are the only determiners).**  
b. **An arithmetic expression using integers, addition, and multiplication, such as ** `2*3+8`.

In [5]:
re_a = r'(\ban?\b|\bthe\b)'
re_b = r'[\d\*\+]+'

**8. Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use ** `from urllib import request` ** and then ** ` request.urlopen('http://nltk.org/').read().decode('utf8')` ** to access the contents of the URL.**

In [6]:
from bs4 import BeautifulSoup
def content_of_URL(URL):
    html = request.urlopen(URL).read().decode('utf8')
    raw = BeautifulSoup(html).get_text()
    tokens = word_tokenize(raw)
    return tokens

# well, I haven't installed BeautifulSoup so I skip running this block

**9. Save some text into a file ** `corpus.txt` **. Define a function ** `load(f)` ** that reads from the file named in its sole argument, and returns a string containing the text of the file.**  
a. **Use ** `nltk.regexp_tokenize()` ** to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag (?x).**  
b. **Use ** `nltk.regexp_tokenize()` ** to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.**

In [11]:
def load_punctuations(f):
    file = open(f,encoding="utf8")
    raw = file.read()
    pattern = r'''(?x)        # set flag to allow verbose regexps
        [,\.]                 # comma, period
      | [\[\](){}<>]          # brackets () {} [] <>
      | ['"“]                 # quotation marks
      | [?!]                  # question mark and exclamation mark
      | [:;]                  # colon and semicolon
      | \.\.\.                # ellipsis
      | [，。？！、‘：；]       # some Chinese punctuations
    '''
    return nltk.regexp_tokenize(raw, pattern)

load_punctuations('corpus.txt')

['.',
 ',',
 ',',
 ',',
 ',',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 ',',
 ',',
 ',',
 ',',
 '.',
 ',',
 ',',
 '.',
 ',',
 ',',
 ',',
 '.',
 '“',
 ',',
 ',',
 ',',
 '“',
 '.',
 '.',
 ',',
 ',',
 ',',
 ',',
 ',',
 '.',
 '.',
 '(',
 ':',
 '.',
 '.',
 ')',
 ',',
 '.',
 '.',
 '.',
 '?',
 '.',
 '.',
 '.']

In [12]:
def load_monetary(f):
    file = open(f,encoding="utf8")
    raw = file.read()
    pattern = r'''(?x)
        \$\d+(?:,\d+)*(?:\.\d+)?      # USD
      | £\d+(?:,\d+)*(?:\.\d+)?       # GBP
      | ￥\d+(?:\.\d+)?               # CNY
    '''
    return nltk.regexp_tokenize(raw, pattern)

load_monetary('corpus.txt')

['$1,000', '£999.99', '￥1000']

In [13]:
def load_date(f):
    file = open(f,encoding="utf8")
    raw = file.read()
    pattern = r'''(?x)
        \d{,4}[/\.-]\d{1,2}[/\.-]\d{1,2}       # big-endian, e.g., 1996-10-23, 1996.10.23, 1996/10/23
      | \d{1,2}[/\.-]\d{1,2}[/\.-]\d{,4}       # little-endian or middle-endian, dd/mm/yyyy or mm/dd/yyyy 
    '''
    # There are dates with month spelled out in full or in abbreviation as well.
    # But the pattern expression can be extremly tedious so I just leave them out.
    
    return nltk.regexp_tokenize(raw, pattern)
load_date('corpus.txt')

['2018-08-06', '2018.08.06', '08/06/20', '06/08/20', '06/08/18', '06-08-20']

**10. Rewrite the following loop as a list comprehension:**

```Python
>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> result = []
>>> for word in sent:
...     word_len = (word, len(word))
...     result.append(word_len)
>>> result
[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]
```

In [14]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = [(word, len(word)) for word in sent]
result

[('The', 3),
 ('dog', 3),
 ('gave', 4),
 ('John', 4),
 ('the', 3),
 ('newspaper', 9)]

**11. Define a string ** `raw` ** containing a sentence of your own choosing. Now, split ** `raw` ** on some character other than space, such as ** `'s'`.

In [15]:
raw = 'Define a string  raw containing a sentence of your own choosing.'
raw.split('s')

['Define a ', 'tring  raw containing a ', 'entence of your own choo', 'ing.']

**12. Write a ** `for` ** loop to print out the characters of a string, one per line.**

In [16]:
s = 'Hello world'
for char in s:
    print(char)

H
e
l
l
o
 
w
o
r
l
d


**13. What is the difference between calling split on a string with no argument or with ** `' '` ** as the argument, e.g. ** `sent.split()` ** versus ** `sent.split(' ')` **? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? (In IDLE you will need to use ** `'\t'` ** to enter a tab character.)**

In [17]:
sent = 'Hello\t World\nNLTK'
print(sent.split(' '))
print(sent.split())

['Hello\t', 'World\nNLTK']
['Hello', 'World', 'NLTK']


`sent.split(' ')` will not split other blank characters like `\t`, `\n`.

**14. Create a variable ** `words` ** containing a list of words. Experiment with ** `words.sort()` ** and ** `sorted(words)` **. What is the difference?**

`words.sort()` modify the original variable `words`, and it will not output in default.  
`sorted(words)` return a sorted list without changing the original list.

**15. Explore the difference between strings and integers by typing the following at a Python prompt: ** `"3" * 7` ** and ** `3 * 7` **. Try converting between strings and integers using ** `int("3")` ** and ** `str(3)` **.**

In [18]:
print("3" * 7)
print(3 * 7)
print(type(int("3")))
print(type(str(3)))

3333333
21
<class 'int'>
<class 'str'>


**16. Use a text editor to create a file called ** `prog.py` ** containing the single line ** `monty = 'Monty Python'` **. Next, start up a new session with the Python interpreter, and enter the expression ** `monty` ** at the prompt. You will get an error from the interpreter. Now, try the following (note that you have to leave off the ** `.py` ** part of the filename):**
```Python
from prog import monty
monty
```
**This time, Python should return with a value. You can also try ** `import prog` **, in which case Python should be able to evaluate the expression ** `prog.monty` ** at the prompt.**

Omitted.

**17. What happens when the formatting strings ** `%6s` ** and ** `%-6s` ** are used to display strings that are longer than six characters?**

In [19]:
s = 'helloworld'
print('%6s' %s)
print('%-6s' %s)
# There seems no difference.

helloworld
helloworld


**18. Read in some text from a corpus, tokenize it, and print the list of all ** *wh-* **word types that occur. (** *wh-* **words in English are used in questions, relative clauses and exclamations: ** *who*, *which*, *what* **, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?**

In [21]:
f = 'corpus.txt'
file = open(f,encoding="utf8")
raw = file.read()
tokens = word_tokenize(raw)
print([wh for wh in tokens if wh.lower().startswith('wh')])

['What', 'Why', 'who']


**19. Create a file consisting of words and (made up) frequencies, where each line consists of a word, the space character, and a positive integer, e.g. ** `fuzzy 53` **. Read the file into a Python list using ** ` open(filename).readlines()` **. Next, break each line into its two fields using ** `split()` **, and convert the number into an integer using ** `int()` **. The result should be a list of the form: ** `[['fuzzy', 53], ...]` **.**

In [22]:
filename = 'word_freq.txt'
lines = open(filename).readlines()
fields = []
for line in lines:
    field = line.split()
    field[1] = int(field[1])
    fields.append(field)
fields

[['fuzzy', 53], ['natural', 14], ['language', 12], ['processing', 16]]

**20. Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.**

In [20]:
	
>>> from urllib import request

>>> url = "http://www.gutenberg.org/files/2554/2554-0.txt"
>>> response = request.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> type(raw)

str

In [16]:
len(raw)

1176967

In [33]:
# import urllib3.request
# url = 'https://weather.com/weather/5day/l/CHXX0044:1:CH'
# html = request.urlopen(url).read().decode('utf8')
# high = int(re.findall(r'High (\d+)F', html)[0])                    
# high
# # I just use regular expression instead of BeautifulSoup, which is a bit tricky:D

In [26]:


import urllib3,re
http = urllib3.PoolManager()
from bs4 import BeautifulSoup

url = 'https://medium.com/@aakankshaws/using-beautifulsoup-requests-to-scrape-weather-data-9c6e9d317800'

response = http.request('GET', url)
html = response.data.decode('utf8')
raw = BeautifulSoup(html, 'html.parser').get_text()
raw



# I just use regular expression instead of BeautifulSoup, which is a bit tricky:D



'Using BeautifulSoup, requests to scrape weather data{"@context":"http://schema.org","@type":"NewsArticle","image":{"@type":"ImageObject","width":545,"height":106,"url":"https://cdn-images-1.medium.com/max/1090/1*OMF3fSqH8t4xBJ9-6oZDZw.png"},"url":"https://medium.com/@aakankshaws/using-beautifulsoup-requests-to-scrape-weather-data-9c6e9d317800","dateCreated":"2018-06-28T06:07:39.544Z","datePublished":"2018-06-28T06:07:39.544Z","dateModified":"2019-01-17T11:51:57.524Z","headline":"Using BeautifulSoup, requests to scrape weather data","name":"Using BeautifulSoup, requests to scrape weather data","articleId":"9c6e9d317800","thumbnailUrl":"https://cdn-images-1.medium.com/max/1090/1*OMF3fSqH8t4xBJ9-6oZDZw.png","keywords":["Tag:Python","Tag:Beautifulsoup","Tag:Pandas","Tag:Requests","Tag:First Post","LockedPostSource:0","Elevated:false","LayerCake:0"],"author":{"@type":"Person","name":"Aakanksha W S","url":"https://medium.com/@aakankshaws"},"creator":["Aakanksha W S"],"publisher":{"@type":"O

In [32]:
high=(re.findall(r'HTTP', raw))

'HTTP'

**21. Write a function ** `unknown()` ** that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using ** `re.findall()` **) and remove any items from this set that occur in the Words Corpus ** `(nltk.corpus.words)` **. Try to categorize these words manually and discuss your findings.**

In [None]:
import nltk
def unknown(url):
    html = request.urlopen(url).read().decode('utf8')
    lowers = re.findall(r'\b[a-z]+', html)
    unknowns = [w for w in lowers if w not in nltk.corpus.words.words()]
    return unknowns

unknown('https://en.wikipedia.org')
# the function is quite slow...

**22. Examine the results of processing the URL http://news.bbc.co.uk/ using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.**

In [12]:
file = open('BBC.html').read()         
# For the reason of GFW, BBC is not accessible now. 
# I don't know how to set up proxy in Jupyter Notebook
# so I save HTML via safari and deal with the file instead.

lowers = re.findall(r'[a-z]+', file)
unknowns = [w for w in lowers[:100] if w not in nltk.corpus.words.words()]
# It's too costly to judge whether a word is in nltk's words corpus.
# Therefore, I choose the first words 
unknowns

FileNotFoundError: [Errno 2] No such file or directory: 'BBC.html'

**23. Are you able to write a regular expression to tokenize text in such a way that the word ** *don't* ** is tokenized into**  *do* ** and ** *n't* **? Explain why this regular expression won't work: ** `«n't|\w+»` **.**

In [13]:
pattern = r"\w+(?:'t)?"

The regular expression would detect `don` first and leave out `'t`, which doesn't match the expression `n't` and `t` will be matched to `\w+`.

**24. Try to write code to convert text into hAck3r, using regular expressions and substitution, where ** `e → 3, i → 1, o → 0, l → |, s → 5, . → 5w33t!, ate → 8` **. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map s to two different values: ** `$` ** for word-initial ** `s` **, and ** `5` ** for word-internal ** `s`**.**

In [14]:
text = 'I ate an apple yesterday in my car, said by Tom.'

# text = re.sub(pattern, repl, text)
text = re.sub(r'ate', '8', text)        # ate -> 8, replace it first, or 'e' will be replaced
text = re.sub(r'e', '3', text)          # e -> 3
text = re.sub(r'i', '1', text)          # i -> 1
text = re.sub(r'o', '0', text)          # o -> 0
text = re.sub(r'l', '|', text)          # l -> |
text = re.sub(r'\.', '5w33t!', text)    # . -> 5w33t!

text = re.sub(r'(\b)(s)', r'\1$', text)              # word-initial s
text = re.sub(r'(\w)(s)', r'\g<1>5', text)           # word-internal s
# reference: https://stackoverflow.com/questions/5984633/python-re-sub-group-number-after-number

text = re.sub(r'g', '9', text)          # g -> 9
text = re.sub(r'c', '(', text)          # c -> (
text

'I 8 an app|3 y35t3rday 1n my (ar, $a1d by T0m5w33t!'

**25. ** *Pig Latin* ** is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ** *ay* **, e.g.**  *string → ingstray*, *idle → idleay* **.  http://en.wikipedia.org/wiki/Pig_Latin**  
a. **Write a function to convert a word to Pig Latin.**  
b. **Write code that converts text, instead of individual words.**  
c. **Extend it further to preserve capitalization, to keep ** `qu` ** together (i.e. so that ** `quiet` ** becomes ** `ietquay` **), and to detect when ** `y` ** is used as a consonant (e.g. ** `yellow` **) vs a vowel (e.g. ** `style` **).**

In [15]:
def pig_latin_word(word):
    pattern = r'\b([^aeiou]*)(\w*)'
    repl = r'\2\1ay'
    word = re.sub(pattern, repl, word)
    # word += 'ay'
    return word

pig_latin_word('string')

# word = 'idle'
# word = re.sub(r'\b([^aeiou]*)(\w*)', r'\2\1', word)
# word += 'ay'
# word

'ingstray'

In [16]:
def pig_latin_text(text):
    pattern = r'\b([b-df-hj-np-tv-z]*)(\w*)\b'
    repl = r'\2\1ay'
    text = re.sub(pattern, repl, text)
    return text

pig_latin_text('string idle')

'ingstray idleay'

In [17]:
def pig_latin_extention(text):
    # Well, 'Y' as the initial letter seems to be just consonant?
    pattern = r'(?i)\b(qu|[b-df-hj-np-tv-z]*)(\w*)\b'
    repl = r'\2\1ay'
    text = re.sub(pattern, repl, text)
    return text

pig_latin_extention('quite IDLE')

'itequay IDLEay'

**26. Download some text from a language that has vowel harmony (e.g. Hungarian), extract the vowel sequences of words, and create a vowel bigram table.**

Omitted.

**27. Python's random module includes a function ** `choice()` ** which randomly chooses an item from a sequence, e.g. ** `choice("aehh ")` ** will produce one of four possible characters, with the letter ** `h` ** being twice as frequent as the others. Write a generator expression that produces a sequence of 500 randomly chosen letters drawn from the string ** `"aehh "` **, and put this expression inside a call to the ** `''.join()` ** function, to concatenate them into one long string. You should get a result that looks like uncontrolled sneezing or maniacal laughter: ** `he  haha ee  heheeh eha` **. Use ** `split()` ** and ** `join()` ** again to normalize the whitespace in this string.**

In [19]:
import random
s = []
for i in range(500):
    s.append(random.choice("aehh "))
ori = ''.join(s)                     # the original form may contain multiple blank spaces at the same time
' '.join(ori.split())                # normalize the whitespace

'aa ahaee hah hhehhh ahhheeh hheh ae eaahaaahaahehh e ahha ehhehaa hhhh hhheh ahhehh aeae aaehheea heeh hh hh e hhheahhheah ae aa hh haaaeeeh ahhheh hhaaahh h ehah hhehhhhe ahaha ehhheaea ehhhehehehheehe ae heeh aeah hhheh e a aahe h h hhhhhhae ehe h ahh e e ha aha hhe eah ah hahaaehaeh ehh aahehh h hh ee ehaa heheheh hhhh a hhheaheaehhhh a eeh heaehae hhahh aea hheeeaehehaaeahhhae hhe eah h hha ea hahh aahh ehha e h aahhhhh hahaaeeeaeahh hhhhhae a hahh ehhhh eeh hhah eaahahahh'

**28. Consider the numeric expressions in the following sentence from the MedLine Corpus: ** *The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively.* ** Should we say that the numeric expression ** *4.53 +/- 0.15%* ** is three words? Or should we say that it's a single compound word? Or should we say that it is actually ** *nine* ** words, since it's read "four point five three, plus or minus zero point fifteen percent"? Or should we say that it's not a "real" word at all, since it wouldn't appear in any dictionary? Discuss these different possibilities. Can you think of application domains that motivate at least two of these answers?**

Well, all the explanations are reasonable. But I don't understand why it is *nine*, shouldn't it be *eleven*?

**29. Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define ** `μw` ** to be the average number of letters per word, and ** `μs` ** to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: ** `4.71 μw + 0.5 μs - 21.43` **. Compute the ARI score for various sections of the Brown Corpus, including section ** `f`** (lore) and ** `j` ** (learned). Make use of the fact that ** `nltk.corpus.brown.words()` ** produces a sequence of words, while ** `nltk.corpus.brown.sents()` ** produces a sequence of sentences.**

In [21]:
from nltk.corpus import brown
def miu_w(category):
    word_length = sum(len(w) for w in brown.words(categories=category))
    word_number = len(brown.words(categories=category))
    return word_length / word_number

def miu_s(category):
    sent_length = sum(len(s) for s in brown.sents(categories=category))
    sent_number = len(brown.sents(categories=category))
    return sent_length / sent_number

def ari(category):
    return 4.71 * miu_w(category) + 0.5 * miu_s(category) - 21.43

for category in brown.categories():
    print(category, ari(category))

adventure 4.0841684990890705
belles_lettres 10.987652885621749
editorial 9.471025332953673
fiction 4.9104735321302115
government 12.08430349501021
hobbies 8.922356393630267
humor 7.887805248319808
learned 11.926007043317348
lore 10.254756197101155
mystery 3.8335518942055167
news 10.176684595052684
religion 10.203109907301261
reviews 10.769699888473433
romance 4.34922419804213
science_fiction 4.978058336905399


**30. Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.**

In [31]:

import nltk
from nltk import word_tokenize
raw = """THE Dawn of Love is an oil painting by English artist 
William Etty, first exhibited in 1828. Loosely based on a passage 
from John Milton's 1634 Comus, it shows Venus leaning across to 
wake the sleeping Love by stroking his wings. It was very poorly 
received when first exhibited; the stylised Venus was thought unduly 
influenced by foreign artists such as Rubens as well as being overly 
voluptuous and unrealistically coloured, while the painting as a whole 
was considered tasteless and obscene. The Dawn of Love was omitted 
from the major 1849 retrospective exhibition of Etty's works, and 
its exhibition in Glasgow in 1899 drew complaints for its supposed 
obscenity. In 1889 it was bought by Merton Russell-Cotes, and has 
remained in the collection of the Russell-Cotes Art Gallery & Museum ever since."""
# from Wikipedia 2018-08-08's featured article
tokens = word_tokenize(raw)

porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

porter_output = [porter.stem(t) for t in tokens] 
porter_output


['the',
 'dawn',
 'of',
 'love',
 'is',
 'an',
 'oil',
 'paint',
 'by',
 'english',
 'artist',
 'william',
 'etti',
 ',',
 'first',
 'exhibit',
 'in',
 '1828',
 '.',
 'loos',
 'base',
 'on',
 'a',
 'passag',
 'from',
 'john',
 'milton',
 "'s",
 '1634',
 'comu',
 ',',
 'it',
 'show',
 'venu',
 'lean',
 'across',
 'to',
 'wake',
 'the',
 'sleep',
 'love',
 'by',
 'stroke',
 'hi',
 'wing',
 '.',
 'It',
 'wa',
 'veri',
 'poorli',
 'receiv',
 'when',
 'first',
 'exhibit',
 ';',
 'the',
 'stylis',
 'venu',
 'wa',
 'thought',
 'unduli',
 'influenc',
 'by',
 'foreign',
 'artist',
 'such',
 'as',
 'ruben',
 'as',
 'well',
 'as',
 'be',
 'overli',
 'voluptu',
 'and',
 'unrealist',
 'colour',
 ',',
 'while',
 'the',
 'paint',
 'as',
 'a',
 'whole',
 'wa',
 'consid',
 'tasteless',
 'and',
 'obscen',
 '.',
 'the',
 'dawn',
 'of',
 'love',
 'wa',
 'omit',
 'from',
 'the',
 'major',
 '1849',
 'retrospect',
 'exhibit',
 'of',
 'etti',
 "'s",
 'work',
 ',',
 'and',
 'it',
 'exhibit',
 'in',
 'glasgow',

In [32]:
lancaster_output = [lancaster.stem(t) for t in tokens]
lancaster_output

['the',
 'dawn',
 'of',
 'lov',
 'is',
 'an',
 'oil',
 'paint',
 'by',
 'engl',
 'art',
 'william',
 'etty',
 ',',
 'first',
 'exhibit',
 'in',
 '1828',
 '.',
 'loos',
 'bas',
 'on',
 'a',
 'pass',
 'from',
 'john',
 'milton',
 "'s",
 '1634',
 'com',
 ',',
 'it',
 'show',
 'ven',
 'lean',
 'across',
 'to',
 'wak',
 'the',
 'sleep',
 'lov',
 'by',
 'stroking',
 'his',
 'wing',
 '.',
 'it',
 'was',
 'very',
 'poor',
 'receiv',
 'when',
 'first',
 'exhibit',
 ';',
 'the',
 'styl',
 'ven',
 'was',
 'thought',
 'undu',
 'influ',
 'by',
 'foreign',
 'art',
 'such',
 'as',
 'rub',
 'as',
 'wel',
 'as',
 'being',
 'ov',
 'voluptu',
 'and',
 'unr',
 'colo',
 ',',
 'whil',
 'the',
 'paint',
 'as',
 'a',
 'whol',
 'was',
 'consid',
 'tasteless',
 'and',
 'obsc',
 '.',
 'the',
 'dawn',
 'of',
 'lov',
 'was',
 'omit',
 'from',
 'the',
 'maj',
 '1849',
 'retrospect',
 'exhibit',
 'of',
 'etty',
 "'s",
 'work',
 ',',
 'and',
 'it',
 'exhibit',
 'in',
 'glasgow',
 'in',
 '1899',
 'drew',
 'complaint',

**31. Define the variable saying to contain the list ** `['After', 'all', 'is', 'said', 'and', 'done', ',', 'more',
'is', 'said', 'than', 'done', '.']` **. Process this list using a ** `for` ** loop, and store the length of each word in a new list lengths. Hint: begin by assigning the empty list to ** `lengths` **, using ** `lengths = []` **. Then each time through the loop, use ** `append()` ** to add another length value to the list. Now do the same thing using a list comprehension.**

In [33]:
var = ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more', 'is', 'said', 'than', 'done', '.']
lengths = []

for w in var:
    lengths.append(len(w))

In [34]:
lengths = [len(w) for w in var]

**32. Define a variable ** `silly` ** to contain the string: ** 'newly formed bland ideas are inexpressible in an infuriating way' **. (This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous nonsense phrase, colorless green ideas sleep furiously according to Wikipedia). Now write code to perform the following tasks:**  
a. **Split ** `silly` ** into a list of strings, one per word, using Python's ** `split()` ** operation, and save this to a variable called ** `bland`.  
b. **Extract the second letter of each word in ** `silly` ** and join them into a string, to get ** 'eoldrnnnna'.  
c. **Combine the words in ** `bland` ** back into a single string, using ** `join()` **. Make sure the words in the resulting string are separated with whitespace.**  
d. **Print the words of ** `silly` ** in alphabetical order, one per line.**

In [35]:
silly = 'newly formed bland ideas are inexpressible in an infuriating way'
bland = silly.split()                 # a
''.join(w[1] for w in bland)          # b
' '.join(bland)                       # c
print(sorted(bland))                  # d

['an', 'are', 'bland', 'formed', 'ideas', 'in', 'inexpressible', 'infuriating', 'newly', 'way']


**33. The ** `index()` ** function can be used to look up items in sequences. For example, ** `'inexpressible'.index('e')` ** tells us the index of the first position of the letter ** `e`.  
a. **What happens when you look up a substring, e.g. ** `'inexpressible'.index('re')` **?**  
b. **Define a variable ** `words` ** containing a list of words. Now use ** `words.index()` ** to look up the position of an individual word.**  
c. **Define a variable ** `silly` ** as in the exercise above. Use the ** `index()` ** function in combination with list slicing to build a list ** `phrase` ** consisting of all the words up to (but not including) ** `in` ** in silly.**

In [36]:
'inexpressible'.index('re')

5

In [37]:
words = ['a', 'list', 'of', 'words']
words.index('of')

2

In [38]:
phrase = bland[:bland.index('in')]             # use bland rather than silly here
phrase

['newly', 'formed', 'bland', 'ideas', 'are', 'inexpressible']

**34. Write code to convert nationality adjectives like ** *Canadian* ** and ** *Australian* ** to their corresponding nouns ** *Canada* ** and ** *Australia* ** (see http://en.wikipedia.org/wiki/List_of_adjectival_forms_of_place_names).**

In [39]:
# the link should be
# https://en.wikipedia.org/wiki/List_of_adjectival_and_demonymic_forms_for_countries_and_nations

# Argentina - Argentinian
# Australia - Australian
# Austria - Austrian
# to be finished...

pattern = r'(\w+)ian'
repl = r'\1a'
re.sub(pattern, repl, 'Canadian')

'Canada'

**35. Read the LanguageLog post on phrases of the form ** *as best as p can* ** and ** *as best p can* **, where ** *p* ** is a pronoun. Investigate this phenomenon with the help of a corpus and the ** `findall()` ** method for searching tokenized text described in 3.5. http://itre.cis.upenn.edu/~myl/languagelog/archives/002733.html**

In [40]:
text = """ I wil straight dispose, as best I can, th'inferiour Magistrate ...
And I haue thrust my selfe into this maze, Happily to wiue and thriue, as best I may ...
In fine, my life is that of a great schoolboy, getting into scrapes for the fun of it,
and fighting my way out as best as I can!
As best as she can she hides herself in the full sunlight
"""
# text sample from the given url link
re.findall(r'(?i)as best (?:as )?(?:I|we|you|he|she|they|it) can', text)

['as best I can', 'as best as I can', 'As best as she can']

**36. Study the ** *lolcat* ** version of the book of Genesis, accessible as ** `nltk.corpus.genesis.words('lolcat.txt')` **, and the rules for converting text into ** *lolspeak* ** at http://www.lolcatbible.com/index.php?title=How_to_speak_lolcat. Define regular expressions to convert English words into corresponding lolspeak words.**

In [41]:
# nltk.corpus.genesis.words('lolcat.txt')
text = 'sight kite dude over kitty little'
# just implement some easy-to-check rules
text = re.sub(r'ight', 'iet', text)                             # ight -> iet
text = re.sub(r'\bdude\b', 'dood', text)                        # dude -> dood
text = re.sub(r'([b-df-hj-np-tv-z])(e)\b', r'\2\1', text)       # exchange the consonant and the endding 'e'
text = re.sub(r'er\b', 'ah', text)                              # -er -> -ah
text = re.sub(r'y\b', 'eh', text)                               # -y -> -eh
text = re.sub(r'le\b', 'el', text)                              # -le -> -el
text

'siet kiet dood ovah kitteh littel'

**37. Read about the ** `re.sub()` ** function for string substitution using regular expressions, using ** `help(re.sub)` ** and by consulting the further readings for this chapter. Use ** `re.sub` ** in writing code to remove HTML tags from an HTML file, and to normalize whitespace.**

In [42]:
file = open('BBC.html').read()
file = re.sub(r'<.*>', '', file)
file = re.sub(r'\s+', ' ', file)

FileNotFoundError: [Errno 2] No such file or directory: 'BBC.html'

**38. An interesting challenge for tokenization is words that have been split across a line-break. E.g. if ** *long-term* ** is split, then we have the string **  *long-\nterm*.  
a. **Write a regular expression that identifies words that are hyphenated at a line-break. The expression will need to include the ** `\n` ** character.**  
b. **Use ** `re.sub()` ** to remove the \n character from these words.**  
c. **How might you identify words that should not remain hyphenated once the newline is removed, e.g. ** 'encyclo-\npedia'?

In [43]:
text = """long-
term"""
pattern = r'\w+-\n\w+'
re.findall(pattern, text)

['long-\nterm']

In [44]:
pattern = r'(\w+-)(\n)(\w+)'
re.findall(pattern, text)
re.sub(pattern, r'\1\3', text)

'long-term'

Check whether the hyphenated word is in the word corpus.

**39. Read the Wikipedia entry on ** *Soundex* **. Implement this algorithm in Python.**

In [45]:
# https://en.wikipedia.org/wiki/Soundex
# cumbersome implementation...
def soundex(word):
    word = word.upper()         # convert the word to upper case for convenience
    
    # Step 1: Retain the first letter
    sound = word[0]

    # Step 3: If two or more letters with the same number are adjacent 
    # in the original name (before step 1), only retain the first letter;
    word = re.sub(r'([BFPV])[BFPV]', r'\1', word)             # 
    word = re.sub(r'([CGJKQSXZ])[CGJKQSXZ]', r'\1', word)
    word = re.sub(r'([DT])[DT]', r'\1', word)
    word = re.sub(r'LL', r'L', word)
    word = re.sub(r'([MN])[MN]', r'\1', word)
    word = re.sub(r'RR', r'R', word)
    
    # Step 3:  two letters with the same number separated by 'h' or 'w' are coded as a single number
    word = re.sub(r'([BFPV])([HW])[BFPV]', r'\1\2', word)
    word = re.sub(r'([CGJKQSXZ])([HW])[CGJKQSXZ]', r'\1\2', word)
    word = re.sub(r'([DT])([HW])[DT]', r'\1\2', word)
    word = re.sub(r'L([HW])L', r'L\1', word)
    word = re.sub(r'([MN])([HW])[MN]', r'\1\2', word)
    word = re.sub(r'R([HW])R', r'R\1', word)
    
    # Replace consonants with digits as follows (after the first letter)
    word = re.sub(r'[AEIOUYHW]', r'', word)
    word = re.sub(r'[BFPV]', '1', word)
    word = re.sub(r'[CGJKQSXZ]', '2', word)
    word = re.sub(r'[DT]', '3', word)
    word = re.sub(r'L', '4', word)
    word = re.sub(r'[MN]', '5', word)
    word = re.sub(r'R', '6', word)
    
    # Step 4: If you have too few letters in your word that you can't assign three numbers, 
    # append with zeros until there are three numbers. If you have more than 3 letters, 
    # just retain the first 3 numbers.
    if sound in 'AEIOUYHW':
        sound = (sound + word + '000')[:4]
    else:
        sound = (sound + word[1:] + '000')[:4]
    return sound

soundex('Honeyman')

'H555'

**40. Obtain raw texts from two or more genres and compute their respective reading difficulty scores as in the earlier exercise on reading difficulty. E.g. compare ABC Rural News and ABC Science News (** `nltk.corpus.abc` **). Use Punkt to perform sentence segmentation.**

In [46]:
# nltk.corpus.abc.fileids()

def ari(fileid):
    words = nltk.corpus.abc.words(fileids=fileid)
    
    text = nltk.corpus.abc.raw(fileids=fileid)
    sents = nltk.sent_tokenize(text)
    
    word_number = len(words)
    word_length = sum(len(w) for w in words)
    miu_w = word_length / word_number

    sent_length = sum(len(s.split()) for s in sents)
    sent_number = len(sents)
    miu_s = sent_length / sent_number
    
    ari = 4.71 * miu_w + 0.5 * miu_s - 21.43
    return ari
print(ari('rural.txt'))
print(ari('science.txt'))

10.66074843699441
10.703963706930097


**41. Rewrite the following nested loop as a nested list comprehension:**  
```Python
>>> words = ['attribution', 'confabulation', 'elocution',
...          'sequoia', 'tenacious', 'unidirectional']
>>> vsequences = set()
>>> for word in words:
...     vowels = []
...     for char in word:
...         if char in 'aeiou':
...             vowels.append(char)
...     vsequences.add(''.join(vowels))
>>> sorted(vsequences)
['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']
```

In [47]:
words = ['attribution', 'confabulation', 'elocution', 'sequoia', 'tenacious', 'unidirectional']
vsequences = [''.join(re.findall(r'[aeiou]', v)) for v in words]
sorted(vsequences)

['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']

**42. Use WordNet to create a semantic index for a text collection. Extend the concordance search program in 3.6, indexing each word using the offset of its first synset, e.g. ** `wn.synsets('dog')[0].offset` ** (and optionally the offset of some of its ancestors in the hypernym hierarchy).**

In [49]:
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        # self._index = nltk.Index((self._stem(word), i)
        #                          for (i, word) in enumerate(text))
        self._index = nltk.Index((wn.synsets(self._stem(word))[0].offset(), i)
                                 for (i, word) in enumerate(text) 
                                 if wn.synsets(self._stem(word)) != [])     # to avoid list index out of range
        
        # basic idea: use WordNet's offset as the word's key rather than the word itself
        
    def concordance(self, word, width=40):
        key = wn.synsets(self._stem(word))[0].offset()
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()
    
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('women')

ModuleNotFoundError: No module named 'wn'

**43. With the help of a multilingual corpus such as the Universal Declaration of Human Rights Corpus (** `nltk.corpus.udhr` **), and NLTK's frequency distribution and rank correlation functionality (** `nltk.FreqDist`, `nltk.spearman_correlation` **), develop a system that guesses the language of a previously unseen text. For simplicity, work with a single character encoding and just a few languages.**

In [50]:
def guess_language(text):
    candidate_language = ['English-Latin1', 'French_Francais-Latin1', 
                          'German_Deutsch-Latin1', 'Italian-Latin1', 'Spanish-Latin1']

    fdist = nltk.FreqDist(lang for lang in candidate_language
                               for w in text if w in nltk.corpus.udhr.words(lang))
    return fdist

# well, I just don't want to show the text in multiple lines since the words doesn't matter
text_english = "Wikipedia is a project dedicated to the building of free encyclopedias in all languages of the world. The project started with the English-language Wikipedia on January 15, 2001. On March 23, 2001 it was joined by a French Wikipedia, and shortly afterwards by many other languages. Large efforts are underway to highlight the international nature of the project. On 20 September 2004 Wikipedia reached a total of 1,000,000 articles in over 100 languages.".split()
text_french = "Wikipédia Écouter est un projet d'encyclopédie universelle, multilingue, créé par Jimmy Wales et Larry Sanger le 15 janvier 2001 en wiki sous le nom de domaine wikipedia.org. Les versions des différentes langues utilisent le même logiciel de publication, MediaWiki, et ont la même apparence, mais elles comportent des variations dans leurs contenus, leurs structures et leurs modalités d'édition et de gestion.".split()
text_german = "Wikipedia ist ein am 15. Januar 2001 gegründetes gemeinnütziges Projekt zur Erstellung einer Enzyklopädie in zahlreichen Sprachen mit Hilfe des Wiki­prinzips. Gemäß Publikumsnachfrage und Verbreitung gehört Wikipedia unterdessen zu den Massenmedien. Aufgrund der für die Entstehung und Weiterentwicklung dieser Enzyklopädie charakteristischen kollaborativen Erstellungs-, Kontroll- und Aushandlungsprozesse der ehrenamtlichen Beteiligten zählt Wikipedia zugleich zu den Social Media.".split()
text_italian = "Wikipedia (pronuncia: vedi sotto) è un'enciclopedia online a contenuto libero, collaborativa, multilingue e gratuita, nata nel 2001, sostenuta e ospitata dalla Wikimedia Foundation, un'organizzazione non a scopo di lucro statunitense. Lanciata da Jimmy Wales e Larry Sanger il 15 gennaio 2001, inizialmente nell'edizione in lingua inglese, nei mesi successivi ha aggiunto edizioni in numerose altre lingue. Sanger ne suggerì il nome,[1] una parola macedonia nata dall'unione della radice wiki al suffisso pedia (da enciclopedia).".split()
text_spanish = "Wikipedia es una enciclopedia libre, políglota y editada de manera colaborativa. Es administrada por la Fundación Wikimedia, una organización sin ánimo de lucro cuya financiación está basada en donaciones. Sus más de 46 millones de artículos en 288 idiomas han sido redactados conjuntamente por voluntarios de todo el mundo, lo que hace un total de más de 2000 millones de ediciones, y prácticamente cualquier persona con acceso al proyecto6​ puede editarlos, salvo que la página se encuentre protegida contra vandalismos para evitar problemas y/o trifulcas.".split()

print(guess_language(text_english).max())
print(guess_language(text_french).max())
print(guess_language(text_german).max())
print(guess_language(text_italian).max())
print(guess_language(text_spanish).max())

# I don't know how to use rank correlation functionality :(

English-Latin1
French_Francais-Latin1
German_Deutsch-Latin1
Italian-Latin1
Spanish-Latin1


**44. Write a program that processes a text and discovers cases where a word has been used with a novel sense. For each word, compute the WordNet similarity between all synsets of the word and all synsets of the words in its context. (Note that this is a crude approach; doing it well is a difficult, open research problem.)**

In [51]:
def novel_sense(text):
    for word in text:
        all_synsets = wn.synsets(word)
        context_synsets = []
        for other_word in text:
            for synset in all_synsets:
                if other_word in synsest:
                    context_synsets.append(synset)
        # after this I don't know what to do...
        # for s1 in all_synsets:
        #     for s2 in context_synsets:
        #         s1.path_similarity(s2)        ?

**45. Read the article on normalization of non-standard words (Sproat et al, 2001), and implement a similar system for text normalization.**

In [52]:
# paper link:
# http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.200&rep=rep1&type=pdf

# well, I don't have much time reading such a paper now.