# Processing real text (from out on the inter-webs)

> _Reading_: NLPP Chapter 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.9, and 3.10. It's not important that you go in depth with everything here - the key think is that you know that Chapter 3 of this book exists, and that it's a great place to return to if you're ever in need of an explanation of regular expressions, unicode, or other topics that you forget as soon as you stop using them (and don't worry, I forget about those things too).

> _Exercise 1_: Just a couple of examples from the book: Work through the exercises NLPP1e 3.12: 6, 30.

6. Describe the class of strings matched by the following regular expressions.

    a. **[a-zA-Z]+**: Any consecutive characters from the range a to z with at least one repetition, case insensitive.
    
    b. **[A-Z][a-z]*** : Words starting with a capital letter with no need to repeat.
    
    c. **p[aeiou]{,2}t** : starting with 'p' (lower case) any vowel and two consecutive characters.
    
    d. **\d+(\.\d+)?** : any decimal digit repeated at least once or any character followed by a digit with one repetition minimum. Optionally, words that repeat the set.
    
    e. **([^aeiou][aeiou][^aeiou])** : any vowel surrounded by any character starting with a vowel.
    
    f. **\w+|[^\w\s]+** : any character followed by an empty character.

Test your answers using nltk.re_show().

In [1]:
import nltk

In [2]:
regex_a = '[a-zA-Z]+'
regex_b = '[A-Z][a-z]*'
regex_c = 'p[aeiou]{,2}t'
regex_d = '\d+(\.\d+)?'
regex_e = '([^aeiou][aeiou][^aeiou])'
regex_f = '\w+|[^\w\s]+'
regex = [regex_a,
        regex_b,
        regex_c,
        regex_d,
        regex_e,
        regex_f]

In [3]:
for re in regex:
    print(re, ':')
    nltk.re_show(re,'As long as Pete is concerned, 50 of those parts cost at least $5000.')
    print()

[a-zA-Z]+ :
{As} {long} {as} {Pete} {is} {concerned}, 50 {of} {those} {parts} {cost} {at} {least} $5000.

[A-Z][a-z]* :
{As} long as {Pete} is concerned, 50 of those parts cost at least $5000.

p[aeiou]{,2}t :
As long as Pete is concerned, 50 of those parts cost at least $5000.

\d+(\.\d+)? :
As long as Pete is concerned, {50} of those parts cost at least ${5000}.

([^aeiou][aeiou][^aeiou]) :
As {lon}g{ as} {Pet}e{ is} {con}{cer}{ned}, 50{ of} t{hos}e {par}ts {cos}t{ at} least $5000.

\w+|[^\w\s]+ :
{As} {long} {as} {Pete} {is} {concerned}{,} {50} {of} {those} {parts} {cost} {at} {least} {$}{5000}{.}



30. Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.

In [4]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

In [5]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [6]:
[porter.stem(t) for t in text6]

['scene',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'king',
 'arthur',
 ':',
 'whoa',
 'there',
 '!',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'soldier',
 '#',
 '1',
 ':',
 'halt',
 '!',
 'who',
 'goe',
 'there',
 '?',
 'arthur',
 ':',
 'It',
 'is',
 'I',
 ',',
 'arthur',
 ',',
 'son',
 'of',
 'uther',
 'pendragon',
 ',',
 'from',
 'the',
 'castl',
 'of',
 'camelot',
 '.',
 'king',
 'of',
 'the',
 'briton',
 ',',
 'defeat',
 'of',
 'the',
 'saxon',
 ',',
 'sovereign',
 'of',
 'all',
 'england',
 '!',
 'soldier',
 '#',
 '1',
 ':',
 'pull',
 'the',
 'other',
 'one',
 '!',
 'arthur',
 ':',
 'I',
 'am',
 ',',
 '...',
 'and',
 'thi',
 'is',
 'my',
 'trusti',
 'servant',
 'patsi',
 '.',
 'We',
 'have',
 'ridden',
 'the',
 'length',
 'and',
 'breadth',
 'of',
 'the',
 'land',
 'in',
 'search',
 'of',
 'knight',
 'who',
 'will',
 'join',
 'me',
 'in',
 'my',
 'court',
 'at',
 'camelot',
 '.',
 'I',
 'must',
 'speak',
 'with',
 'your',
 'lord',
 'and',
 'master',
 

In [7]:
[lancaster.stem(t) for t in text6]

['scen',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'king',
 'arth',
 ':',
 'who',
 'ther',
 '!',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'soldy',
 '#',
 '1',
 ':',
 'halt',
 '!',
 'who',
 'goe',
 'ther',
 '?',
 'arth',
 ':',
 'it',
 'is',
 'i',
 ',',
 'arth',
 ',',
 'son',
 'of',
 'uth',
 'pendragon',
 ',',
 'from',
 'the',
 'castl',
 'of',
 'camelot',
 '.',
 'king',
 'of',
 'the',
 'briton',
 ',',
 'def',
 'of',
 'the',
 'saxon',
 ',',
 'sovereign',
 'of',
 'al',
 'england',
 '!',
 'soldy',
 '#',
 '1',
 ':',
 'pul',
 'the',
 'oth',
 'on',
 '!',
 'arth',
 ':',
 'i',
 'am',
 ',',
 '...',
 'and',
 'thi',
 'is',
 'my',
 'trusty',
 'serv',
 'patsy',
 '.',
 'we',
 'hav',
 'rid',
 'the',
 'leng',
 'and',
 'bread',
 'of',
 'the',
 'land',
 'in',
 'search',
 'of',
 'knight',
 'who',
 'wil',
 'join',
 'me',
 'in',
 'my',
 'court',
 'at',
 'camelot',
 '.',
 'i',
 'must',
 'speak',
 'with',
 'yo',
 'lord',
 'and',
 'mast',
 '.',
 'soldy',
 '#',
 '1',
 ':',
 'what',


They both find the stem of words, but they do it differently. Porter shows the stem while lancaster seems to remove all that is not the stem.