 ## Text normalization
 

In [12]:
import nltk
from nltk.corpus import stopwords
from nltk.

text = " But her work onstage did not even begin to capture the stamina required to be in the corps." \
       " Spending a week shadowing Ms. Kretzschmar was exhausting — she gave new meaning to the idea of" \
       " being on your feet all day. Twelve-hour days at the David H. Koch Theater," \
       " the company’s Lincoln Center home, were hardly unusual."
text

' But her work onstage did not even begin to capture the stamina required to be in the corps. Spending a week shadowing Ms. Kretzschmar was exhausting — she gave new meaning to the idea of being on your feet all day. Twelve-hour days at the David H. Koch Theater, the company’s Lincoln Center home, were hardly unusual.'

    Split the paragraph into sentences and examine the result.

In [13]:
sentences_naive = text.split('.')
sentences_naive

[' But her work onstage did not even begin to capture the stamina required to be in the corps',
 ' Spending a week shadowing Ms',
 ' Kretzschmar was exhausting — she gave new meaning to the idea of being on your feet all day',
 ' Twelve-hour days at the David H',
 ' Koch Theater, the company’s Lincoln Center home, were hardly unusual',
 '']

    Use the `nltk` sentence tokenizer to split the paragraph into sentences.

In [14]:
sentences = nltk.sent_tokenize(text)
sentences[0]

' But her work onstage did not even begin to capture the stamina required to be in the corps.'

    Split the first sentence into words by splitting on whitespace.

In [15]:
words_naive = [w for w in sentences[0].split(' ')]
words_naive

['',
 'But',
 'her',
 'work',
 'onstage',
 'did',
 'not',
 'even',
 'begin',
 'to',
 'capture',
 'the',
 'stamina',
 'required',
 'to',
 'be',
 'in',
 'the',
 'corps.']

    Split the sentences into words by using the `nltk` `word_tokenize`. 

In [16]:
words = [w for w in nltk.word_tokenize(sentences[0])]
words

['But',
 'her',
 'work',
 'onstage',
 'did',
 'not',
 'even',
 'begin',
 'to',
 'capture',
 'the',
 'stamina',
 'required',
 'to',
 'be',
 'in',
 'the',
 'corps',
 '.']

   ## Conversion to lowcase characters
   
   Before analysis the texts are usually converted to lowcase characters to avoid treating
   one and the same words as different words only because the one is written with a capital
   case characters.
   
   Note however that this transformation is lossy and can remove relevant information
   from the text, e.g. "White House" usually refers to something quite different from a "white house".

In [29]:
words_lowcase = [w.lower() for w in words]
words_lowcase

['but',
 'her',
 'work',
 'onstage',
 'did',
 'not',
 'even',
 'begin',
 'to',
 'capture',
 'the',
 'stamina',
 'required',
 'to',
 'be',
 'in',
 'the',
 'corps',
 '.']

   ## Stop words removal
   
   Some words like "the", "a", etc. are very common in each text but do not 
   contribute to its meaning (stop words). Sometimes, e.g. in some text classification cases, those words contribute little
   for discriminating between documents and can be removed to reduce the feature space. 
   
   `nltk` provides a list with stopwords several languages. Here we use list of English
   stopwords to remove these from the text.

In [20]:
english_stopwords = set(stopwords.words('english'))

filtered_sentence = [w for w in words if w not in english_stopwords]
filtered_sentence

['But',
 'work',
 'onstage',
 'even',
 'begin',
 'capture',
 'stamina',
 'required',
 'corps',
 '.']

 # Stemming
 
 It is usual that one and the same word occurs in different forms, e.g. fly, flying, etc.
 One way to reduce the different forms to a base form of the word (stem) is to remove the suffixes (ly, ing, ment, etc.) from the forms.
 
 Here we will use the Porter stemmer from `nltk`.  

In [26]:
stemmer = nltk.SnowballStemmer('english')

stemmed_words = [stemmer.stem(w) for w in words]
stemmed_words

['but',
 'her',
 'work',
 'onstag',
 'did',
 'not',
 'even',
 'begin',
 'to',
 'captur',
 'the',
 'stamina',
 'requir',
 'to',
 'be',
 'in',
 'the',
 'corp',
 '.']

  ## Lemmatization
  
  While stemming is useful to reduce the different forms of the word to a single form
  it has the disadvantage that it produces invalid words like 'captur', 'requir' in the
  example above.
  
  Lemmatisation provides a solution to this problem by finding the prime form (or lemma)
   of the words. 

    We will use the WordNetLemmatizer provided by `nltk` to lemmatise the words
    from the first sentence. Notice the different result for 'onstage' and 'capture'.

In [28]:
lemmatizer = nltk.WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(w) for w in words]
lemmatized_words

['But',
 'her',
 'work',
 'onstage',
 'did',
 'not',
 'even',
 'begin',
 'to',
 'capture',
 'the',
 'stamen',
 'required',
 'to',
 'be',
 'in',
 'the',
 'corp',
 '.']

 ## Part of speech tagging
 
 Part of speech (POS) tagging refers to determining the role of each word within a sentence, e.g.: 
    2.  CD  Cardinal number
    3.  DT  Determiner    
    4.  JJ  Adjective
    8.  JJR Adjective, comparative
    11. MD  Modal
    12. NN  Noun, singular or mass
    13. NNS Noun, plural
    20. RB  Adverb
    21. RBR Adverb, comparative
    22. RBS Adverb, superlative
    27. VB  Verb, base form
    34. WP  Wh-pronoun
    35. WP$ Possessive wh-pronoun
    
   for the full list of Penn Treebank POS tags see their [web site](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). 
   
   Here we use the default POS tagger from `nltk`.  

In [33]:
tagged_words = nltk.pos_tag(words)
tagged_words

[('But', 'CC'),
 ('her', 'PRP$'),
 ('work', 'NN'),
 ('onstage', 'NN'),
 ('did', 'VBD'),
 ('not', 'RB'),
 ('even', 'RB'),
 ('begin', 'VB'),
 ('to', 'TO'),
 ('capture', 'VB'),
 ('the', 'DT'),
 ('stamina', 'NN'),
 ('required', 'VBN'),
 ('to', 'TO'),
 ('be', 'VB'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('corps', 'NN'),
 ('.', '.')]

 ## Ambiguities with lemmatisation 
 
 Look at the following sentence: "Visiting aunts can be quite annoying" and examine
 the POS tags generated by `nltk.pos_tag`

In [34]:
nltk.pos_tag(nltk.word_tokenize("Visiting aunts can be quite annoying"))

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('quite', 'RB'),
 ('annoying', 'VBG')]