# Week 2: Preprocessing Text (Part 2)


In [2]:
#necessary library imports and setup introduced previously

import sys
#sys.path.append(r'T:\Departments\Informatics\LanguageEngineering') 
#sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
sys.path.append(r'\Users\J\Desktop\code\sussex\nlp\labs\resources')

import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.tokenize import word_tokenize

from sussex_nltk.corpus_readers import ReutersCorpusReader

Sussex NLTK root directory is \Users\J\Desktop\code\sussex\nlp\labs\resources


## Overview 
Remember, a raw text document is just a sequence of characters. There are a number of basic steps that are often performed when processing natural language text. In lab sessions this week we are covering some of the basic text pre-processing methods. Last time, you looked at
- <b> segmentation</b> - breaking down large units of text into smaller units such as documents and sentences. 
- <b> tokenisation</b> - roughly speaking, this involves grouping characters into words;

This time, you will be looking at:
- <b>case normalisation</b> - this involves converting all of the text into lower case; 
- <b>stemming</b> - this involves removing a word's inflections to find the stem; and 
- <b>punctuation and stop-word removal</b> - stop-words are common functions words that in some situations can be ignored.

Note that we do not always apply all of the above preprocessing methods; it depends on the application. One of the things that you will be learning about in this module, is when the application of each of these methods is, and is not, appropriate.

## Normalising text and removing unimportant tokens
In this next section we will consider several methods that pre-process (tokenised) text in ways that are sometimes helpful to 'downstream' processing.

### Number and case normalisation
Without any kind of normalisation, the tokens `"help"` and `"Help"` are two distinct types. In some contexts you may not want to distinguish them.

Another example, is that `"1998"` and `"1999"` count as distinct types. There are situations where there is no need to distinction between different numbers.

The following code performs case normalisation and replaces tokens that consist of digits by "NUM". 
- Python provides a [number of functions](http://docs.python.org/library/stdtypes.html#string-methods), which you can call in order to analyse their content, or produce new strings from them.
- The code uses [list comprehension](http://docs.python.org/tutorial/datastructures.html#list-comprehensions) to build a new list by looping through and filtering items.

In [3]:
tokens = ["The","cake","is","a","LIE"]      #a list of tokens, some of which contain uppercase letters
print([token.lower() for token in tokens])   #print newly created list of all lowercase tokens

numbers = ['in', 'the', 'year', '120', 'of', 'the', 'fourth', 'age', ',', 'after', '120', 'years', 'as', 'king', ',' , 'aragorn', 'died', 'at', 'the', 'age', 'of', '210']
print(["NUM" if token.isdigit() else token for token in numbers])  #replace all number tokens with "NUM" in a new list of tokens

['the', 'cake', 'is', 'a', 'lie']
['in', 'the', 'year', 'NUM', 'of', 'the', 'fourth', 'age', ',', 'after', 'NUM', 'years', 'as', 'king', ',', 'aragorn', 'died', 'at', 'the', 'age', 'of', 'NUM']


### Exercise 1.1
- Write a function <code>normalise</code> which 
    * replaces numbers with NUM; 
    * and replaces tokens such as `"4th"`, `"1st"` and `"22nd"` with `"Nth"`.
- Test your code on the list `["Within","5","minutes",",","the", "1st", "and", "2nd", "placed", "runners", "lapped", "the", "5th","."]`. 
- Check that the token `"and"` isn't changed to `"Nth"`.
- You will find [this page](http://docs.python.org/library/stdtypes.html#string-methods) useful.


In [9]:
def normalise(tokens):
    normalisedTokens = []
    for token in tokens:
        if token.isdigit(): s = "NUM"
        elif len(token) > 2 and token[-2:] in ["th", "st", "nd"] and token[:-2].isdigit(): s = "Nth"
        else: s = token
        normalisedTokens.append(s)
    return normalisedTokens
        

print(normalise(["Within","5","minutes",",","the", "1st", "and", "2nd", "placed", "runners", "lapped", "the", "5th","."]))

['Within', 'NUM', 'minutes', ',', 'the', 'Nth', 'and', 'Nth', 'placed', 'runners', 'lapped', 'the', 'Nth', '.']


### Exercise 1.2
- Complete the code in the cell below. You have just two lines to complete. The goal is to use a large sample of the Reuters corpus to establish the extent to which vocabulary size is reduced when number and case normalisation is applied.
- For each of the two incomplete lines you should use nested list comprehensions. This is described in Section 5.1.4 in [this document](http://docs.python.org/tutorial/datastructures.html#list-comprehensions).  Alternatively, you could define functions which iterate over the sentences in each sample and the tokens within each sentence.


In [16]:
def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())

rcr = ReutersCorpusReader()    

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

############################################
lowered_sentences = [[token.lower() for token in tokens] for tokens in tokenised_sentences]
normalised_sentences = [["NUM" if token.isdigit() else token for token in tokens] for tokens in lowered_sentences]
############################################

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))


Normalisation produced a 13.21% reduction in vocabulary size from 19253 to 16709


## Stemming
A considerable amount of the lexical variation found in documents results from the use of morphological variants which we might not wish to distinguish - e.g. when determining the topic of a document. An easy way to remove these varied forms is to use a stemmer. NLTK includes a number of stemmers in the `nltk.stem` package.
- [NLTK stem module API](http://nltk.org/api/nltk.stem.html)

- [NLTK Porter stemmer](http://nltk.org/api/nltk.stem.html?highlight=stemmer#nltk.stem.porter.PorterStemmer)

- Look at the code below to show how the NLTK implementation of the Porter stemmer in `nltk.stem.porter.PorterStemmer` stems a sample of sentences in the Reuters corpus.
- Have a close look at the differences between the columns. This will give you a good indication of what the stemmer does.

In [17]:
from nltk.stem.porter import PorterStemmer

rcr = ReutersCorpusReader() 
st = PorterStemmer()

sample_size = 10

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

for sentence in tokenised_sentences:
    df = pd.DataFrame(list(zip_longest(sentence,[st.stem(token) for token in sentence])),columns=["BEFORE","AFTER"])
    print(df)

   BEFORE   AFTER
0       (       (
1  Approx  approx
2       .       .
       BEFORE       AFTER
0  07/01/1999  07/01/1999
1        860M        860m
2        6.22        6.22
3           %           %
       BEFORE   AFTER
0        over    over
1      German  german
2  government  govern
3       bonds    bond
4           .       .
      BEFORE      AFTER
0         --         --
1    Beijing       beij
2   Newsroom   newsroom
3          (          (
4       8610       8610
5          )          )
6  6532-1921  6532-1921
   BEFORE  AFTER
0    1230   1230
1     Wed    wed
2     CAN    can
3     NEW    new
4   HOUSE   hous
5   PRICE  price
6    YAPR   yapr
7     N/F    n/f
8     PCT    pct
9     N/A    n/a
10    0.2    0.2
         BEFORE    AFTER
0            If       If
1           the      the
2     countries  countri
3            of       of
4         South    south
5       America  america
6          want     want
7   disarmament  disarma
8             ,        ,
9          then     

### Exercise 2.1
- By looking at the impact on a large sample of the Reuters corpus, establish the extent to which vocabulary size is reduced by stemming.
- Write code to do this in the empty cell below. You should be able to re-use a lot of the code from the code you used when measuring the impact of lower case and number normalisation.

In [18]:
sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

############################################
stemmed_sentences = [[st.stem(token) for token in tokens] for tokens in tokenised_sentences]
############################################

raw_vocab_size = vocabulary_size(tokenised_sentences)
stemmed_vocab_size = vocabulary_size(stemmed_sentences)
print("Stemming produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - stemmed_vocab_size)/raw_vocab_size,raw_vocab_size,stemmed_vocab_size))


Stemming produced a 26.53% reduction in vocabulary size from 19234 to 14132


### Exercise 2.2
* Try using the WordNetLemmatizer <code>nltk.stem.wordnet.WordNetLemmatizer</code> instead of the Porter Stemmer.
* Using a large sample of the Reuters corpus, establish the extent to which the vocabulary size reduced by lemmatization?
* As an extension, you could look at different sample sizes and/or different corpora and display the results in a table or graph (using <code>pandas</code> and <code>matplotlib</code>)

In [56]:
from nltk.stem.wordnet import WordNetLemmatizer

lem = WordNetLemmatizer()

def getLemReduction(sampleSize):
    raw_sentences = rcr.sample_raw_sents(sampleSize)
    tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

    ############################################
    lemmatized_sentences = [[lem.lemmatize(token) for token in tokens] for tokens in tokenised_sentences]
    ############################################

    raw_vocab_size = vocabulary_size(tokenised_sentences)
    lemmatized_vocab_size = vocabulary_size(lemmatized_sentences)
    
    reduction = '{0:.2f}%'.format(100*(raw_vocab_size - lemmatized_vocab_size)/raw_vocab_size)
    
    return (sampleSize, reduction, raw_vocab_size, lemmatized_vocab_size)

cols = []
for i in range(1, 4):
    cols.append(getLemReduction(10**i))

pd.DataFrame(cols, columns=["Sample size", "Reduction in vocab", "Raw vocab size", "Lemmatized vocab size"])

Unnamed: 0,Sample size,Reduction in vocab,Raw vocab size,Lemmatized vocab size
0,10,0.00%,99,99
1,100,1.93%,828,812
2,1000,4.94%,4836,4597


### Punctuation and stop-word removal
A stopword is a word that occurs so often that it loses its usefulness in some tasks. We may get more meaningful information from our corpus analysis if we remove stopwords and punctuation.

The code below takes a list of tokens and creates a new list, which contains only those strings which are alphabetic and non-stop-words.

In [27]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
tokens="The cat , which is really fat , sat on the mat".lower().split()
filtered_tokens = [w for w in tokens if w.isalpha() and w not in stop]
print(tokens)
print(filtered_tokens)

['the', 'cat', ',', 'which', 'is', 'really', 'fat', ',', 'sat', 'on', 'the', 'mat']
['cat', 'really', 'fat', 'sat', 'mat']


**Note**: `isalpha` only returns `True` if the string is entirely composed of alphabet characters. If you want a function to return `True` even when a word contains digits, then you should use `isalnum`.`

### Exercise 3.1
- In the empty cell below, write code that looks at a large sample of the Medline corpus, establishing what proportion of tokens are stop-words.
- As extension, you could establish the mean (and or the distribution of the) number of stop-words per sentence; or compare the numbers of stop-words in different corpora.

In [32]:
from sussex_nltk.corpus_readers import MedlineCorpusReader

mcr = MedlineCorpusReader()   

sampleSize = 1000

raw_sentences = mcr.sample_raw_sents(sampleSize)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

stopWordCount = 0
tokenCount = 0
for tokens in tokenised_sentences:
    for token in tokens:
        if token in stop: stopWordCount += 1
        tokenCount += 1

print('Proportion of tokens that are stop-words: {0:.2f}%'.format(stopWordCount / tokenCount))
print('Mean number of stop words per sentence: {0:.2f}'.format(stopWordCount / len(tokenised_sentences)))


Proportion of tokens that are stop-words: 0.32%
Mean number of stop words per sentence: 7.73
