# Introduction to NLTK

This notebook demonstrates how to use NLTK for various NLP tasks. NLTK is an older package of tools that let's you work with low level modules for text processing.

## Background reading

Before you check out this notebook, read through:

https://www.nltk.org/book/ch01.html
* section 1.2

https://www.nltk.org/book/ch03.html

* section 3.4 Regular Expressions for Detecting Word Patterns
* section 3.5 Useful Applications of Regular Expressions
* section 3.6 Normalizing Text
* section 3.7 Regular Expressions for Tokenizing Text
* section 3.8 Segmentation

https://www.nltk.org/book/ch05.html

* section 5.1 Using a Tagger
* section 5.4 Automatic Tagging


Tip: The best way to go about these assignments is to first read the section in the NLTK book that is mentioned, then start playing around with it. That should give you the hints to deal with an error message such as  "NameError: name 'FreqDist' is not defined". 

It's like cooking a recipe for the first time, you probably want to read the full recipe first, before you start cooking, in case you find out at step 5 that you should have soaked your beans 12 hours in advance.  


## NLTK Chapter 1, section 1.2 

In [14]:
import nltk  

The first time you import nltk on your local machine, you need to download the data sets that are used in the course. You will get a pop-up window to select datasets. The minimal data set you need is  "book". Take your time to check out the different TABs and have an idea what is there. Make sure you have sufficient disk space to store what you want.

If you did download once, you can skip the next step as the data are in your local drive. If you need another dataset of package, run it again and take your pick.

In [2]:
nltk.download() 

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Tokenisation with NLTK 

NLTK Chapter 5, section 1 Using a Tagger

In [8]:
example_text = "I'll refuse to permit you to obtain the refuse permit."
text = example_text.split()
text

["I'll",
 'refuse',
 'to',
 'permit',
 'you',
 'to',
 'obtain',
 'the',
 'refuse',
 'permit.']

Think about the above line, is it actually the same as tokenising? 
Let's try a real tokenizer....

In [9]:
from nltk import word_tokenize
text = word_tokenize("I'll refuse to permit you to obtain the refuse permit.")
text

['I',
 "'ll",
 'refuse',
 'to',
 'permit',
 'you',
 'to',
 'obtain',
 'the',
 'refuse',
 'permit',
 '.']

You can also call the tokenizer directly from NLTK

In [10]:
text = nltk.word_tokenize("I'll refuse to permit you to obtain the refuse permit.")
text

['I',
 "'ll",
 'refuse',
 'to',
 'permit',
 'you',
 'to',
 'obtain',
 'the',
 'refuse',
 'permit',
 '.']

In [11]:
nltk.pos_tag(text)

[('I', 'PRP'),
 ("'ll", 'MD'),
 ('refuse', 'VB'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('you', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN'),
 ('.', '.')]

Task: Make sure you know what each tag means. Would you have tagged each word using the same tag or do you disagree with the tagger? 

Observe some word contexts to get an idea of why NLTK tags the way it tags (Chapter 1, section 1.3) 


## Some more on cleaning up text and getting the basic forms for words

In [35]:
raw_docs = ["Here are some very simple basic sentences.",
"They won't be very interesting, I'm afraid.",
"The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."]

In [36]:
# Tokenizing text into bags of words
from nltk.tokenize import word_tokenize
tokenized_docs = [word_tokenize(doc) for doc in raw_docs]
print(tokenized_docs)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'wo', "n't", 'be', 'very', 'interesting', ',', 'I', "'m", 'afraid', '.'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', '_learn', 'how', 'basic', 'text', 'cleaning', 'works_', 'on', '*very', 'simple*', 'data', '.']]


In [37]:
# Removing punctuation
import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation)) 
#see documentation here: http://docs.python.org/2/library/string.html

tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    
    tokenized_docs_no_punctuation.append(new_review)
    
print(tokenized_docs_no_punctuation)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'wo', 'nt', 'be', 'very', 'interesting', 'I', 'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', 'learn', 'how', 'basic', 'text', 'cleaning', 'works', 'on', 'very', 'simple', 'data']]


In [38]:
# Cleaning text of stopwords
from nltk.corpus import stopwords

tokenized_docs_no_stopwords = []

for doc in tokenized_docs_no_punctuation:
    new_term_vector = []
    for word in doc:
        if not word in stopwords.words('english'):
            new_term_vector.append(word)
    
    tokenized_docs_no_stopwords.append(new_term_vector)

print(tokenized_docs_no_stopwords)

[['Here', 'simple', 'basic', 'sentences'], ['They', 'wo', 'nt', 'interesting', 'I', 'afraid'], ['The', 'point', 'examples', 'learn', 'basic', 'text', 'cleaning', 'works', 'simple', 'data']]


### Question:
What are stopwords and why would you want to remove these from a text?
What would be another way to define stop words than making a list?

## Stemming and lemmatizing
NLTK has various modules for stripping inflection of words (stemming) or finding the lemma (the form you can find in a dicitonary). Below is a script to stem and lemmatize the words in a text example after tokizing the text.

In [33]:
raw="SHUT UP! Enough already, Ballstein! Who cares about Derek Zoolander anyway? The man has only one look, for Christ's sake! Blue Steel? Ferrari? Le Tigra?"

In [47]:
# Stemming and Lemmatizing
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()
tokens = nltk.word_tokenize(raw)

porterlemmas = []
wordnetlemmas = []
snowballlemmas = []

for word in tokens:
    porterlemmas.append(porter.stem(word))
    snowballlemmas.append(snowball.stem(word))
    wordnetlemmas.append(wordnet.lemmatize(word))

print('Porter')
print(porterlemmas)
print('Snowball')
print(snowballlemmas)
print('Wordnet')
print(wordnetlemmas)

Porter
['shut', 'UP', '!', 'enough', 'alreadi', ',', 'ballstein', '!', 'who', 'care', 'about', 'derek', 'zooland', 'anyway', '?', 'the', 'man', 'ha', 'onli', 'one', 'look', ',', 'for', 'christ', "'s", 'sake', '!', 'blue', 'steel', '?', 'ferrari', '?', 'Le', 'tigra', '?']
Snowball
['shut', 'up', '!', 'enough', 'alreadi', ',', 'ballstein', '!', 'who', 'care', 'about', 'derek', 'zooland', 'anyway', '?', 'the', 'man', 'has', 'onli', 'one', 'look', ',', 'for', 'christ', "'s", 'sake', '!', 'blue', 'steel', '?', 'ferrari', '?', 'le', 'tigra', '?']
Wordnet
['SHUT', 'UP', '!', 'Enough', 'already', ',', 'Ballstein', '!', 'Who', 'care', 'about', 'Derek', 'Zoolander', 'anyway', '?', 'The', 'man', 'ha', 'only', 'one', 'look', ',', 'for', 'Christ', "'s", 'sake', '!', 'Blue', 'Steel', '?', 'Ferrari', '?', 'Le', 'Tigra', '?']


## Question:
What do you notice as a difference? What will be better for finding words in a dictionary? What output is better for looking up entities in Wikipedia?

## TIP
How to get a list of input files from a directory and read the complete content using NLTK PlaintextCorpusReader

In [15]:
## Loading your own text 

## NLTK Chapter 2, section 1.9

from nltk.corpus import PlaintextCorpusReader

corpus_root = '/Users/piek/nltk_data/corpora/subjectivity'

wordlists = PlaintextCorpusReader(corpus_root, '.*') 

print("There are 3 text files in this folder")
wordlists.fileids()


There are 3 text files in this folder


['README.txt', 'plot.tok.gt9.5000', 'quote.tok.gt9.5000']

In [16]:
# you can access all the words from each through a list
wordlists.words('README.txt')

['Subjectivity', 'Dataset', 'v1', '.', '0', ...]

## A running example

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from pprint import pprint # for pretty print output

We are going to use the text on the Apple-Samsung patent law cases. Download the examples file from Canvas and store it on your local drive. Adapt the path to your location in the examples below. Note that Windows uses backward slashes where Linux and Mac use forwards slashes for directory paths.

In [2]:
# the next function reads the content of a file. Make sure the path is correct.
f=open('/Users/piek/Desktop/TextMiningFEW-2019/text-mining-ba-git/notebooks/Lab1-apple-samsung-example.txt','r')
example_text=f.read()
print(example_text)

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.
The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.
Apple stated it had “acted quickly and diligently" in order to "determine that these newly released products do infringe many of the same claims already asserted by Apple."
In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices. Samsung, which is the world's top mobile phone maker, is appealing the ruling.
A similar case in the UK found in Samsung's fav

We get the sentences from this document in a list

In [3]:
sents = nltk.sent_tokenize(example_text)
len(sents)

6

In [4]:
# To get the last sentence
pprint(sents[-1])

("A similar case in the UK found in Samsung's favour and ordered Apple to "
 'publish an apology making clear that the South Korean firm had not copied '
 'its iPad when designing its own devices.')


## Tagging a text

https://www.nltk.org/book/ch05.html

In [5]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [6]:
lastsentence = preprocess(sents[-1])
lastsentence

[('A', 'DT'),
 ('similar', 'JJ'),
 ('case', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('UK', 'NNP'),
 ('found', 'VBD'),
 ('in', 'IN'),
 ('Samsung', 'NNP'),
 ("'s", 'POS'),
 ('favour', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('Apple', 'NNP'),
 ('to', 'TO'),
 ('publish', 'VB'),
 ('an', 'DT'),
 ('apology', 'NN'),
 ('making', 'VBG'),
 ('clear', 'JJ'),
 ('that', 'IN'),
 ('the', 'DT'),
 ('South', 'JJ'),
 ('Korean', 'JJ'),
 ('firm', 'NN'),
 ('had', 'VBD'),
 ('not', 'RB'),
 ('copied', 'VBN'),
 ('its', 'PRP$'),
 ('iPad', 'NN'),
 ('when', 'WRB'),
 ('designing', 'VBG'),
 ('its', 'PRP$'),
 ('own', 'JJ'),
 ('devices', 'NNS'),
 ('.', '.')]

In [7]:
patternNP = 'NP: {<DT>?<JJ>*<NNP>}'

In [10]:
constituent_parser = nltk.RegexpParser(patternNP)
constituent_structure = constituent_parser.parse(lastsentence)
print(constituent_structure)

(S
  A/DT
  similar/JJ
  case/NN
  in/IN
  (NP the/DT UK/NNP)
  found/VBD
  in/IN
  (NP Samsung/NNP)
  's/POS
  favour/NN
  and/CC
  ordered/VBD
  (NP Apple/NNP)
  to/TO
  publish/VB
  an/DT
  apology/NN
  making/VBG
  clear/JJ
  that/IN
  the/DT
  South/JJ
  Korean/JJ
  firm/NN
  had/VBD
  not/RB
  copied/VBN
  its/PRP$
  iPad/NN
  when/WRB
  designing/VBG
  its/PRP$
  own/JJ
  devices/NNS
  ./.)


In [11]:
# Generate 
constituent_structure.draw()

Note that you need to exit the program with the drawing that is shown as a pop up. The notebook will hang untill you killed the pop-up program

You can create a constituent parser yourself using several rules, e.g.

In [13]:
parser = nltk.RegexpParser('''
... NP: {<DT>? <JJ>* <NN>*} # NP
... P: {<IN>}           # Preposition
... V: {<V.*>}          # Verb
... PP: {<P> <NP>}      # PP -> P NP
... VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
... ''')

In [14]:
constituent_structure = parser.parse(lastsentence)
print(constituent_structure)

(S
  (NP A/DT similar/JJ case/NN)
  (PP (P in/IN) (NP the/DT))
  UK/NNP
  (VP (V found/VBD))
  (P in/IN)
  Samsung/NNP
  's/POS
  (NP favour/NN)
  and/CC
  (VP (V ordered/VBD))
  Apple/NNP
  to/TO
  (VP (V publish/VB) (NP an/DT apology/NN))
  (VP
    (V making/VBG)
    (NP clear/JJ)
    (PP (P that/IN) (NP the/DT South/JJ Korean/JJ firm/NN)))
  (VP (V had/VBD))
  not/RB
  (VP (V copied/VBN))
  its/PRP$
  (NP iPad/NN)
  when/WRB
  (VP (V designing/VBG))
  its/PRP$
  (NP own/JJ)
  devices/NNS
  ./.)


In [15]:
# Generate a drawing
constituent_structure.draw()

Dont forget to kill the drawing. You can save it to a PostScript file before exiting the program. The PostScript file can be loaded as a PDF (and coverted to other formats).

## Named entities in NLTK

You can use the ne_chunk module from NLTK to assign entity types to the tokeized and part-of-speech output of the sentence.

In [17]:
from nltk.chunk import ne_chunk

ne_tree = ne_chunk(lastsentence)
print(ne_tree)

(S
  A/DT
  similar/JJ
  case/NN
  in/IN
  the/DT
  (ORGANIZATION UK/NNP)
  found/VBD
  in/IN
  (GPE Samsung/NNP)
  's/POS
  favour/NN
  and/CC
  ordered/VBD
  (PERSON Apple/NNP)
  to/TO
  publish/VB
  an/DT
  apology/NN
  making/VBG
  clear/JJ
  that/IN
  the/DT
  (LOCATION South/JJ Korean/JJ)
  firm/NN
  had/VBD
  not/RB
  copied/VBN
  its/PRP$
  iPad/NN
  when/WRB
  designing/VBG
  its/PRP$
  own/JJ
  devices/NNS
  ./.)


## Converting tagged tokens to CoNLL format
The next shows how tokenized texts with simple labels (PoS and entity types) can be converted to the CoNLL format. This is a common format in NLP for which there is a lot of training and evaluation data for many languages.

In [18]:
from nltk.chunk import conlltags2tree, tree2conlltags
iob_tagged = tree2conlltags(ne_tree)
pprint(iob_tagged)

[('A', 'DT', 'O'),
 ('similar', 'JJ', 'O'),
 ('case', 'NN', 'O'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('UK', 'NNP', 'B-ORGANIZATION'),
 ('found', 'VBD', 'O'),
 ('in', 'IN', 'O'),
 ('Samsung', 'NNP', 'B-GPE'),
 ("'s", 'POS', 'O'),
 ('favour', 'NN', 'O'),
 ('and', 'CC', 'O'),
 ('ordered', 'VBD', 'O'),
 ('Apple', 'NNP', 'B-PERSON'),
 ('to', 'TO', 'O'),
 ('publish', 'VB', 'O'),
 ('an', 'DT', 'O'),
 ('apology', 'NN', 'O'),
 ('making', 'VBG', 'O'),
 ('clear', 'JJ', 'O'),
 ('that', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('South', 'JJ', 'B-LOCATION'),
 ('Korean', 'JJ', 'I-LOCATION'),
 ('firm', 'NN', 'O'),
 ('had', 'VBD', 'O'),
 ('not', 'RB', 'O'),
 ('copied', 'VBN', 'O'),
 ('its', 'PRP$', 'O'),
 ('iPad', 'NN', 'O'),
 ('when', 'WRB', 'O'),
 ('designing', 'VBG', 'O'),
 ('its', 'PRP$', 'O'),
 ('own', 'JJ', 'O'),
 ('devices', 'NNS', 'O'),
 ('.', '.', 'O')]
