## NLTK Chapter 1, section 1.2 

In [1]:
import nltk  

The first time you import nltk on your local machine, you need to download the data sets that are used in the course. You will get a pop-up window to select datasets. The minimal data set you need is  "book". Take your time to check out the different TABs and have an idea what is there. Make sure you have sufficient disk space to store what you want.

If you did download once, you can skip the next step as the data are in your local drive. If you need another dataset of package, run it again and take your pick.

In [2]:
nltk.download() 

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


## NLTK Chapter 1, section 3.1 Frequency Distributions

In [4]:
from nltk.book import *
from nltk import FreqDist
fdist1 = FreqDist(text1) 
print(fdist1)
fdist1.most_common(50)
fdist1['whale']

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
<FreqDist with 19317 samples and 260819 outcomes>


906

Tip: The best way to go about these assignments is to first read the section in the NLTK book that is mentioned, then start playing around with it. That should give you the hints to deal with an error message such as  "NameError: name 'FreqDist' is not defined". 

It's like cooking a recipe for the first time, you probably want to read the full recipe first, before you start cooking, in case you find out at step 5 that you should have soaked your beans 12 hours in advance.  


## Tokenisation with NLTK 

NLTK Chapter 5, section 1 Using a Tagger

In [22]:
text = "I'll refuse to permit you to obtain the refuse permit.".split()
text

["I'll",
 'refuse',
 'to',
 'permit',
 'you',
 'to',
 'obtain',
 'the',
 'refuse',
 'permit.']

Think about the above line, is it actually the same as tokenising? 
Let's try a real tokenizer....

In [23]:
from nltk import word_tokenize
text = word_tokenize("I'll refuse to permit you to obtain the refuse permit.")
text

['I',
 "'ll",
 'refuse',
 'to',
 'permit',
 'you',
 'to',
 'obtain',
 'the',
 'refuse',
 'permit',
 '.']

You can also call the tokenizer directly from NLTK

In [24]:
text = nltk.word_tokenize("I'll refuse to permit you to obtain the refuse permit.")
text

['I',
 "'ll",
 'refuse',
 'to',
 'permit',
 'you',
 'to',
 'obtain',
 'the',
 'refuse',
 'permit',
 '.']

In [25]:
nltk.pos_tag(text)

[('I', 'PRP'),
 ("'ll", 'MD'),
 ('refuse', 'VB'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('you', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN'),
 ('.', '.')]

Task: Make sure you know what each tag means. Would you have tagged each word using the same tag or do you disagree with the tagger? 

Observe some word contexts to get an idea of why NLTK tags the way it tags (Chapter 1, section 1.3) 


## Some more on cleaning up text and getting the basic forms for words

In [35]:
raw_docs = ["Here are some very simple basic sentences.",
"They won't be very interesting, I'm afraid.",
"The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."]

In [36]:
# Tokenizing text into bags of words
from nltk.tokenize import word_tokenize
tokenized_docs = [word_tokenize(doc) for doc in raw_docs]
print(tokenized_docs)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'wo', "n't", 'be', 'very', 'interesting', ',', 'I', "'m", 'afraid', '.'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', '_learn', 'how', 'basic', 'text', 'cleaning', 'works_', 'on', '*very', 'simple*', 'data', '.']]


In [37]:
# Removing punctuation
import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation)) 
#see documentation here: http://docs.python.org/2/library/string.html

tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    
    tokenized_docs_no_punctuation.append(new_review)
    
print(tokenized_docs_no_punctuation)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'wo', 'nt', 'be', 'very', 'interesting', 'I', 'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', 'learn', 'how', 'basic', 'text', 'cleaning', 'works', 'on', 'very', 'simple', 'data']]


In [38]:
# Cleaning text of stopwords
from nltk.corpus import stopwords

tokenized_docs_no_stopwords = []

for doc in tokenized_docs_no_punctuation:
    new_term_vector = []
    for word in doc:
        if not word in stopwords.words('english'):
            new_term_vector.append(word)
    
    tokenized_docs_no_stopwords.append(new_term_vector)

print(tokenized_docs_no_stopwords)

[['Here', 'simple', 'basic', 'sentences'], ['They', 'wo', 'nt', 'interesting', 'I', 'afraid'], ['The', 'point', 'examples', 'learn', 'basic', 'text', 'cleaning', 'works', 'simple', 'data']]


### Question:
What are stopwords and why would you want to remove these from a text?
What would be another way to define stop words than making a list?

## Stemming and lemmatizing
NLTK has various modules for stripping inflection of words (stemming) or finding the lemma (the form you can find in a dicitonary). Below is a script to stem and lemmatize the words in a text example after tokizing the text.

In [33]:
raw="SHUT UP! Enough already, Ballstein! Who cares about Derek Zoolander anyway? The man has only one look, for Christ's sake! Blue Steel? Ferrari? Le Tigra?"

In [47]:
# Stemming and Lemmatizing
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()
tokens = nltk.word_tokenize(raw)

porterlemmas = []
wordnetlemmas = []
snowballlemmas = []

for word in tokens:
    porterlemmas.append(porter.stem(word))
    snowballlemmas.append(snowball.stem(word))
    wordnetlemmas.append(wordnet.lemmatize(word))

print('Porter')
print(porterlemmas)
print('Snowball')
print(snowballlemmas)
print('Wordnet')
print(wordnetlemmas)

Porter
['shut', 'UP', '!', 'enough', 'alreadi', ',', 'ballstein', '!', 'who', 'care', 'about', 'derek', 'zooland', 'anyway', '?', 'the', 'man', 'ha', 'onli', 'one', 'look', ',', 'for', 'christ', "'s", 'sake', '!', 'blue', 'steel', '?', 'ferrari', '?', 'Le', 'tigra', '?']
Snowball
['shut', 'up', '!', 'enough', 'alreadi', ',', 'ballstein', '!', 'who', 'care', 'about', 'derek', 'zooland', 'anyway', '?', 'the', 'man', 'has', 'onli', 'one', 'look', ',', 'for', 'christ', "'s", 'sake', '!', 'blue', 'steel', '?', 'ferrari', '?', 'le', 'tigra', '?']
Wordnet
['SHUT', 'UP', '!', 'Enough', 'already', ',', 'Ballstein', '!', 'Who', 'care', 'about', 'Derek', 'Zoolander', 'anyway', '?', 'The', 'man', 'ha', 'only', 'one', 'look', ',', 'for', 'Christ', "'s", 'sake', '!', 'Blue', 'Steel', '?', 'Ferrari', '?', 'Le', 'Tigra', '?']


## Question:
What do you notice as a difference? What will be better for finding words in a dictionary? What output is better for looking up entities in Wikipedia?

-- Do something on chunking, parsing and dependencies

== START OF THE LAB ASSIGNMENT == 

This lab assignment consists of 3 parts: Sentence Boundary detection, Reading Tagged Corpora and Stemming . Submit your ipynb with the answers via Canvas. Please note that submissions should contain a title, your names and a description of the assignment, such that the document is interpretable on its own. At the least this means copying the task description!

You don't need to provide a long explanation for your commands, simply the commands you used, and maybe some output will suffice. We will strive to give you feedback before the next Lab session.

### Sentence boundary detection with NLTK

NLTK offers a sentence boundary detection function, similar to the word tokenisation function. It is called sent_tokenize(). Submit the commands that you use to print the last sentence of a sentence split text using NLTK. The task can be carried out in 4 or 5 steps. 

- Create a string that contains several sentences. 
- Import the sent_tokenize function 
- Tokenize / detect sentence boundaries in your text and store the results in a list 
- Get the number of sentences (number of list items) (can be skipped for those more familiar with Python) 
- print the last sentence

In [None]:
# Provide your answer here

### NLTK Chapter 5, Section 2.2 & 2.3 Reading tagged corpora

Task: NLTK contains a Dutch tagged corpus, can you print the most common tags from it? 

For this you need to know that the Dutch tags are contained in the conll2002 corpus. 

The conll2002 corpus also contains Spanish text, so you need to only read in the Dutch part. 

You can check what types of information the corpus class contains with the dir command. 

When you create a subset of the conll corpus, you need to include the files that start with 'ned'.

Check the example of getting the 10 most common tags from the Brown corpus in Section 2.3 of chapter 5.

Note: the conll2002  corpus doesn't contain the universal POS tagset.

In [None]:
# Provide your answer here

### Error propagation 
(this means that an error in an earlier step of the pipeline, for example in the tokeniser, later steps harms, see http://ceur-ws.org/Vol-1386/piling_up.pdf for a more in-depth discussion) 

Tokenisation and POS-tagging are often used as pre-processing steps for higher level tasks such as named entity recognition. If you run the text "SHUT UP! Enough already, Ballstein! Who cares about Derek Zoolander anyway? The man has only one look, for Christ's sake! Blue Steel? Ferrari? Le Tigra?" through NLTK's tokeniser and POS-tagger, which errors do you think may arise that can harm the named entity recogniser at a later stage? 

In [None]:
# Provide your answer here

== END OF THE LAB ASSIGNMENT ==

In [30]:
### Working towards the final assignment & further tips & ticks

In [28]:
## Loading your own text 

## NLTK Chapter 2, section 1.9

from nltk.corpus import PlaintextCorpusReader

corpus_root = '/usr/share/dict' 

wordlists = PlaintextCorpusReader(corpus_root, '.*') 

wordlists.fileids()


['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']

In [29]:
wordlists.words('connectives')

['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

--------------------------------------
Homework/Working towards the final assignment:

Check out the data for the final assignment and tokenise and POS tag the texts you selected. You find the link and description of the data in "Course Documents".

If you perform these steps now, and document them for yourself, you can 'grow' your data for the final assignment. 

--------------------------------------

Tips and tricks:

If you are not familiar with storing your code in scripts (for easy reuse) you can find some suggestions in Chapter 2, section 3

--------------------------------------

Further reading & hacking: 

More information on dealing with character encodings:

NLTK 3.3 http://www.nltk.org/book/ch03.html#sec-unicode 

Should you be interested in tokenising and POS tagging text in languages other than Dutch, you can find information on training your own tokeniser for a variety of languages at:

http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize