### **CS481 Notebook 14**: Part of speech tagging with NLTK

Let's import nltk package first:

In [29]:
import nltk

Let's download the Punkt Tokenizer (it may already be downloaded). We will use it for tokenization, of course!

In [30]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dzikjac\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Punkt Tokenizer is downloaded and ready to use. What is it exactly, though? It is an unsupervised sentence boundary detection algorithm **that you can train yourself**. NLTK has a **pre-trained** implementation. Let's set it up:

In [31]:
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

We will also need a tagger. Let's load one:

In [32]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dzikjac\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Here's our input text:

In [45]:
text = 'The futuristic special effects are top notch. Sadly the degree of characterisation necessary to power a fairly static storyline is missing. With limited action, performances need to be riveting in themselves. Imagine Anthony Hopkins as Hari Seldon. Tilda Swinton as Demerzel. Maybe a Mel Gibson as Cleon. These are people that can hold the screen with the slightest word or gesture. Foundation onscreen needs such characterisation as the plot itself is quite dry and action limited for the sake of concept.'

Let's segment input text into individual sentences:

In [46]:
sentences = nltk.tokenize.sent_tokenize(text, language = 'english')

... and print them out

In [47]:
for sentence in sentences:
    print(sentence)

The futuristic special effects are top notch.
Sadly the degree of characterisation necessary to power a fairly static storyline is missing.
With limited action, performances need to be riveting in themselves.
Imagine Anthony Hopkins as Hari Seldon.
Tilda Swinton as Demerzel.
Maybe a Mel Gibson as Cleon.
These are people that can hold the screen with the slightest word or gesture.
Foundation onscreen needs such characterisation as the plot itself is quite dry and action limited for the sake of concept.


Now that we have all sentences ready, we can tokenize each one and then use the POS tagger:

In [48]:
# If you want to also remove stop words, uncomment this line below:
#stopWords = set(nltk.corpus.stopwords.words('english'))

for sentence in sentences:
    # Let's tokenize the sentence
    words = nltk.word_tokenize(sentence)
    
    # You could remove stop words if you wanted to
    #words = [w for w in words if not w in stop_words]
 
    # let's use the Part of Speech tagger to tag each word
    taggedWords = nltk.pos_tag(words)
 
    print(taggedWords)

[('The', 'DT'), ('futuristic', 'JJ'), ('special', 'JJ'), ('effects', 'NNS'), ('are', 'VBP'), ('top', 'JJ'), ('notch', 'NN'), ('.', '.')]
[('Sadly', 'RB'), ('the', 'DT'), ('degree', 'NN'), ('of', 'IN'), ('characterisation', 'NN'), ('necessary', 'JJ'), ('to', 'TO'), ('power', 'NN'), ('a', 'DT'), ('fairly', 'RB'), ('static', 'JJ'), ('storyline', 'NN'), ('is', 'VBZ'), ('missing', 'VBG'), ('.', '.')]
[('With', 'IN'), ('limited', 'JJ'), ('action', 'NN'), (',', ','), ('performances', 'NNS'), ('need', 'VBP'), ('to', 'TO'), ('be', 'VB'), ('riveting', 'VBG'), ('in', 'IN'), ('themselves', 'PRP'), ('.', '.')]
[('Imagine', 'NNP'), ('Anthony', 'NNP'), ('Hopkins', 'NNP'), ('as', 'IN'), ('Hari', 'NNP'), ('Seldon', 'NNP'), ('.', '.')]
[('Tilda', 'NNP'), ('Swinton', 'NNP'), ('as', 'IN'), ('Demerzel', 'NNP'), ('.', '.')]
[('Maybe', 'RB'), ('a', 'DT'), ('Mel', 'NNP'), ('Gibson', 'NNP'), ('as', 'IN'), ('Cleon', 'NNP'), ('.', '.')]
[('These', 'DT'), ('are', 'VBP'), ('people', 'NNS'), ('that', 'WDT'), ('can'