This material is copied (possibily with some modifications) from the [Python for Text-Analysis course](https://github.com/cltl/python-for-text-analysis/tree/master/Chapters).

# Chapter 24: NLP Pipeline with NLTK

Way to go! You have already learnt a lot of essential components of the Python language. Being able to deal with data structures, import packages, build your own functions and operate with files is not only essential for most tasks in Python, but also a prerequisite for text analysis. We have applied some common preprocessing steps like casefolding/lowercasing, punctuation removal and stemming/lemmatization. Did you know that there are some very useful NLP packages and modules that do some of these steps? One that is often used in text analysis is the Python package **NLTK (the Natural Language Toolkit)**.

### At the end of this chapter, you will be able to:
* have an idea of the NLP tasks that constitute an NLP pipeline
* use the functions of the NLTK module to manipulate the content of files for NLP purposes (e.g. sentence splitting, tokenization, POS-tagging, and lemmatization);
* do nesting of multiple for-loops or files

### More NLP software for Python:
* [NLTK](http://www.nltk.org/)
* [SpaCy](https://spacy.io/)
* [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/index.html)
* [About Python NLP libraries](https://elitedatascience.com/python-nlp-libraries)


If you have **questions** about this chapter, drop em in the Slack.

# 1 A short intro to text processing

There are many aspects of text we can (try to) analyze. Commonly used analyses conducted in Natural Lanugage Processing (**NLP**) are for instance:

* determining the part of speech of words in a text (verb, noun, etc.)
* analyzing the syntactic relations between words and phrases in a sentence (i.e. syntactic parsing)
* analyzing which entities (people, organizations, locations) are mentioned in a text

...and many more. Each of these aspects is addressed within its own **NLP task**. 

**The NLP pipeline**

Usually, these tasks are carried out sequentially, because they depend on each other. For instance, we need to first tokenize the text (split it into words) in order to be able to assign part-of-speech tags to each word. This sequence is often called an **NLP pipeline**. For example, a general pipeline could consist of the components shown below (taken from [here](https://www.slideshare.net/YuriyGuts/natural-language-processing-nlp)) You can see the NLP pipeline of the NewsReader project [here](http://www.newsreader-project.eu/files/2014/02/SystemArchitecture.png). (you can ignore the middle part of the picture, and focus on the blue and green boxes in the outer row).

<img src='images/nlp-pipeline.jpg'>

In this chapter we will look into four simple NLP modules that are nevertheless very common in NLP: **tokenization, sentence splitting**, **lemmatization** and **POS tagging**. 

There are also more advanced processing modules out there - feel free to do some research yourself :-) 

# 2 The NLTK package

NLTK (Natural Language Processing Toolkit) is a module we can use for most fundamental aspects of natural language processing. There are many more advanced approaches out there, but it is a good way of getting started. 

Here we will show you how to use it for tokenization, sentence splitting, POS tagging, and lemmatization. These steps are necessary processing steps for most NLP tasks. 

We will first give you an overview of all tasks and then delve into each of them in more detail. 

Before we can use NLTK for the first time, we have to make sure it is downloaded and installed on our computer (some of you may have already done this). 

To install NLTK, please try to run the following 2 cells. If this does not work, please try and follow the [documentation](http://www.nltk.org/install.html). If you don't manage to get this to work, please ask for help. 

In [1]:
%%bash
pip install nltk

Collecting nltk
  Downloading nltk-3.5.zip (1.4 MB)
Collecting regex
  Downloading regex-2020.5.7.tar.gz (696 kB)
Collecting tqdm
  Downloading tqdm-4.46.0-py2.py3-none-any.whl (63 kB)
Could not build wheels for nltk, since package 'wheel' is not installed.
Could not build wheels for click, since package 'wheel' is not installed.
Could not build wheels for joblib, since package 'wheel' is not installed.
Could not build wheels for regex, since package 'wheel' is not installed.
Installing collected packages: regex, tqdm, nltk
    Running setup.py install for regex: started
    Running setup.py install for regex: finished with status 'done'
    Running setup.py install for nltk: started
    Running setup.py install for nltk: finished with status 'done'
Successfully installed nltk-3.5 regex-2020.5.7 tqdm-4.46.0


In [2]:
# downloading nltk

import nltk
nltk.download('book')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/cody/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to /Users/cody/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/cody/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/cody/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/cody/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/cody/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /Users/cody/nltk_data...
[nltk_data]    |  

True

Now that we have installed and downloaded NLTK, let's look at an example of a simple NLP pipeline. In the following cell, you can observe how we tokenize raw text into tokens and setnences, perform part of speech tagging and lemmatize some of the tokens. Don't worry about the details just yet - we will go trhough them step by step. 

In [3]:
text = "This example sentence is used for illustrating some basic NLP tasks. Language is awesome!"

# Tokenization
tokens = nltk.word_tokenize(text)

# Sentence splitting
sentences = nltk.sent_tokenize(text)

# POS tagging
tagged_tokens = nltk.pos_tag(tokens)

# Lemmatization
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
lemma=lmtzr.lemmatize(tokens[4], 'v')

# Printing all information
print(tokens)
print(sentences)
print(tagged_tokens)
print(lemma)

['This', 'example', 'sentence', 'is', 'used', 'for', 'illustrating', 'some', 'basic', 'NLP', 'tasks', '.', 'Language', 'is', 'awesome', '!']
['This example sentence is used for illustrating some basic NLP tasks.', 'Language is awesome!']
[('This', 'DT'), ('example', 'NN'), ('sentence', 'NN'), ('is', 'VBZ'), ('used', 'VBN'), ('for', 'IN'), ('illustrating', 'VBG'), ('some', 'DT'), ('basic', 'JJ'), ('NLP', 'NNP'), ('tasks', 'NNS'), ('.', '.'), ('Language', 'NN'), ('is', 'VBZ'), ('awesome', 'JJ'), ('!', '.')]
use


## 2.1 Tokenization and sentence splitting with NLTK

### 2.1.1 `word_tokenize()`

Now, let's try tokenizing our Charlie story! First, we will open and read the file again and assign the file contents to the variable `content`. Then, we can call the `word_tokenize()` function from the `nltk` module as follows:

In [5]:
with open("../data/charlie.txt") as infile:
    content = infile.read()

tokens = nltk.word_tokenize(content)
print(tokens)

['Charlie', 'Bucket', 'stared', 'around', 'the', 'gigantic', 'room', 'in', 'which', 'he', 'now', 'found', 'himself', '.', 'The', 'place', 'was', 'like', 'a', 'witch', '’', 's', 'kitchen', '!', 'All', 'about', 'him', 'black', 'metal', 'pots', 'were', 'boiling', 'and', 'bubbling', 'on', 'huge', 'stoves', ',', 'and', 'kettles', 'were', 'hissing', 'and', 'pans', 'were', 'sizzling', ',', 'and', 'strange', 'iron', 'machines', 'were', 'clanking', 'and', 'spluttering', ',', 'and', 'there', 'were', 'pipes', 'running', 'all', 'over', 'the', 'ceiling', 'and', 'walls', ',', 'and', 'the', 'whole', 'place', 'was', 'filled', 'with', 'smoke', 'and', 'steam', 'and', 'delicious', 'rich', 'smells', '.', 'Mr', 'Wonka', 'himself', 'had', 'suddenly', 'become', 'even', 'more', 'excited', 'than', 'usual', ',', 'and', 'anyone', 'could', 'see', 'that', 'this', 'was', 'the', 'room', 'he', 'loved', 'best', 'of', 'all', '.', 'He', 'was', 'hopping', 'about', 'among', 'the', 'saucepans', 'and', 'the', 'machines', 'l

As you can see, we now have a list of all words in the text. The punctuation marks are also in the list, but as separate tokens.

### 2.1.2 `sent_tokenize()`

Another thing that NLTK can do for you is to split a text into sentences by using the `sent_tokenize()` function. We use it on the entire text (as a string):

In [6]:
with open("../data/charlie.txt") as infile:
    content = infile.read()

sentences = nltk.sent_tokenize(content)
print(sentences)

['Charlie Bucket stared around the gigantic room in which he now found himself.', 'The place was like a witch’s kitchen!', 'All about him black metal pots were boiling and bubbling on huge stoves, and kettles were hissing and pans were sizzling, and strange iron machines were clanking and spluttering, and there were pipes running all over the ceiling and walls, and the whole place was filled with smoke and steam and delicious rich smells.', 'Mr Wonka himself had suddenly become even more excited than usual, and anyone could see that this was the room he loved best of all.', 'He was hopping about among the saucepans and the machines like a child among his Christmas presents, not knowing which thing to look at first.', 'He lifted the lid from a huge pot and took a sniff; then he rushed over and dipped a finger into a barrel of sticky yellow stuff and had a taste; then he skipped across to one of the machines and turned half a dozen knobs this way and that; then he peered anxiously throug

We can now do all sorts of cool things with these lists. For example, we can search for all words that have certain letters in them and add them to a list. Let's say we want to find all present participles in the text. We know that present participles end with *-ing*, so we can do something like this:

In [7]:
# Open and read in file as a string, assign it to the variable `content`
with open("../data/charlie.txt") as infile:
    content = infile.read()
    
# Split up entire text into tokens using word_tokenize():
tokens = nltk.word_tokenize(content)

# create an empty list to collect all words having the present participle -ing:
present_participles = []

# looking through all tokens
for token in tokens:
    # checking if a token ends with the present parciciple -ing
    if token.endswith("ing"):
        # if the condition is met, add it to the list we created above (present_participles)
        present_participles.append(token)
        
# Print the list to inspect it
print(present_participles)

['boiling', 'bubbling', 'hissing', 'sizzling', 'clanking', 'spluttering', 'running', 'ceiling', 'hopping', 'knowing', 'thing', 'rubbing', 'cackling', 'going']


This looks good! We now have a list of words like *boiling*, *sizzling*, etc. However, we can see that there is one word in the list that actually is not a present participle (*ceiling*). Of course, also other words can end with *-ing*. So if we want to find all present participles, we have to come up with a smarter solution. 

## 2.2. Part-of-speech (POS) tagging

Once again, NLTK comes to the rescue. Using the function `pos_tag()`, we can label each word in the text with its part of speech. 

To do pos-tagging, you first need to tokenize the text. We have already done this above, but we will repeat the steps here, so you get a sense of what an NLP pipeline may look like.

### 2.2.1 `pos_tag()`

To see how `pos_tag()` can be used, we can (as always) look at the documentation by using the `help()` function. As we can see, `pos_tag()` takes a tokenized text as input and returns a list of tuples in which the first element corresponds to the token and the second to the assigned pos-tag.

In [8]:
# As always, we can start by reading the documentation:
help(nltk.pos_tag)

Help on function pos_tag in module nltk.tag:

pos_tag(tokens, tagset=None, lang='eng')
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.
    
        >>> from nltk.tag import pos_tag
        >>> from nltk.tokenize import word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
        ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
        [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
        ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
    
    NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.
    
    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :param tagset: the tagset to be u

In [10]:
# Open and read in file as a string, assign it to the variable `content`
with open("../data/charlie.txt") as infile:
    content = infile.read()
    
# Split up entire text into tokens using word_tokenize():
tokens = nltk.word_tokenize(content)

# Apply pos tagging to the tokenized text
tagged_tokens = nltk.pos_tag(tokens)

# Inspect pos tags
print(tagged_tokens)

[('Charlie', 'NNP'), ('Bucket', 'NNP'), ('stared', 'VBD'), ('around', 'IN'), ('the', 'DT'), ('gigantic', 'JJ'), ('room', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('he', 'PRP'), ('now', 'RB'), ('found', 'VBD'), ('himself', 'PRP'), ('.', '.'), ('The', 'DT'), ('place', 'NN'), ('was', 'VBD'), ('like', 'IN'), ('a', 'DT'), ('witch', 'NN'), ('’', 'NN'), ('s', 'NN'), ('kitchen', 'NN'), ('!', '.'), ('All', 'DT'), ('about', 'IN'), ('him', 'PRP'), ('black', 'JJ'), ('metal', 'NN'), ('pots', 'NNS'), ('were', 'VBD'), ('boiling', 'VBG'), ('and', 'CC'), ('bubbling', 'VBG'), ('on', 'IN'), ('huge', 'JJ'), ('stoves', 'NNS'), (',', ','), ('and', 'CC'), ('kettles', 'NNS'), ('were', 'VBD'), ('hissing', 'VBG'), ('and', 'CC'), ('pans', 'NNS'), ('were', 'VBD'), ('sizzling', 'VBG'), (',', ','), ('and', 'CC'), ('strange', 'JJ'), ('iron', 'NN'), ('machines', 'NNS'), ('were', 'VBD'), ('clanking', 'VBG'), ('and', 'CC'), ('spluttering', 'NN'), (',', ','), ('and', 'CC'), ('there', 'EX'), ('were', 'VBD'), ('pipes', 'NNS

### 2.2.2 Working with POS tags

As we saw above, `pos_tag()` returns a list of tuples: The first element is the token, the second element indicates the part of speech (POS) of the token. 

This POS tagger uses the POS tag set of the Penn Treebank Project, which can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). For example, all tags starting with a V are used for verbs. 

We can now use this, for example, to identify all the verbs in a text:

In [11]:
# Open and read in file as a string, assign it to the variable `content`
with open("../data/charlie.txt") as infile:
    content = infile.read()
    
# Apply tokenization and POS tagging
tokens = nltk.word_tokenize(content)
tagged_tokens = nltk.pos_tag(tokens)

# List of verb tags (i.e. tags we are interested in)
verb_tags = ["VBD", "VBG", "VBN", "VBP", "VBZ"]

# Create an empty list to collect all verbs:
verbs = []

# Iterating over all tagged tokens
for token, tag in tagged_tokens:
 
    # Checking if the tag is any of the verb tags
    if tag in verb_tags:
        # if the condition is met, add it to the list we created above 
        verbs.append(token)
        
# Print the list to inspect it
print(verbs)

['stared', 'found', 'was', 'were', 'boiling', 'bubbling', 'were', 'hissing', 'were', 'sizzling', 'were', 'clanking', 'were', 'running', 'was', 'filled', 'had', 'become', 'was', 'loved', 'was', 'hopping', 'knowing', 'lifted', 'took', 'rushed', 'dipped', 'had', 'skipped', 'turned', 'peered', 'rubbing', 'cackling', 'saw', 'ran', 'kept', 'going', 'went', 'dropped']


## 2.3. Lemmatization

We can also use NLTK to lemmatize words.

The lemma of a word is the form of the word which is usually used in dictionary entries. This is useful for many NLP tasks, as it gives a better generalization than the strong a word appears in. To a computer, `cat` and `cats` are two completely different tokens, even though we know they are both forms of the same lemma. 



### 2.3.1 The WordNet lemmatizer

We will use the WordNetLemmatizer for this using the `lemmatize()` function. In the code below, we loop through the list of verbs, lemmatize each of the verbs, and add them to a new list called `verb_lemmas`. Again, we show all the processing steps (consider the comments in the code below):

In [12]:
#################################################################################
#### Process text as explained above ###

with open("../data/charlie.txt") as infile:
    content = infile.read()
    
tokens = nltk.word_tokenize(content)
tagged_tokens = nltk.pos_tag(tokens)

verb_tags = ["VBD", "VBG", "VBN", "VBP", "VBZ"]
verbs = []

for token, tag in tagged_tokens:
    if tag in verb_tags:
        verbs.append(token)

print(verbs)

#############################################################################
#### Use the list of verbs collected above to lemmatize all the verbs ###

        
# Instatiate a lemmatizer object
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()

# Create list to collect all the verb lemmas:
verb_lemmas = []
        
for participle in verbs:
    # For this lemmatizer, we need to indicate the POS of the word (in this case, v = verb)
    lemma = lmtzr.lemmatize(participle, "v") 
    verb_lemmas.append(lemma)
print(verb_lemmas)

['stared', 'found', 'was', 'were', 'boiling', 'bubbling', 'were', 'hissing', 'were', 'sizzling', 'were', 'clanking', 'were', 'running', 'was', 'filled', 'had', 'become', 'was', 'loved', 'was', 'hopping', 'knowing', 'lifted', 'took', 'rushed', 'dipped', 'had', 'skipped', 'turned', 'peered', 'rubbing', 'cackling', 'saw', 'ran', 'kept', 'going', 'went', 'dropped']
['star', 'find', 'be', 'be', 'boil', 'bubble', 'be', 'hiss', 'be', 'sizzle', 'be', 'clank', 'be', 'run', 'be', 'fill', 'have', 'become', 'be', 'love', 'be', 'hop', 'know', 'lift', 'take', 'rush', 'dip', 'have', 'skip', 'turn', 'peer', 'rub', 'cackle', 'saw', 'run', 'keep', 'go', 'go', 'drop']


**Note about the wordnet lemmatizer:** 

We need to specify a POS tag to the WordNet lemmatizer, in a WordNet format ("n" for noun, "v" for verb, "a" for adjective). If we do not indicate the Part-of-Speech tag, the WordNet lemmatizer thinks it is a noun (this is the default value for its part-of-speech). See the examples below:

In [13]:
test_nouns = ('building', 'applications', 'leafs')
for n in test_nouns:
    print(f"Noun in conjugated form: {n}")
    default_lemma=lmtzr.lemmatize(n) # default lemmatization, without specifying POS, n is interpretted as a noun!
    print(f"Default lemmatization: {default_lemma}")
    verb_lemma=lmtzr.lemmatize(n, 'v')
    print(f"Lemmatization as a verb: {verb_lemma}")
    noun_lemma=lmtzr.lemmatize(n, 'n')
    print(f"Lemmatization as a noun: {noun_lemma}")
    print()

Noun in conjugated form: building
Default lemmatization: building
Lemmatization as a verb: build
Lemmatization as a noun: building

Noun in conjugated form: applications
Default lemmatization: application
Lemmatization as a verb: applications
Lemmatization as a noun: application

Noun in conjugated form: leafs
Default lemmatization: leaf
Lemmatization as a verb: leaf
Lemmatization as a noun: leaf



In [14]:
test_verbs=('grew', 'standing', 'plays')
for v in test_verbs:
    print(f"Verb in conjugated form: {v}")
    default_lemma=lmtzr.lemmatize(v) # default lemmatization, without specifying POS, v is interpretted as a noun!
    print(f"Default lemmatization: {default_lemma}")
    verb_lemma=lmtzr.lemmatize(v, 'v')
    print(f"Lemmatization as a verb: {verb_lemma}")
    noun_lemma=lmtzr.lemmatize(v, 'n')
    print(f"Lemmatization as a noun: {noun_lemma}")
    print()

Verb in conjugated form: grew
Default lemmatization: grew
Lemmatization as a verb: grow
Lemmatization as a noun: grew

Verb in conjugated form: standing
Default lemmatization: standing
Lemmatization as a verb: stand
Lemmatization as a noun: standing

Verb in conjugated form: plays
Default lemmatization: play
Lemmatization as a verb: play
Lemmatization as a noun: play



### 2.3.2 Combining NLTK POS tags with the WordNet lemmatizer

The WordNet lemmatizer assumes every word is a noun unless specified diferently. We need to be careful and specify the POS tag because otherwise we will end up with wrong lemmatization such as the cases shown in the past two cells. For example, by default WordNet thinks that "grew" is a noun, and it will not lemmatize it as a past-tense verb.

Luckily, we learned that we can also automatically infer the POS tags for each word. We can use these automatic POS tags as input to our lemmatizer to improve its accuracy for non-nouns. As an intermediate step, we need to translate the POS tags that we get from our POS tagger (this are according to the Penn TreeBank classification) to WordNet POS tags. Here is an example of how to lemmatize your words in a proper way, accounting for different POS tags (you can also read [this discussion](https://stackoverflow.com/questions/25534214/nltk-wordnet-lemmatizer-shouldnt-it-lemmatize-all-inflections-of-a-word)):

In [15]:
# Lemmatizing (the proper way, accounting for different POS tags)
from nltk.corpus import wordnet as wn


# We can write a general function to translate penn tree bank tags to wordnet tags
def penn_to_wn(penn_tag):
    """
    Returns the corresponding WordNet POS tag for a Penn TreeBank POS tag.
    """
    if penn_tag in ['NN', 'NNS', 'NNP', 'NNPS']:
        wn_tag = wn.NOUN
    elif penn_tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']:
        wn_tag = wn.VERB
    elif penn_tag in ['RB', 'RBR', 'RBS']:
        wn_tag = wn.ADV
    elif penn_tag in ['JJ', 'JJR', 'JJS']:
        wn_tag = wn.ADJ
    else:
        wn_tag = None
    return wn_tag

lmtzr = nltk.stem.wordnet.WordNetLemmatizer()

# create empty list to collect lemmas
lemmas = list()

# We use the tagged tokens we collected above and loop through the list of tuples
for token, pos in tagged_tokens:
    # convert Penn Treebank POS tag to WordNet POS tag
    wn_tag = penn_to_wn(pos) 
    # we check if a wordnet tag was assigned
    if not wn_tag == None:
        # we lemmatize using the translated wordnet tag
        lemma = lmtzr.lemmatize(token, wn_tag)
    else:
        # if there is no wordnet tag, we apply default lemmatization
        lemma = lmtzr.lemmatize(token)
    # add lemmas to list
    lemmas.append(lemma)
    
# Inspect lemmas by printing them
print(lemmas)


['Charlie', 'Bucket', 'star', 'around', 'the', 'gigantic', 'room', 'in', 'which', 'he', 'now', 'find', 'himself', '.', 'The', 'place', 'be', 'like', 'a', 'witch', '’', 's', 'kitchen', '!', 'All', 'about', 'him', 'black', 'metal', 'pot', 'be', 'boil', 'and', 'bubble', 'on', 'huge', 'stove', ',', 'and', 'kettle', 'be', 'hiss', 'and', 'pan', 'be', 'sizzle', ',', 'and', 'strange', 'iron', 'machine', 'be', 'clank', 'and', 'spluttering', ',', 'and', 'there', 'be', 'pipe', 'run', 'all', 'over', 'the', 'ceiling', 'and', 'wall', ',', 'and', 'the', 'whole', 'place', 'be', 'fill', 'with', 'smoke', 'and', 'steam', 'and', 'delicious', 'rich', 'smell', '.', 'Mr', 'Wonka', 'himself', 'have', 'suddenly', 'become', 'even', 'more', 'excited', 'than', 'usual', ',', 'and', 'anyone', 'could', 'see', 'that', 'this', 'be', 'the', 'room', 'he', 'love', 'best', 'of', 'all', '.', 'He', 'be', 'hop', 'about', 'among', 'the', 'saucepan', 'and', 'the', 'machine', 'like', 'a', 'child', 'among', 'his', 'Christmas', '

# 3 Nesting

So far we typically used a single for-loop or we were opening a single file at a time. In Python (and most programming languages), one can **nest** multiple loops or files in one another. For instance, we can use one (external) for-loop to iterate through files, and then for each file iterate through all its sentences (internal for-loop). As we have learned above, `glob` is a convenient way of creating a list of files. 

You might think: can we stretch this on more levels? Iterate through files, then iterate through the sentences in these files, then iterate through each word in these sentences, then iterate through each letter in these words, etc.  This is possible. Python (and most programming languages) allow you to perform nesting with (in theory) as many loops as you want. Keep in mind that nesting too much will eventually cause computational problems, but this depends also on the size of your data.

In the code below, we want get an idea of the number and length of the sentences in the texts stored in the `../data/Dreams` directory. We do this by creating two for loops: We iterate over all the files in the directory (loop 1), apply sentence tokenization and iterate over all the sentences in the file (loop 2).

Look at the code and comments below to figure out what is going on:

In [16]:
import glob

### Loop 1 ####
# Loop1: iterate over all the files in the dreams directory
for filename in glob.glob("../data/Dreams/*.txt"): 
    # read in the file and assign the content to a variable
    with open(filename, "r") as infile:
        content = infile.read()
    sentences = nltk.sent_tokenize(content)                            # split the content into sentences
    print(f"INFO: File {filename} has {len(sentences)} sentences")     # Print the number of sentences in the file

    # For each file, assign a number to each sentence. Start with 0:
    counter=0

    #### Loop 2 ####
    # Loop 2: loop over all the sentences in a file:
    for sentence in sentences:
        counter+=1                                                    # add 1 to the counter
        tokens=nltk.word_tokenize(sentence)                           # tokenize the sentence
        print("Sentence %d has %d tokens" % (counter, len(tokens)))   # print the number of tokens per sentence
               
    # print an empty line after each file (this belongs to loop 1)
    print()

INFO: File ../data/Dreams/vickie8.txt has 17 sentences
Sentence 1 has 13 tokens
Sentence 2 has 15 tokens
Sentence 3 has 9 tokens
Sentence 4 has 11 tokens
Sentence 5 has 8 tokens
Sentence 6 has 13 tokens
Sentence 7 has 8 tokens
Sentence 8 has 10 tokens
Sentence 9 has 18 tokens
Sentence 10 has 7 tokens
Sentence 11 has 19 tokens
Sentence 12 has 10 tokens
Sentence 13 has 11 tokens
Sentence 14 has 5 tokens
Sentence 15 has 12 tokens
Sentence 16 has 9 tokens
Sentence 17 has 5 tokens

INFO: File ../data/Dreams/vickie9.txt has 9 sentences
Sentence 1 has 11 tokens
Sentence 2 has 12 tokens
Sentence 3 has 8 tokens
Sentence 4 has 9 tokens
Sentence 5 has 20 tokens
Sentence 6 has 6 tokens
Sentence 7 has 6 tokens
Sentence 8 has 13 tokens
Sentence 9 has 16 tokens

INFO: File ../data/Dreams/vickie10.txt has 16 sentences
Sentence 1 has 18 tokens
Sentence 2 has 5 tokens
Sentence 3 has 15 tokens
Sentence 4 has 19 tokens
Sentence 5 has 11 tokens
Sentence 6 has 13 tokens
Sentence 7 has 18 tokens
Sentence 8 h

# 4 Putting it all together

In this section, we will use what we have learned above to write a small NLP program. We will go through all the steps and show how they can be put together. In the last chapters, we have already learned how to write functions. We will make use of this skill here. 

Our goal is to collect all the nouns from Vickie's dream reports. 

Before we write actual code, it is always good to consider which steps we need to carry out to reach the goal. 

Important steps to remember:

* create a list of all the files we want to process
* open and read the files
* tokenize the texts
* perform pos-tagging
* collect all the tokens analyzed as nouns

Remember, we first needed to import `nltk` to use it. 

Since we want to carry out the same task for each of the files, it is very useful (and good practice!) to write a single function which can do the processing. The following function reads the specified file and returns the tokens with their POS tags:

## 4.1 Writing a processing function for a single file

In [17]:
import nltk

def tag_tokens_file(filepath):
    """Read the contents of the file found at the location specified in 
    FILEPATH and return a list of its tokens with their POS tags."""
    with open(filepath, "r") as infile:
        content = infile.read()
        tokens = nltk.word_tokenize(content)
        tagged_tokens = nltk.pos_tag(tokens)
    return tagged_tokens

Now, instead of having to open a file, read the contents and close the file, we can just call the function `tag_tokens_file` to do this. We can test it on a single file: 

In [18]:
filename = "../data/Dreams/vickie1.txt"
tagged_tokens = tag_tokens_file(filename)
print(tagged_tokens)

[('My', 'PRP$'), ('mom', 'NN'), ('and', 'CC'), ('I', 'PRP'), ('were', 'VBD'), ('in', 'IN'), ('the', 'DT'), ('grocery', 'NN'), ('store', 'NN'), ('.', '.'), ('I', 'PRP'), ('went', 'VBD'), ('over', 'IN'), ('to', 'TO'), ('the', 'DT'), ('free', 'JJ'), ('cookie', 'NN'), ('area', 'NN'), ('.', '.'), ('And', 'CC'), ('this', 'DT'), ('guy', 'NN'), ('gave', 'VBD'), ('me', 'PRP'), ('a', 'DT'), ('cookie', 'NN'), ('.', '.'), ('I', 'PRP'), ('had', 'VBD'), ('seen', 'VBN'), ('the', 'DT'), ('cookies', 'NNS'), (',', ','), ('and', 'CC'), ('they', 'PRP'), ('were', 'VBD'), ('pretend', 'JJ'), ('grasshoppers', 'NNS'), ('.', '.'), ('I', 'PRP'), ('saw', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('spider', 'NN'), ('go', 'VB'), ('by', 'IN'), ('(', '('), ('on', 'IN'), ('the', 'DT'), ('cookie', 'NN'), (')', ')'), ('.', '.'), ('I', 'PRP'), ('said', 'VBD'), (',', ','), ('``', '``'), ('Oh', 'UH'), (',', ','), ('I', 'PRP'), ('do', 'VBP'), ("n't", 'RB'), ('like', 'VB'), ('this', 'DT'), ('cookie', 'NN'), ('.', '.'), ("''", "

## 4.2 Iterating over all the files and applying the processing function

We can also do this for each of the files in the `../Data/dreams` directory by using a for-loop:

In [19]:
import glob

# Iterate over the `.txt` files in the directory and perform POS tagging on each of them
for filename in glob.glob("../data/Dreams/*.txt"): 
    tagged_tokens = tag_tokens_file(filename)
    print(filename, "\n", tagged_tokens, "\n")

../data/Dreams/vickie8.txt 
 [('I', 'PRP'), ('had', 'VBD'), ('this', 'DT'), ('horse', 'NN'), (',', ','), ('and', 'CC'), ('I', 'PRP'), ('was', 'VBD'), ('going', 'VBG'), ('to', 'TO'), ('go', 'VB'), ('downtown', 'RB'), ('.', '.'), ('My', 'NNP'), ('friend', 'NN'), (',', ','), ('Sally', 'NNP'), (',', ','), ('was', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('back', 'NN'), ('of', 'IN'), ('me', 'PRP'), ('on', 'IN'), ('the', 'DT'), ('horse', 'NN'), ('.', '.'), ('I', 'PRP'), ('dropped', 'VBD'), ('her', 'PRP'), ('off', 'RP'), ('at', 'IN'), ('the', 'DT'), ('stop', 'NN'), ('sign', 'NN'), ('.', '.'), ('I', 'PRP'), ('was', 'VBD'), ('trying', 'VBG'), ('to', 'TO'), ('figure', 'VB'), ('out', 'RP'), ('how', 'WRB'), ('to', 'TO'), ('go', 'VB'), ('downtown', 'JJ'), ('.', '.'), ('I', 'PRP'), ('finally', 'RB'), ('got', 'VBD'), ('there', 'RB'), ('on', 'IN'), ('the', 'DT'), ('horse', 'NN'), ('.', '.'), ('I', 'PRP'), ('saw', 'VBD'), ('Valerie', 'NNP'), ('hanging', 'VBG'), ('out', 'RP'), ('and', 'CC'), ('said', 'VBD')

## 4.3 Collecting all the nouns

Now, we extend this code a bit so that we don't print all POS-tagged tokens of each file, but we get all (proper) nouns from the texts and add them to a list called `nouns_in_dreams`. Then, we print the set of nouns:

In [20]:
# Create a list that will contain all nouns
nouns_in_dreams = []

# Iterate over the `.txt` files in the directory and perform POS tagging on each of them
for filename in glob.glob("../data/Dreams/*.txt"): 
    tagged_tokens = tag_tokens_file(filename)
        
    # Get all (proper) nouns in the text ("NN" and "NNP") and add them to the list
    for token, pos in tagged_tokens:
        if pos in ["NN", "NNP"]:
            nouns_in_dreams.append(token)

# Print the set of nouns in all dreams
print(set(nouns_in_dreams))


{'Hey', 'Toys', 'head', 'kind', 'My', 'Sally', 'necklace', 'drink', 'Morrison', 'computer', 'pool', 'Charlotte', 'back', 'Fine', 'mother', 'university', 'middle', 'friend', 'song', 'Hop', 'house', 'dad', 'diamond', 'Doug', 'silk', 'Jeb', 'thing', 'stereo', 'Mark', 'Jim', 'picture', 'velvet', 'water', 'hand', 'soda', 'boss', 'lady', 'everything', 'school', 'bag', 'counter', 'downtown', 'Allison', 'cookie', 'presence', 'Vickie', 'stop', 'sister', 'sign', 'ball', 'door', 'cowboy', 'family', 'gun', 'Finally', 'office', 'girl', 'Ð', 'glitter', 'dress', 'Bess', 'mom', 'store', 'Us', 'dream', 'top', 'bar', 'life', 'father', 'god', 'castle', 'neck', 'air', 'truck', 'wire', 'help', 'fur', 'look', 'cigarette', 'bus', 'bunch', 'man', 'ghost', 'grocery', 'party', 'room', 'Wendy', 'spider', 'Mom', 'brother', 'horse', 'Nancy', 'Valerie', 'place', 'ride', 'bedroom', 'Bonnie', 'Hi', 'money', 'housing', 'neighbor', 'hospital', 'lock', 'someplace', 'queen', 'rectangle', 'Can', 'lace', 'R', 'Mary', 'gate

Now we have an idea what Vickie dreams about!


# Exercises

**Exercise 1:** 

Try to collect all the present participles in the the text store in `../data/charlie.txt` using the NLTK tokenizer and POS-tagger. 

In [19]:
# you code here

You should get the following list: 
`['boiling', 'bubbling', 'hissing', 'sizzling', 'clanking', 'running', 'hopping', 'knowing', 'rubbing', 'cackling', 'going']`

In [20]:
# we can test our code using the assert statement (don't worry about this now, 
# but if you want to use it, you can probably figure out how it works yourself :-) 
# If our code is correct, we should get a compliment :-)
assert len(present_participles) == 11 and type(present_participles[0]) == str
print("Well done!")

AssertionError: 

**Exercise 2:** 

The resulting list `verb_lemmas` above contains a lot of duplicates. Do you remember how you can get rid of these duplicates? Create a set in which each verb occurs only once and name it `unique_verbs`. Then print it.

In [None]:
## the list is stored under the variable 'verb_lemmas'

# your code here

In [None]:
# Test your code here! If your code is correct, you should get a compliment :-)
assert len(unique_verbs) == 28    
print("Well done!")

**Exercise 3:** 

Now use a for-loop to count the number of times that each of these verb lemmas occurs in the text! For each verb in the list you just created, get the count of this verb in `charlie.txt` using the `count()` method. Create a dictionary that contains the lemmas of the verbs as keys, and the counts of these verbs as values. Refer to the notebook about Topic 1 if you forgot how to use the `count()` method or how to create dictionary entries!

Tip: you don't need to read in the file again, you can just use the list called verb_lemmas.

In [None]:
verb_counts = {}

# Finish this for-loop
for verb in unique_verbs:
    # your code here

print(verb_counts) 

In [None]:
# Test your code here! If your code is correct, you should get a compliment :-)
assert len(verb_counts) == 28 and verb_counts["bubble"] == 1 and verb_counts["be"] == 9
print("Well done!")

**Exercise 4:**
    
Write your counts to a file called `charlie_verb_counts.txt` and write it to `../data` in the following format:

verb, count

verb, count 

...

Don't forget to use newline characters at the end of each line. 