# A mini Mad Libs using tokenization and POS tagging

1. [Get some text to play with](#01)
1. [Tokenize the text](#02)
1. [Tag each token](#03)
1. [Explore the tags](#04)
1. [Using tags for simple analytics](#05)
1. [Using tags for procedural modifications](#06)
1. [Ideas for going further](#07)

<hr>

## <span id="01">Get some text to play with</span>

`Natural Language Tooklik` (usually referred to as `nltk`) is a Python library made for processing 'natural' language. We'll use several of the library's features in this code, and will introduce each as we go.


In this first section, we will use `nltk` to download a set of texts from the Project Gutenberg **corpus**.

The [Project Gutenberg](https://www.gutenberg.org/) corpus contains 60,000 public domain e-books made available for non-commercial use.

In [2]:
# Import Gutenberg corpus.
from nltk.corpus import gutenberg

In [3]:
# Look at the file names for the texts that are included in the corpus.
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [4]:
# Store an entire story as raw text in a string variable.
alice_text = gutenberg.raw('carroll-alice.txt')

# Print a preview of the text using the variable name.
# print(alice_text)

In [5]:
# Grab just the first paragraph of this text using a character range.
alice_paragraph = alice_text[90:392]

# Print a preview of the paragraph.
print(alice_paragraph)


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'


<hr>
    
## <span id="02">Tokenize the text</span>

**Tokenization** is the process of turning text into a list of individual **tokens**. "A token is the technical name for a sequence of characters...that we want to treat as a group." ([nltk.org](https://www.nltk.org/book/ch05.html))

The most common token is a word (e.g. "cat", "dog"), but tokens can also be symbols like punctuation marks (e.g. "?", "..."), emoticons (":)"), or sub-word units like [clitics](https://en.wikipedia.org/wiki/Clitic) (e.g. "'s", "n't"). We could even tokenize a text based on syllables. For this demo, we will tokenize text using nltk's `word_tokenize`.

More info: https://www.nltk.org/book/ch05.html

### Example invocation

```python
word_tokenize("Alice was beginning to get very tired...")
# Returns ['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', '...']
```

In [6]:
# Import the word tokenizer.
from nltk.tokenize import word_tokenize

In [7]:
# Pass the trimmed text into the tokenizer.
alice_paragraph_tokenized = word_tokenize(alice_paragraph)

# Print tokenized text.
print(alice_paragraph_tokenized)

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'and", 'what', 'is', 'the', 'use', 'of', 'a', 'book', ',', "'", 'thought', 'Alice', "'without", 'pictures', 'or', 'conversation', '?', "'"]


<hr>

## <span id="03">Tag each token</span>

A **part-of-speech tagger**, or **POS-tagger**, accepts an ordered list of tokens, and attaches a part of speech tag to each. We will use `Nltk`'s `pos_tag` which applies part-of-speech tags using a data structure called a `tuple`. For example: `('Alice', 'NNP')`.

### Example invocation

```python
pos_tag(['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired'])
# Returns [('Alice', 'NNP'), ('was', 'VBD'), ('beginning', 'VBG'), ('to', 'TO'), ('get', 'VB'), ('very', 'RB'), ('tired', 'JJ')]
```

In [8]:
# Import POS tagger.
from nltk import pos_tag

In [9]:
# We pass our tokenized text into the part-of-speech tagger.
alice_paragraph_tagged = pos_tag(alice_paragraph_tokenized)

# Print a preview of the tagged paragraph.
print(alice_paragraph_tagged)

[('Alice', 'NNP'), ('was', 'VBD'), ('beginning', 'VBG'), ('to', 'TO'), ('get', 'VB'), ('very', 'RB'), ('tired', 'JJ'), ('of', 'IN'), ('sitting', 'VBG'), ('by', 'IN'), ('her', 'PRP$'), ('sister', 'NN'), ('on', 'IN'), ('the', 'DT'), ('bank', 'NN'), (',', ','), ('and', 'CC'), ('of', 'IN'), ('having', 'VBG'), ('nothing', 'NN'), ('to', 'TO'), ('do', 'VB'), (':', ':'), ('once', 'RB'), ('or', 'CC'), ('twice', 'VB'), ('she', 'PRP'), ('had', 'VBD'), ('peeped', 'VBN'), ('into', 'IN'), ('the', 'DT'), ('book', 'NN'), ('her', 'PRP$'), ('sister', 'NN'), ('was', 'VBD'), ('reading', 'VBG'), (',', ','), ('but', 'CC'), ('it', 'PRP'), ('had', 'VBD'), ('no', 'DT'), ('pictures', 'NNS'), ('or', 'CC'), ('conversations', 'NNS'), ('in', 'IN'), ('it', 'PRP'), (',', ','), ("'and", 'VB'), ('what', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('use', 'NN'), ('of', 'IN'), ('a', 'DT'), ('book', 'NN'), (',', ','), ("'", "''"), ('thought', 'JJ'), ('Alice', 'NNP'), ("'without", 'POS'), ('pictures', 'NNS'), ('or', 'CC'), ('conv

In [10]:
# Wait, what do those symbols mean??
from nltk.help import upenn_tagset

# Print a specific tag description.
upenn_tagset('NNP')

# Print all tag descriptions.
# upenn_tagset()

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


<hr>

## <span id="04">Explore the tags</span>

In [11]:
# First let's remind ourselves what our data looks like...
print(alice_paragraph_tagged)

[('Alice', 'NNP'), ('was', 'VBD'), ('beginning', 'VBG'), ('to', 'TO'), ('get', 'VB'), ('very', 'RB'), ('tired', 'JJ'), ('of', 'IN'), ('sitting', 'VBG'), ('by', 'IN'), ('her', 'PRP$'), ('sister', 'NN'), ('on', 'IN'), ('the', 'DT'), ('bank', 'NN'), (',', ','), ('and', 'CC'), ('of', 'IN'), ('having', 'VBG'), ('nothing', 'NN'), ('to', 'TO'), ('do', 'VB'), (':', ':'), ('once', 'RB'), ('or', 'CC'), ('twice', 'VB'), ('she', 'PRP'), ('had', 'VBD'), ('peeped', 'VBN'), ('into', 'IN'), ('the', 'DT'), ('book', 'NN'), ('her', 'PRP$'), ('sister', 'NN'), ('was', 'VBD'), ('reading', 'VBG'), (',', ','), ('but', 'CC'), ('it', 'PRP'), ('had', 'VBD'), ('no', 'DT'), ('pictures', 'NNS'), ('or', 'CC'), ('conversations', 'NNS'), ('in', 'IN'), ('it', 'PRP'), (',', ','), ("'and", 'VB'), ('what', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('use', 'NN'), ('of', 'IN'), ('a', 'DT'), ('book', 'NN'), (',', ','), ("'", "''"), ('thought', 'JJ'), ('Alice', 'NNP'), ("'without", 'POS'), ('pictures', 'NNS'), ('or', 'CC'), ('conv

In [12]:
# Get a particular tuple based on its index.
alice_paragraph_tagged[0]

('Alice', 'NNP')

In [13]:
# Get a particular word based on its tuple index.
alice_paragraph_tagged[0][0]

'Alice'

In [14]:
# Get a particular POS tag based on its tuple index.
alice_paragraph_tagged[0][1]

'NNP'

In [15]:
# Get all tuples with a particule POS tag.
for t in alice_paragraph_tagged:
    if t[1] == 'NN':
        print(t)

('sister', 'NN')
('bank', 'NN')
('nothing', 'NN')
('book', 'NN')
('sister', 'NN')
('use', 'NN')
('book', 'NN')
('conversation', 'NN')


In [16]:
# Get all words with a particule POS tag.
for t in alice_paragraph_tagged:
    if t[1] == 'NN':
        print(t[0])

sister
bank
nothing
book
sister
use
book
conversation


<hr>

## <span id="05">Using tags for simple analytics</span>

In [None]:
# Count breakdown of POS
for t in alice_paragraph_tagged:
    if t[1] == 'NN':
        print(t[0])

<hr>

## <span id="06">Using tags for procedural modifications</span>

In [17]:
# Tuples cannot be modified, so fist we want to change
# each tuple into a two-item list. Unlike tuples, lists can be modified.

# Create an empty list to hold all of our newly-formatted data.
alice_paragraph_tagged_list = []

# Go through the original data.
for t in alice_paragraph_tagged:
    # Convert each tuple "t" into a list "l".
    l = list(t)
    # Add that new list "l" to the data.
    alice_paragraph_tagged_list.append(l)

# Print the newly formatted data.
print(alice_paragraph_tagged_list)

[['Alice', 'NNP'], ['was', 'VBD'], ['beginning', 'VBG'], ['to', 'TO'], ['get', 'VB'], ['very', 'RB'], ['tired', 'JJ'], ['of', 'IN'], ['sitting', 'VBG'], ['by', 'IN'], ['her', 'PRP$'], ['sister', 'NN'], ['on', 'IN'], ['the', 'DT'], ['bank', 'NN'], [',', ','], ['and', 'CC'], ['of', 'IN'], ['having', 'VBG'], ['nothing', 'NN'], ['to', 'TO'], ['do', 'VB'], [':', ':'], ['once', 'RB'], ['or', 'CC'], ['twice', 'VB'], ['she', 'PRP'], ['had', 'VBD'], ['peeped', 'VBN'], ['into', 'IN'], ['the', 'DT'], ['book', 'NN'], ['her', 'PRP$'], ['sister', 'NN'], ['was', 'VBD'], ['reading', 'VBG'], [',', ','], ['but', 'CC'], ['it', 'PRP'], ['had', 'VBD'], ['no', 'DT'], ['pictures', 'NNS'], ['or', 'CC'], ['conversations', 'NNS'], ['in', 'IN'], ['it', 'PRP'], [',', ','], ["'and", 'VB'], ['what', 'WP'], ['is', 'VBZ'], ['the', 'DT'], ['use', 'NN'], ['of', 'IN'], ['a', 'DT'], ['book', 'NN'], [',', ','], ["'", "''"], ['thought', 'JJ'], ['Alice', 'NNP'], ["'without", 'POS'], ['pictures', 'NNS'], ['or', 'CC'], ['conv

In [18]:
# We can look up words based on part of speech just as we did before.
for l in alice_paragraph_tagged_list:
    if l[1] == 'NN':
        print(l[0])

sister
bank
nothing
book
sister
use
book
conversation


In [19]:
# But now we can also modify the content!
for l in alice_paragraph_tagged_list:
    if l[1] == 'NN':
        l[0] = '~~~~~~~'
        
print(alice_paragraph_tagged_list)

[['Alice', 'NNP'], ['was', 'VBD'], ['beginning', 'VBG'], ['to', 'TO'], ['get', 'VB'], ['very', 'RB'], ['tired', 'JJ'], ['of', 'IN'], ['sitting', 'VBG'], ['by', 'IN'], ['her', 'PRP$'], ['~~~~~~~', 'NN'], ['on', 'IN'], ['the', 'DT'], ['~~~~~~~', 'NN'], [',', ','], ['and', 'CC'], ['of', 'IN'], ['having', 'VBG'], ['~~~~~~~', 'NN'], ['to', 'TO'], ['do', 'VB'], [':', ':'], ['once', 'RB'], ['or', 'CC'], ['twice', 'VB'], ['she', 'PRP'], ['had', 'VBD'], ['peeped', 'VBN'], ['into', 'IN'], ['the', 'DT'], ['~~~~~~~', 'NN'], ['her', 'PRP$'], ['~~~~~~~', 'NN'], ['was', 'VBD'], ['reading', 'VBG'], [',', ','], ['but', 'CC'], ['it', 'PRP'], ['had', 'VBD'], ['no', 'DT'], ['pictures', 'NNS'], ['or', 'CC'], ['conversations', 'NNS'], ['in', 'IN'], ['it', 'PRP'], [',', ','], ["'and", 'VB'], ['what', 'WP'], ['is', 'VBZ'], ['the', 'DT'], ['~~~~~~~', 'NN'], ['of', 'IN'], ['a', 'DT'], ['~~~~~~~', 'NN'], [',', ','], ["'", "''"], ['thought', 'JJ'], ['Alice', 'NNP'], ["'without", 'POS'], ['pictures', 'NNS'], ['or'

In [20]:
# Let's define our own random set of nouns.
NNs = [
    "jar of pickles",
    "cotton candy",
    "frisbee",
    "homework",
    "kitchen knife",
    "baked potato",
    "cellphone",
    "harmonica",
    "banjo",
    "drama",
    "office",
    "desk",
    "celebration",
    "wife"
]

In [21]:
# Replace all noun slots with a random noun from our custom list.
from random import randint

for l in alice_paragraph_tagged_list:
    if l[1] == 'NN':
        replacement = NNs[randint(0,4)]
        l[0] = replacement

print(alice_paragraph_tagged_list)

[['Alice', 'NNP'], ['was', 'VBD'], ['beginning', 'VBG'], ['to', 'TO'], ['get', 'VB'], ['very', 'RB'], ['tired', 'JJ'], ['of', 'IN'], ['sitting', 'VBG'], ['by', 'IN'], ['her', 'PRP$'], ['homework', 'NN'], ['on', 'IN'], ['the', 'DT'], ['kitchen knife', 'NN'], [',', ','], ['and', 'CC'], ['of', 'IN'], ['having', 'VBG'], ['kitchen knife', 'NN'], ['to', 'TO'], ['do', 'VB'], [':', ':'], ['once', 'RB'], ['or', 'CC'], ['twice', 'VB'], ['she', 'PRP'], ['had', 'VBD'], ['peeped', 'VBN'], ['into', 'IN'], ['the', 'DT'], ['jar of pickles', 'NN'], ['her', 'PRP$'], ['jar of pickles', 'NN'], ['was', 'VBD'], ['reading', 'VBG'], [',', ','], ['but', 'CC'], ['it', 'PRP'], ['had', 'VBD'], ['no', 'DT'], ['pictures', 'NNS'], ['or', 'CC'], ['conversations', 'NNS'], ['in', 'IN'], ['it', 'PRP'], [',', ','], ["'and", 'VB'], ['what', 'WP'], ['is', 'VBZ'], ['the', 'DT'], ['cotton candy', 'NN'], ['of', 'IN'], ['a', 'DT'], ['jar of pickles', 'NN'], [',', ','], ["'", "''"], ['thought', 'JJ'], ['Alice', 'NNP'], ["'witho

In [22]:
# Return the data back to a format for reading.
output = ''

for l in alice_paragraph_tagged_list:
    output += (' ' + l[0])
    
print(output)

 Alice was beginning to get very tired of sitting by her homework on the kitchen knife , and of having kitchen knife to do : once or twice she had peeped into the jar of pickles her jar of pickles was reading , but it had no pictures or conversations in it , 'and what is the cotton candy of a jar of pickles , ' thought Alice 'without pictures or kitchen knife ? '


## <span id="07">Going further</span>

This is clearly very silly. But it's the groundwork for quite few interesting possibilities. You could...

- Make global replacements so that all instances of X get replaced with Y
- Mix multiple texts by replacing nouns in one text with nouns from another
- Re-make [Oulipo](https://en.wikipedia.org/wiki/Oulipo)'s famous S+7 procedure by using an external dictionary
- Replace all adjectives with their antonyms
- Start to think about article matching

...