<a href="https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Demystified | Preprocessing
https://nlpdemystified.org<br>
https://github.com/futuremojo/nlp-demystified

### spaCy upgrade and package installation.

At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy.
<br><br>
**IMPORTANT**<br>
If you're running this in the cloud rather than using a local Jupyter server on your machine, then the notebook will **timeout** after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical packages.
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

In [None]:
!pip install -U spacy==3.* 

In [None]:
!python -m spacy info

In [None]:
 import spacy 

After importing spaCy, the next thing we need to do is load a suitable statistical model for our project. spaCy offers a variety of models for different languages. These models help with tokenization, part-of-speech tagging, named entity recognition, and more.

Here, we're loading the **en_core_web_sm** model which is the smallest English model spaCy offers and is a good starting point for NLP tasks.<br>
https://spacy.io/models/en#en_core_web_sm

Since we upgraded spaCy, we'll need to download the statistical model as well.

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
nlp = spacy.load('en_core_web_sm')

**en_core_web_sm** is trained on OntoNotes 5 which is an annotated corpus comprising news, blogs, transcripts, etc. Put simply, this means a bunch of documents were labelled with information such as how each sentence should be parsed, whether a particular word is a noun or adjective or other part-of-speech, whether a word is a special entity like a person or a real-world organization, and other language-related labels. A statistical model was then generated from these labelled documents.<br>
https://catalog.ldc.upenn.edu/LDC2013T19
<br><br>
You can learn more about the available spaCy models at these links:<br>
https://spacy.io/models<br>
https://spacy.io/usage/models

After loading the model, the _nlp_ variable now references a **Language** class instance which contains language-specific rules for various tasks (e.g. tokenization) and a processing pipeline.<br>
https://spacy.io/api/language

In [None]:
type(nlp) 

# Tokenization

Course module for this demo:
https://www.nlpdemystified.org/course/tokenization


### Tokenization with spaCy

We pass whatever text we want to process to _nlp_, which returns a **Doc** container object containing the tokenized text and a number of annotations for each token. These annotations are discussed in follow-up videos. You can learn more about the **Doc** object here:<br>
https://spacy.io/api/doc

In [None]:
# Sample sentence.
s = "He didn't want to pay $20 for this book."
doc = nlp(s)

We can iterate over this **Doc** object and view the tokens.

In [None]:
print([t.text for t in doc])

Note how
- "didn't" is separated into "did"  and "n't".
- the currency symbol and amount are separated.
- the period at the end of the sentence is its own token.

The **Doc** object can be indexed and sliced like a regular list. The **Doc** object contains **Token** and **Span** objects, which offer different views into the text.

In [None]:
# We can view an individual token by indexing into the Doc object.
print(doc[0])

In [None]:
# A Doc object is a container of other objects, namely Token and Span objects.
print(type(doc[0]))

In [None]:
# Slicing a Doc object returns a Span object.
print(doc[0:3])
print(type(doc[0:3]))

In [None]:
# Access a token's index in a sentence.
print([(t.text, t.i) for t in doc])

Spacy's tokenization is _non-destructive_, which means the original input can be reconstructed from the tokens.

In [None]:
# You can view the original input like so:
print(doc.text)

You can learn more about the **Token** and **Span** objects here:<br>
https://spacy.io/api/token<br>
https://spacy.io/api/span


We can also tokenize multiple sentences and access each sentence individually using the **Doc** object's _sents_ property.

In [None]:
s = """Either the well was very deep, or she fell very slowly, for she 
had plenty of time as she went down to look about her and to wonder what 
was going to happen next. First, she tried to look down and make out what 
she was coming to, but it was too dark to see anything; then she looked at 
the sides of the well, and noticed that they were filled with cupboards and 
book-shelves; here and there she saw maps and pictures hung upon pegs."""

doc = nlp(s)

# Look at individual sentences (there should be two 'Span' objects).
print([sent for sent in doc.sents])

### Tokenization Exercises

In [None]:
#
# EXERCISE:
# 1) Tokenize the following text
# 2) Iterate through the tokens to check whether there's a currency symbol.
# 3) If there is, and the currency label is followed by a number, print
#    both the symbol and the number.
# 
# Look through https://spacy.io/api/token#attributes on how to check whether
# a token is a currency symbol or a number.
#
# Expected output: "$20".
s = "He didn't want to pay $20 for this book."
doc = nlp(s)

In [None]:
#
# EXERCISE: Learn how the spaCy tokenizer works and how to customize it:
# https://spacy.io/usage/linguistic-features#tokenization
#

In [None]:
#
# EXERCISE: Read through spaCy-101 and if you're interested, check out their course
# on spaCy itself (link on the page).
# https://spacy.io/usage/spacy-101
#

In [None]:
#
# EXERCISE: Look up how to tokenize the sentence below using NLTK. The imports 
# are done for you. Does the NLTK tokenizer handle "N.Y.C." correctly?
#
import nltk
from nltk.tokenize import TreebankWordTokenizer
s = "Let's go to N.Y.C. for the weekend."

**NOTE**: Different tokenizers will give subtly different results based on the rules they use. Experiment with different tokenizers and use the one best suited for your project.

# Basic Preprocessing
## Case-Folding, Stop Word Removal, Stemming, and Lemmatization.

Course module for this demo:
https://www.nlpdemystified.org/course/basic-preprocessing

**NOTE: If the notebook timed out, you may need to re-upgrade spaCy and re-install the language model as follows:**


In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm
!python -m spacy info

spaCy performs all these preprocessing steps (except stemming) behind the scenes for you. Inline with its non-destructive policy, the tokens aren't modified directly. Rather, each **Token** object has a number of attributes which can help you get views of your document with these pre-processing steps applied. The attributes a **Token** has can be found here:<br>
https://spacy.io/api/token#attributes
<br><br>
More information about spaCy's processing pipeline:<br>
https://spacy.io/usage/processing-pipelines

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
s = "He told Dr. Lovato that he was done with the tests and would post the results shortly."
doc = nlp(s)

### Case-Folding

View your document with case-folding using the *lower_* attribute.

In [None]:
print([t.lower_ for t in doc])

You can also apply conditions when generating these views. For example, we can skip case-folding if a token is the start of a sentence.

In [None]:
print([t.lower_ if not t.is_sent_start else t for t in doc])

### Stop Word Removal

spaCy comes with a default stop word list. To view your document with stop words removed, you can use the *is_stop* attribute.

In [None]:
# spaCy's default stop word list.
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))

In [None]:
print([t for t in doc if not t.is_stop])

### Lemmatization

It's similar with lemmatization. You can view your document with lemmatization applied through the *lemma_* attribute.

In [None]:
[(t.text, t.lemma_) for t in doc]

### Basic Preprocessing Exercises

spaCy doesn't support stemming natively. But for completeness, we can stem using **NLTK**. Specifically, we can use the *Snowball stemmer* which is an improved version of the *Porter stemmer*.

In [None]:
#
# EXERCISE: Find out how to intialize the SnowballStemmer, then tokenize
# and stem the sentence below.
#
from nltk.stem.snowball import SnowballStemmer
s = 'He told Dr. Lovato that he was done with the tests and would post the results shortly.'

# Initialize the stemmer here.


# Tokenize, stem, and print the tokens.


In [None]:
#
# EXERCISE: Find out how to add and remove your own stop words in spaCy. Add the 
# word 'told' as a stop word, test that it works, then remove it from 
# the stop word list.
#

In [None]:
#
# EXERCISE: Read up on how to add your own custom attributes to Token objects
# and try adding one of your own.
# https://spacy.io/usage/processing-pipelines#custom-components-attributes
#

#Advanced Preprocessing

## Part-of-Speech Tagging, Named Entity Recognition, and Parsing.

Course module for this demo:
https://www.nlpdemystified.org/course/advanced-preprocessing

**NOTE: If the notebook timed out, you may need to re-upgrade spaCy and re-install the language model as follows:**


In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm
!python -m spacy info

spaCy performs Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and parsing as part of its default pipeline in the *nlp* object.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
s = "John watched an old movie at the cinema."
doc = nlp(s)

### Part-of-Speech Tagging

POS tags can be accessed through the *pos_* attribute

In [None]:
[(t.text, t.pos_) for t in doc]

To get a description for a POS tag, we can use _spacy.explain_.

In [None]:
spacy.explain('PROPN')

The POS tags above are called *course-grained* tags. You can also access *fine-grained* tags through the *tag_* attribute. Fine-grained tags provide more detailed information about a token such as its tense and, if a word is a pronoun, what specific type of pronoun it is.

In [None]:
[(t.text, t.tag_) for t in doc]

So **NNP** refers specifically to a _singular pronoun_, and **VBD** is a verb in *past tense*.

In [None]:
print(spacy.explain('NNP'))
print(spacy.explain('VBD'))

### Named Entity Recognition

There are multiple ways to access named entities. One way is through the *ent_type_* attribute.


In [None]:
s = "Volkswagen is developing an electric sedan which could potentially come to America next fall."
doc = nlp(s)

[(t.text, t.ent_type_) for t in doc]

You can view spaCy's named entities annotations here:<br>
https://spacy.io/api/annotation#named-entities

or use _spacy.explain_.

In [None]:
spacy.explain('GPE')

You can also check if a token is an entity before printing it by checking whether the _ent_type_ (note the lack of trailing underscore) attribute is non-zero.

In [None]:
print([(t.text, t.ent_type_) for t in doc if t.ent_type != 0])

Another way is through the _ents_ property of the **Doc** object. Here, we iterate through _ents_ and print the entity itself and its label.

In [None]:
print([(ent.text, ent.label_) for ent in doc.ents])

Note how "next fall" is outputted above as a single span when you use _ents_.
<br><br>
You can also access the positions of entities:

In [None]:
print([(ent.text, ent.label_, ent.start_char, ent.end_char) for ent in doc.ents])

spaCy is bundled with visualizers for both parsing and named entities.<br>
https://spacy.io/usage/visualizers
<br><br>
Here, we visualize the entities in our sample sentence.

In [None]:
from spacy import displacy

# We need to set the 'jupyter' variable to True in order to output
# the visualization directly. Otherwise, you'll get raw HTML.
displacy.render(doc, style='ent', jupyter=True)

For domain-specific corpora, an NER tagger may need to be further fine-tuned. Here, we may want _The Martian_ tagged as a "FILM" (assuming that's our goal).

In [None]:
s = "Ridley Scott directed The Martian."
doc = nlp(s)
displacy.render(doc, style='ent', jupyter=True)

### Parsing

Let's first visualize a parse to make it easier to follow.

In [None]:
s = "She enrolled in the course at the university."
doc = nlp(s)

# Note the 'style' argument is assigned a 'dep' flag this time around.
displacy.render(doc, style='dep', jupyter=True)

The visualization above is for a dependency parse (spaCy doesn't come with a constituency parser). For each pair of depencencies, spaCy visualizes the child (pointed to), the head (pointed from), and their relationship (the label arc). You can view the dependency annotations here:<br>
https://spacy.io/api/annotation#dependency-parsing

You can also use *spacy.explain* to get information on a particular annotation.

In [None]:
spacy.explain('nsubj')

The dependency labels themselves can be accessed through the *dep_* attribute.

In [None]:
[(t.text, t.dep_) for t in doc]

Note how the word 'enrolled' is the _ROOT_.
<br><br>
But the labels above don't show how the words are related to each other (the arcs). To get a better idea, you can print the head of each dependency.

In [None]:
[(t.text, t.dep_, t.head.text) for t in doc]

### Using spaCy's Matcher to find patterns
spaCy comes with a host of pattern-matching functionality. Beyond regex, spaCy can match on a variety of attributes such as POS tags, entity labels, lemmas, dependencies, entire phrases, and a lot more. You can learn more here:<br>
https://spacy.io/usage/rule-based-matching<br>
https://explosion.ai/demos/matcher
<br><br>
Here, we try to search for patterns that may be useful for a hospitality bot.

In [None]:
# The general Matcher is one of multiple matcher objects
# included with spaCy.
from spacy.matcher import Matcher

# We initialize the Matcher with the spaCy vocab object, which contains
# words along with their labels and entities.
matcher = Matcher(nlp.vocab)

s = "I want to book a hotel room."
doc = nlp(s)

# Patterns are expressed as an ordered sequence. Here, we're looking
# to match occurrences starting with a 'book' string followed by
# a determiner (DET) POS tag, then a noun POS tag.
# The OP key marks the match as optional in some way.

# Here, the DET POS (marked with '?') will match 0 or 1 times, and
# the NOUN POS (marked with '+') will match 1 or more times.
# See this link for more information:
# https://spacy.io/usage/rule-based-matching#quantifiers
pattern = [
  {'TEXT': 'book'},
  {'POS': 'DET', 'OP': '?'},
  {'POS': 'NOUN', 'OP': '+'},
]

# We give our pattern a label and pass it to the matcher.
matcher.add('USER_INTENT', [pattern])

# Run the matcher over the doc.
matches = matcher(doc)

# For each match, the matcher returns a tuple specifying a match id, start, 
# and end of the match.
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

The code above demonstrates the Matcher but is brittle.
- What if "book" is capitalized?
- What if a user types "reserve" instead of "book"?
- How can we match on "hotel room" as a compound noun?
- What if a user types "book a flight and hotel room"?

Can you think of how you would handle these cases?
<br><br>
We could come up more rules to match different patterns, or perhaps just search for keywords based on POS and entities (e.g. a country) and present the user with a bunch of possible intentions and let them choose one, or have a bunch of different interpretation functions submit answers and select the most likely one based on what was historically accepted most often. We can also ask clarifying questions to narrow things down.
<br><br>
For example, for the last sentence, you could have a function scan through the **Doc** object's *noun_chunks* (phrases that have a noun as their head) and isolate keywords there along with potential conjunctions (e.g. "and").<br>
https://spacy.io/usage/linguistic-features#noun-chunks


In [None]:
doc = nlp("I want to book a flight and hotel room in Berlin.")
for noun_phrase in doc.noun_chunks:
  print("phrase: {}, root head: {}".format(noun_phrase, noun_phrase.root.head))

Using pure rules is a good place to start or prototype (especially if the domain is narrow with a tight set of use cases) but as our requirements get more sophisticated, we'll need to blend in other approaches such as classical models or perhaps deep learning (at the very least, maybe tune existing neural networks). spaCy's models can be updated with more examples to fine-tune predictions.<br>
https://spacy.io/usage/training<br>
<br>
We'll keep learning more approaches as the course progresses.

### Talkin' like Yoda
Languages like English are built around the _subject-verb-object_ pattern. But if you're familiar with Yoda from Star Wars, he famously speaks in an _object-subject-verb pattern_. Using the information in a dependency parse, we can turn basic English sentences into Yoda-speak.

In [None]:
def yodize(s: str):
  doc = nlp(s)
  for t in doc:
    if t.dep_ == "ROOT":

      # Assuming our sentence is of the form subject-verb-object, we take 
      # everything after the root (likely verb) and put it in front, and 
      # likewise take everything before the root, and put it after.
      seq = [doc[t.i + 1: -1].text, doc[0: t.i].text, t.text + '.']
      seq[0] = seq[0].capitalize()
      print(' '.join(seq))

In [None]:
yodize("I will fly to Texas.")

This is ok for simple sentences but starts getting weird with longer, more convoluted sentences. What are some ways you would improve this?

### Advanced Preprocessing Exercises

In [None]:
#
# EXERCISE: Learn how to extend spaCy's NER models. Specifically, how to add new
# entity names and entity types. 
#

In [None]:
#
# EXERCISE: using doc.ents, identify and print the dates in this sentence.
# Expected output: ['Feb 13th', 'Feb 24th']
#
s = "We'll be in Osaka on Feb 13th and leave on Feb 24th."
doc = nlp(s)



In [None]:
#
# EXERCISE: Read about spaCy's PhraseMatcher
# https://spacy.io/usage/rule-based-matching#phrasematcher
#
# Using the PhraseMatcher, find the start and end index of all occurrences 
# of 'Caesar Augustus' and 'Roman Empire' (case-insensitive).
#
# Expected output: [(0, 2), (15, 17)]
#
from spacy.matcher import PhraseMatcher
s = "Caesar Augustus was the founder of the Roman Principate (the first phase of the Roman Empire)."
doc = nlp(s)


# Additional Reading and Resources

Read through this page to learn more about spaCy's language processing pipeline including what's going on under the hood, how to create custom components, disable certain components (e.g. NER) when they're unneeded, optimization tips, and best practices:<br>
https://spacy.io/usage/processing-pipelines
<br><br>
Take the free and succinct spaCy course (available in multiple languages):<br>
https://course.spacy.io/
