# Module 5: Natural Language Processing With spaCy

## Table of Contents
<br>
<a href="#Module 4: Natural Language Processing With spaCy"><font size="+1">Module 4: Natural Language Processing With spaCy
</font></a>
<ol>
  <li>What is spaCy?</li>
  <li>Spacy - Tokenisation</li>
  <li>Spacy - Checks</li>
  <li>Rule-Based Matching using spaCy</li>
  <li>Spacy - Stopwords</li>
  <li>Spacy - Remove punctuation</li>
  <li>Spacy - Remove Numbers</li>
  <li>Spacy - Sentence Detection</li>
  <li>Spacy - Lemmatization</li>
  <li>Spacy - Part of Speech Tagging </li>
  <li>Spacy - Shallow Parsing</li>
  <li>Spacy - Named Entity Recognition</li>
  <li>SpaCy - Dependency Parsing </li>  
    
</ol>

The code below uses the patents dataset to demonstrate how to undertake key NLP tasks using spaCy. 

**Learning Outcomes:** 

Perform the following operations on text using the spaCy library:-


* Execute tokenisation
* Remove stopwords
* Remove punctuation
* Remove numbers
* Identify sentences
* Execute Lemmatization
* Execute Part of speech tagging
* Execute Shallow Parsing (chunking)
* Execute Named Entity Recognition
* Execute Dependency Parsing

Additionally you should be able to:

* Provide a brief desciption of the spaCy library
<br>


## 5.1 What is spaCy?
<br>

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

spaCy is designed specifically for **production use** and helps you build applications that process and “understand” large volumes of text. It can be used to build **information extraction** or natural language understanding systems, or to **pre-process** text for deep learning.

spaCy provides a variety of linguistic annotations to give you insights into a text’s grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you’re analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether “google” is used as a verb, or refers to the website or company in a specific context. (from official spaCy documentation - link within the references section)
<br>
<br>

<img src="../pics/spacypipeline.svg" alt="Summary of NLP pipeline in spaCy" width=650>


<br>


When you call nlp on a text, spaCy first tokenizes the text to produce a **Doc** object. The Doc is then processed in several different steps – this is also referred to as the **processing pipeline**. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.


### Difference between NLTK and spaCy

We have until this point used `nltk` primarily, whereas in reality we will want to use the NLP toolkit that is most relevant for our task.

`nltk` was originally a tool for academic research. It contains a wide range of algorithms that can be combined, customised and used with great flexibility. In this way using `nltk` is similar to starting from scratch. A greater degree of customisation is achievable.

`spaCy` on the other hand is production oriented. Rather than giving a wide range of options, `spaCy` gives one, sometimes two, really good ways to solve an NLP problem.

The choice between `nltk` and `spaCy` depends largely on what the project/analytical requirements are.

* If developing new approaches, experimenting with techniques or learning about specific methods; `nltk` is likely your best bet.
* If building an application of a well known task, wanting to focus on delivery, or just looking for out of the box high performance then `spaCy` can be really useful.

`spaCy` will tokenize and perform other processing options requested for you by default, rather than having to specify approaches and build complicated process pipelines.


In [None]:
# Import libraries
import spacy
import pandas as pd
from spacy.matcher import Matcher
from spacy import displacy

Lets load in our practice data the same way we have in previous modules.

In [None]:
# Read in dataframe to process the abstract column
patents = pd.read_pickle('../data/Patent_Dataset.pkl')

# Fix index
patents = patents.reset_index(drop=True)

patents.head()

In this course we will use a model stored within the folders. This model has been selected as it is basic and small yet an effective learning tool. To find more powerful models please see [the spaCy documentation](https://spacy.io/usage/models).

You are welcome to install your own models to play around with, however, we will assume that you are using the local model loaded below. This model requires the version of `spacy` outlined in the pre-course instructions file.

If you get an error loading the below, it is likely you have changed the directory structure intended, or do not have the correct version of `spacy` loaded.

In [None]:
# Load the language model instance in spaCy locally
nlp = spacy.load('../local_packages/spacy_local/small_practice_model/en_core_web_sm-3.1.0')

# Create an nlp object using our text to analyse
doc = nlp("Balo walked to school. She met several friends at the school gate")

In [None]:
# We can view the active pipeline components
# these will be explored later
# they tell us what we can do with our text in this pipeline
nlp.pipe_names

## 5.2 Spacy - Tokenisation

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. 

In [None]:
# helper function to apply spacy processing to our data frame
def token_spacy(pdoc):
    pdoc = nlp(pdoc)
    return [token.text for token in pdoc]

In [None]:
patents['token_spacy'] = patents['abstract'].apply(token_spacy)

In [None]:
patents['token_spacy'][0]

## 5.3 Spacy - Checks

`spacy` comes with a variety of checks, allowing us to find out characteristics of the text we are looking at. These often return boolean values.

<ul>
    <li><b>idx</b> index of the string at which the token starts within the sentence, starting with 0</li>
    <li><b>text_with_ws</b> prints token text with trailing space (if present)</li>
    <li><b>is_alpha</b> detects if the token consists of alphabetic characters or not</li>
    <li><b>is_punct</b> detects if the token is a punctuation symbol or not</li>
    <li><b>is_space</b> detects if the token is a space or not.</li>
    <li><b>shape</b> prints out the shape of the word</li>
    <li><b>is_stop</b> detects if the token is a stop word or not.</li>
</ul>

In [None]:
# inspect the properties of each token
text = nlp("There is a green hill far away. It is in a land I heard in a lullaby")

for token in text[:3]:
    
    print(token, # original token
          token.idx, # token index number
          token.text_with_ws, # original token with whitespace included
          token.is_alpha, # boolean - is token alphabetical?
          token.is_punct, # boolean - is token punctuation?
          token.is_space, # boolean - is token a space (or multiple)?
          token.shape_, # gives a size of the token, represented by x's
          token.is_stop, # boolean - whether token is a stopword
          "\n")


For more information on the attributes and methods that can be used by a token object see the [spacy documentation](https://spacy.io/api/token#attributes).

## 5.4 Rule-Based Matching using spaCy

Find words and phrases in the text using user-defined rules. It is like Regular Expressions.

Used to match patterns.
You provide lists of dictionaries, one per token.

The key in each dictionary is the type of attribute you would like to match, such as text, lowercasing or part of speech. The value in th dictionary is the actual string you want to match that has the attribute given. 

We wrap each pattern in a list, so we can have a list of patterns. This will be a list, that contains lists, which contain dictionaries.

```{python}
# Match exact token texts
[{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Match lexical attributes
[{"LOWER": "iphone"}, {"LOWER": "x"}]

# Match any token attributes
[{"LEMMA": "buy"}, {"POS": "NOUN"}]
```


spaCy POS tags  shown here: https://spacy.io/api/annotation


Below, we are going to create our own rule based matcher.

To do so we are going to create a matcher object, then add in patterns we want to match.

The pattern we want to match is instances of "iPhone X", so we create a pattern saying we need a text token matching "iPhone" and a text token matchine "X" following it.

In [None]:
# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('../local_packages/spacy_local/small_practice_model/en_core_web_sm-3.1.0')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [[{"TEXT": "iPhone"}, {"TEXT": "X"}]]
matcher.add("IPHONE_PATTERN", patterns=pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

In [None]:
print(matches)

The output has three elements. 

The first element, ‘9528407286733565721’, is the match ID, this identifies the specific match.

The second and third elements are the positions of the matched tokens.

Below we are going to create a new matcher which finds full names from text.

To do this we will define a function that creates the pattern needed, then scans the text, extracting the matches based on the pattern.

In [None]:
from spacy.matcher import Matcher

text = nlp('Gus Proto is a Python developer currently working for a London-based Fintech company. \
            He is interested in learning Natural Language Processing.')



def extract_full_name(nlp_doc):
    
    matcher = Matcher(nlp_doc.vocab)
    
    # Match a proper noun followed by a propernoun
    name_pattern = [[{'POS': 'PROPN'}, {'POS': 'PROPN'}]]
    matcher.add('FULL_NAME', patterns=name_pattern)
    matches = matcher(nlp_doc)
    allmatches = []
    
    for match_id, start, end in matches:
        # slice the doc to only the range we want
        span = nlp_doc[start:end]
        allmatches.append(span)
    
    return allmatches

full_names = extract_full_name(text)

print(full_names)

In [None]:
# Call the matcher on the doc
doc = nlp("Upcoming iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    # slice the original text to get the wanted match
    matched_span = doc[start:end]
    print(matched_span.text)

**match_id**: hash value of the pattern name. This will be unique to each match, which means if we have many matches of the same text we can distinguish between each. Using the hash value we can go backwards and find the original match.

**start**: start index of matched span

**end**: end index of matched span

### Another example:

Consider the two sentences below:

*You can read this book* <br>
*I will book my ticket* <br>
<br>
Does a sentence contains the word “book” in it or not. 
Looking to find the word “book” only if it has been used in the sentence as a noun.
<br>


In [None]:
doc1 = nlp("You read this book")
doc2 = nlp("I will book my ticket")

# book which is a noun
pattern = [[{'TEXT': 'book', 'POS': 'NOUN'}]]

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# we need to add rules to the matcher
# within the matcher object the arguments are:
# "rule_2" - name of match, an ID
# patterns - list of patterns
matcher.add('rule_2', patterns=pattern)


In [None]:
# find matches within document 1
matches = matcher(doc1)
matches

In [None]:
# select the match
doc1[matches[0][1]:matches[0][2]]

In [None]:
# find matches in document 2
matches = matcher(doc2)
matches

In [None]:
# Show parts of speech for each token
for token in doc2:
    print(token.text, token.pos_)

In the first sentence above, “book” has been used as a noun and in the second sentence, it has been used as a verb. So, the spaCy matcher should be able to extract the pattern from the first sentence only. 

Below we will create a more complex pattern, seaching for adjective noun pairs, with an additional optional noun at the end. We can customize and make our patterns as complex as we like to capture the phenomena we want.

In [None]:
nlp = spacy.load('../local_packages/spacy_local/small_practice_model/en_core_web_sm-3.1.0')
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
    )

# Write a pattern for adjective plus one or two nouns
pattern = [[{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]]


# Add the pattern to the matcher and apply the matcher to the doc
matcher.add(key="ADJ_NOUN_PATTERN", patterns=pattern,)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

## 5.5 Spacy - Stopwords

Stop words are the most common words in a language, which have "little semantic meaning".
Most sentences need to contain stop words in order to be full sentences that make sense.
Stop words are removed because they aren’t significant and distort the word frequency analysis. 
spaCy has a list of stop words for the English language (Singh, 2019)

Stopwords are always arbitrary, we choose which words are stop words. This means it is important to choose them ourselves, and see what words are being used by different packages. Below we look at the default `spacy` stopwords. We will likely want to specify our own stopwords in the future.

If we only use default stopwords we will potentially remove text that is important to our task, or end up keeping unimportant words. [This stackoverflow example for updating a spacy stopword list may be useful](https://stackoverflow.com/a/51627002).

In [None]:
# access built in stopwords
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('Spacy has', len(spacy_stopwords), 'stopwords')

# Display the first 10 stopwords spaCy has
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

In [None]:
# helper function to apply spacy stopword removal to pandas frame
def remove_stopword_spacy(pdoc):
    # takes a string as input, creates a document
    pdoc = nlp(pdoc)
    # iterate through tokens, keep them if they are not stop words
    # join together resulting tokens with a space in between each
    text = " ".join([str(token) for token in pdoc if not token.is_stop])
    return text

In [None]:
# apply stopword removal
patents['preprocess_spacy'] = patents['abstract'].apply(remove_stopword_spacy)

# not our data has been tokenized, then rejoined as text
removed_stopwords_example = patents['preprocess_spacy'][0]
removed_stopwords_example

## 5.6 Spacy - Remove punctuation

Below is a function that uses `spacy`'s token characteristics to remove punctuation. Note that we are applying it to the already processed text, so the tokenisation may be different to the previous step.

In [None]:
def punctuation_spacy(pdoc):
    pdoc = nlp(pdoc)
    text = ""
    for token in pdoc:
        if not token.is_punct:
            # update text with next token
            text = text + " " + token.text
    return text

In [None]:
patents['preprocess_spacy'] = patents['preprocess_spacy'].apply(punctuation_spacy)

In [None]:
# example of removed punctionation (and stopwords)
removed_punctuation_example = patents['preprocess_spacy'][0]

print("\t\tBefore punctuation removal:\n", removed_stopwords_example)
print("\t\tAfter punctuation removal:\n", removed_punctuation_example)

## 5.7 Spacy - Remove Numbers

Below we create a new function that only keeps alphabetical characters. As we have already removed punctuation, this step will remove numbers.

We therefore are using the properties of the text, and what we know about it to achieve a desired outcome, rather than explicitly programming what we want.

In [None]:
def nonumbers_spacy(pdoc):
    pdoc = nlp(pdoc)
    text = ""
    for token in pdoc:
        # keep only alpha
        if token.is_alpha:
            text = text + " " + token.text
    return text

In [None]:
patents['preprocess_spacy'] = patents['preprocess_spacy'].apply(nonumbers_spacy)

In [None]:
# example of removed numbers (and stopwords, punctuation)
removed_numbers_example = patents['preprocess_spacy'][2]
print("\t\tExample text with numbers:\n", patents['abstract'][2])
print("\t\tExample text with numbers:\n", removed_numbers_example)

## 5.8 Spacy - Sentence Detection
<br>

Sentence Detection is the process of locating the start and end of sentences in a given text. This separates the text into linguistically meaningful units. In spaCy, the **sents** property is used to extract sentences. 

In [None]:
text = nlp("There is a green hill far away. It is in a land I heard in a lullaby")
sentences = list(text.sents)

for sentence in sentences:
    print(sentence, "\n\tType:", type(sentence).__name__)

In [None]:
def sentence_spacy(pdoc):
    pdoc = nlp(pdoc)
    return list(pdoc.sents)

In [None]:
patents['sentence_spacy'] = patents['abstract'].apply(sentence_spacy)

In [None]:
patents['abstract'][1]

In [None]:
# The text has been split into sentences
patents['sentence_spacy'][1]#[0]

## 5.9 Spacy - Lemmatization


**Lemmatization** is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form or root word is called a **lemma**.

Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text. (Singh, 2019)

Below we will lemmatize the text and rejoin each string together.

In [None]:
def lemmatization_spacy(pdoc):
    
    pdoc =  nlp(pdoc)
    text  = ""
    
    for token in pdoc:
        text = text + " " + str(token.lemma_)
            
    return text

In [None]:
patents['preprocess_spacy'] = patents['preprocess_spacy'].apply(lemmatization_spacy)

In [None]:
# Resulting lemmatized text
patents['preprocess_spacy'][0]

## 5.10 Spacy - Part of Speech Tagging 

Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence.
<br>

**Part of speech tagging** is the process of **assigning a POS tag**  to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.
<br>

In spaCy, POS tags are available as an attribute on the Token object

In [None]:
# in the text below the "\" symbol is a line break which allows us to write one string (or general code)
# across multiple lines
text = nlp("Algebra can essentially be considered as doing computations\
            similar to those of arithmetic but with non-numerical mathematical objects. \
            However, until the 19th century, algebra consisted essentially of the theory \
            of equations")

In [None]:
for token in text[:10]:
    print(f"{token}, {token.tag_}, {token.pos_}, {spacy.explain(token.tag_)}", "\n")

In [None]:
# helper function to use spact part of speech tagging with pandas
def pos_spacy(pdoc):
    
    pdoc = nlp(pdoc)
    pos = []
    
    for token in pdoc:
        pos.append([token.text, "-->", token.pos_])
 
    return pos

In [None]:
patents['pos_spacy'] = patents['abstract'].apply(pos_spacy)

In [None]:
patents['pos_spacy'][0]

## 5.11 Spacy - Shallow Parsing

**Shallow parsing, or chunking**, is the process of extracting phrases from unstructured text. Chunking groups adjacent tokens into phrases on the basis of their POS tags. There are some standard well-known chunks such as **noun phrases, verb phrases, and prepositional phrases.**

Noun chunks example shown - to view example of verb phrase extraction see example at 
https://realpython.com/natural-language-processing-spacy-python/

In [None]:
# Code below extracted from https://realpython.com/natural-language-processing-spacy-python/
text = ('There is a developer conference happening on 21 July 2019 in London.')
text = nlp(text)

# Extract Noun Phrases
for chunk in text.noun_chunks:
    print(chunk)

In [None]:
def nounchunk_spacy(pdoc):
    
    pdoc =  nlp(pdoc)
    noun_chunks  = []
    
    for chunk in pdoc.noun_chunks:
        noun_chunks.append(chunk)
        
    return noun_chunks

In [None]:
patents['noun_chunks_spacy'] = patents['abstract'].apply(nounchunk_spacy)

In [None]:
patents['noun_chunks_spacy'][0]

## 5.12 Spacy - Named Entity

**Named Entity Recognition (NER)** is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.


spaCy has the property ents on Doc objects. You can use it to extract named entities:

In [None]:
# example taken from https://realpython.com/natural-language-processing-spacy-python/
text = ('Great Piano Academy is situated in Mayfair or the City of London and has world-class piano instructors.')

text = nlp(text)

for entity in text.ents:
    # show the text, start and end index, entity label and explaination of said label
    print(entity.text, entity.start_char, entity.end_char, entity.label_, spacy.explain(entity.label_))


In [None]:
def ne_spacy(pdoc):
    
    pdoc =  nlp(pdoc)
    named_entities  = []
    
    for entity in pdoc.ents:
        named_entities.append([entity.text, "--->", entity.label_] )
        
    return named_entities

In [None]:
patents['ne_spacy'] = patents['abstract'].apply(ne_spacy)

In [None]:
# show results of named entity recognition
# have a look at other documents
patents['ne_spacy'][0]

In [None]:
patents['abstract'][0]

## 5.13 spaCy - Dependency Parsing 

**Dependency parsing** is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between **headwords and their dependents**. The head of a sentence has no dependency and is called the **root** of the sentence. The verb is usually the head of the sentence. All other words are linked to the headword.

The dependencies can be mapped in a directed graph representation:
<br>
Words are the nodes.<br>
The grammatical relationships are the edges.<br>
Dependency parsing helps you know what role a word plays in the text and how different words relate to each other. It’s also used in shallow parsing and named entity recognition. (Singh,2019)
<br>
Here’s how you can use dependency parsing to see the relationships between words:

spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

You can use displaCy to find POS tags for tokens:


In [None]:
about_interest_text = ('He is interested in learning natural Language Processing.')
about_interest_doc = nlp(about_interest_text)
# the below will launch a displacy visualisation. to continue hit the stop button
# this may not run on all machines, depending on package and permissions
displacy.render(about_interest_doc, style='dep')

In [None]:
# Example from https://realpython.com/natural-language-processing-spacy-python/
text = 'Gus is learning piano'
text = nlp(text)
for token in text:
    # show the text, it's corresponding tag, the where the head of that token 
    # is pointing to and it's dependency
    print(token.text, token.tag_, token.head.text, token.dep_)

In [None]:
displacy.render(text, style="ent")

The dependency tag ROOT denotes the main verb or action in the sentence. <br>
The other words are directly or indirectly connected to the ROOT word of the sentence. <br>
You can find out what other tags stand for by executing the code below:

In [None]:
spacy.explain("nsubj"), spacy.explain("ROOT"), spacy.explain("aux"), spacy.explain("advcl"), spacy.explain("dobj")

In [None]:
# helper function to show dependecies in pandas text
def depend_parse_spacy(pdoc):
    
    pdoc =  nlp(pdoc)
    de_parse  = []
    
    for token in pdoc:
        de_parse.append([token.text, "--->", token.dep_])
        
    return de_parse

In [None]:
patents['de_parse_spacy'] = patents['abstract'].apply(depend_parse_spacy)

In [None]:
patents['de_parse_spacy'][0]

## 5.14 Processing Pipeline

So far we have looked at individual functionalities of the techniques provided by `spaCy`. The power of `spaCy` comes in when we can use these steps one after another in a pipeline, one after another.

When we load a model, this by default has certain steps in the enabled.

Below we are going to explore some basic properties of the pipelines themselves and how to interact with them.

In [None]:
# Load the language model instance in spaCy locally
nlp = spacy.load('../local_packages/spacy_local/small_practice_model/en_core_web_sm-3.1.0')

In [None]:
# Retrieve the steps in the pipeline loaded
nlp.pipe_names

In [None]:
# Access the specific objects used for each step
nlp.pipeline

In [None]:
# We can remove the Named Entity Recognition step
nlp.remove_pipe("ner")

nlp.pipeline

Instead of using NER, we could use [merge_noun_chunks](https://spacy.io/api/pipeline-functions#merge_noun_chunks) instead. This can be added to the pipeline after the existing steps.

In [None]:
# Create a new pipeline step
merge_noun_chunks = nlp.create_pipe("merge_noun_chunks")

# Add it to pipeline
nlp.add_pipe("merge_noun_chunks")


In [None]:
# New step has been added
nlp.pipe_names

In [None]:
# We can extract noun chunks where we could not with the original pipeline
example_merged = nlp("I bought a blue car for my great grandad")

list(example_merged.noun_chunks)

`spaCy` pipelines are incredibly powerful, and already have pre-existing components for most NLP processing. 

For more information on what you can do with the pipelines [see the spaCy documentation on the topic](https://spacy.io/usage/processing-pipelines). You can add your own customized steps in a pipeline too if you wish. It is well worth a read to see what already exists before trying to impliment a process from scratch.

#### Exercises
<br>

<ol>
     <li>Import the Hep Dataset and  using spacy perform the steps listed below to the text column. Add new columns to hold         results of each operation</li>
    
            Tokenise
            Identify all phrases in column that have the pattern - adjective/noun.
            Remove all stopwords
            Remove all punctuation
            Remove all numbers
            Identify sentences
            Lemmatize the text
            Apply POS tagging
            Apply shallow parsing
            Apply Named Entity Recognition
            Apply Dependency Parsing
    
    
    
    
</ol>





#### References

https://spacy.io/usage/spacy-101 <br>
https://www.analyticsvidhya.com/blog/2020/03/spacy-tutorial-learn-natural-language-processing/ <br>
https://realpython.com/natural-language-processing-spacy-python/
