# Introducing spaCy For NLP
> This spaCy tutorial explains and applies the introduction to spaCy and features of spaCy for NLP based on the [spaCy 101](https://spacy.io/usage/spacy-101).
- toc: true 
- badges: true
- comments: true
- author: Zeyu Guan
- categories: [spaCy, Python, Machine Learning, Data Mining, NLP, RandomForest]
- annotations: true
- image: https://spacy.io/static/social_default-1d3b50b1eba4c2b06244425ff0c49570.jpg
- hide: false

## First, what is spacy?
SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python that allows you to perform extensive natural language processing analysis and develop models that may be used to support document analysis, chatbot capabilities, and other types of text analysis.

This blog should cover a few features you need to know about spaCy, whether you're new to it or just want to brush up on some NLP basics and implementation specifics.

It's become one of the most extensively used natural language libraries in Python for corporate use cases, and it has a sizable community—and with it, a lot of support for commercializing research breakthroughs as this field evolves rapidly.

After downloading a list of packages, we can load spaCy and run some code like below. That nlp variable, which is loaded with the en_core_web_sm small model for English.

In [6]:
import spacy
nlp = spacy.load('en_core_web_sm')

Next, let's run a small "document" through the natural language parser:

In [7]:
# Create an nlp object
doc = nlp("He went to play basketball")

**nlp.pipe_names** shows spaCy’s processing pipeline. The NLP pipeline has multiple components, such as tokenizer, tagger, parser, ner, etc. So, the input text string has to go through all these components before we can work on it.

In [8]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


Just in case you wish to disable the pipeline components and keep only the tokenizer up and running, then you can use the code **nlp.disable_pipes** to disable the pipeline components:

In [9]:
nlp.disable_pipes('tagger', 'parser')
nlp.pipe_names

['tok2vec', 'attribute_ruler', 'lemmatizer', 'ner']

In the following section, you will learn to perform various NLP tasks using spaCy. We will start off with the popular NLP tasks of Part-of-Speech Tagging, Dependency Parsing, and Named Entity Recognition.

### 1. Part-of-Speech (POS) Tagging using spaCy
In English, nouns, pronouns, adjectives, verbs, and adverbs are some of the most prevalent parts of speech. The task of automatically assigning POS tags to all the words in a sentence is known as POS tagging. It's useful for a variety of downstream NLP applications, including feature engineering, language comprehension, and information extraction.

In [10]:
# Iterate over the tokens
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.pos_)

He --> PRON
went --> VERB
to --> PART
play --> VERB
basketball --> NOUN


### 2. Dependency Parsing using spaCy

Every sentence has a grammatical structure, which may be extracted with the assistance of dependency parsing. It can alternatively be viewed as a directed graph, with nodes corresponding to the words in the sentence and edges between nodes corresponding to the word dependencies. Performing dependency parsing in spaCy is simple as well. We'll use the same document we did for POS labeling here:

In [11]:
# dependency parsing
for token in doc:
    print(token.text, "-->", token.dep_)

He --> nsubj
went --> ROOT
to --> aux
play --> advcl
basketball --> dobj


### 3. Named Entity Recognition using spaCy

What entities are. Entities are words or collections of words that represent information about everyday objects like people, places, and organizations. These things have official names. Like the example below, Today is a date. Chris is assumed as a person.

In [12]:
doc = nlp("Today is March 21, and Chris has already spent a few hours on this project in the United States.")
 
for ent in doc.ents:
    print(ent.text, ent.label_)

Today DATE
March 21 DATE
Chris PERSON
a few hours TIME
the United States GPE




### 4. Rule-Based Matching using spaCy

It's hard to explain the meaning of Rule-Based matching using spacy; however, the output of the following code demenstrate the meaning cristal clear. With this spaCy matcher, you can find words and phrases in the text using user-defined rules.

So, in the code below:

- First, we import the spaCy matcher
- After that, we initialize the matcher object with the default spaCy vocabulary
- Then, we pass the input in an NLP object as usual
- In the next step, we define the rule/pattern for what we want to extract from the text.

Let's imagine we're looking for the term "lemon water" in a text. As a result, our goal is for the matcher to be able to discover this pattern in the text whenever "lemon" is followed by the word "water." That's exactly what we did in the code above when defining the pattern. Finally, the defined rule is applied to the matcher object.

In [13]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the spaCy vocabulary
matcher = Matcher(nlp.vocab)

doc = nlp("Some people start their day with lemon water")

# Define rule
pattern = [{'TEXT': 'lemon'}, {'TEXT': 'water'}]

# Add rule
matcher.add('rule_1',[pattern])

print(matcher(doc))

matches = matcher(doc)
matches

# Extract matched text
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

[(7604275899133490726, 6, 8)]
lemon water


The output above has three elements. The first element, ‘7604275899133490726’, is the match ID. The second and third elements are the positions of the matched tokens.

Let's have a look at another application of the spaCy matcher. Consider the following two sentences:

- You can read this book
- I will book my ticket

We are now interested in determining whether or not a sentence contains the word "book." It appears to be quite simple, doesn't it? But here's the catch: we can only find the word "book" if it was used as a noun in the sentence.

The word "book" was used as a noun in the first sentence and as a verb in the second sentence. As a result, the spaCy matcher should only be able to extract the pattern from the first sentence. Let's put it to the test:

In [16]:
doc1 = nlp("You read this book")
doc2 = nlp("I will book my ticket")

#Text and POS are required
pattern = [{'TEXT': 'book', 'POS': 'NOUN'}]

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
matcher.add('rule_2', [pattern])

print("doc1: ", matcher(doc1),"\ndoc2: ", matcher(doc2))

doc1:  [(375134486054924901, 3, 4)] 
doc2:  []


This was a quick overview to give you an idea of what spaCy can do. Trust me, you'll find yourself using spaCy frequently for NLP tasks. I encourage you to experiment with the code, download a dataset from DataHack, and try your hand at it with spaCy.

## Now, let’s get our hands dirty with spaCy.
### First, let's play around.

That **nlp** variable is now your gateway to all things spaCy and loaded with the **en_core_web_sm** small model for English. Next, let's run a small "document" through the natural language parser.

We started by making a doc out of the text, which is a container for a document and all of its annotations. Then we went through the document iteratively to see what spaCy had parsed.

In [17]:
# Same process as above.
import spacy
nlp = spacy.load("en_core_web_sm")

text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)

# stoplist
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

The the DET True
rain rain NOUN False
in in ADP True
Spain Spain PROPN False
falls fall VERB False
mainly mainly ADV False
on on ADP True
the the DET True
plain plain NOUN False
. . PUNCT False


It's good, but there's a lot of information and it's a little difficult to read. Let's make the spaCy parse of that sentence into a pandas dataframe:

In [18]:
import pandas as pd
cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []
for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)
df = pd.DataFrame(rows, columns=cols)
df

Unnamed: 0,text,lemma,POS,explain,stopword
0,The,the,DET,determiner,True
1,rain,rain,NOUN,noun,False
2,in,in,ADP,adposition,True
3,Spain,Spain,PROPN,proper noun,False
4,falls,fall,VERB,verb,False
5,mainly,mainly,ADV,adverb,False
6,on,on,ADP,adposition,True
7,the,the,DET,determiner,True
8,plain,plain,NOUN,noun,False
9,.,.,PUNCT,punctuation,False


Next, let's use the displaCy library to visualize the [parse tree](https://en.wikipedia.org/wiki/Parse_tree#:~:text=A%20parse%20tree%20or%20parsing,to%20some%20context%2Dfree%20grammar.) for the sentence. If you are familier with parse tree, a parse tree, also known as a parsing tree, derivation tree, or concrete syntax tree, is a rooted tree that represents the syntactic structure of a string according to some context-free grammar.

In [19]:
from spacy import displacy
displacy.render(doc, style="dep")

However, the this only work on a single sentence. What if we have more sentences like a novel? There are features for sentence boundary detection (SBD)—also known as sentence segmentation—based on the builtin/default sentencizer:

In [20]:
text = "We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit. I fell in.Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket. The gorillas just went wild."
doc = nlp(text)
for sent in doc.sents:
    print(">", sent)

> We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit.
> I fell in.
> Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket.
> The gorillas just went wild.


When spaCy generates a document, it employs a non-destructive tokenization principle, which means that the tokens, sentences, and so on are simply indexes into a large array. In other words, they do not cut up the text stream into small chunks. As a result, each sentence is a span with a start and end index into the document array:

In [21]:
for sent in doc.sents:
    print(">", sent.start, sent.end)

> 0 25
> 25 29
> 29 48
> 48 54


We can index into the document array to pull out the tokens for one sentence:

In [22]:
doc[48:54]

The gorillas just went wild.

Or simply index into a specific token, such as the verb **went** in the last sentence:

In [23]:
token = doc[51]
print(token.text, token.lemma_, token.pos_)

went go VERB


At this point, we can parse a document, segment that document into sentences, then look at annotations about the tokens in each sentence. That's a good start.

### Acquiring Text

Using the internet as a quick source is one option for us to acquire texts. Of course, we get HTML when we download web pages and must subsequently extract text from them. For this, [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a popular package.

In [24]:
import sys
import warnings
warnings.filterwarnings("ignore")

In the following function get_text() we'll parse the HTML to find all of the **\<p/>** tags, then extract the text for those:

In [25]:
from bs4 import BeautifulSoup
import requests
import traceback
def get_text (url):
    buf = []
    try:
        soup = BeautifulSoup(requests.get(url).text, "html.parser")
        for p in soup.find_all("p"):
            buf.append(p.get_text())
        return "".join(buf)
    except:
        print(traceback.format_exc())
        sys.exit(-1)

Now let's grab some text from online sources. We can compare open-source licenses hosted on the [Open Source Initiative](https://opensource.org/licenses/) site:

In [26]:
lic = {}
lic["mit"] = nlp(get_text("https://opensource.org/licenses/MIT"))
lic["asl"] = nlp(get_text("https://opensource.org/licenses/Apache-20"))
lic["bsd"] = nlp(get_text("https://opensource.org/licenses/BSD-3-Clause"))

for sent in lic["bsd"].sents:
    print(">", sent)

> SPDX short identifier: BSD-3-Clause Note: This license has also been called the "New BSD License" or "Modified BSD License".
> See also the 2-clause BSD License.
> Copyright <YEAR> <COPYRIGHT HOLDER>Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:1.
> Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.3.
> Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
> THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING

Text comparison is a common application for natural language work. With those open-source licenses, for example, we can download their text, parse it, and then compare similarity metrics among them:

In [27]:
pairs = [
    ["mit", "asl"],
    ["asl", "bsd"],
    ["bsd", "mit"]]

for a, b in pairs:
    print(a, b, lic[a].similarity(lic[b]))

mit asl 0.7400606692738089
asl bsd 0.7629310829561079
bsd mit 0.976838684194463


This is intriguing because the BSD and MIT licenses appear to be the most similar. They are, in fact, closely related.

Due to the OSI disclaimer in the footer, some extra text was included in each document—but this provides a reasonable approximation for comparing the licenses.

### Natural Language Understanding
Now, let's look at some of the spaCy features for NLU. Given a parse of a document, we can extract the noun chunks, i.e., each of the noun phrases:

In [28]:
text = "Steve Jobs and Steve Wozniak incorporated Apple Computer on January 3, 1977, in Cupertino, California."
doc = nlp(text)

for chunk in doc.noun_chunks:
    print(chunk.text)

Steve Jobs
Steve Wozniak
Apple Computer
January
Cupertino
California


Noun phrases in a sentence generally provide more information content as a simple filter used to reduce a long document into a more "distilled" representation.

This approach can be extended by identifying named entities within the text, i.e., proper nouns:

In [29]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Steve Jobs PERSON
Steve Wozniak PERSON
Apple Computer ORG
January 3, 1977 DATE
Cupertino GPE
California GPE


The displaCy library provides an excellent way to visualize named entities:

In [30]:
displacy.render(doc, style="ent")

There's a spaCy integration for WordNet called [spacy-wordnet](https://github.com/recognai/spacy-wordnet) by [Daniel Vila Suero](https://twitter.com/dvilasuero), an expert in natural language and knowledge graph work.

Then we'll load the WordNet data via NLTK (these things happen):

In [31]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /Users/zeyu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

It should be noted that spaCy operates as a "pipeline" and provides means for customizing parts of the pipeline in use. This is fantastic for enabling really interesting workflow integrations in data science work. We'll include the WordnetAnnotator from the spacy-wordnet project here:

In [32]:
from spacy_wordnet.wordnet_annotator import WordnetAnnotator

print("before", nlp.pipe_names)

#V3 difference
if "spacy_wordnet" not in nlp.pipe_names:
    nlp.add_pipe("spacy_wordnet", after='tagger', config={'lang': nlp.lang})
else:
    print("after", nlp.pipe_names)

before ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [33]:
token = nlp("withdraw")[0]
token._.wordnet.synsets()

[Synset('withdraw.v.01'),
 Synset('retire.v.02'),
 Synset('disengage.v.01'),
 Synset('recall.v.07'),
 Synset('swallow.v.05'),
 Synset('seclude.v.01'),
 Synset('adjourn.v.02'),
 Synset('bow_out.v.02'),
 Synset('withdraw.v.09'),
 Synset('retire.v.08'),
 Synset('retreat.v.04'),
 Synset('remove.v.01')]

In [34]:
token._.wordnet.lemmas()

[Lemma('withdraw.v.01.withdraw'),
 Lemma('withdraw.v.01.retreat'),
 Lemma('withdraw.v.01.pull_away'),
 Lemma('withdraw.v.01.draw_back'),
 Lemma('withdraw.v.01.recede'),
 Lemma('withdraw.v.01.pull_back'),
 Lemma('withdraw.v.01.retire'),
 Lemma('withdraw.v.01.move_back'),
 Lemma('retire.v.02.retire'),
 Lemma('retire.v.02.withdraw'),
 Lemma('disengage.v.01.disengage'),
 Lemma('disengage.v.01.withdraw'),
 Lemma('recall.v.07.recall'),
 Lemma('recall.v.07.call_in'),
 Lemma('recall.v.07.call_back'),
 Lemma('recall.v.07.withdraw'),
 Lemma('swallow.v.05.swallow'),
 Lemma('swallow.v.05.take_back'),
 Lemma('swallow.v.05.unsay'),
 Lemma('swallow.v.05.withdraw'),
 Lemma('seclude.v.01.seclude'),
 Lemma('seclude.v.01.sequester'),
 Lemma('seclude.v.01.sequestrate'),
 Lemma('seclude.v.01.withdraw'),
 Lemma('adjourn.v.02.adjourn'),
 Lemma('adjourn.v.02.withdraw'),
 Lemma('adjourn.v.02.retire'),
 Lemma('bow_out.v.02.bow_out'),
 Lemma('bow_out.v.02.withdraw'),
 Lemma('withdraw.v.09.withdraw'),
 Lemma('wit

In [35]:
token._.wordnet.wordnet_domains()

['astronomy',
 'school',
 'telegraphy',
 'industry',
 'psychology',
 'ethnology',
 'ethnology',
 'administration',
 'school',
 'finance',
 'economy',
 'exchange',
 'banking',
 'commerce',
 'medicine',
 'ethnology',
 'university',
 'school',
 'buildings',
 'factotum',
 'agriculture',
 'mechanics',
 'gastronomy',
 'meteorology',
 'physics',
 'basketball',
 'anatomy',
 'skiing',
 'nautical',
 'engineering',
 'racing',
 'home',
 'drawing',
 'dentistry',
 'ethnology',
 'mathematics',
 'furniture',
 'animal_husbandry',
 'industry',
 'economy',
 'body_care',
 'chemistry',
 'medicine',
 'surgery',
 'vehicles',
 'transport',
 'atomic_physic',
 'archaeology',
 'hydraulics',
 'oceanography',
 'golf',
 'sculpture',
 'earth',
 'applied_science',
 'artisanship']

Again, if you're working with knowledge graphs, those WordNet "word sense" links could be used in conjunction with graph algorithms to help identify the meanings of specific words. This technique, known as summarization, can also be used to create summaries for longer sections of text. It's beyond the scope of this tutorial, but it's an interesting application for natural language in the industry right now.

In the opposite direction, if you know ahead of time that a document is about a specific domain or set of topics, you can limit the meanings returned by WordNet. In the following example, we'll look at NLU results from the Finance and Banking sectors:

In [36]:
domains = ["finance", "banking"]
sentence = nlp("I want to withdraw 5,000 euros.")

enriched_sent = []

for token in sentence:
    # get synsets within the desired domains
    synsets = token._.wordnet.wordnet_synsets_for_domain(domains)

    if synsets:
        lemmas_for_synset = []

        for s in synsets:
        # get synset variants and add to the enriched sentence
            lemmas_for_synset.extend(s.lemma_names())
            enriched_sent.append("({})".format("|".join(set(lemmas_for_synset))))
    else:
        enriched_sent.append(token.text)
print(" ".join(enriched_sent))

I (privation|want|neediness|deprivation) (deficiency|deprivation|privation|lack|want|neediness) (deficiency|deprivation|privation|require|lack|want|need|neediness) to (withdraw|draw_off|draw|take_out) 5,000 euros .


That example may appear straightforward, but if you experiment with the **domains** list, you'll notice that the results exhibit a kind of combinatorial explosion when run without reasonable constraints. Consider a knowledge graph with millions of elements: you'd want to limit searches wherever possible to avoid having each query take days/weeks/months/years to compute.

### Text Comparison with spaCy and Scattertext

Sometimes the problems encountered when attempting to understand a text—or, better yet, attempting to understand a corpus (a dataset containing many related texts)—become so complex that they must first be visualized. Here's an interactive visualization to help you understand text: [Jason Kessler](https://twitter.com/jasonkessler)'s genius has resulted in [scattertext](https://spacy.io/universe/project/scattertext).

Let us examine text data from party conventions held during the 2012 US Presidential elections. Please keep in mind that this cell may take a few minutes to run, but the results of all that number-crunching are well worth the wait.

In [37]:
import scattertext as st

if "merge_entities" not in nlp.pipe_names:
    nlp.add_pipe('merge_entities')

if "merge_noun_chunks" not in nlp.pipe_names:
    nlp.add_pipe('merge_noun_chunks')

convention_df = st.SampleCorpora.ConventionData2012.get_data()
corpus = st.CorpusFromPandas(convention_df, category_col="party", text_col="text", nlp=nlp).build()

In [38]:
convention_df

Unnamed: 0,party,text,speaker
0,democrat,Thank you. Thank you. Thank you. Thank you so ...,BARACK OBAMA
1,democrat,"Thank you so much. Tonight, I am so thrilled a...",MICHELLE OBAMA
2,democrat,Thank you. It is a singular honor to be here t...,RICHARD DURBIN
3,democrat,"Hey, Delaware. \nAnd my favorite Democrat, Jil...",JOSEPH BIDEN
4,democrat,"Hello. \nThank you, Angie. I'm so proud of how...",JILL BIDEN
...,...,...,...
184,republican,"As the elected leader of 250,000 College Repub...",ALEX SCHRIVER
185,republican,"Good afternoon. I'm Pete Sessions, a congressm...",PETE SESSIONS
186,republican,To Chairman Priebus and to my fellow Americans...,BOB BUCKHORN
187,republican,"\nAbsolutely. Thank you, Mr.Chairman.\nWelcome...",SHARON DAY


Once you have the **corpus** ready, generate an interactive visualization in HTML:

In [39]:
html = st.produce_scattertext_explorer(corpus, category="democrat", 
                                       category_name="Democratic",
                                       not_category_name="Republican",
                                       width_in_pixels=1000,
                                       metadata=convention_df["speaker"])

Now we'll render the HTML—give it a minute or two to load, it's worth the wait:

In [40]:
from IPython.display import IFrame

file_name = "foo.html"

with open(file_name, "wb") as f:
    f.write(html.encode("utf-8"))

IFrame(src=file_name, width = 1200, height=700)

Consider storing text from the last three years of customer support for a specific product in your company. Assume your team required information about how customers were discussing the product. This scattertext library has the potential to be extremely useful! You could cluster (k=2) on NPS (a customer evaluation metric) and then substitute the top two clustering components for the Democrat/Republican dimension.

# Conclusion

In this blog, we introduce some frequently used features of spaCy from an overall perspective. If you asked many data scientists five years ago about open source in Python for natural language processing, the default answer would have been NLTK. That project includes almost everything except the kitchen sink and has mostly academic components.

There's so much more we can [do with spaCy](https://blog.dominodatalab.com/making-pyspark-work-spacy-overcoming-serialization-errors)— hopefully, this tutorial provides an introduction. We wish you all the best in your natural language work.