# Introducing spaCy For NLP
> This spaCy tutorial explains and applies the introduction to spaCy and features of spaCy for NLP based on the [spaCy 101](https://spacy.io/usage/spacy-101).
- toc: true 
- badges: true
- comments: true
- author: Zeyu Guan
- categories: [spaCy, Python, Machine Learning, Data Mining, NLP, RandomForest]
- annotations: true
- image: https://spacy.io/static/social_default-1d3b50b1eba4c2b06244425ff0c49570.jpg
- hide: false

## First, what is spacy?
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python that allows you to perform extensive natural language processing analysis and develop models that may be used to support document analysis, chatbot capabilities, and other types of text analysis.

It's become one of the most extensively used natural language libraries in Python for corporate use cases, and it has a sizable community—and with it, a lot of support for commercializing research breakthroughs as this field evolves rapidly.

After downloading a list of packages, we can load spaCy and run some code like below. That nlp variable, which is loaded with the en_core_web_sm small model for English.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

Next, let's run a small "document" through the natural language parser:

In [None]:
# Create an nlp object
doc = nlp("He went to play basketball")

In [None]:
nlp.pipe_names
nlp.disable_pipes('tagger', 'parser')
nlp.pipe_names

### 1. Part-of-Speech (POS) Tagging using spaCy

In [None]:
# Iterate over the tokens
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.pos_)

### 2. Dependency Parsing using spaCy

In [None]:
# dependency parsing
for token in doc:
    print(token.text, "-->", token.dep_)

### 3. Named Entity Recognition using spaCy

In [None]:
doc = nlp("Today is March 21, and Chris has already spent a few hours on this project in the United States.")
 
for ent in doc.ents:
    print(ent.text, ent.label_)

### 4. Rule-Based Matching using spaCy

In [None]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the spaCy vocabulary
matcher = Matcher(nlp.vocab)

doc = nlp("Some people start their day with lemon water")

# Define rule
pattern = [{'TEXT': 'lemon'}, {'TEXT': 'water'}]

# Add rule
matcher.add('rule_1',[pattern])

print(matcher(doc))

matches = matcher(doc)
matches

# Extract matched text
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

In [None]:
doc1 = nlp("You read this book")
doc2 = nlp("I will book my ticket")

#Text and POS are required
pattern = [{'TEXT': 'book', 'POS': 'NOUN'}]

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
matcher.add('rule_2', [pattern])


print("doc1: ", matcher(doc1),"\ndoc2: ", matcher(doc2))

## Now, let’s get our hands dirty with spaCy.
### First, let's play around.

In [None]:
# Same process as above.
import spacy
nlp = spacy.load("en_core_web_sm")

text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)

# stoplist
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

We got many valuable information by using the package, but it's a bit difficult to read. Thus, we reformat the spaCy parse of that sentence as a pandas dataframe.

In [None]:
import pandas as pd
cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []
for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)
df = pd.DataFrame(rows, columns=cols)
df

Next, let's use the displaCy library to visualize the parse tree for the sentence.

In [None]:
from spacy import displacy
displacy.render(doc, style="dep")

However, the this only work on a single sentence. What if we have more sentences like a novel? There are features for sentence boundary detection (SBD)—also known as sentence segmentation—based on the builtin/default sentencizer:

In [None]:
text = "We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit. I fell in.Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket. The gorillas just went wild."
doc = nlp(text)
for sent in doc.sents:
    print(">", sent)

In [None]:
for sent in doc.sents:
    print(">", sent.start, sent.end)

In [None]:
doc[48:54]

In [None]:
token = doc[51]
print(token.text, token.lemma_, token.pos_)

### Acquiring Text

Using the internet as a quick source is one option for us to acquire texts. Of course, we get HTML when we download web pages and must subsequently extract text from them. For this, Beautiful Soup is a popular package.

In [None]:
import sys
import warnings
warnings.filterwarnings("ignore")

In [None]:
from bs4 import BeautifulSoup
import requests
import traceback
def get_text (url):
    buf = []
    try:
        soup = BeautifulSoup(requests.get(url).text, "html.parser")
        for p in soup.find_all("p"):
            buf.append(p.get_text())
        return "".join(buf)
    except:
        print(traceback.format_exc())
        sys.exit(-1)

In [None]:
lic = {}
lic["mit"] = nlp(get_text("https://opensource.org/licenses/MIT"))
lic["asl"] = nlp(get_text("https://opensource.org/licenses/Apache-20"))
lic["bsd"] = nlp(get_text("https://opensource.org/licenses/BSD-3-Clause"))

for sent in lic["bsd"].sents:
    print(">", sent)

In [None]:
pairs = [
    ["mit", "asl"],
    ["asl", "bsd"],
    ["bsd", "mit"]]

for a, b in pairs:
    print(a, b, lic[a].similarity(lic[b]))

This is interesting since the BSD and MIT licenses appear to be the most similar documents. In fact, they are closely related. There are many applications in real-life experience, such as distinguishing an article's opinions and bias.

### Natural Language Understanding

In [None]:
text = "Steve Jobs and Steve Wozniak incorporated Apple Computer on January 3, 1977, in Cupertino, California."
doc = nlp(text)

for chunk in doc.noun_chunks:
    print(chunk.text)

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
displacy.render(doc, style="ent")

In [None]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download("wordnet")

In [None]:
from spacy_wordnet.wordnet_annotator import WordnetAnnotator

print("before", nlp.pipe_names)

#V3 difference
if "spacy_wordnet" not in nlp.pipe_names:
    nlp.add_pipe("spacy_wordnet", after='tagger', config={'lang': nlp.lang})
else:
    print("after", nlp.pipe_names)

In [None]:
token = nlp("withdraw")[0]
token._.wordnet.synsets()

In [None]:
token._.wordnet.lemmas()

In [None]:
token._.wordnet.wordnet_domains()

In [None]:
domains = ["finance", "banking"]
sentence = nlp("I want to withdraw 5,000 euros.")

enriched_sent = []

for token in sentence:
    # get synsets within the desired domains
    synsets = token._.wordnet.wordnet_synsets_for_domain(domains)

    if synsets:
        lemmas_for_synset = []

        for s in synsets:
        # get synset variants and add to the enriched sentence
            lemmas_for_synset.extend(s.lemma_names())
            enriched_sent.append("({})".format("|".join(set(lemmas_for_synset))))
    else:
        enriched_sent.append(token.text)
print(" ".join(enriched_sent))

### Text Comparison with spaCy and Scattertext

In [None]:
import scattertext as st

if "merge_entities" not in nlp.pipe_names:
    nlp.add_pipe('merge_entities')

if "merge_noun_chunks" not in nlp.pipe_names:
    nlp.add_pipe('merge_noun_chunks')

convention_df = st.SampleCorpora.ConventionData2012.get_data()
corpus = st.CorpusFromPandas(convention_df, category_col="party", text_col="text", nlp=nlp).build()

In [None]:
convention_df

In [None]:
html = st.produce_scattertext_explorer(corpus, category="democrat", 
                                       category_name="Democratic",
                                       not_category_name="Republican",
                                       width_in_pixels=1000,
                                       metadata=convention_df["speaker"])

In [None]:
from IPython.display import IFrame

file_name = "foo.html"

with open(file_name, "wb") as f:
    f.write(html.encode("utf-8"))

IFrame(src=file_name, width = 1200, height=700)