# SpaCy: NLP Pipeline

![](images/spacy_nlp_pipeline.svg)


**What we will do**
Reference: https://spacy.io/

* Tokenization
* POS Tagging
* NER
* Entity linking

![](images/spacy-training.svg)

## Tokenization

In [None]:
try:
    %tensorflow_version 2.x
    is_colab = True
    !pip install spacy
    !python -m spacy download en_core_web_md
    !pip install spacy-transformers
except:
    is_colab = False

print(f'\033[00mUsing Google CoLab = \033[93m{is_colab}')
if (is_colab): print("Dependencies installed")

# Spacy: Getting started

As discussed in the lecture portion, Python has two main libraries to help with NLP tasks: 

* [NLTK](https://www.nltk.org/)
* [Spacy](https://spacy.io/)

SpaCy launched in 2015 and has rapidly become an industry standard, and is a focus of our training. SpaCy provides an industrial grade project that is both open-source and contains community driven integrations (see SpaCy Universe).

SpaCy requires you to download language resources (such as models). For the english language, you can use `python -m spacy download en_core_web_sm`. The suffix `_sm` indicates "small" model, while `_md` and `_lg` indicate medium and large, respectively and provide more advanced features (we won't need in this tutorial).


In [None]:
!python -m spacy download en_core_web_sm

In [None]:
!pip install urllib3==1.25.10

In [None]:
!pip show urllib3 | grep version

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

# Use if needed:
#spacy.util.get_data_path()

# Tokenization

For each word in that sentence _spaCy_ generates a [token](https://spacy.io/api/token) for each word in the sentence. The token fields show the raw text, the root of the word (lemma), the Part of Speech (POS), whether or not its a stop word, and many other things. 

In [None]:
import spacy
text = "this is a beautiful day"
doc = nlp(text)
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

In [None]:
from spacy.symbols import ORTH

nlp = spacy.load("en_core_web_sm")
text = "gimme that"
doc = nlp(text)  # phrase to tokenize
print([w.text for w in doc])  # ['gimme', 'that']

# Add special case rule
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

# Check new tokenization
print([w.text for w in nlp("gimme that")])  # ['gim', 'me', 'that']

In [None]:
tok_exp = nlp.tokenizer.explain(text)
for t in tok_exp:
    print(t[1], '\t', t[0])

In [None]:
[t.is_space for t in nlp('''"a gimme give me let's "''')]

# Numeric representation

Let's print the last token and see its _numeric_ representation:

In [None]:
print(f'The token is from the raw text: \033[92m{token.text}\033[0m\nNumeric representation:\n')
print(token.vector)
print(f'\nThe length of the vector is {token.vector.shape}') # 96 length vector

## Part-of-speech tagging

Requires a model for parsing and tagging

In [None]:
# Another example
import pandas as pd
doc2 = nlp("Doordash and AirBnB have IPO'd this week")
my_columns = ['Text', 'Lemma', 'POS', 'TAG','DEP', 'Shape', 'Alpha', 'Stop']
df = pd.DataFrame(columns = my_columns)

for token in doc2:
    df = df.append(
        pd.Series(
        [
            token.text,
            token.lemma_,
            token.pos_,
            token.tag_,
            token.dep_,
            token.shape_,
            token.is_alpha,
            token.is_stop,            
        ],
        index = my_columns),
        ignore_index = True
    )
    print(df)
#    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
#            token.shape_, token.is_alpha, token.is_stop)

# Display

Note: Run the following as `display.serve` outside of Jupyter

In [None]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)
displacy.render(doc, style="ent")
# day is shown as a recognized "DATE"

### Exercise:

Explore different parts of speech & sentence structures. 
* Show PERSON 
* Show location

Some examples:
* "They met at a cafe in London last year"
* "Peter went to see his uncle in Brooklyn"
* "The chicken crossed the road because it was hungry"
* "The chicken crossed the road because it was narrow"

# Similarity of two sentences

Let's do the same as above, but mix with two similar sentences

In [None]:
sentence_list = ["this is a beautiful day", "today is bright and sunny"]

In [None]:
doc_list = list(map(nlp, sentence_list))

In [None]:
## Python program to understand the usage of tabulate function for printing tables in a tabular format
from tabulate import tabulate
import pandas as pd

column_names = ['text', 'lemma', 'pos', 'tag', 'dep', 'shape', 'is_alpha', 'is_stop']
df = pd.DataFrame(columns = column_names)
for doc in doc_list:
    print(f'\n\033[92mPrinting tokens for \033[91m"{doc}"\033[0m')
    for token in doc:
        token_list = [token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                      token.shape_, token.is_alpha, token.is_stop]
        token_series = pd.Series(token_list, index = df.columns)
        df = df.append(token_series, ignore_index=True)
    print(tabulate(df, headers=column_names))

# Showing similarity between two sentences

1. "this is a beautiful day"
2. "this day is bright and sunny"

Note: If you have loaded the small (sm) dataset, you will get the following warning:
> UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Token.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.

Try: 
* `python -m spacy download en_core_web_md`
* or: `python -m spacy download en_core_web_lg`

In [None]:
import warnings

# choose action = 'ignore' to ignore the small dataset warning
warnings.filterwarnings(action = "ignore") # "default"

In [None]:
doc_list[0].similarity(doc_list[1])

In [None]:
nlp_md = spacy.load("en_core_web_md")

In [None]:
# try again
doc_md_list = list(map(nlp_md, sentence_list))
doc_md_list[0].similarity(doc_md_list[1])

# Paragraph

How do you deal with multiple sentences?

In [None]:
text = """When we went out for ice-cream last summer, the place was 
packed. This year, however, things are eerily different. You can see that 
the stores are nearly desserted and roads empty like never before. It's a 
reality that we are all getting used to, albeit very slowly and reluctantly.
"""

doc = nlp(text)

for sent in doc.sents:
    print(">", sent)

# Scattertext

The following is nice demonstration of the power of SpaCy with text from the Democratic and Republican conventions over the years. This demo is created by 
derwen.ai using the `scattertext` library. 

In [None]:
# First, install scattertext
!pip install scattertext

In [None]:
?nlp.create_pipe

In [None]:
import scattertext as st

# By default, the nlp English pipeline comes with `tagger`, `parser`, and `NER`
if "merge_entities" not in nlp.pipe_names:
    nlp.add_pipe(nlp.create_pipe("merge_entities"))

if "merge_noun_chunks" not in nlp.pipe_names:
    nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))

convention_df = st.SampleCorpora.ConventionData2012.get_data() 
corpus = st.CorpusFromPandas(convention_df,
                             category_col="party",
                             text_col="text",
                             nlp=nlp).build()

Generate interactive visualization once the corpus is ready:

In [None]:
html = st.produce_scattertext_explorer(
    corpus,
    category="democrat",
    category_name="Democratic",
    not_category_name="Republican",
    width_in_pixels=1000,
    metadata=convention_df["speaker"]
)

Render the visualization:

In [None]:
from IPython.display import IFrame
from IPython.core.display import display, HTML
import sys

IN_COLAB = "google.colab" in sys.modules
print(IN_COLAB)

**Use in Google Colab**

In [None]:
if IN_COLAB:
    display(HTML("<style>.container { width:98% !important; }</style>"))
    display(HTML(html))

**Use in Jupyter**

In [None]:
file_name = "foo.html"

with open(file_name, "wb") as f:
    f.write(html.encode("utf-8"))

IFrame(src=file_name, width = 1200, height=700)

# The SpaCy universe

That's the end of our intro to SpaCy journey. However, as discussed, SpaCy is an open, collaborative project that has a universe of plugins and datasets that make working with it very helpful for a number of use cases. The following is a sampling of the [SpaCy Universe](https://spacy.io/universe):
 - [Legal: Blackstone](https://spacy.io/universe/project/blackstone)
 - [Biomedical: Kindred](https://spacy.io/universe/project/kindred)
 - [Geographic: mordecai](https://spacy.io/universe/project/mordecai)
 - [Label: Prodigy](https://spacy.io/universe/project/prodigy)
 - [Edge: spacy-raspberry](https://spacy.io/universe/project/spacy-raspberry)
 - [Voice: Rasa NLU](https://spacy.io/universe/project/rasa) 
  - [Transformers: spacy-transformers](https://explosion.ai/blog/spacy-pytorch-transformers) 
  - [Conference: spaCy IRL 2019](https://irl.spacy.io/2019/)

  _Credit: Derwen.ai_