In [None]:
## pip install spacy

In [None]:
## pip install -U spacy

In [None]:
## !pip install spacy

In [None]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
!python -m spacy info

[1m

spaCy version    3.7.4                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.85+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_md (3.7.1), en_core_web_sm (3.7.1)



NLP Operations
Processing Pipeline
Tokenization - Lemmatization - Part-of-speech tagging - Syntactic dependency parsing - Named entity Recognition

Tokenization


In [None]:
import spacy

nlp = spacy.load('en_core_web_md')

In [None]:
doc = nlp(u"I'm flying to Frisco")
print([w.text for w in doc])


['I', "'m", 'flying', 'to', 'Frisco']


Lemamatization


In [None]:
doc = nlp("this produc integrates both libraries for downloading and applying patches")
for token in doc:
  print(token.text, token.lemma_)

this this
produc produc
integrates integrate
both both
libraries library
for for
downloading download
and and
applying apply
patches patch


Lemmatization for Meaning Recognition:

To determine this, the app searches for a word that matches one of the keywords in the predefined list. An easy way to simplify the search for these
keywords is to first convert all the words in a sentence being processed to their lemmas. Other case is to add city nicknames .

Define a special case

In [None]:
import spacy
from spacy.symbols import ORTH, LEMMA, NORM

nlp = spacy.load('en_core_web_md')
special_case = [{ORTH:'Frisco', NORM:'San Francisco'}]
nlp.tokenizer.add_special_case('Frisco', special_case)
## doc = nlp('I am flying to Frisco')
print([w.text for w in nlp("I am flying to Frisco")])

['I', 'am', 'flying', 'to', 'Frisco']


In [None]:
# import spacy
nlp = spacy.load('en_core_web_md', cache_disabled=True)
special_case = [{'ORTH': 'Frisco', 'NORM': 'San Francisco'}]
nlp.tokenizer.add_special_case('Frisco', special_case)

print([w.text for w in nlp("I am flying to Frisco")])

TypeError: load() got an unexpected keyword argument 'cache_disabled'

In [None]:
import spacy
from spacy.symbols import ORTH, LEMMA, NORM

nlp = spacy.load('en_core_web_md')
special_case = [{ORTH:'Frisco', LEMMA:'San Francisco'}]

nlp.tokenizer.add_special_case('Frisco', special_case)
doc = nlp('I am flying to Frisco')
print([(token.text, token.lemma_) for token in doc])

ValueError: [E1005] Unable to set attribute 'LEMMA' in tokenizer exception for 'Frisco'. Tokenizer exceptions are only allowed to specify ORTH and NORM.

In [None]:
import spacy
from spacy.language import Language

# Load the spaCy language model
nlp = spacy.load('en_core_web_md')

# Define and register a custom pipeline component
@Language.component("custom_lemma")
def custom_lemma(doc):
    for token in doc:
        if token.text == 'Frisco':
            token.lemma_ = 'San Francisco'
    return doc

# Add the custom component to the pipeline
nlp.add_pipe("custom_lemma", after='ner')

# Process a text that includes the special case
doc = nlp("I am flying to Frisco")

# Print the token text and lemmas
print([(token.text, token.lemma_) for token in doc])

[('I', 'I'), ('am', 'be'), ('flying', 'fly'), ('to', 'to'), ('Frisco', 'San Francisco')]


In [None]:
## other

import spacy
from spacy.language import Language

@Language.component("custom_lemma")
def custom_lemma(doc):
  for token in doc:
    if token.text == 'Frisco':
      # Set lemma to the desired string
      token.lemma = "San Francisco"  # String value for lemma
  return doc

# Load the spaCy language model
nlp = spacy.load('en_core_web_md')

# Add the custom component to the pipeline
nlp.add_pipe("custom_lemma", after='ner')

# Process a text that includes the special case
doc = nlp("I am flying to Frisco")

# Print the token text and lemmas
print([(token.text, token.lemma_) for token in doc])

TypeError: an integer is required

In [None]:

import spacy
from spacy.tokens import Doc

def fix_frisco_lemma(doc: Doc) -> Doc:
  """
  Post-processing function to modify lemma for "Frisco".
  """
  for token in doc:
      if token.text == "Frisco":
          token.lemma_ = "San Francisco"  # Modify lemma_ after tokenization
  return doc

nlp = spacy.load('en_core_web_md')
doc = nlp("I am flying to Frisco")
doc = fix_frisco_lemma(doc)  # Apply post-processing function

print([(token.text, token.lemma_) for token in doc])

[('I', 'I'), ('am', 'be'), ('flying', 'fly'), ('to', 'to'), ('Frisco', 'San Francisco')]


Importance of Pipeline Order
In spaCy, the processing pipeline consists of various components that operate in a sequence. The order of these components can significantly impact the processing results because each component relies on the output of the previous one. Here are some common components in a spaCy pipeline:

Tokenizer: Splits the raw text into individual tokens.

Tagger: Assigns part-of-speech tags to tokens.

Parser: Analyzes the syntactic structure of the sentence.

NER (Named Entity Recognizer): Identifies named entities in the text.

Custom Components: Any additional custom processing logic.

Why Use after='ner'

The ner component identifies named entities in the text and assigns them specific labels (such as "ORG" for organizations, "PERSON" for people, etc.). If your custom component needs to modify token attributes based on whether they are part of a named entity, it is essential to ensure that the NER component has already run.


Part of Speech Tagging

Verbs: Tense, Aspect(simple, prgressive, or perfect), person and number

Noun pronoun determiner, these are called coarse-grained parts of speech and are available as a fixed set of tags through the Token.pos (int) and Token.pos_ (unicode) attributes.

Also, spaCy offers fine-grained parts of speech tags that provide more detailed information about a token.The finegrained part-of-speech tags are available as the Token.tag (int) and Token.tag_ (unicode) attributes.


In [None]:
nlp = spacy.load('en_core_web_md')
doc = nlp('I have flown to LA. Now I am flying to Frisco')
print([w.text for w in doc if w.tag_ == 'VBG'or w.tag_ == 'VB'])

['flying']


Part of Speech
The tag_ property of a Token object contains the fine-grained part-of-speech attribute assigned to that object

In [None]:
doc = nlp('I have flown to LA. Now I am flying to Frisco')
print([w.text for w in doc if w.pos_ == 'PROPN' ])

['LA', 'Frisco']


In [None]:
print([w.text for w in doc if w.pos == spacy.symbols.PROPN])

['LA', 'Frisco']


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I am a runner")

for token in doc:
    print(token.text, token.pos, token.pos_)

I 95 PRON
am 87 AUX
a 90 DET
runner 92 NOUN


Context is important:

the utterance might
mean either “I'm already in the sky, flying to LA.” or “I'm
going to fly to LA.” 91



Syntactic Relations

Constituent-Based Structure  / Word-based sturucture

The phrase structure tree breaks up the sentence based on the fact that the sentence consists of a noun phrase and a verb phrase.(second level hierarchy)

Head: A word that governs or determines the properties of another word.
Child (or Dependent): A word that depends on the head and is governed by it.
Root: The topmost node in the tree, typically the main verb or predicate of the sentence.
Dependency Relation: The type of syntactic relationship between a head and its child.


In [None]:
import spacy
nlp =spacy.load('en_core_web_md')
doc = nlp('I have flown to LA. Now I am flying to Frisco')
for token in doc:
  print(token.text, token.pos_ , token.tag_ ,' ' , token.dep_)


I PRON PRP   nsubj
have AUX VBP   aux
flown VERB VBN   ROOT
to ADP IN   prep
LA PROPN NNP   pobj
. PUNCT .   punct
Now ADV RB   advmod
I PRON PRP   nsubj
am AUX VBP   aux
flying VERB VBG   ROOT
to ADP IN   prep
Frisco PROPN NNP   pobj


In [None]:
for token in doc:
  print(token.head.text, token.dep_ , token.text)

flown nsubj I
flown aux have
flown ROOT flown
flown prep to
to pobj LA
flown punct .
flying advmod Now
flying nsubj I
flying aux am
flying ROOT flying
flying prep to
to pobj Frisco


let's try to figure out what labels point to the tokens that could potentially best describe the customer's intent. You need to find a pair that would alone appropriately describe the customer's intent.

Interested in the tokens marked with the ROOT and pobj dependency labels, because in this example they're key in intent recognition. they marks the entity that—in conjunction with the verb— summarizes the meaning of the entire utterance.

Sentences level indices


In [None]:
for sent in doc.sents:
  print([w.text for w in sent if w.dep_ == 'ROOT' or w.dep_ == 'pobj'])

['flown', 'LA']
['flying', 'Frisco']
