# SpaCy!

- huge library
- every thing, e.g., tagger, ner, pos, uses CNN
- #1 library for NLP (e.g., NLTK, gensim)
- HuggingFace (mostly for deep learning)

    `pip install spacy or pip install -U 'spacy[cuda-autodetect]'`

    `python -m spacy download en_core_web_sm`   #trained using cnn

    `python -m spacy download en_core_web_md`   #has word embedding (gloVe); trained using cnn

    `python -m spacy download en_core_web_trf`  #everything is trained using transformer

In [1]:
import spacy
spacy.__version__

2023-02-02 12:47:13.197261: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


'3.3.2'

## 1. Basics

### 1.1 Intro

In [2]:
#create a spacy object that can parse a lot of stuffs
#based on some learned model

nlp = spacy.load('en_core_web_sm')

In [3]:
text = 'Thailand really like to eat naan and masala.  He also likes to eat sushi.'

In [4]:
doc = nlp(text)

In [5]:
type(doc)

spacy.tokens.doc.Doc

In [6]:
#there are so many things in this doc
for tokens in doc[:10]:
    print(tokens)  #this spacy.tokens.doc.Doc already tokenize it!!!
    break

Thailand


In [7]:
tokens

Thailand

In [8]:
for sent in doc.sents:
    print(sent)  #it also has sentence 

Thailand really like to eat naan and masala.
 He also likes to eat sushi.


In [9]:
tokens

Thailand

In [10]:
tokens.ent_type #entity type ids

384

In [11]:
tokens.ent_type_ #geo political entity

'GPE'

In [12]:
spacy.explain('GPE')

'Countries, cities, states'

In [13]:
tokens.ent_iob_  #beginning of an entity

'B'

In [14]:
tokens.pos_  #proper noun

'PROPN'

In [15]:
tokens.dep_

'nsubj'

In [16]:
tokens.head

like

In [17]:
sentence1 = list(doc.sents)[0]

In [18]:
sentence1

Thailand really like to eat naan and masala.

In [19]:
from spacy import displacy  #displaying stuffs
displacy.render(sentence1, style="dep")

In [20]:
displacy.render(sentence1, style="ent")

### 1.2 Word Vectors

In [21]:
nlp = spacy.load("en_core_web_md")

In [22]:
text = "Chaky likes to eat sushi."

In [23]:
doc = nlp(text)

In [24]:
sentence = list(doc.sents)[0]

In [25]:
sentence[1]

likes

In [26]:
len(sentence[1].vector)  #what is the size?? --> 300 glove embedding

300

### 1.3 Similarity

In [27]:
#before similarity, let's about nlp.vocab.strings
doc = nlp("I love coffee.")

In [28]:
nlp.vocab.strings['coffee']  #hash value

3197928453018144401

In [29]:
nlp.vocab.strings[3197928453018144401]

'coffee'

In [30]:
#first numericalize dog
integer = nlp.vocab.strings['dog']
integer

7562983679033046312

In [31]:
#get the vector based on this id
vector = nlp.vocab.vectors[integer]
vector[:5] #size 300 - vector of dog

array([-0.72483 ,  0.42538 ,  0.025489, -0.39807 ,  0.037463],
      dtype=float32)

In [32]:
import numpy as np

close_words = nlp.vocab.vectors.most_similar(np.asarray([vector]), n=10)
close_words

(array([[13192779106523156987,  4476338517347267351, 14199852958745354380,
          3615545391617869586,  6740239789784345073,  9120157979859245900,
          6189118356939658504, 17686863692678987895,  8330890959751529634,
          4295179733490603801]], dtype=uint64),
 array([[9980, 9979, 9981, 4791, 4792, 4793, 7916, 7918, 7917,  451]],
       dtype=int32),
 array([[1.    , 1.    , 1.    , 0.7044, 0.7044, 0.7044, 0.6588, 0.6588,
         0.6588, 0.6366]], dtype=float32))

In [33]:
close_words[0].shape

(1, 10)

In [34]:
nlp.vocab.strings[close_words[0][0][0]]

'puppies'

### 1.4 Doc and span similarity

In [35]:
doc1 = nlp("Chaky likes french fries")
doc2 = nlp("Tonson likes sweet potato nuggets")

In [36]:
doc1.similarity(doc2)  #higher means more similar

0.7701631433949315

In [37]:
#doc ---> sents ---> span ---> tokens

#do span similarity
span1 = doc1[2:4]
span1

french fries

In [38]:
span2 = doc2[2:6]
span2

sweet potato nuggets

In [39]:
span1.similarity(span2)

0.6381438374519348

## 2. Entity Ruler

Basically NER, the most prominent function of spacy.

To do NER, spacy can do two ways:
1. rule-based (talk about this first)
2. neural network (a little bit on this)

NER is the most common thing people do in industry, e.g., information extraction.

In [40]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [41]:
# #pipes --> everything you insert some text into nlp
# #it must through a sequential list of pipes --> they do something
# analysis = nlp.analyze_pipes(pretty=True)
# analysis

In [42]:
#add some entityruler pipe, we use the function (add_pipe)
ruler = nlp.add_pipe('entity_ruler', before='ner') #put the pipe before ner

#add patterns
patterns = [
                {"label": "LOC", "pattern": "Rangsit"}
            ]

ruler.add_patterns(patterns)

In [43]:
# nlp.analyze_pipes(pretty=True)

In [44]:
text = "AIT is at Rangsit."
doc = nlp(text)

In [45]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Rangsit LOC


### 2.1 More patterns!!!

In [46]:
import spacy

text = "My phone number is (555) 666-5555"

nlp = spacy.blank("en") #blank model (no pipes)

In [47]:
ruler = nlp.add_pipe('entity_ruler')

In [48]:
patterns = [
                {"label": "PHONE_NUMBER", "pattern": [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"},
                {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]}
           ]

ruler.add_patterns(patterns)

In [49]:
doc = nlp(text)

In [50]:
for ent in doc.ents:
    print(ent.text, ent.label_)

(555) 666-5555 PHONE_NUMBER


### 2.2 Matcher

Even more powerful pattern guy

In [51]:
from spacy.matcher import Matcher #help us recognize patterns

In [52]:
#Email
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

In [53]:
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL", [pattern])

doc = nlp("Chaky email is chaklam@ait.asia.")
matches = matcher(doc)

In [54]:
matches

[(17587345535198158200, 3, 4)]

In [55]:
nlp.vocab[matches[0][0]].text

'EMAIL'

In [56]:
#proper nouns and longer phrases
with open("../data/wiki_king.txt", "r") as f:
    text = f.read()
    
text

'Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.\n\nKing participated in and led marches for blacks\' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his fam

In [57]:
nlp = spacy.load("en_core_web_sm")

In [58]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}]  #pos ==> part of speech
matcher.add("PROPER_NOUN_CHAKY", [pattern])
doc = nlp(text)
matches = matcher(doc)
for match in matches[:10]:
    print(match, doc[match[1]:match[2]]) #match[1] start of the span, match[2] end of the span

(2015442650195688329, 0, 1) Martin
(2015442650195688329, 1, 2) Luther
(2015442650195688329, 2, 3) King
(2015442650195688329, 3, 4) Jr.
(2015442650195688329, 6, 7) Michael
(2015442650195688329, 7, 8) King
(2015442650195688329, 8, 9) Jr.
(2015442650195688329, 10, 11) January
(2015442650195688329, 15, 16) April
(2015442650195688329, 23, 24) Baptist


In [59]:
##multi-word token
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]  #pos ==> part of speech; + means 1 or more
matcher.add("PROPER_NOUN_CHAKY", [pattern])
doc = nlp(text)
matches = matcher(doc)
for match in matches[:10]:
    print(match, doc[match[1]:match[2]]) #match[1] start of the span, match[2] end of the span

(2015442650195688329, 0, 1) Martin
(2015442650195688329, 0, 2) Martin Luther
(2015442650195688329, 1, 2) Luther
(2015442650195688329, 0, 3) Martin Luther King
(2015442650195688329, 1, 3) Luther King
(2015442650195688329, 2, 3) King
(2015442650195688329, 0, 4) Martin Luther King Jr.
(2015442650195688329, 1, 4) Luther King Jr.
(2015442650195688329, 2, 4) King Jr.
(2015442650195688329, 3, 4) Jr.


In [60]:
##how do we get only one
##greedy = longest
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]  #pos ==> part of speech; + means 1 or more
matcher.add("PROPER_NOUN_CHAKY", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
for match in matches[:10]:
    print(match, doc[match[1]:match[2]]) #match[1] start of the span, match[2] end of the span

(2015442650195688329, 0, 4) Martin Luther King Jr.
(2015442650195688329, 83, 87) Martin Luther King Sr
(2015442650195688329, 128, 132) Southern Christian Leadership Conference
(2015442650195688329, 6, 9) Michael King Jr.
(2015442650195688329, 69, 71) Mahatma Gandhi
(2015442650195688329, 146, 148) Albany Movement
(2015442650195688329, 193, 195) Lincoln Memorial
(2015442650195688329, 10, 11) January
(2015442650195688329, 15, 16) April
(2015442650195688329, 23, 24) Baptist


In [61]:
##sort
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]  #pos ==> part of speech; + means 1 or more
matcher.add("PROPER_NOUN_CHAKY", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)

matches.sort(key = lambda x: x[1])

for match in matches[:10]:
    print(match, doc[match[1]:match[2]]) #match[1] start of the span, match[2] end of the span

(2015442650195688329, 0, 4) Martin Luther King Jr.
(2015442650195688329, 6, 9) Michael King Jr.
(2015442650195688329, 10, 11) January
(2015442650195688329, 15, 16) April
(2015442650195688329, 23, 24) Baptist
(2015442650195688329, 49, 50) King
(2015442650195688329, 69, 71) Mahatma Gandhi
(2015442650195688329, 83, 87) Martin Luther King Sr
(2015442650195688329, 89, 90) King
(2015442650195688329, 113, 114) King


In [62]:
##add some verb
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}, {"POS": "VERB"}]  #pos ==> part of speech; + means 1 or more
matcher.add("PROPER_NOUN_CHAKY", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)

matches.sort(key = lambda x: x[1])

for match in matches[:10]:
    print(match, doc[match[1]:match[2]]) #match[1] start of the span, match[2] end of the span

(2015442650195688329, 49, 51) King advanced
(2015442650195688329, 89, 91) King participated
(2015442650195688329, 113, 115) King led


### 2.3 Regex - regular expression

In [63]:
import spacy

#Sample text
text = "This is a sample number 5555555."

#Build upon the spaCy small model
nlp = spacy.blank("en")

#add the pipe
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns (source: https://spacy.io/usage/rule-based-matching)
patterns = [
                {
                    "label": "PHONE_NUMBER", 
                    "pattern": [{"TEXT": {"REGEX": "((\d){7})"}}]
                }
            ]
#add patterns to ruler
ruler.add_patterns(patterns)


#create the doc
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

5555555 PHONE_NUMBER
