<a href="https://colab.research.google.com/github/cweiqiang/wq.github.io/blob/main/Cheatsheet_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 0: SpaCy

spaCy is a free, open-source library for advanced Natural
 Language
processing (NLP) in Python. It's designed
 specifically for production use and
helps you build
 applications that process and "understand" large volumes of text. 

Documentation: [spacy.io](https://spacy.io/api/doc)

`$ pip install spacy`

`import spacy`

In [1]:
!pip install spacy



In [2]:
import spacy

# Section 1: Statistical models

## Download statistical models

Predict part-of-speech tags, dependency labels, named entities and more. See here for available models: `spacy.io/models`


In [3]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 7.5 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


## Check that your installed models are up to date

In [4]:
!python -m spacy validate

⠙ Loading compatibility table...[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /usr/local/lib/python3.7/dist-packages/spacy[0m

TYPE      NAME             MODEL            VERSION                            
package   en-core-web-sm   en_core_web_sm   [38;5;2m2.2.5[0m   [38;5;2m✔[0m
link      en               en_core_web_sm   [38;5;2m2.2.5[0m   [38;5;2m✔[0m



## Loading statistical models

In [5]:
import spacy
nlp = spacy.load("en_core_web_sm") # Load the installed model "en_core_web_sm"

# Section 2: Documents and tokens

## Processing text

Processing text with the nlp object returns a Doc object
 that holds all 

information about the tokens, their linguistic
 features and their relationships

In [6]:
doc = nlp("This is a text")

## Accessing token attributes

In [7]:
doc = nlp("This is a text")
[token.text for token in doc] #Token texts

['This', 'is', 'a', 'text']

# Section 3: Label explanations

In [8]:
spacy.explain("RB")

'adverb'

In [9]:
spacy.explain("GPE")

'Countries, cities, states'

# Section 4: Spans

## Accessing spans

Span indices are exclusive. So `doc[2:4]` is a span starting at
 token 2, up to – but not including! – token 4.

In [10]:
doc = nlp("This is a text")
span = doc[2:4]
span.text

'a text'

## Creating a span manually

In [11]:
from spacy.tokens import Span #Import the Span object

doc = nlp("I live in New York") #Create a Doc object

span = Span(doc, 3, 5, label="GPE") #Span for "New York" with label GPE (geopolitical)

span.text

'New York'

# Section 5: Linguistic features

Attributes return label IDs. For string labels, use the attributes with an underscore. For example, `token.pos_`

## Part-of-speech tags - Predicted by Statistical model

In [12]:
doc = nlp("This is a text.") 
[token.pos_ for token in doc] #Coarse-grained part-of-speech tags

['DET', 'AUX', 'DET', 'NOUN', 'PUNCT']

In [13]:
[token.tag_ for token in doc] #Fine-grained part-of-speech tags

['DT', 'VBZ', 'DT', 'NN', '.']

## Syntactic dependencies  - Predicted by Statistical model

In [14]:
doc = nlp("This is a text.")
[token.dep_ for token in doc] #Dependency labels

['nsubj', 'ROOT', 'det', 'attr', 'punct']

In [15]:
[token.head.text for token in doc] #Syntactic head token (governor)

['is', 'is', 'text', 'is', 'is']

## Named entities - Predicted by Statistical model

In [16]:
doc = nlp("Larry Page founded Google")
[(ent.text, ent.label_) for ent in doc.ents] #Text and label of named entity span

[('Larry Page', 'PERSON'), ('Google', 'ORG')]

# Section 6: Pipeline components

Functions that take a Doc object, modify it and return it

Text --> nlp (tokenizer) ---> (tagger) --> (Parser) --> (NER) --> ... -->Doc

## Pipeline information

In [17]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tagger', 'parser', 'ner']

In [18]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f6bf1103450>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f6bf114cd70>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f6bf114c9f0>)]

## Custom components

In [33]:
def custom_component(doc): #Function that modifies the doc and returns it
    print(" ")
    return doc

nlp.add_pipe(custom_component, first=True) #Add the component first in the pipeline

In [34]:
nlp.remove_pipe('custom_component') #Add the component first in the pipeline

('custom_component', <function __main__.custom_component>)

Components can be added first , last (default), or
 before or after an existing component.

# Section 7: Visualizing

If you're in a Jupyter notebook, use `displacy.render` otherwise, 

use `displacy.serve` to start a web server and
 show the visualization in your browser.

In [20]:
from spacy import displacy

## Visualize dependencies

In [35]:
doc = nlp("This is a sentence")
# displacy.render(doc, style="dep") for Jupyter notebook
displacy.render(doc, style="dep", jupyter=True, options={'distance': 90}) 
# For Google Colab

## Visualize named entities

In [36]:
doc = nlp("Larry Page founded Google")

displacy.render(doc, style="ent", jupyter=True, options={'distance': 90})

# Section 8: Word vectors and similarity

To use word vectors, you need to install the larger models
 ending in md or lg , for example `en_core_web_lg` .

## Comparing similarity

In [37]:
doc1 = nlp("I like cats")
doc2 = nlp("I like dogs")
doc1.similarity(doc2) #Compare 2 documents

  "__main__", mod_spec)


0.91780747964994

In [27]:
doc1[2].similarity(doc2[2]) #Compare 2 tokens

  "__main__", mod_spec)


0.74924546

In [28]:
doc1[0].similarity(doc2[1:3]) # Comparetokens and spans

  "__main__", mod_spec)


0.12487637

## Accessing word vectors

In [38]:
doc = nlp("I like cats") #Vector as a numpy array

doc[2].vector #The L2 norm of the token's vector

doc[2].vector_norm

21.84148

# Section 9: Syntax iterators

## Sentences - Ususally needs the dependency parser

In [39]:
doc = nlp("This a sentence. This is another one.")
[sent.text for sent in doc.sents] #doc.sents is a generator that yields sentence spans

['This a sentence.', 'This is another one.']

## Base noun phrases - Needs the tagger and parser

In [40]:
doc = nlp("I have a red car")
#doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]

['I', 'a red car']

# Section 10: Extension attributes

Custom attributes that are registered on the global Doc, Token and Span classes and become available as `._ .`

In [41]:
from spacy.tokens import Doc, Token, Span
doc = nlp("The sky over New York is blue")

## Attribute extensions - With default value

In [42]:
# Register custom attribute on Token class
Token.set_extension("is_color", default=False)

# Overwrite extension attribute with default value
doc[6]._.is_color = True

## Property extensions - With getter and setter

In [43]:
# Register custom attribute on Doc class
get_reversed = lambda doc: doc.text[::-1]
Doc.set_extension("reversed", getter=get_reversed)

# Compute value of extension attribute with getter
doc._.reversed


'eulb si kroY weN revo yks ehT'

## Method extensions - Callable Method

In [None]:
# Register custom attribute on Span class
has_label = lambda span, label: span.label_ == label
Span.set_extension("has_label", method=has_label)

# Compute value of extension attribute with method
doc[3:5].has_label("GPE")

`True`

# Section 11: Rule-based matching

## Using the matcher

In [45]:
# Matcher is initialized with the shared vocab

from spacy.matcher import Matcher

# Each dict represents one token and its attributes

matcher = Matcher(nlp.vocab)

# Add with ID, optional callback and pattern(s)

pattern = [{"LOWER": "new"}, {"LOWER": "york"}]

matcher.add("CITIES", None, pattern)

# Match by calling the matcher on a Doc object

doc = nlp("I live in New York")

matches = matcher(doc)

# Matches are (match_id, start, end) tuples

for match_id, start, end in matches: # Get the matched span by slicing the Doc
    span = doc[start:end]
    print(span.text)


New York


## Token patterns

In [46]:
#"love cats", "loving cats", "loved cats"

pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]

# "10 people", "twenty people"

pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]

# "book", "a cat", "the sea" (noun + optional article)

pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]

## Operators and quantifiers

Can be added to a token dict as the "OP" key

- !: Negate pattern and match exactly 0 times
- ?: Make pattern optional and match 0 or 1 times
- +: Require pattern to match 1 or more times
- *: Allow pattern to match 0 or more time

# Section 12: Glossary

## Tokenization

Segmenting text into words, punctuation etc

## Lemmatization

Assigning the base forms of words, for example:

"was" → "be" or "rats" → "rat".


## Sentence Boundary Detection

Finding and segmenting individual sentences.


## Part-of-speech (POS) Tagging

Assigning word types to tokens like verb or noun.

## Dependency Parsing

Assigning syntactic dependency labels,

describing the relations between individual

tokens, like subject or object.

## Named Entity Recognition (NER)

Labeling named "real-world" objects, 

like persons, companies or locations.

## Text Classification

Assigning categories or labels to a whole

document, or parts of a document.

## Statistical model

Process for making predictions based on
 examples.

## Training

Updating a statistical model with new examples.