# Introduction to Spacy

## Everything you Need to Know about Spacy

https://spacy.io/usage/spacy-101


Spacy uses neural network models, trained on classical NLP datasets, to predict the NLP data of a sentence. There are different model that vary for different use cases. Some are larger and more accurate, some trained on different kinds of data, some predict different things.

### Spacy Features

- Tokenization -- Segmenting your text. 
- Parts- of Speech Tagging -- Assigning grammatical word types to individual words in a sentence.
- Dependency Parsing -- Assigning dependency labels that describe relationships between tokens.
- Lemmatization -- Assigning the base form of a word
- Sentenc Boundary Detection -- Finding and Segmenting individual sentences.
- Named Entity Recoginition -- Label real world objects.
- Similarity -- comparing two textual documents to determine similarity.
- Text Classification -- Assigning categories and labels to a document or subdocument.
- Rule Based Matching -- regex
- Training -- Statistical model predictions?
- Serialization -- Saving objects to files or bite strings.

## English Models
Downloadable statistical models for spaCy to predict and assign features. Most are CNNs with residual connections, layer normalization, and maxout nonlinearity.

- tagging
- parsing
- entity recognition

### en_core_web_sm
English multitask CNN assigns content specific token vectors, Parts of Speech tags, dependency parsing, and Named Entity extraction. 29MB. 

In [1]:
import spacy

# load the model
nlp = spacy.load('en_core_web_sm')

#assign avariable with the models output.
doc = nlp("My name is Harrison and I do not likely Apple Music.")
print(doc.text, '\n')
for token in doc:
    print(token.text, token.pos_, token.dep_)

My name is Harrison and I do not likely Apple Music. 

My ADJ poss
name NOUN nsubj
is VERB ROOT
Harrison PROPN attr
and CCONJ cc
I PRON nsubj
do VERB aux
not ADV neg
likely ADV conj
Apple PROPN compound
Music PROPN dobj
. PUNCT punct


## Linguistic Annotation 

Load a model with spacy.load(). Which returns a language model that is referred to as nlp. Call nlp on a doc to return a compressed doc, containing the word type, POS, and dependency.

In [2]:
doc = nlp("Sometimes I cry myself to sleep at night thinking about Donald Trump and Brexit.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Sometimes ADV advmod
I PRON nsubj
cry VERB ROOT
myself PRON dobj
to PART aux
sleep VERB ccomp
at ADP prep
night NOUN pobj
thinking VERB advcl
about ADP prep
Donald PROPN compound
Trump PROPN pobj
and CCONJ cc
Brexit PROPN conj
. PUNCT punct


## Tokenization
Each document is tokenized by rules specific to each language.
Raw text is split on whitespace, then the tokenizer iterates over the text.

Checks:
1. Does the substring match a tokenizer exception?
2. Can a prefix, suffix, or infix be split off?

        Prefix: Character(s) at the beginning, e.g. $, (, “, ¿.

        Suffix: Character(s) at the end, e.g. km, ), ”, !.

        Infix: Character(s) in between, e.g. -, --, /, ….

If the substring matches to an above exception, the substring is modified and the tokenizer continues its iteration through the text.

In [3]:
import pandas as pd
doc = nlp("Chimpanzees drink boba-tea in the sunshine.")
df = pd.DataFrame([token.text for token in doc], columns = ["Text"])
df

Unnamed: 0,Text
0,Chimpanzees
1,drink
2,boba
3,-
4,tea
5,in
6,the
7,sunshine
8,.


## Parts-of-Speech Tags and Dependencies

Once tokenization is complete, we begin parsing and tagging the doc. The statistical model makes a prediction about what tag is most likely to be appropriate in this context.


Linguistic annotations are available as Token attributes. Spacy encodes all strings to hash values (one way conversion)



In [4]:
def df_build(text):
    doc  = nlp(text)
    outr = []
    for token in doc:
        inr = [token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop]
        outr.append(inr)
    df = pd.DataFrame(outr, columns = ["tbext", "lemma", "pos", "tag", "dep", "shape", "isalpha", "isstop"])    
    return df
    
df = df_build("R is a strange coding language. Strange yet speedy.")
df

Unnamed: 0,tbext,lemma,pos,tag,dep,shape,isalpha,isstop
0,R,r,NOUN,NN,nsubj,X,True,False
1,is,be,VERB,VBZ,ROOT,xx,True,True
2,a,a,DET,DT,det,x,True,True
3,strange,strange,ADJ,JJ,amod,xxxx,True,False
4,coding,coding,NOUN,NN,compound,xxxx,True,False
5,language,language,NOUN,NN,attr,xxxx,True,False
6,.,.,PUNCT,.,punct,.,False,False
7,Strange,strange,PROPN,NNP,ROOT,Xxxxx,True,False
8,yet,yet,ADV,RB,cc,xxx,True,True
9,speedy,speedy,ADJ,JJ,conj,xxxx,True,False


In [5]:
print(spacy.explain("NN"))
print(spacy.explain("VBZ"))
print(spacy.explain("JJ"))

noun, singular or mass
verb, 3rd person singular present
adjective


## Named Entities

Available in the ents attribute of a Doc container.

In [6]:
doc = nlp("Chicken nuggets are the best part of McDonald's found in the US. Harrison eats them everyday.")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

McDonald's 37 47 ORG
US 61 63 GPE
Harrison 65 73 PERSON


## Word Vectors and Similarity

Compare 2 projects and make a prediction on their similarity to each other. Useful for many things. Each Doc, Span, Token comes with a .similarity() method thar allows for comparison with another object.

Word embeddings are generated with word2Vec. Small models do not come with word embeddings.

Doc vectors will average the vectors of its tokens.

- Text: The original token text.
- has vector: Does the token have a vector representation?
- Vector norm: The L2 norm of the token's vector (the square root of the sum - of the values squared)
- OOV: Out-of-vocabulary

In [14]:
nlp2 = spacy.load('en_core_web_md')  # make sure to use larger model!
doc = nlp2('chicken banana spoon')
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))
    



chicken chicken 1.0
chicken banana 0.50540304
chicken spoon 0.46489623
banana chicken 0.50540304
banana banana 1.0
banana spoon 0.44800603
spoon chicken 0.46489623
spoon banana 0.44800603
spoon spoon 1.0


In [24]:
doc = nlp2('chicken banana spoon')
for token1 in doc:
    #vectors are long, just showing shape
    print(token1.vector.shape)
print("Doc Vector: ")
print(doc.vector[:3], " ... ", doc.vector[-3:])

(300,)
(300,)
(300,)
Doc Vector: 
[-0.17136002 -0.18593933  0.42900333]  ...  [-0.69952327  0.03642799  0.31438032]


## Pipelines

Start out by tokenizing the text document, then pass down a pipeline of POS tagger, dependency parser, entity recognizer,... Each is a seperate statistical model that predicts objects for each token.

You can mess with the pipeline.


## Vocab, Hashes, and Lexemes

Spacy stores data in a library to save space in RAM. This vocab is shared by many document items. String values are encoded to hash values.

- Token: A word, punctuation mark etc. in context, including its attributes, tags and dependencies.
- Lexeme: A "word type" with no context. Includes the word shape and flags, e.g. if it's lowercase, a digit or punctuation.
- Doc: A processed container of tokens in context.
- Vocab: The collection of lexemes.
- StringStore: The dictionary mapping hash values to strings, for example 3197928453018144401 → "coffee".

One word used many times in multiple documents is only stored once. StringStore is the conversion between string and hash.

In [25]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
print(doc.vocab.strings[u'coffee'])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

3197928453018144401
coffee


In [26]:
doc = nlp(u'I love coffee')
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
          lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
coffee 3197928453018144401 xxxx c fee True False False en


- Text: The original text of the lexeme.
- Orth: The hash value of the lexeme.
- Shape: The abstract word shape of the lexeme.
- Prefix: By default, the first letter of the word string.
- Suffix: By default, the last three letters of the word string.
- is alpha: Does the lexeme consist of alphabetic characters?
- is digit: Does the lexeme consist of digits?