# Review Spacy 

General Package Details. 

References:
 * [Spacy Guides](https://spacy.io/usage)
 * [Spacy Examles](https://spacy.io/usage/examples)
 * [Real Python](https://realpython.com/natural-language-processing-spacy-python/)


In [1]:
import spacy

In [2]:
#Version
print(spacy.__version__)

2.2.3


## Linguistic Features

Put in raw text, and get back a **Doc** object, that comes with a variety of annotations.

## Part-of-speech tagging

After tokenization, spaCy can parse and tag a given Doc. Like many NLP libraries, spaCy encodes all strings to [hash values](https://en.wikipedia.org/wiki/Hash_function) to reduce memory usage and improve efficiency. 

In [3]:
nlp=spacy.load("en_core_web_sm")

In [4]:
text=open("data/glossary_ML.txt").read()

In [5]:
print(text)

A/B testing
A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival. A/B testing aims to determine not only which technique performs better but also to understand whether the difference is statistically significant. A/B testing usually considers only two techniques using one measurement, but it can be applied to any finite number of techniques and measures.

accuracy
The fraction of predictions that a classification model got right. In multi-class classification, accuracy is defined as follows:

In binary classification, accuracy has the following definition:

See true positive and true negative.

action
#rl
In reinforcement learning, the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.

activation function
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value 

In [11]:
doc=nlp(text)

In [8]:
for token in doc:
    print(token.text,'--',token.lemma_,'--', token.pos_,'--', token.tag_,'--', token.dep_,'--',
            token.shape_,'--', token.is_alpha,'--', token.is_stop)

A -- a -- DET -- DT -- det -- X -- True -- True
/ -- / -- SYM -- SYM -- punct -- / -- False -- False
B -- b -- NOUN -- NN -- compound -- X -- True -- False
testing -- testing -- NOUN -- NN -- ROOT -- xxxx -- True -- False

 -- 
 -- SPACE -- _SP --  -- 
 -- False -- False
A -- a -- DET -- DT -- det -- X -- True -- True
statistical -- statistical -- ADJ -- JJ -- amod -- xxxx -- True -- False
way -- way -- NOUN -- NN -- npadvmod -- xxx -- True -- False
of -- of -- ADP -- IN -- prep -- xx -- True -- True
comparing -- compare -- VERB -- VBG -- pcomp -- xxxx -- True -- False
two -- two -- NUM -- CD -- nummod -- xxx -- True -- True
( -- ( -- PUNCT -- -LRB- -- punct -- ( -- False -- False
or -- or -- CCONJ -- CC -- cc -- xx -- True -- True
more -- more -- ADJ -- JJR -- conj -- xxxx -- True -- True
) -- ) -- PUNCT -- -RRB- -- punct -- ) -- False -- False
techniques -- technique -- NOUN -- NNS -- dobj -- xxxx -- True -- False
, -- , -- PUNCT -- , -- punct -- , -- False -- False
typically -- typi

In [10]:
for chunk in doc.noun_chunks:
    print(chunk.text,'--', chunk.root.text,'--', chunk.root.dep_,'--',
            chunk.root.head.text)

A/B testing -- testing -- ROOT -- testing
two (or more) techniques -- techniques -- dobj -- comparing
typically an incumbent -- incumbent -- appos -- techniques
a new rival -- rival -- pobj -- against
A/B testing -- testing -- nsubj -- aims
not only which technique -- technique -- nsubj -- performs
the difference -- difference -- nsubj -- is
A/B testing -- testing -- nsubj -- considers
only two techniques -- techniques -- dobj -- considers
one measurement -- measurement -- dobj -- using
it -- it -- nsubjpass -- applied
any finite number -- number -- pobj -- to
techniques -- techniques -- pobj -- of
measures -- measures -- conj -- techniques
accuracy -- accuracy -- ROOT -- accuracy
The fraction -- fraction -- ROOT -- fraction
predictions -- predictions -- pobj -- of
a classification model -- model -- nsubj -- got
multi-class classification -- classification -- pobj -- In
accuracy -- accuracy -- nsubjpass -- defined
binary classification -- classification -- pobj -- In
accuracy -- accura

In [13]:
# Finding a verb with a subject from below — good

from spacy.symbols import nsubj, VERB


verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
print(verbs)

{performs, appears, label, jumps, performing, offers, identifies, refer, makes, begun, solve, predicts, interact, learns, chop, determines, uses, increases, circumscribes, identifies, affect, needs, helps, captures, fall, translates, relies, finds, account, get, prohibits, translate, represents, permit, permit, enable, winning, include, takes, accounts, provide, quantify, predicts, learn, determine, dataset, chooses, include, enables, apply, used, favors, combines, requires, predicts, organizes, predicts, aims, learn, match, permit, detects, having, serve, creates, superimposes, evaluates, chooses, See, misclassfying, enable, rescales, achieve, help, add, seeks, considers, considers, got, outputs, appears, demonstrates}


## Disabling the parser

#Examples

```python
nlp = spacy.load("en_core_web_sm", disable=["parser"])
nlp = English().from_disk("/model", disable=["parser"])
doc = nlp("I don't want parsed", disable=["parser"])
```

## Named Entity Recognition

In [15]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("San Francisco considers banning sidewalk delivery robots")

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)  # ['San', 'B', 'GPE']
print(ent_francisco)  # ['Francisco', 'I', 'GPE']

[('San Francisco', 0, 13, 'GPE')]
['San', 'B', 'GPE']
['Francisco', 'I', 'GPE']


In [16]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "fb" as an entity :(

fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
# [('fb', 0, 2, 'ORG')] 🎉

Before []
After [('fb', 0, 2, 'ORG')]
