# Review Spacy 

General Package Details. 

References:
 * [Spacy Guides](https://spacy.io/usage)
 * [Spacy Examles](https://spacy.io/usage/examples)
 * [Real Python](https://realpython.com/natural-language-processing-spacy-python/)


In [2]:
import spacy

In [3]:
#Version
print(spacy.__version__)

2.2.3


## Linguistic Features

 * POS Taggin 
 * Dependency Parse
 * Named Entities
 * Entity Linking
 * Tokenization
 * Merging and Splitting
 * Sentence Segmentation
 
Put in raw text, and get back a **Doc** object, that comes with a variety of annotations.

## Part-of-speech tagging

After tokenization, spaCy can parse and tag a given Doc. Like many NLP libraries, spaCy encodes all strings to [hash values](https://en.wikipedia.org/wiki/Hash_function) to reduce memory usage and improve efficiency. 

In [4]:
nlp=spacy.load("en_core_web_sm")

In [5]:
text=open("data/GM_News.txt").read()

In [6]:
print(text)

GM said Monday it will invest $2.2 billion into its Detroit-Hamtramck assembly plant to produce all-electric trucks and SUVs as well as a self-driving vehicle unveiled by its subsidiary Cruise. The automaker will invest an additional $800 million in supplier tooling and other projects related to the launch of the new electric trucks.

GM will kick off this new program with an all-electric pickup truck that will go into production in late 2021. The Cruise Origin, the electric self-driving shuttle designed for ride sharing, will be the second vehicle to go into production at the Detroit area plant.

Detroit-Hamtramck will be GM’s first fully-dedicated electric vehicle assembly plant. When fully operational, the plant will create more than 2,200 jobs, according to GM.


In [7]:
#Definition of Doc
doc=nlp(text)

In [8]:
for token in doc:
    print(token.text,'--',token.lemma_,'--', token.pos_,'--', token.tag_,'--', token.dep_,'--',
            token.shape_,'--', token.is_alpha,'--', token.is_stop)

GM -- GM -- PROPN -- NNP -- nsubj -- XX -- True -- False
said -- say -- VERB -- VBD -- ROOT -- xxxx -- True -- False
Monday -- Monday -- PROPN -- NNP -- npadvmod -- Xxxxx -- True -- False
it -- -PRON- -- PRON -- PRP -- nsubj -- xx -- True -- True
will -- will -- VERB -- MD -- aux -- xxxx -- True -- True
invest -- invest -- VERB -- VB -- ccomp -- xxxx -- True -- False
$ -- $ -- SYM -- $ -- quantmod -- $ -- False -- False
2.2 -- 2.2 -- NUM -- CD -- compound -- d.d -- False -- False
billion -- billion -- NUM -- CD -- dobj -- xxxx -- True -- False
into -- into -- ADP -- IN -- prep -- xxxx -- True -- True
its -- -PRON- -- DET -- PRP$ -- poss -- xxx -- True -- True
Detroit -- Detroit -- PROPN -- NNP -- compound -- Xxxxx -- True -- False
- -- - -- PUNCT -- HYPH -- punct -- - -- False -- False
Hamtramck -- Hamtramck -- PROPN -- NNP -- compound -- Xxxxx -- True -- False
assembly -- assembly -- NOUN -- NN -- compound -- xxxx -- True -- False
plant -- plant -- NOUN -- NN -- pobj -- xxxx -- True -

Tha's is really interesting, the spacy model for example:
``` 
said -- say -- VERB -- VBD -- ROOT -- xxxx -- True -- False
```

Give the description of the linguistics properties.

In [9]:
# Noun chunks: Noun chunks are “base noun phrases”
for chunk in doc.noun_chunks:
    print(chunk.text,'--', chunk.root.text,'--', chunk.root.dep_,'--',
            chunk.root.head.text)

GM -- GM -- nsubj -- said
it -- it -- nsubj -- invest
its Detroit-Hamtramck assembly plant -- plant -- pobj -- into
all-electric trucks -- trucks -- dobj -- produce
SUVs -- SUVs -- conj -- trucks
a self-driving vehicle -- vehicle -- dobj -- produce
its subsidiary -- subsidiary -- pobj -- by
Cruise -- Cruise -- appos -- subsidiary
The automaker -- automaker -- nsubj -- invest
supplier tooling -- tooling -- pobj -- in
other projects -- projects -- conj -- tooling
the launch -- launch -- pobj -- to
the new electric trucks -- trucks -- pobj -- of
GM -- GM -- nsubj -- kick
this new program -- program -- dobj -- kick
an all-electric pickup truck -- truck -- pobj -- with
production -- production -- pobj -- into
The Cruise Origin -- Origin -- nsubj -- be
the electric self-driving shuttle -- shuttle -- appos -- Origin
ride sharing -- sharing -- pobj -- for
the second vehicle -- vehicle -- attr -- be
production -- production -- pobj -- into
the Detroit area plant -- plant -- pobj -- at
Detroit-H

In [10]:
# Visualizing dependencies

from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc, style='dep')

In [11]:
# Finding a verb with a subject from below — good

from spacy.symbols import nsubj, VERB


verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
print(verbs)

{shift}


## Disabling the parser

#Examples

```python
nlp = spacy.load("en_core_web_sm", disable=["parser"])
nlp = English().from_disk("/model", disable=["parser"])
doc = nlp("I don't want parsed", disable=["parser"])
```

## Named Entity Recognition

spaCy features an extremely fast statistical **entity recognition system**, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and **update the model with new examples**.

In [12]:
#Change the text for the examples
text=open("data/GM_News.txt").read()
print(text)

GM said Monday it will invest $2.2 billion into its Detroit-Hamtramck assembly plant to produce all-electric trucks and SUVs as well as a self-driving vehicle unveiled by its subsidiary Cruise. The automaker will invest an additional $800 million in supplier tooling and other projects related to the launch of the new electric trucks.

GM will kick off this new program with an all-electric pickup truck that will go into production in late 2021. The Cruise Origin, the electric self-driving shuttle designed for ride sharing, will be the second vehicle to go into production at the Detroit area plant.

Detroit-Hamtramck will be GM’s first fully-dedicated electric vehicle assembly plant. When fully operational, the plant will create more than 2,200 jobs, according to GM.


In [15]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc=nlp(text)


for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


GM 0 2 ORG
Monday 8 14 DATE
$2.2 billion 30 42 MONEY
Detroit-Hamtramck 52 69 ORG
an additional $800 million 220 246 MONEY
GM 337 339 ORG
late 2021 437 446 DATE
second 540 546 ORDINAL
Detroit 584 591 GPE
Detroit-Hamtramck 605 622 ORG
GM 631 633 ORG
first 636 641 ORDINAL
more than 2,200 737 752 CARDINAL
GM 772 774 ORG


Spacy Model detected those entities.But, it's possible to improve the model!!

In [17]:
displacy.render(doc, style="ent")

## Setting entity annotations

In [18]:
# Example: Setting entity annotations
from spacy.tokens import Span

doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('#'*50)
print('Before', ents)
print('#'*50)
# the model didn't recognise "fb" as an entity :(

fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
print('#'*50)
# [('fb', 0, 2, 'ORG')] 🎉

##################################################
Before []
##################################################
After [('fb', 0, 2, 'ORG')]
##################################################


In [19]:
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

#nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent")

# Entity Linking

# Tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on

In [21]:
text=open("data/GM_News.txt").read()
doc=nlp(text)
for token in doc:
    print(token.text)

GM
said
Monday
it
will
invest
$
2.2
billion
into
its
Detroit
-
Hamtramck
assembly
plant
to
produce
all
-
electric
trucks
and
SUVs
as
well
as
a
self
-
driving
vehicle
unveiled
by
its
subsidiary
Cruise
.
The
automaker
will
invest
an
additional
$
800
million
in
supplier
tooling
and
other
projects
related
to
the
launch
of
the
new
electric
trucks
.



GM
will
kick
off
this
new
program
with
an
all
-
electric
pickup
truck
that
will
go
into
production
in
late
2021
.
The
Cruise
Origin
,
the
electric
self
-
driving
shuttle
designed
for
ride
sharing
,
will
be
the
second
vehicle
to
go
into
production
at
the
Detroit
area
plant
.



Detroit
-
Hamtramck
will
be
GM
’s
first
fully
-
dedicated
electric
vehicle
assembly
plant
.
When
fully
operational
,
the
plant
will
create
more
than
2,200
jobs
,
according
to
GM
.


# Merging and splitting

In [22]:
doc = nlp("I live in New York")
print("Before:", [token.text for token in doc])

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[3:5], attrs={"LEMMA": "new york"})
print("After:", [token.text for token in doc])

Before: ['I', 'live', 'in', 'New', 'York']
After: ['I', 'live', 'in', 'New York']


# Sentence Segmentation

A Doc object’s sentences are available via the Doc.sents property. Unlike other libraries, spaCy uses the dependency parse to **determine sentence boundaries**. This is usually more accurate than a rule-based approach, but it also means you’ll need a statistical model and accurate predictions. If your **texts are closer to general-purpose news or web text, this should work well out-of-the-box**. For **social media or conversational text** that doesn’t follow the same rules, your application may benefit from a **custom rule-based implementation.** You can either use the built-in Sentencizer or plug an entirely custom rule-based function into your processing pipeline.

In [24]:
text=open("data/GM_News.txt").read()
doc=nlp(text)

for sent in doc.sents:
    print(sent.text)
    print('\n')

GM said Monday it will invest $2.2 billion into its Detroit-Hamtramck assembly plant to produce all-electric trucks and SUVs as well as a self-driving vehicle unveiled by its subsidiary Cruise.


The automaker will invest an additional $800 million in supplier tooling and other projects related to the launch of the new electric trucks.




GM will kick off this new program with an all-electric pickup truck that will go into production in late 2021.


The Cruise Origin, the electric self-driving shuttle designed for ride sharing, will be the second vehicle to go into production at the Detroit area plant.




Detroit-Hamtramck will be GM’s first fully-dedicated electric vehicle assembly plant.


When fully operational, the plant will create more than 2,200 jobs, according to GM.




# Custom rule-based strategy

In [27]:
text = "this is a sentence...hello...and another sentence."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("#"*50)
print("Before:", [sent.text for sent in doc.sents])

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == "...":
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print("#"*50)
print("After:", [sent.text for sent in doc.sents])

##################################################
Before: ['this is a sentence...', 'hello...and another sentence.']
##################################################
After: ['this is a sentence...', 'hello...', 'and another sentence.']
