## Spacy Basics

**spaCy** (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

Designed to effectively handle NLP tasks with most efficient implementation of common algorithms.

For many NLP tasks, Spacy only has one implemented method, choosing the most efficient algorithm currently available.

This means you often don't have the option to choose the other algorithms.

**There are few key steps to work with Spacy:**

* Loading the Language Library
* Building a pipeline object
* Using Tokens
* Parts-of-speech tagging 
* Understanding Tokens attributes

In [1]:
# So now import the Spacy and load the language package
import spacy
import en_core_web_sm
nlp=en_core_web_sm.load()

# Create a doc object
doc=nlp(u"Tesla is looking at buying U.S. startup for $6 million")

# Print each token seperately
for token in doc:
    print(token.text,token.pos_,token.dep_) # Here pos means parts of speech and dep means dependancy

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


The nlp() function from Spacy automatically takes raw text and performs a series of operations to tag, parse, and describe the text data.

___
## Pipeline
When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   Image source: https://spacy.io/usage/spacy-101#pipelines

<img src="..\pipeline1.png" width="600">

In [17]:
nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x182a98cf6c8>),
 ('parser', <spacy.pipeline.DependencyParser at 0x182aa5aa3a8>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x182aa5aa948>)]

In [18]:
nlp.pipe_names

['tagger', 'parser', 'ner']

## Tokenization 
Tokenization is the process of breaking up the original text into component pieces(tokens)

In [19]:
# let's look into another example

In [20]:
doc2=nlp(u"Tesla isn't looking   into startups anymore")
for token in doc2:
    print(token.text,token.pos_,token.dep_)

Tesla PROPN nsubj
is VERB aux
n't ADV neg
looking VERB ROOT
   SPACE 
into ADP prep
startups NOUN pobj
anymore ADV advmod


In [21]:
# look how smart our spacy algorithm is, it detects space in the sentence and successfully seperate isn't 

In [22]:
doc2

Tesla isn't looking   into startups anymore

In [23]:
doc2[0]

Tesla

In [24]:
doc2[1]

is

In [25]:
type(doc2)

spacy.tokens.doc.Doc

___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags visit https://spacy.io/api/annotation#pos-tagging

In [26]:
doc2[0].pos

95

In [27]:
doc2[0].pos_

'PROPN'

___
## Dependencies
We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing
<br>A good explanation of typed dependencies can be found [here](https://nlp.stanford.edu/software/dependencies_manual.pdf)

In [28]:
doc2[0].dep_

'nsubj'

In [29]:
spacy.explain('nsubj')

'nominal subject'

In [33]:
spacy.explain("PROPN")

'proper noun'

___
## Additional Token Attributes

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [48]:
# Lemma Base form of the word
print(doc2[3])
print(doc2[3].lemma_)


looking
look


In [53]:
# Word Shapes:
print(doc2[0].text+': ' + doc2[0].shape_)

Tesla: Xxxxx


In [59]:
# Boolean expression
print(doc2[0].is_alpha)
print(doc2[0].is_stop)

True
False


## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [60]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [62]:
life_quote=doc3[16:30]
life_quote

"Life is what happens to us while we are making other plans"

In [63]:
type(life_quote)

spacy.tokens.span.Span

## Sentences
Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [65]:
doc4=nlp(u"This is the first sentence. This is the second sentence. this is the third sentence.")

In [66]:
for sen in doc4.sents:
    print(sen)

This is the first sentence.
This is the second sentence.
this is the third sentence.


In [68]:
doc4[6].is_sent_start  # Is index 6 start of the sentence

True