#### Install Spacy

In [None]:
# Install spacy
#!pip install spacy

In [6]:
# Download the spacy model
#!python -m spacy download en_core_web_sm

#### Import Spacy and Load a NLP Model

In [3]:
# Import the spacy library and get start
import spacy

In [4]:
# Load the spacy model by creating a nlp object
nlp = spacy.load("en_core_web_sm")

#### Load Data

In [5]:
# Load a dataset
with open("dataset/nlp_wiki.txt","r") as f:
    text = f.read()

In [13]:
#print(text)

One of the first things required for natural language processing (NLP) tasks is a corpus. In linguistics and NLP, corpus (literally Latin for body) refers to a collection of texts. Such collections may be formed of a single language of texts, or can span multiple languages -- there are numerous reasons for which multilingual corpora (the plural of corpus) may be useful. Corpora may also consist of themed texts (historical, Biblical, etc.). Corpora are generally solely used for statistical linguistic analysis and hypothesis testing.
The good thing is that the internet is filled with text, and in many cases this text is collected and well oganized, even if it requires some finessing into a more usable, precisely-defined format. Wikipedia, in particular, is a rich source of well-organized textual data. It's also a vast collection of knowledge, and the unhampered mind can dream up all sorts of uses for just such a body of text.
What we will do here is build a corpus from the set of English

#### Make a Doc Object
*What is the purpose of doc object? - We will find it as we go further.*
*To create a doc container, just call the nlp object and pass our text to it as a single argument.*

In [6]:
# Create a doc object
doc = nlp(text)
#print(doc)

*If we want to spot the difference between text and doc object- we won't find any difference printing them. But they are quite different behind the scene. Unlike text object, doc object contains a lot of metadeta or attributes hidden. To prove, lets examine-*

In [7]:
# Lets check how text and doc object are different
print("Text Length:",len(text))
print("Doc Length:",len(doc))

Text Length: 3655
Doc Length: 745


*What the hell going on them. Same text but different length. To find out - Lets iterate them seperately.*

In [8]:
# Iteration over the text
for token in text[0:10]:
    print(token)

O
n
e
 
o
f
 
t
h
e


In [9]:
# Iteration over the doc
for token in doc[:15]:
    print(token)

One
of
the
first
things
required
for
natural
language
processing
(
NLP
)
tasks
is


*In text object- it prints each characters including white spaces. And doc container prints words and more closely, it considers open and close parenthesis as different items. These are called tokens.*
*> Tokens- are fundamental building blocks of spaCy or any NLP framework. They can be words or punctuations.*

Problems with split() method

*> We know python have split() method and We can use it to seperate word-by-word. Then, why we need spacy. Lets find out*

In [10]:
# Spliting using split() method
for token in text.split()[0:15]:
    print(token)

One
of
the
first
things
required
for
natural
language
processing
(NLP)
tasks
is
a
corpus.


*Here, we see that- split cannot seperate punctuations but spacy did.*

#### Sentence boundary Detection

In [11]:
# tokenize senetence using spacy
for sent in doc.sents:
    print(sent)

One of the first things required for natural language processing (NLP) tasks is a corpus.
In linguistics and NLP, corpus (literally Latin for body) refers to a collection of texts.
Such collections may be formed of a single language of texts, or can span multiple languages -- there are numerous reasons for which multilingual corpora (the plural of corpus) may be useful.
Corpora may also consist of themed texts (historical, Biblical, etc.).
Corpora are generally solely used for statistical linguistic analysis and hypothesis testing.

The good thing is that the internet is filled with text, and in many cases this text is collected and well oganized, even if it requires some finessing into a more usable, precisely-defined format.
Wikipedia, in particular, is a rich source of well-organized textual data.
It's also a vast collection of knowledge, and the unhampered mind can dream up all sorts of uses for just such a body of text.

What we will do here is build a corpus from the set of Engli

*Sentence Boundary Detection is the identification of sentences in a text. We would think of using split(".") method to seperate sentences in a text, but in English we use the period to denote abbreviation also. spaCy can perfectly seperates sentences.*

In [19]:
sentence1 = doc.sents[0]
print(sentence1)

TypeError: 'generator' object is not subscriptable

*Here, we found that we actually use a generator and it isn't iterable. So, we need to make a list and then 
we can see the sentence we want.*

In [20]:
sentence1 = list(doc.sents)[0]
print(sentence1)

One of the first things required for natural language processing (NLP) tasks is a corpus.


#### Token Attributes

Lets talk about token in details.
Token Attributes:
----------------
> .text
> .head
> .left_edge
> .right_edge
> .ent_type_    (Entity Type)
> .iob_
> .lemma_
> .morph
> .pos_
> .dep_     (Systectic Dependency)
> .lang_    (Language)

Lets try all of these...

In [10]:
# Specify a word from the first sentence
token3 = sentence1[3]
print(token3)

first


In [11]:
# Apply .text
token3.text

'first'

In [12]:
# Apply .head
token3.head     # The systectic parent or governor of the token

things

In [13]:
# Apply .left_edge
token3.left_edge    # The leftmost token of the token's systactic descendants

first

In [14]:
# Apply .right_edge
token3.right_edge   # The rightmost token of the token's systactic descendants

first

In [15]:
# Apply .ent_type
token3.ent_type

396

*What is 396 means as entity of token3. We can see it by using .ent_type_*

In [16]:
# Apply .ent_type_
token3.ent_type_

'ORDINAL'

In [47]:
# Apply ent_iob_
# IOB code for named entity tag. "B" for begining token of an entity, "I" for inside an entity, 
# "O" for outside of an entity and "" for no entity tag is set
token3.ent_iob_

'B'

In [52]:
# Aply .lemma_
token3.lemma_       # lemma is the base form of a token, with no inflectional suffixes

'first'

In [50]:
# Lets test other words...
sentence1[5].lemma_

'require'

In [51]:
# Apply .morph
token3.morph

Degree=Pos

In [53]:
sentence1[5].morph

Aspect=Perf|Tense=Past|VerbForm=Part

In [54]:
# Apply .pos_
token3.pos_

'ADJ'

In [55]:
sentence1[5].pos_

'VERB'

In [56]:
# Apply .dep_   (Systactic Dependency)
token3.dep_

'amod'

In [57]:
# Apply .lang_    (Language)
token3.lang_

'en'

#### Parts of Speech Tagging(PoS Tag)
*Spacy offers an easy way to parse a text and identify its parts of speech.*

In [14]:
for token in sentence1:
    print(token.text, token.pos_, token.dep_)

One NUM nsubj
of ADP prep
the DET det
first ADJ amod
things NOUN pobj
required VERB acl
for ADP prep
natural ADJ amod
language NOUN compound
processing NOUN nmod
( PUNCT punct
NLP PROPN nmod
) PUNCT punct
tasks NOUN pobj
is AUX ROOT
a DET det
corpus NOUN attr
. PUNCT punct


*We can visualize the sentence with a diagram through spaCy's Notebook feature*

In [15]:
from spacy import displacy
displacy.render(sentence1, style="dep")

#### Named Entity Recognition (NER)
*NER finds the domain name of every token*

In [17]:
for ent in doc.ents:
    print(ent.text, ent.label_)

One CARDINAL
first ORDINAL
NLP ORG
NLP ORG
Latin NORP
Corpora PERSON
Biblical ORG
Corpora PERSON
Wikipedia ORG
English Wikipedia ORG
Wikipedia ORG
Wikipedia ORG
MediaWiki ORG
Wikipedia ORG
English LANGUAGE
GB GPE
Wikipedia ORG
WikiCorpus ORG
Wikipedia ORG
get_texts WORK_OF_ART
WikiCorpus ORG
Wikipedia ORG
second ORDINAL
Wikipedia ORG
one CARDINAL
50 CARDINAL
50 CARDINAL
first ORDINAL


*We can also display the NER annotations using displacy*

In [18]:
displacy.render(doc, style="ent")