'''Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. ... If the text is split into words using some separation technique it is called word tokenization and same separation done for sentences is called sentence tokenization.'''

In [1]:
import spacy 

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
text1 = ' "We\'re moving to Bangalore!" '

In [4]:
print(text1)

 "We're moving to Bangalore!" 


In [5]:
doc = nlp(text1) ## Converting text into spacy doc 

In [6]:
## split into tokens 
for token in doc:
    print(token.text)

 
"
We
're
moving
to
Bangalore
!
"


In [7]:
doc2 = nlp("Want to launch your career into Data Science - We are here to help you, Contact us via email - 'anuragjoshi70@gmail.com' and, visit our github account - 'https://github.com/anurag4646'")

In [8]:
doc2

Want to launch your career into Data Science - We are here to help you, Contact us via email - 'anuragjoshi70@gmail.com' and, visit our github account - 'https://github.com/anurag4646'

In [9]:
for token in doc2:
    print(token.text)

Want
to
launch
your
career
into
Data
Science
-
We
are
here
to
help
you
,
Contact
us
via
email
-
'
anuragjoshi70@gmail.com
'
and
,
visit
our
github
account
-
'
https://github.com/anurag4646
'


In [10]:
doc3 = nlp("with $20 we can visit 10km.")

In [11]:
for token in doc3:
    print(token.text)

with
$
20
we
can
visit
10
km
.


In [12]:
doc4 = nlp("Let's visit St. Louis in the U.S. next year.")

In [13]:
for token in doc4:
    print(token)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


In [14]:
len(doc4.vocab)

795

In [15]:
doc5 = nlp("Random forest is a good choice when the model is suffering from high variance because it reduces the variance without an increase in Bias.")

In [16]:
print(doc5)

Random forest is a good choice when the model is suffering from high variance because it reduces the variance without an increase in Bias.


### Sentence Tokenizer -- 

In [33]:
for token in doc5.sents:
    print(token.text)

Random forest is a good choice when the model is suffering from high variance because it reduces the variance without an increase in Bias.


In [34]:
docn = nlp("This is the first sentence. This is the second Sentence. This is the last sentence")

In [35]:
for token in docn.sents:
    print(token.text)

This is the first sentence.
This is the second Sentence.
This is the last sentence


### Indexing 

In [17]:
## indexing 
doc5[0]

Random

In [18]:
doc5[1:10]

forest is a good choice when the model is

In [19]:
doc6 = nlp("Tesla is going to make their office in Bangalore worth $1M")

In [20]:
for token in doc6:
    print(token.text, end = " | ")

Tesla | is | going | to | make | their | office | in | Bangalore | worth | $ | 1 | M | 

### Split into Name Entity

In [21]:
for entity in doc6.ents:
    print(entity)

Tesla
Bangalore
$1M


In [22]:
for entity in doc6.ents:
    print(entity, entity.label_)
    print(str(spacy.explain(entity.label_)))
    print("\n")

Tesla ORG
Companies, agencies, institutions, etc.


Bangalore GPE
Countries, cities, states


$1M MONEY
Monetary values, including unit




In [23]:
doc7 = nlp('Autonomous cars swift Insurance liability towards manufacturers')

In [24]:
for chunks in doc7.noun_chunks:
    print(chunks)

Autonomous cars
manufacturers


In [25]:
from spacy import displacy

In [26]:
doc8 = nlp("Tesla is going to make their office in Bangalore worth $1M")

In [None]:
### Visulization of tokens and relations 

In [27]:
displacy.render(doc8, style='dep', jupyter=True, options={'distance':100})

In [28]:
doc9 = nlp("Over a last quarter Apple sold nearly 200k kg profit of $6M.")

In [None]:
### Entity relation visulization 

In [29]:
displacy.render(doc9, style='ent', jupyter=True)