- Tokenizing
- vocab
- chunking
- display

## TOKENIZING

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [3]:
txt = """I am Ajay Goswami. If you want a playlist of any new technology, mail me at mail - ajay.goswami0532@gmail.com.
        There is learning cost INR - 00 for learning now a days. Go to my channel https://www.youtube.com/channel/UCuP52iVtoAbwUjo6M5DHkDA?view_as=subscriber """

In [4]:
txt

'I am Ajay Goswami. If you want a playlist of any new technology, mail me at mail - ajay.goswami0532@gmail.com.\n        There is learning cost INR - 00 for learning now a days. Go to my channel https://www.youtube.com/channel/UCuP52iVtoAbwUjo6M5DHkDA?view_as=subscriber '

In [5]:
txt = nlp(txt)

In [6]:
#for fetching tokens from text
for token in txt:
    print(token.text)

I
am
Ajay
Goswami
.
If
you
want
a
playlist
of
any
new
technology
,
mail
me
at
mail
-
ajay.goswami0532@gmail.com
.

        
There
is
learning
cost
INR
-
00
for
learning
now
a
days
.
Go
to
my
channel
https://www.youtube.com/channel/UCuP52iVtoAbwUjo6M5DHkDA?view_as=subscriber


In [7]:
for token in txt:
    print(token.text,end=" | ")

I | am | Ajay | Goswami | . | If | you | want | a | playlist | of | any | new | technology | , | mail | me | at | mail | - | ajay.goswami0532@gmail.com | . | 
         | There | is | learning | cost | INR | - | 00 | for | learning | now | a | days | . | Go | to | my | channel | https://www.youtube.com/channel/UCuP52iVtoAbwUjo6M5DHkDA?view_as=subscriber | 

In [8]:
len(txt)

41

## Vocab

In [10]:
list(nlp.vocab.strings)

['""',
 '#',
 '$',
 "''",
 ',',
 '-LRB-',
 '-RRB-',
 '.',
 ':',
 'ADD',
 'AFX',
 'BES',
 'CC',
 'CD',
 'DT',
 'EX',
 'FW',
 'GW',
 'HVS',
 'HYPH',
 'IN',
 'JJ',
 'JJR',
 'JJS',
 'LS',
 'MD',
 'NFP',
 'NIL',
 'NN',
 'NNP',
 'NNPS',
 'NNS',
 'PDT',
 'PRP',
 'PRP$',
 'RB',
 'RBR',
 'RBS',
 'RP',
 'SP',
 'TO',
 'UH',
 'VB',
 'VBD',
 'VBG',
 'VBN',
 'VBP',
 'VBZ',
 'WDT',
 'WP',
 'WP$',
 'WRB',
 'XX',
 '_SP',
 '``',
 'that',
 'if',
 'as',
 'because',
 'while',
 'since',
 'like',
 'so',
 'than',
 'whether',
 'although',
 'though',
 'unless',
 'once',
 'cause',
 'upon',
 'till',
 'whereas',
 'whilst',
 'except',
 'despite',
 'wether',
 'but',
 'becuse',
 'whie',
 'it',
 'w/out',
 'albeit',
 'save',
 'besides',
 'becouse',
 'coz',
 'til',
 'ask',
 "i'd",
 'out',
 'near',
 'seince',
 'tho',
 'sice',
 'will',
 'That',
 'If',
 'As',
 'Because',
 'While',
 'Since',
 'Like',
 'So',
 'Than',
 'Whether',
 'Although',
 'Though',
 'Unless',
 'Once',
 'Cause',
 'Upon',
 'Till',
 'Whereas',
 'Whilst',
 '

In [12]:
len(list(nlp.vocab.strings))

1183

![vocab_stringstore-1d1c9ccd7a1cf4d168bfe4ca791e6eed.svg](attachment:vocab_stringstore-1d1c9ccd7a1cf4d168bfe4ca791e6eed.svg)

- Doc: A processed container of tokens in context.
- Lexeme: A “word type” with no context. Includes the word shape and flags, e.g. if it’s lowercase, a digit or punctuation.
- Vocab: The collection of lexemes.
- StringStore: The dictionary mapping hash values to strings, for example 3197928453018144401 → “coffee”.

In [22]:
doc = nlp("I love Coffee")
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text,lexeme.orth,lexeme.shape_,lexeme.prefix_,lexeme.suffix_,lexeme.is_alpha,lexeme.is_digit,lexeme.is_title,lexeme.lang_)

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
Coffee 3474706295102377020 Xxxxx C fee True False True en


In [23]:
txt

I am Ajay Goswami. If you want a playlist of any new technology, mail me at mail - ajay.goswami0532@gmail.com.
        There is learning cost INR - 00 for learning now a days. Go to my channel https://www.youtube.com/channel/UCuP52iVtoAbwUjo6M5DHkDA?view_as=subscriber 

In [24]:
txt[0]

I

In [25]:
txt[3]

Goswami

In [26]:
txt[5:22]

If you want a playlist of any new technology, mail me at mail - ajay.goswami0532@gmail.com.

In [27]:
txt[37:]

to my channel https://www.youtube.com/channel/UCuP52iVtoAbwUjo6M5DHkDA?view_as=subscriber

In [28]:
txt[:37]

I am Ajay Goswami. If you want a playlist of any new technology, mail me at mail - ajay.goswami0532@gmail.com.
        There is learning cost INR - 00 for learning now a days. Go

In [30]:
#txt[0] = "you"

## Find out Entities

- named entity is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include Barack Obama, New York City, Volkswagen Golf, or anything else that can be named. Named entities can simply be viewed as entity instances

In [32]:
for entity in txt.ents:
    print(entity)

Ajay Goswami
ajay.goswami0532@gmail.com


In [36]:
txt2 = nlp("Carey from Delhi has 20 million followers in YouTube")

In [37]:
for entity in txt2.ents:
    print(entity)

Carey
Delhi
20 million
YouTube


#### entity LABELING

In [40]:
for entity in txt2.ents:
    print(entity,entity.label_)

Carey PERSON
Delhi GPE
20 million CARDINAL
YouTube ORG


In [42]:
for entity in txt2.ents:
    print(str(spacy.explain(entity.label_)))

People, including fictional
Countries, cities, states
Numerals that do not fall under another type
Companies, agencies, institutions, etc.


## CHUNKING


- Chunking is the process of extracting noun phrases from the text. spaCy can identify noun phrases (or noun chunks), as well. You can think of noun chunks as a noun plus the words describing the noun. It's also possible to identify and extract the base-noun of a given chunk

In [43]:
txt

I am Ajay Goswami. If you want a playlist of any new technology, mail me at mail - ajay.goswami0532@gmail.com.
        There is learning cost INR - 00 for learning now a days. Go to my channel https://www.youtube.com/channel/UCuP52iVtoAbwUjo6M5DHkDA?view_as=subscriber 

In [44]:
for chunks in txt.noun_chunks:
    print(chunks)

I
Ajay Goswami
you
a playlist
any new technology
me
mail - ajay.goswami0532@gmail.com
cost
my channel


## Display

In [48]:
from spacy import displacy

In [49]:
txt2

Carey from Delhi has 20 million followers in YouTube

In [53]:
displacy.render(txt2,style= 'ent',jupyter =True, options= {"distance":120})

In [54]:
spacy.explain("CARDINAL")

'Numerals that do not fall under another type'

In [55]:
spacy.explain("GPE")

'Countries, cities, states'

In [57]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

In [60]:
displacy.render(txt2,style='dep',jupyter=True,options={'distace':120})