<a href="https://colab.research.google.com/github/deoprakash/NLP_Tutorial/blob/main/TokenizationInSpacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import spacy

Create blank language object and tokenize words in a sentence

In [13]:
nlp = spacy.blank("en")
doc = nlp("Dr. Lalit is a dentist. His wife is surgeon.")

for token in doc:
  print(token)

Dr.
Lalit
is
a
dentist
.
His
wife
is
surgeon
.


In [3]:
type(doc)

spacy.tokens.doc.Doc


Using index to grab tokens/
Span object

In [14]:
doc[0:3]

Dr. Lalit is

Collecting email ids

In [16]:
text = '''Dear Mr. John Doe,

I hope you’re doing well. Please reach out to me at **john.doe@example.com** or contact my assistant at **assistant@company.org**.

For urgent matters, you can call my office at **+1-202-555-0173** or my personal number **(415) 555-1234**.
Our support team is also available at **support@business.net** and can be reached via WhatsApp at **+44 7911 123456**.

Best regards,
Michael Scott
Regional Manager, Dunder Mifflin
Email: **michael.scott@dmifflin.com**
Phone: **+1 646-555-5678**
Website: www.dundermifflin.com  '''

doc = nlp(text)
email=[]
for token in doc:
  if token.like_email:
    email.append(token)

print(email)


[john.doe@example.com, assistant@company.org, support@business.net, michael.scott@dmifflin.com]


Support in other languages

In [22]:
nlp = spacy.blank('hi')

doc = nlp("मेरे दोस्त ने मुझे एक उपहार दिया.")

for token in doc:
  print(token)

मेरे
दोस्त
ने
मुझे
एक
उपहार
दिया
.




In [24]:
doc = nlp("gimme double cheese extra large healthy pizza.")

tokens=[token.text for token in doc]
tokens

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza', '.']

Customizing tokenizer

In [34]:
from spacy.symbols import ORTH
nlp.tokenizer.add_special_case("gimme", [
    {ORTH:'gim'},
    {ORTH: 'me'}
])

tokens

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza', '.']

Sentence Tokenization or Segmentation

In [27]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7a467af76e50>

In [28]:
nlp.pipe_names

['sentencizer']

Earlier empty pipeline was there now custom pipeline 'Sentencizer' is added.

In [29]:
text = '''Dear Mr. John Doe,

I hope you’re doing well. Please reach out to me at **john.doe@example.com** or contact my assistant at **assistant@company.org**.

For urgent matters, you can call my office at **+1-202-555-0173** or my personal number **(415) 555-1234**.
Our support team is also available at **support@business.net** and can be reached via WhatsApp at **+44 7911 123456**.

Best regards,
Michael Scott
Regional Manager, Dunder Mifflin
Email: **michael.scott@dmifflin.com**
Phone: **+1 646-555-5678**
Website: www.dundermifflin.com  '''

doc = nlp(text)
for sentence in doc.sents:
  print(sentence)

Dear Mr.
John Doe,  

I hope you’re doing well.
Please reach out to me at **john.doe@example.com** or contact my assistant at **assistant@company.org**.
 

For urgent matters, you can call my office at **+1-202-555-0173** or my personal number **(415) 555-1234**.
 
Our support team is also available at **support@business.net** and can be reached via WhatsApp at **+44 7911 123456**.
 

Best regards,  
Michael Scott  
Regional Manager, Dunder Mifflin  
Email: **michael.scott@dmifflin.com**  
Phone: **+1 646-555-5678**  
Website: www.dundermifflin.com  


## **Assignments**

Q1. Detect URLs

In [36]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/,
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''
doc = nlp(text)

for token in doc:
  if token.like_url:
    print(token)


http://www.data.gov/
http://www.science
http://data.gov.uk/.
http://www3.norc.org/gss+website/
http://www.europeansocialsurvey.org/.


In [49]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc = nlp(transactions)
for token in doc:
    if doc[token.i].is_currency:
        print(doc[token.i-1].text, doc[token.i].text)

two $
500 €
