# Tokenization

#### Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is called word tokenization and same separation done for sentences is called sentence tokenization. 

## why use it ? 
#### This process is important because the meaning of the text can be interpreted through analysis of the words present in the text. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

#### The basic thing about tokenization, is spliting the word, but why dont use the conventional split method. The main issue with normal conventional  is that, you cant differntiate between a full stop. For example : Dr. and sentence end. Normal conventional method cant do that, that is why use methods. 

## Spacy

In [9]:
import spacy

In [10]:
nlp = spacy.blank("en")

In [23]:
doc = nlp("this sentence will turn into tokens now, Dr.")
for token in doc:
    print(token, end= " -> ")

this -> sentence -> will -> turn -> into -> tokens -> now -> , -> Dr. -> 

In [24]:
doc1 = nlp("let's go to N.Y!")
for token in doc1:
    print(token, end= " -> ")

let -> 's -> go -> to -> N.Y -> ! -> 

In [25]:
doc[1:5]

sentence will turn into

### Operations on token

In [28]:
token = doc[0]
type(token)

spacy.tokens.token.Token

In [29]:
token

this

In [27]:
token.like_num

False

In [30]:
token.text

'this'

In [31]:
example = "$"
doc = nlp(example)

In [36]:
token2 = doc[0]
token2.is_currency

True

## Example : suppose you have data of your students and you want to find mail of your student to send them  an important announuncement. how you can use something ?

In [45]:
with open("student.txt") as f :
    text = f.readlines()

text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n']

In [46]:
text = ' '.join(text)
text



## how you can get email ?

In [47]:
doc = nlp(text)
for token in doc: 
    if token.like_email: # you can use multiple methods like this to know more about your type.
        print(token)

virat@kohli.com
maria@sharapova.com
serena@williams.com
joe@root.com


## customizing tokens

In [48]:
from spacy.symbols import ORTH

In [49]:
nlp.tokenizer.add_special_case("gimme",
                               [
                                   {ORTH:"gim"},
                                   {ORTH:"me"}
                               ])
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

# Exercise

###### This book has references to many websites from where you can download free datasets. You are an NLP engineer working for some company and you want to collect all dataset websites from this book. To keep exercise simple you are given a paragraph from this book and you want to grab all urls from this paragraph using spacy



In [50]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

In [51]:
doc = nlp(text)

In [52]:
for token in doc:
    if token.like_url:
        print(token)

http://www.data.gov/
http://www.science
http://data.gov.uk/.
http://www3.norc.org/gss+website/
http://www.europeansocialsurvey.org/.


##### (2) Extract all money transaction from below sentence along with currency. Output should be,

two $

500 €

In [56]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
transaction = nlp(transactions)

In [65]:
for token in transaction:
    if token.is_currency:
        print(transaction[token.i-1], token)

two $
500 €


# nltk

In [1]:
import nltk

In [2]:
from nltk.tokenize import word_tokenize

In [3]:
 s = '''Good muffins cost $3.88\nin New York.  Please buy me
 two of them.\n\nThanks.'''

In [4]:
word_tokenize(s)

['Good',
 'muffins',
 'cost',
 '$',
 '3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [5]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(s)

['Good',
 'muffins',
 'cost',
 '$',
 '3',
 '.',
 '88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

### wordpunct_tokenize is based on a simple regexp tokenization. Basically it uses the regular expression \w+|[^\w\s]+ to split the input.

### word_tokenize on the other hand is based on a TreebankWordTokenizer. It basically tokenizes text like in the Penn Treebank. 

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [7]:
sent_tokenize(s)

['Good muffins cost $3.88\nin New York.',
 'Please buy me\ntwo of them.',
 'Thanks.']

In [8]:
[word_tokenize(t) for t in sent_tokenize(s)]

[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
 ['Please', 'buy', 'me', 'two', 'of', 'them', '.'],
 ['Thanks', '.']]