# Tokenization


Tokenization is a process that splits an input sequence into so-called tokens.

- You can think of token as a useful unit for semantic processing.
- Can be a word, sentence , paragraph, etc.

Resource: http://text-processing.com/demo/tokenize/

## Example of different tokenizer

In [1]:
import nltk

text = "This is Paradox's text, isn't it?"

# An exmple of simple whitespace tokenizer

tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenizer.tokenize(text)




['This', 'is', "Paradox's", 'text,', "isn't", 'it?']

- Problem: Here "it" and "it?" are different  tokens with same meaning.
- Let's try to split by puntuation.

In [2]:
tokenizer = nltk.tokenize.WordPunctTokenizer()
tokenizer.tokenize(text)


['This', 'is', 'Paradox', "'", 's', 'text', ',', 'isn', "'", 't', 'it', '?']

- Problem: "s" ,"isn","t" are not very meaningful

- So,we can come up with a set of rules using TreebankWordTokenizer

In [3]:
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Paradox', "'s", 'text', ',', 'is', "n't", 'it', '?']

Here "s" and "n't" are more meaningful for processing.

# Token Normalization

Text normalization is the process of transforming text into a single canonical form that it
might not have had before. Normalizing text before storing or processing it allows for separation of 
concerns, since input is guaranteed to be consistent before operations are performed on it.

For eg. We may want the same token for different forms of the word
- wolf,wolves --> wolf
- talk,talks --> talk

## Stemming

- A process of removing and replacing suffixes to get the root form of word, which is called the <b>stem.<b>
- Usually refers to heuristics that chop off suffixes.
 
 
### Stemming Example:

Porter's Stemmer

- 5 heuristics phases of word reductions, applied sequentially 
- Example of phase 1 rules:

    <b>Rule<b>               <b>Example <b>
    
    SSES --> SS              caresses --> caress
    IES  --> I               ponies   --> poni
    SS   --> SS              caress   --> caress
    S    -->                 cats     --> cat
    
- nltk.stem.PorterStemmer
- Examples:
   feet--> feet      cats--> cat
   wolves --> wolv   talked-->talk
   
- Problem: Fails on irregualr form, produces non-words





 -

In [4]:
import nltk

text = "feet cats wolves talked"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)

stemmer = nltk.stem.PorterStemmer()
" ".join(stemmer.stem(token) for token in tokens)

'feet cat wolv talk'

## Lemmatization

- Usually refers to doing things properly with the use of a vocabulary and morphological analysis.
- Returns the base or dictionary form of a word, which is known as the <b>lemma.<b>
    
### Lemmatization Example:

WordNet Lemmatizer

- Uses the WordNet Database to lookup lemmas
- nltk.stem.WordNetLemmatizer
- Examples:
    -feet--> foot    cats-->cat
    -wolves--> wolf  talked--> talked
- Problems: Not all forms are reduced

In [5]:
stemmer = nltk.stem.WordNetLemmatizer()
" ".join(stemmer.lemmatize(token) for token in tokens)

'foot cat wolf talked'

## Stopword

In [6]:
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
txt = nltk.tokenize.word_tokenize(data)

In [7]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
txt=[word for word in txt if word not in stop_words]


In [8]:
print(txt)

['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.']


### Takeaway:  We need to try stemming or lemmatization and choose best for our task.