## Tokenization

The process that splits an input sequence into so-called tokens. 

![white_token.png](pics/white_token.png)
![white_token.png](pics/more_tokens.png)


In [2]:
import nltk

text = "This is Andrew's text, isn't it?"

In [3]:
tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenizer.tokenize(text)

['This', 'is', "Andrew's", 'text,', "isn't", 'it?']

In [5]:
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Andrew', "'s", 'text', ',', 'is', "n't", 'it', '?']

In [6]:
tokenizer = nltk.tokenize.WordPunctTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Andrew', "'", 's', 'text', ',', 'isn', "'", 't', 'it', '?']

## Token Normalization

We may want the same token for different forms of the world 
* wolf, wolves -> wolf
* talk, talks -> talk

### Stemming 
* A process of removing and replacing suffixes to get to the root form of the world which is called **stem**.
* Usually refers to hezristics that chop off suffixes

### Lemmatization
* Usually refers to doing things properly with the use of a vocabulary and morphological analysis
* Returns the base or dictionary form of a word, which is known as the **lemma**

![lemma_1.png](pics/lemma_1.png)

![lemma_1.png](pics/lemma_2.png)



In [8]:
text = "feet cats wolves talked"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)


In [9]:
stemmer = nltk.stem.PorterStemmer()
" ".join(stemmer.stem(token) for token in tokens)

'feet cat wolv talk'

In [13]:
stemmer = nltk.stem.WordNetLemmatizer()
" ".join(stemmer.lemmatize(token) for token in tokens)

'foot cat wolf talked'

![norm_problems.png](pics/norm_problems.png)

