# Tonkenization

#### Definition
- __Tonkenization__ or __word segmentation__ is the process of separating (or tokenizing) words in raw text.
- First pre-processing step in NLP applications.
- Nearly all NLP and ML libraries and toolkits include highly optimized tokenizers. So, in practice, writing a new tokenizer is not necessary.
- In this notebook I will implement a simple tokenization algorithm based on regular expressions and compare it to one of the available ones from the NLP Toolkit.

#### Challenges
- How would we define a word? - White spaces might not be relevant in all cases.
    - Sometimes we would like to keep some words together as a single unit, because they only hold a meaning in context.
    - Not all languages define words by whitespaces.
    - Separating by puntuation marks may not be generalizable: what about apostrophes, acronyms, or number expressions?
    

### Import data

We can import data from NLP Toolkit or NLTK (see how to import data from the web https://www.nltk.org/book_1ed/ch02.html)

We will extract the first sentence of the chosen text as an example

In [9]:
import re, nltk
from nltk.corpus import webtext

#nltk.download('webtext')

# show list of .txt files in the corpus
webtext.fileids()

['firefox.txt',
 'grail.txt',
 'overheard.txt',
 'pirates.txt',
 'singles.txt',
 'wine.txt']

In [8]:
# download pirates.txt for the webtext corpus

full_text = webtext.raw("pirates.txt")
#print(full_text)

In [14]:
# extract first sentence
sentence = full_text.split("\n")[0]
print(sentence)

PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terry Rossio


### Simple tonkenization algorithm

The most obvious way of separating words is splitting them by whitespaces:

In [18]:
# re.split(" ", str) splits the string based on this characters defined on the first argument

print(sentence.split(" "))

# re.split("\s+", str) splits a string based on the regular expression \s+.
# the regex expression \s+ returns a match where the string contains a white space character.

#print(re.split("\s+", sentence))

['PIRATES', 'OF', 'THE', 'CARRIBEAN:', 'DEAD', "MAN'S", 'CHEST,', 'by', 'Ted', 'Elliott', '&', 'Terry', 'Rossio']


This approach leaves the punctuation marks on the tonkenized strings and is not generalizable.

Let's try to split by punctuation marks too using pattern-matching instead:

In [19]:
tokens = re.split('[(\s+.,:;!?\'\")]', sentence)
#print(tokens)

tokenized = [x for x in tokens if x != '' and x not in '- \t\n.,;:!?[]']
print(tokenized)

['PIRATES', 'OF', 'THE', 'CARRIBEAN', 'DEAD', 'MAN', 'S', 'CHEST', 'by', 'Ted', 'Elliott', '&', 'Terry', 'Rossio']


This seems to work well. However, when applied to the sentece below it fails to keep the "" which can give the phrase very positive a very important subjective tone. 

Also, it takes away the apostrophe at the end of Containers', which changes the meaning of the sentence. 

This simple tonkekization algorithm cannot handle abbreviations and therefore ignores the "." after "Mr.", which makes it lose some meaning.

"Mr. Sherwood said reaction to Sea Containers' proposal has been \"very positive.\"

In [20]:
sentence_ = "Mr. Sherwood said reaction to Sea Containers' proposal has been \"very positive.\""

tokens_ = re.split('[(\s+.,:;!?\'\")]', sentence_)
#print(tokens_)

tokenized_ = [x for x in tokens_ if x != '' and x not in '- \t\n.,;:!?[]']
print(tokenized_)

['Mr', 'Sherwood', 'said', 'reaction', 'to', 'Sea', 'Containers', 'proposal', 'has', 'been', 'very', 'positive']


### Apply NLTK's tokenizer

See https://www.nltk.org/book/ch03.html

To solve the problem with our algorithm not being able to recognize abbreviations, we can import __punkt__, a tonkenizer algorithm that splits a text into sentences while handling special cases like abbreviations, collocations and words that start sentences (see nltk.org/_modules/nltk/tokenize/punkt.html)

In [27]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

tokens = word_tokenize(sentence)
print(tokens)

['PIRATES', 'OF', 'THE', 'CARRIBEAN', ':', 'DEAD', 'MAN', "'S", 'CHEST', ',', 'by', 'Ted', 'Elliott', '&', 'Terry', 'Rossio']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\win10\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### More pre-processing: standardize and sort all the data

Let's convert all words to lower case and sort them in alphabetical order forming our vocabulary with all the word ocurrences we will be working with:

In [30]:
words = [w.lower() for w in tokens]
print(words)

vocab = sorted(set(words))
print(vocab)

['pirates', 'of', 'the', 'carribean', ':', 'dead', 'man', "'s", 'chest', ',', 'by', 'ted', 'elliott', '&', 'terry', 'rossio']
['&', "'s", ',', ':', 'by', 'carribean', 'chest', 'dead', 'elliott', 'man', 'of', 'pirates', 'rossio', 'ted', 'terry', 'the']


While I could keep developing my own tokeniser to accommodate more cases, I will rely on the tokenization algorithms available on NLP toolkits. These are highly optimised and keep track of those instances in which words shouldn't be separated like in the following examples taken from "Getting Started with Natural Language Processing" by Ekaterina Kochmar.

In [29]:
text1 = "What's the best way to cook a pizza?"
text2 = "We're going to use a baking stone"
text3 = "I haven't used a baking stone before"


print(word_tokenize(text1))
print(word_tokenize(text2))
print(word_tokenize(text3))


['What', "'s", 'the', 'best', 'way', 'to', 'cook', 'a', 'pizza', '?']
['We', "'re", 'going', 'to', 'use', 'a', 'baking', 'stone']
['I', 'have', "n't", 'used', 'a', 'baking', 'stone', 'before']


# Frequency Analysis