# Tokenisation

The notebook contains three types of tokenisation techniques:
1. Word tokenisation
2. Sentence tokenisation
3. Tweet tokenisation
4. Custom tokenisation using regular expressions

In [11]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/amar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 1. Word tokenisation

In [12]:
document = "At nine o'clock I visited him myself. It looks like religious mania, and he'll soon can't think that he himself is God."
print(document)

At nine o'clock I visited him myself. It looks like religious mania, and he'll soon can't think that he himself is God.


Tokenising on spaces using python

In [13]:
print(document.split())

['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself.', 'It', 'looks', 'like', 'religious', 'mania,', 'and', "he'll", 'soon', "can't", 'think', 'that', 'he', 'himself', 'is', 'God.']


Tokenising using nltk word tokeniser

In [14]:
from nltk.tokenize import word_tokenize
words = word_tokenize(document)

In [15]:
print(words)

['At', 'nine', "o'clock", 'I', 'visited', 'him', 'myself', '.', 'It', 'looks', 'like', 'religious', 'mania', ',', 'and', 'he', "'ll", 'soon', 'ca', "n't", 'think', 'that', 'he', 'himself', 'is', 'God', '.']


NLTK's word tokeniser not only breaks on whitespaces but also breaks contraction words such as he'll into "he" and "'ll". On the other hand it doesn't break "o'clock" and treats it as a separate token.

### 2. Sentence tokeniser

Tokenising based on sentence requires you to split on the period ('.'). Let's use nltk sentence tokeniser.

In [16]:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(document)

In [17]:
print(sentences)

["At nine o'clock I visited him myself.", "It looks like religious mania, and he'll soon can't think that he himself is God."]


### 3. Tweet tokeniser

A problem with word tokeniser is that it fails to tokeniser emojis and other complex special characters such as word with hashtags. Emojis are common these days and people use them all the time.

In [18]:
message = "i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎"

In [19]:
print(word_tokenize(message))

['i', 'recently', 'watched', 'this', 'show', 'called', 'mindhunters', ':', ')', '.', 'i', 'totally', 'loved', 'it', '😍', '.', 'it', 'was', 'gr8', '<', '3', '.', '#', 'bingewatching', '#', 'nothingtodo', '😎']


The word tokeniser breaks the emoji '<3' into '<' and '3' which is something that we don't want. Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can salone prove to be a really good predictor of the sentiment. Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook. So there, you want to use the hashtag as is.

Let's use the tweet tokeniser of nltk to tokenise this message.

In [20]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

In [21]:
tknzr.tokenize(message)

['i',
 'recently',
 'watched',
 'this',
 'show',
 'called',
 'mindhunters',
 ':)',
 '.',
 'i',
 'totally',
 'loved',
 'it',
 '😍',
 '.',
 'it',
 'was',
 'gr8',
 '<3',
 '.',
 '#bingewatching',
 '#nothingtodo',
 '😎']

As you can see, it handles all the emojis and the hashtags pretty well.

Now, there is a tokeniser that takes a regular expression and tokenises and returns result based on the pattern of regular expression.

Let's look at how you can use regular expression tokeniser.

In [22]:
from nltk.tokenize import regexp_tokenize
message = "i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎"
pattern = "#[\w]+"

In [23]:
regexp_tokenize(message, pattern)

['#bingewatching', '#nothingtodo']

In [31]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

sentence = 'Education is the most powerful weapon that you can use to change the world'

# change sentence to lowercase
sentence = sentence.lower() # write code here

# tokenise sentence into words
words = word_tokenize(sentence) # write code here

# extract nltk stop word list
stopwords = set(stopwords.words('english'))    # write code here

# remove stop words
no_stops = [w for w in words if not w in stopwords] # write code here

# print length - don't change the following piece of code
print(len(no_stops))

6


In [32]:
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords

text = 'So excited to be a part of machine learning and artificial intelligence program made by @upgrad and @iiitb'

# change text to lowercase
text = text.lower()

# pattern to extract mentions
pattern = '@[a-z0-9_]+'

# extract mentions by using regex tokeniser
mentions = regexp_tokenize(text, pattern)

# print length - don't change the following piece of code
print(len(mentions))

2


# Bag of words

In [34]:
from nltk import word_tokenize

In [33]:
d1 = 'there was a place on my ankle that was itching'
d2 = 'but I did not scratch it'
d3 = 'and then my ear began to itch'
d4 = 'and next my back'

In [35]:
wt1 = word_tokenize(d1)
wt2 = word_tokenize(d2)
wt3 = word_tokenize(d3)
wt4 = word_tokenize(d4)

print(wt1)
print(wt2)
print(wt3)
print(wt4)

['there', 'was', 'a', 'place', 'on', 'my', 'ankle', 'that', 'was', 'itching']
['but', 'I', 'did', 'not', 'scratch', 'it']
['and', 'then', 'my', 'ear', 'began', 'to', 'itch']
['and', 'next', 'my', 'back']


In [48]:
tot_wrd1 = [wrd for wrd in wt1 if wrd not in wt2+wt3+wt4]
tot_wrd2 = [wrd for wrd in wt2 if wrd not in tot_wrd1+wt3+wt4]
tot_wrd3 = [wrd for wrd in wt3 if wrd not in tot_wrd1+tot_wrd2+wt4]
tot_wrd4 = [wrd for wrd in wt4 if wrd not in tot_wrd1+tot_wrd2+tot_wrd3]
#tot_wrd1 = [wrd for wrd in wt1 if wrd not in tot_wrd1+tot_wrd2+tot_wrd3+tot_wrd4]

l = len(tot_wrd1) + len(tot_wrd2) + len(tot_wrd3) + len(tot_wrd4)

print(tot_wrd1)
print(tot_wrd2)
print(tot_wrd3)
print(tot_wrd4)
print(l)

['there', 'was', 'a', 'place', 'on', 'ankle', 'that', 'was', 'itching']
['but', 'I', 'did', 'not', 'scratch', 'it']
['then', 'ear', 'began', 'to', 'itch']
['and', 'next', 'my', 'back']
24
