# Tokenising Text

In this notebook:

1. **Tokenising strings** - splitting it into tokens (words etc.)
2. **Alternative tokenizers**: ``regexp_tokenize, wordpunct_tokenize, blankline_tokenize,TweetTokenizer``

#### Questions & Objectives:

- What is tokenisation?
- How can a string of raw text be tokenised?

#### Key Points

- Tokenisation means to split a string into separate words and punctuation, for example to be able to count them.
- Text can be tokenised using the a tokeniser, e.g. the punkt tokeniser in NLTK.

In order to process text we need to break it down into tokens. As we explained at the start, a token is a letter, word, number, or punctuation which is contained in a string.

To tokenise we first need to import the word_tokenize method from the tokenize package from NLTK which allows us to do this without writing the code ourselves.

We will also download a specific tokeniser that NLTK uses as default. There are different ways of tokenising text and today we will use NLTKâ€™s in-built punkt tokeniser by calling:

In [3]:
# run this cell. (it's fine if you see some pink warnings underneath it)
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\asantos2\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

Let's tokenise (split into tokens) a nursery rhyme "Humpty Dumpty".

We will save the tokenised output in a list using the `humpty_tokens` variable and can inspect it.

In [4]:
humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
print(humpty_tokens) # print all tokens

['Humpty', 'Dumpty', 'sat', 'on', 'a', 'wall', ',', 'Humpty', 'Dumpty', 'had', 'a', 'great', 'fall', ';', 'All', 'the', 'king', "'s", 'horses', 'and', 'all', 'the', 'king', "'s", 'men', 'could', "n't", 'put', 'Humpty', 'together', 'again', '.']


In [5]:
# let's print just a few of them to have a closer look:
print(humpty_tokens[0:10])

['Humpty', 'Dumpty', 'sat', 'on', 'a', 'wall', ',', 'Humpty', 'Dumpty', 'had']


### Unifying and cleaning up the text

To further analyse the data, we'll first learn how to perforn some cleanup tasks. 

As you can see in the above example, some of the words are uppercase and some are lowercase. But Python is case-sensitive, which means that 'hope' and 'Hope' are considered to be two completely different strings.

For example, when searching for a word or counting the occurrences of a word, we most likely will want to seek both for the lowercase and uppercase versions of the word (eg. `company` and `Company` ). That's why simplify the analysis, we often normalise the data and make it all lowercase. This way both of the above words would simply become `company` and will make the text easier to comprehend.

Since our list of tokens is basically a list of strings (words) we can apply the `List comprehention Loop` we learned about before to transform our list of mixed-case words, into a list of lower-case words. 

As you might remember a syntax for such loop is `[output_format for item in items ]` where:

- 'output_format' is some operation we perform on item, like `item.lower()` or `len(item)`
- 'items' is the List with all the elements we want to transform
- 'item' is a temporary name we give to each element of items, for the purposes of using that name inside of output_format

Let's modify above example, so that we only aquire lowe-case tokens of the nursery rhyme:

In [6]:
humpty_string = "Humpty Dumpty sat on a wall, Humpty Dumpty had a great fall; All the king's horses and all the king's men couldn't put Humpty together again."
humpty_tokens = word_tokenize(humpty_string)
lowercase_tokens = [token.lower() for token in humpty_tokens]
print(lowercase_tokens)

['humpty', 'dumpty', 'sat', 'on', 'a', 'wall', ',', 'humpty', 'dumpty', 'had', 'a', 'great', 'fall', ';', 'all', 'the', 'king', "'s", 'horses', 'and', 'all', 'the', 'king', "'s", 'men', 'could', "n't", 'put', 'humpty', 'together', 'again', '.']


### Alternative tokenizers

Here we only show some simple examples for popular tokenizers.

For more detailed use of tokenizers, please check the `nltk.tokenize` documentation website:
    https://www.nltk.org/api/nltk.tokenize.html

In [7]:
from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
print(tokenizer.tokenize(s))

['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


In [8]:
from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize,TweetTokenizer

A `RegexpTokenizer` splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences:

In [9]:
print(regexp_tokenize(s, pattern='\w+|\$[\d\.]+|\S+'))

['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


`wordpunct_tokenize` tokenizes a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

In [10]:
print(wordpunct_tokenize(s))

['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


`blankline_tokenize` tokenizes a string, treating any sequence of blank lines as a delimiter. Blank lines are defined as lines containing no characters, except for space or tab characters

In [11]:
print(blankline_tokenize(s))

['Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.', 'Thanks.']


`TweetTokenizer`: Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks.

In [12]:
tknzr = TweetTokenizer()
s = '@Xyd_ thank you for always taking care of @xyz. Take care of yourself too! Merry Christmas and Happy New Year Snowman Christmas treeSparkles. #MerryChristmas #HappyNewYear'
print(tknzr.tokenize(s))
print(s.split()) #check their differences

['@Xyd_', 'thank', 'you', 'for', 'always', 'taking', 'care', 'of', '@xyz', '.', 'Take', 'care', 'of', 'yourself', 'too', '!', 'Merry', 'Christmas', 'and', 'Happy', 'New', 'Year', 'Snowman', 'Christmas', 'treeSparkles', '.', '#MerryChristmas', '#HappyNewYear']
['@Xyd_', 'thank', 'you', 'for', 'always', 'taking', 'care', 'of', '@xyz.', 'Take', 'care', 'of', 'yourself', 'too!', 'Merry', 'Christmas', 'and', 'Happy', 'New', 'Year', 'Snowman', 'Christmas', 'treeSparkles.', '#MerryChristmas', '#HappyNewYear']


source: ls_Text and Data Mining Bootcamp