# Choosing a tokenizer

Given the following string, which of the below patterns is the best tokenizer? If possible, you want to retain sentence punctuation as separate tokens, but have `'#1'` remain a single token.

`my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"`

The string is available in your workspace as `my_string`, and the patterns have been pre-loaded as `pattern1`, `pattern2`, `pattern3`, and `pattern4`, respectively.

Additionally, `regexp_tokenize` has been imported from `nltk.tokenize`. You can use `regexp_tokenize(string, pattern)` with `my_string` and one of the patterns as arguments to experiment for yourself and see which is the best tokenizer.

In [1]:
string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"
pattern1 = '(\\w+|\\?|!)'
pattern2 = '(\\w+|#\\d|\\?|!)'
pattern3 = '(#\\d\\w+\\?!)'
pattern4 = '\\s+'

In [2]:
from nltk.tokenize import regexp_tokenize

regexp_tokenize(string, pattern1)

['SOLDIER',
 '1',
 'Found',
 'them',
 '?',
 'In',
 'Mercea',
 '?',
 'The',
 'coconut',
 's',
 'tropical',
 '!']

In [3]:
regexp_tokenize(string, pattern2) # BEST

['SOLDIER',
 '#1',
 'Found',
 'them',
 '?',
 'In',
 'Mercea',
 '?',
 'The',
 'coconut',
 's',
 'tropical',
 '!']

In [4]:
regexp_tokenize(string, pattern3)

[]

In [5]:
regexp_tokenize(string, pattern4)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

# Regex with NLTK tokenization

Twitter is a frequently used source for NLP text and tasks. In this exercise, you'll build a more complex tokenizer for tweets with hashtags and mentions using `nltk` and regex. The `nltk.tokenize.TweetTokenizer` class gives you some extra methods and attributes for parsing tweets.

Here, you're given some example tweets to parse using both `TweetTokenizer` and `regexp_tokenize` from the `nltk.tokenize` module. These example tweets have been pre-loaded into the variable tweets. Feel free to explore it in the IPython Shell!

*Unlike the syntax for the regex library, with `nltk_tokenize()` you pass the pattern as the **second** argument.*

In [7]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

In [8]:
tweets = [
    'This is the best #nlp exercise ive found online! #python',
    '#NLP is super fun! <3 #learning',
    'Thanks @datacamp :) #nlp #python']

In [9]:
# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"
# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

['#nlp', '#python']


In [10]:
# Write a pattern that matches both mentions (@) and hashtags
pattern2 = r"([@#]\w+)"
# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)

['@datacamp', '#nlp', '#python']


In [11]:
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]


# Non-ascii tokenization

In this exercise, you'll practice advanced tokenization by tokenizing some non-ascii based text. You'll be using German with emoji!

Here, you have access to a string called `german_text`, which has been printed for you in the Shell. Notice the emoji and the German characters!

The following modules have been pre-imported from `nltk.tokenize`: `regexp_tokenize` and `word_tokenize``.

Unicode ranges for emoji are:

`('\U0001F300'-'\U0001F5FF')`, `('\U0001F600-\U0001F64F')`, `('\U0001F680-\U0001F6FF')`, and `('\u2600'-\u26FF-\u2700-\u27BF')`.

In [12]:
german_text = "Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕"

In [15]:
import nltk
nltk.download('punkt')
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [16]:
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

['Wann', 'gehen', 'wir', 'Pizza', 'essen', '?', '🍕', 'Und', 'fährst', 'du', 'mit', 'Über', '?', '🚕']


In [17]:
# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))

['Wann', 'Pizza', 'Und', 'Über']


In [18]:
# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

['🍕', '🚕']
