# Twitter Tokenizer

Tokenization is the process of separating out or segmenting words in a given text. This is often one of the first steps while working with text. In English, words are often separated from each other by whitespace, but whitespace is not always sufficient:

- 'Los Angeles' or 'self sustaining' in "a company in Los Angeles is building a self sustaining city on Mars" are treated as large words despite the fact that they contain spaces.
- Sometimes we need to separate *I'm* into the two words 'I' and 'am.'
- Sometimes hyphenated words that should be separated, e.g. "the hold-him-back-and-drag-him-away maneuver"
- Sometimes sentences carry words with missing or spurious spaces, e.g. "wer an after him qui ckly" which should yield {we, ran, after, him, quickly}
- Sometimes unusual tokens should be recognized as single tokens, e.g., the TV series 'M\*A\*S\*H.'

The goal of this tutorial is to introduce a tokenizer developed for tweets which was released as part of the [TweetNLP](https://www.ark.cs.cmu.edu/TweetNLP/) toolkit. The code is written in Java and the python version for the tokenization is from [this](https://github.com/myleott/ark-twokenize-py) github repository with slight modifications. 

## What's the Big Deal?
Attempting to tokenize the tweets without separating out punctuations, emoticons, etc as described earlier gives something like this. Note that most of the punctuation and special characters are not separated out which poses a problem during tagging.

In [1]:
raw_text = 'tweets.txt'
inp_file = open(raw_text)
for line in inp_file:
    print('o: ' + line.strip())
    tokenized_tweet = line.split(' ')
    print('t: ', tokenized_tweet, '\n')
inp_file.close()

o: I won't win a single          game I bet on!! Got Mr. Cliff Lee, if he loses its on me U.S.A!
t:  ['I', "won't", 'win', 'a', 'single', '', '', '', '', '', '', '', '', '', 'game', 'I', 'bet', 'on!!', 'Got', 'Mr.', 'Cliff', 'Lee,', 'if', 'he', 'loses', 'its', 'on', 'me', 'U.S.A!\n'] 

o: RT @eye_e: this poster-print costs $12.40, which is 40% of the normal price! http://tl.gd/6meogh
t:  ['RT', '@eye_e:', 'this', 'poster-print', 'costs', '$12.40,', 'which', 'is', '40%', 'of', 'the', 'normal', 'price!', 'http://tl.gd/6meogh\n'] 

o: I ❤ Biebs & want to hang out with him!!
t:  ['I', '❤', 'Biebs', '&', 'want', 'to', 'hang', 'out', 'with', 'him!!\n'] 

o: @thecamion I like monkeys, but I still hate COSTCO parking lots.. oO o.O #COSTCO 2:15PM
t:  ['@thecamion', 'I', 'like', 'monkeys,', 'but', 'I', 'still', 'hate', 'COSTCO', 'parking', 'lots..', 'oO', 'o.O', '#COSTCO', '2:15PM\n'] 

o: Texas Rangers are in the World Series!  Go Rangers!!!!!!!!! :> <3 ♥❤♡ http://fb.me/D2LsXBJx
t:  ['Texas', '

### Background on Regular Expressions 

Regular expressions or regex is used for for specifying patterns to search text strings
- For an indepth intro into regex you can see the following resource: https://web.stanford.edu/~jurafsky/slp3/2.pdf
- See https://docs.python.org/2/library/re.html for syntax regarding the different operations.

#### Important Notes:
- Placing an 'r' before a string indicates that the string needs to be interpreted as a raw string. So, r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. This is to avoid colliding with Python's usage of the same character for the same purpose in string literals.
- Characters in a set that require to be matched are indicated by brackets []. For example [amk] will match a, m, or k.
- (?iLmsux) (One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression. This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function. 
- \u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff refer to different white spaces. For a detailed description of what these unicode values mean you can check out this resource: http://www.endmemo.com/unicode/ascii.php


### Download required packages

In [2]:
from __future__ import unicode_literals

import operator
import re
import sys

import subprocess
import shlex

import tokenize
try:
    from html.parser import HTMLParser
except ImportError:
    from HTMLParser import HTMLParser
    

try:
    import html
except ImportError:
    pass  

### whitespaces and contractions

- Contractions are shortened version of words or syllables, e.g., expressions such as n't, I'd, etc. See list of English contractions [here](https://en.wikipedia.org/wiki/Wikipedia%3aList_of_English_contractions). 


In [3]:
Contractions = re.compile(u"(?i)(\w+)(n['’′]t|['’′]ve|['’′]ll|['’′]d|['’′]re|['’′]s|['’′]m)$", re.UNICODE)
Whitespace = re.compile(u"[\s\u0020\u00a0\u1680\u180e\u202f\u205f\u3000\u2000-\u200a]+", re.UNICODE)

### Punctuation characters and sequences


In [4]:
punctChars = r"['\"“”‘’.?!…,:;]"
punctSeq   = r"['\"“”‘’]+|[.?!,…]+|[:;]+"
entity     = r"&(?:amp|lt|gt|quot);" # see more here https://www.w3schools.com/html/html_entities.asp

### Joining multiple expressions
Let's defin the regex_or function to join multiple regular expressions.

In [5]:
def regex_or(*items):
    return '(?:' + '|'.join(items) + ')'

### URLs & country domains
- (?: pattern ): Occasionally we might want to use parentheses for grouping, but don’t want to capture the resulting pattern in a register. In that case we use a non-capturing group, which is specified by putting the commands ?: after the open paren, in the form (?: pattern ). It matches the pattern inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern. For example, *(?:some|a few) (people|cats) like some \1* will match "some cats like some cats" but not "some cats like some a few".
- (?= pattern) : This operator is true if pattern occurs, but is zero-width, i.e. the match pointer doesn’t dvance. The operator
- (?! pattern): it only returns true if a pattern does not match, but again is zero-width and doesn’t advance the cursor. For example suppose we want to match, at the beginning of a line, any single word that doesn’t start with “Volcano”. We can use negative lookahead to do this: *ˆ(?!Volcano)[A-Za-z]+*
- (?(id/name)yes-pattern|no-pattern): will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. no-pattern is optional and can be omitted. For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>) is a poor email matching pattern, which will match with '\<user@host.com\>' as well as 'user@host.com', but not with '\<user@host.com'.

In [6]:
urlStart1  = r"(?:https?://|\bwww\.)"
commonTLDs = r"(?:com|org|edu|gov|net|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|pro|tel|travel|xxx)"
ccTLDs = r"(?:ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|" + \
r"bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|" + \
r"er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|" + \
r"hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|" + \
r"lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|" + \
r"nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|" + \
r"sl|sm|sn|so|sr|ss|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|" + \
r"va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw)"	#TODO: remove obscure country domains?
urlStart2  = r"\b(?:[A-Za-z\d-])+(?:\.[A-Za-z0-9]+){0,3}\." + regex_or(commonTLDs, ccTLDs) + r"(?:\."+ccTLDs+r")?(?=\W|$)"
urlBody    = r"(?:[^\.\s<>][^\s<>]*?)?"
urlExtraCrapBeforeEnd = regex_or(punctChars, entity) + "+?"
urlEnd     = r"(?:\.\.+|[<>]|\s|$)"
url        = regex_or(urlStart1, urlStart2) + urlBody + "(?=(?:"+urlExtraCrapBeforeEnd+")?"+urlEnd+")"

### Numeric
Regular expressions to identify numeric characters in time, separated by commas, and number combinations.

In [14]:
monetary = r"\$([0-9]+)?\.?([0-9]+)?"
timeLike   = r"\d+(?::\d+){1,2}"
numberWithCommas = r"(?:(?<!\d)\d{1,3},)+?\d{3}" + r"(?=(?:[^,\d]|$))"
numComb = u"[\u0024\u058f\u060b\u09f2\u09f3\u09fb\u0af1\u0bf9\u0e3f\u17db\ua838\ufdfc\ufe69\uff04\uffe0\uffe1\uffe5\uffe6\u00a2-\u00a5\u20a0-\u20b9]?"

### Abbreviations
Regular expression for abbreviations like Mr., Mrs. etc.

In [15]:
boundaryNotDot = regex_or("$", r"\s", r"[“\"?!,:;]", entity)
aa1  = r"(?:[A-Za-z]\.){2,}(?=" + boundaryNotDot + ")"
aa2  = r"[^A-Za-z](?:[A-Za-z]\.){1,}[A-Za-z](?=" + boundaryNotDot + ")"
standardAbbreviations = r"\b(?:[Mm]r|[Mm]rs|[Mm]s|[Dd]r|[Ss]r|[Jj]r|[Rr]ep|[Ss]en|[Ss]t)\."
arbitraryAbbrev = regex_or(aa1, aa2, standardAbbreviations)
separators  = "(?:--+|―|—|~|–|=)"
decorations = u"(?:[♫♪]+|[★☆]+|[♥❤♡]+|[\u2639-\u263b]+|[\ue001-\uebbb]+)"
thingsThatSplitWords = r"[^\s\.,?\"]"
embeddedApostrophe = thingsThatSplitWords+r"+['’′]" + thingsThatSplitWords + "*"

### Emoticons
Regular expression for emoticons.

In [16]:
normalEyes = "[:=]" # 8 and x are eyes but cause problems
wink = "[;]"
noseArea = "(?:|-|[^a-zA-Z0-9 ])" # doesn't get :'-(
happyMouths = r"[D\)\]\}]+"
sadMouths = r"[\(\[\{]+"
tongue = "[pPd3]+"
otherMouths = r"(?:[oO]+|[/\\]+|[vV]+|[Ss]+|[|]+)" # remove forward slash if http://'s aren't cleaned

# mouth repetition examples:
# @aliciakeys Put it in a love song :-))
# @hellocalyclops =))=))=)) Oh well

# myleott: try to be as case insensitive as possible, but still not perfect, e.g., o.O fails
#bfLeft = u"(♥|0|o|°|v|\\$|t|x|;|\u0ca0|@|ʘ|•|・|◕|\\^|¬|\\*)".encode('utf-8')
bfLeft = u"(♥|0|[oO]|°|[vV]|\\$|[tT]|[xX]|;|\u0ca0|@|ʘ|•|・|◕|\\^|¬|\\*)"
bfCenter = r"(?:[\.]|[_-]+)"
bfRight = r"\2"
s3 = r"(?:--['\"])"
s4 = r"(?:<|&lt;|>|&gt;)[\._-]+(?:<|&lt;|>|&gt;)"
s5 = "(?:[.][_]+[.])"
# myleott: in Python the (?i) flag affects the whole expression
#basicface = "(?:(?i)" +bfLeft+bfCenter+bfRight+ ")|" +s3+ "|" +s4+ "|" + s5
basicface = "(?:" +bfLeft+bfCenter+bfRight+ ")|" +s3+ "|" +s4+ "|" + s5

eeLeft = r"[＼\\ƪԄ\(（<>;ヽ\-=~\*]+"
eeRight= u"[\\-=\\);'\u0022<>ʃ）/／ノﾉ丿╯σっµ~\\*]+"
eeSymbol = r"[^A-Za-z0-9\s\(\)\*:=-]"
eastEmote = eeLeft + "(?:"+basicface+"|" +eeSymbol+")+" + eeRight

oOEmote = r"(?:[oO]" + bfCenter + r"[oO])"

emoticon = regex_or(
        # Standard version  :) :( :] :D :P
        "(?:>|&gt;)?" + regex_or(normalEyes, wink) + regex_or(noseArea,"[Oo]") + regex_or(tongue+r"(?=\W|$|RT|rt|Rt)", otherMouths+r"(?=\W|$|RT|rt|Rt)", sadMouths, happyMouths),

        # reversed version (: D:  use positive lookbehind to remove "(word):"
        # because eyes on the right side is more ambiguous with the standard usage of : ;
        regex_or("(?<=(?: ))", "(?<=(?:^))") + regex_or(sadMouths,happyMouths,otherMouths) + noseArea + regex_or(normalEyes, wink) + "(?:<|&lt;)?",

        #inspired by http://en.wikipedia.org/wiki/User:Scapler/emoticons#East_Asian_style
        eastEmote.replace("2", "1", 1), basicface,
        # iOS 'emoji' characters (some smileys, some symbols) [\ue001-\uebbb]
        # TODO should try a big precompiled lexicon from Wikipedia, Dan Ramage told me (BTO) he does this

        # myleott: o.O and O.o are two of the biggest sources of differences
        #          between this and the Java version. One little hack won't hurt...
        oOEmote
)

Hearts = "(?:<+/?3+)+" #the other hearts are in decorations

Arrows = regex_or(r"(?:<*[-―—=]*>+|<+[-―—=]*>*)", u"[\u2190-\u21ff]+")

### Twitter Specific Characters

Regular expressions for twitter specific characters such as hashtags, at mentions, email. Details for regular expressions for emails can be found here : http://www.regular-expressions.info/email.html
Note: 
- When the LOCALE and UNICODE flags are not specified, [\w] matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

In [17]:
Hashtag = "#[a-zA-Z0-9_]+"
AtMention = "[@＠][a-zA-Z0-9_]+"

Bound = r"(?:\W|^|$)"
Email = regex_or("(?<=(?:\W))", "(?<=(?:^))") + r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}(?=" +Bound+")"

### All Together 

Create a regex object with all the above mentioned search patterns and identify edge punctuation - this includes punctuations in the edges of words (e.g., 'foo' to become ' foo ') but not in the middle of words like don't.

In [18]:
# We will be tokenizing using these regexps as delimiters
# Additionally, these things are "protected", meaning they shouldn't be further split themselves.
Protected  = re.compile(
    regex_or(
        Hearts,
        url,
        Email,
        timeLike,
        monetary,
        numberWithCommas,
        numComb,
        emoticon,
        Arrows,
        entity,
        punctSeq,
        arbitraryAbbrev,
        separators,
        decorations,
        embeddedApostrophe,
        Hashtag,
        AtMention), re.UNICODE)

# Edge punctuation
# Want: 'foo' => ' foo '
# While also:   don't => don't
# the first is considered "edge punctuation".
# the second is word-internal punctuation -- don't want to mess with it.
# BTO (2011-06): the edgepunct system seems to be the #1 source of problems these days.
# I remember it causing lots of trouble in the past as well.  Would be good to revisit or eliminate.

# Note the 'smart quotes' (http://en.wikipedia.org/wiki/Smart_quotes)
#edgePunctChars    = r"'\"“”‘’«»{}\(\)\[\]\*&" #add \\p{So}? (symbols)
edgePunctChars    = u"'\"“”‘’«»{}\\(\\)\\[\\]\\*&" #add \\p{So}? (symbols)
edgePunct    = "[" + edgePunctChars + "]"
notEdgePunct = "[a-zA-Z0-9]" # content characters
offEdge = r"(^|$|:|;|\s|\.|,)"  # colon here gets "(hello):" ==> "( hello ):"
EdgePunctLeft  = re.compile(offEdge + "("+edgePunct+"+)("+notEdgePunct+")", re.UNICODE)
EdgePunctRight = re.compile("("+notEdgePunct+")("+edgePunct+"+)" + offEdge, re.UNICODE)


## Tokenization 
The functions here help in the tokenization of the input tweets, where each token needs to be separated.

In [19]:
def splitEdgePunct(input):
    input = EdgePunctLeft.sub(r"\1\2 \3", input)
    input = EdgePunctRight.sub(r"\1 \2\3", input)
    return input

# The main work of tokenizing a tweet.
def simpleTokenize(text):

    # Do the no-brainers first
    splitPunctText = splitEdgePunct(text)

    textLength = len(splitPunctText)

    # BTO: the logic here got quite convoluted via the Scala porting detour
    # It would be good to switch back to a nice simple procedural style like in the Python version
    # ... Scala is such a pain.  Never again.

    # Find the matches for subsequences that should be protected,
    # e.g. URLs, 1.0, U.N.K.L.E., 12:53
    bads = []
    badSpans = []
    for match in Protected.finditer(splitPunctText):
        # The spans of the "bads" should not be split.
        if (match.start() != match.end()): #unnecessary?
            bads.append( [splitPunctText[match.start():match.end()]] )
            badSpans.append( (match.start(), match.end()) )

    # Create a list of indices to create the "goods", which can be
    # split. We are taking "bad" spans like
    #     List((2,5), (8,10))
    # to create
    #     List(0, 2, 5, 8, 10, 12)
    # where, e.g., "12" here would be the textLength
    # has an even length and no indices are the same
    indices = [0]
    for (first, second) in badSpans:
        indices.append(first)
        indices.append(second)
    indices.append(textLength)

    # Group the indices and map them to their respective portion of the string
    splitGoods = []
    for i in range(0, len(indices), 2):
        goodstr = splitPunctText[indices[i]:indices[i+1]]
        splitstr = goodstr.strip().split(" ")
        splitGoods.append(splitstr)

    #  Reinterpolate the 'good' and 'bad' Lists, ensuring that
    #  additonal tokens from last good item get included
    zippedStr = []
    for i in range(len(bads)):
        zippedStr = addAllnonempty(zippedStr, splitGoods[i])
        zippedStr = addAllnonempty(zippedStr, bads[i])
    zippedStr = addAllnonempty(zippedStr, splitGoods[len(bads)])

    # BTO: our POS tagger wants "ur" and "you're" to both be one token.
    # Uncomment to get "you 're"
    #splitStr = []
    #for tok in zippedStr:
    #    splitStr.extend(splitToken(tok))
    #zippedStr = splitStr

    return zippedStr


def addAllnonempty(master, smaller):
    for s in smaller:
        strim = s.strip()
        if (len(strim) > 0):
            master.append(strim)
    return master

# "foo   bar " => "foo bar"
def squeezeWhitespace(input):
    return Whitespace.sub(" ", input).strip()

# Final pass tokenization based on special patterns
def splitToken(token):
    m = Contractions.search(token)
    if m:
        return [m.group(1), m.group(2)]
    return [token]

# Assume 'text' has no HTML escaping.
def tokenize(text):
    return simpleTokenize(squeezeWhitespace(text))

# Twitter text comes HTML-escaped, so unescape it.
# We also first unescape &amp;'s, in case the text has been buggily double-escaped.
def normalizeTextForTagger(text):
    assert sys.version_info[0] >= 3 and sys.version_info[1] > 3, 'Python version >3.3 required'
    text = text.replace("&amp;", "&")
    text = html.unescape(text)
    return text

# This is intended for raw tweet text -- we do some HTML entity unescaping before running the tagger.
#
# This function normalizes the input text BEFORE calling the tokenizer.
# So the tokens you get back may not exactly correspond to
# substrings of the original text.
def tokenizeRawTweetText(text):
    tokens = tokenize(normalizeTextForTagger(text))
    return tokens

## Test Run
The folder comes with example_tweets.txt, these contain the raw tweets and need to be tokenized before attempting to do the POS tagging. Once the tweets are tokenized we can store it in a file Tokenized.txt so that the pos tagger can directly read from this file.

In [20]:
raw_text = 'tweets.txt'

inp_file = open(raw_text)
oup_file = open("tweets_tokenized.txt", "w") 
for line in inp_file:
    print('o: ' + line.strip())
    tokenized_tweet = ' '.join(tokenizeRawTweetText(line))    
    print('t: ' + tokenized_tweet + '\n')
    oup_file.write(tokenized_tweet + '\n')
inp_file.close()
oup_file.close()

o: I won't win a single          game I bet on!! Got Mr. Cliff Lee, if he loses its on me U.S.A!
t: I won't win a single game I bet on !! Got Mr. Cliff Lee , if he loses its on me U.S.A !

o: RT @eye_e: this poster-print costs $12.40, which is 40% of the normal price! http://tl.gd/6meogh
t: RT @eye_e : this poster-print costs $12.40 , which is 40% of the normal price ! http://tl.gd/6meogh

o: I ❤ Biebs & want to hang out with him!!
t: I ❤ Biebs & want to hang out with him !!

o: @thecamion I like monkeys, but I still hate COSTCO parking lots.. oO o.O #COSTCO 2:15PM
t: @thecamion I like monkeys , but I still hate COSTCO parking lots .. oO o.O #COSTCO 2:15 PM

o: Texas Rangers are in the World Series!  Go Rangers!!!!!!!!! :> <3 ♥❤♡ http://fb.me/D2LsXBJx
t: Texas Rangers are in the World Series ! Go Rangers !!!!!!!!! : > <3 ♥❤♡ http://fb.me/D2LsXBJx

