## Tokenize Tweets in Python


I'll take below tweet as my sample for the tweet:

```
"https://t.co/9z2J3P33Uc FB needs to hurry up and add a laugh/cry button 😬😭😓🤢🙄😱 Since eating my feelings has not fixed the world's problems, I guess I'll try to sleep... HOLY CRAP: DeVos questionnaire appears to include passages from uncited sources https://t.co/FNRoOlfw9s well played, Senator Murray Keep the pressure on: https://t.co/4hfOsmdk0l @datageneral thx Mr Taussig It's interesting how many people contact me about applying for a PhD and don't spell my name right."
```

## Install necessayr Library

In [1]:
! pip install nltk



In [2]:
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import WordPunctTokenizer

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/fm-pc-
[nltk_data]     lt-227/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
compare_list = ['https://t.co/9z2J3P33Uc',
               'laugh/cry',
               '😬😭😓🤢🙄😱',
               "world's problems",
               "@datageneral",
                "It's interesting",
               "don't spell my name right",
               'all-nighter']

## word_tokenize

In [5]:
word_tokens = []
for sent in compare_list:
    print(word_tokenize(sent))
    word_tokens.append(word_tokenize(sent))

['https', ':', '//t.co/9z2J3P33Uc']
['laugh/cry']
['😬😭😓🤢🙄😱']
['world', "'s", 'problems']
['@', 'datageneral']
['It', "'s", 'interesting']
['do', "n't", 'spell', 'my', 'name', 'right']
['all-nighter']


## WordPunctTokenizer

In [6]:
punct_tokenizer = WordPunctTokenizer()
punct_tokens = []
for sent in compare_list:
    print(punct_tokenizer.tokenize(sent))
    punct_tokens.append(punct_tokenizer.tokenize(sent))

['https', '://', 't', '.', 'co', '/', '9z2J3P33Uc']
['laugh', '/', 'cry']
['😬😭😓🤢🙄😱']
['world', "'", 's', 'problems']
['@', 'datageneral']
['It', "'", 's', 'interesting']
['don', "'", 't', 'spell', 'my', 'name', 'right']
['all', '-', 'nighter']


## RegrexTokenizer (Match on the tokens)

In [7]:
match_tokenizer = RegexpTokenizer("[\w']+")
match_tokens = []
for sent in compare_list:   
    print(match_tokenizer.tokenize(sent))
    match_tokens.append(match_tokenizer.tokenize(sent))

['https', 't', 'co', '9z2J3P33Uc']
['laugh', 'cry']
[]
["world's", 'problems']
['datageneral']
["It's", 'interesting']
["don't", 'spell', 'my', 'name', 'right']
['all', 'nighter']


## Match on Whitespace

In [8]:
space_tokenizer = RegexpTokenizer("\s+", gaps=True)
space_tokens = []
for sent in compare_list:
    print(space_tokenizer.tokenize(sent))
    space_tokens.append(space_tokenizer.tokenize(sent))

['https://t.co/9z2J3P33Uc']
['laugh/cry']
['😬😭😓🤢🙄😱']
["world's", 'problems']
['@datageneral']
["It's", 'interesting']
["don't", 'spell', 'my', 'name', 'right']
['all-nighter']


## TweetTokenizer

In [9]:
tweet_tokenizer = TweetTokenizer()
tweet_tokens = []
for sent in compare_list:
    print(tweet_tokenizer.tokenize(sent))
    tweet_tokens.append(tweet_tokenizer.tokenize(sent))

['https://t.co/9z2J3P33Uc']
['laugh', '/', 'cry']
['😬', '😭', '😓', '🤢', '🙄', '😱']
["world's", 'problems']
['@datageneral']
["It's", 'interesting']
["don't", 'spell', 'my', 'name', 'right']
['all-nighter']


## Put Everything Together

In [10]:
tokenizers = {'word_tokenize': word_tokens,
             'WordPunctTokenize':punct_tokens,
             'RegrexTokenizer for matching':match_tokens,
             'RegrexTokenizer for white space': space_tokens,
             'TweetTokenizer': tweet_tokens }
df = pd.DataFrame.from_dict(tokenizers)
df

Unnamed: 0,word_tokenize,WordPunctTokenize,RegrexTokenizer for matching,RegrexTokenizer for white space,TweetTokenizer
0,"[https, :, //t.co/9z2J3P33Uc]","[https, ://, t, ., co, /, 9z2J3P33Uc]","[https, t, co, 9z2J3P33Uc]",[https://t.co/9z2J3P33Uc],[https://t.co/9z2J3P33Uc]
1,[laugh/cry],"[laugh, /, cry]","[laugh, cry]",[laugh/cry],"[laugh, /, cry]"
2,[😬😭😓🤢🙄😱],[😬😭😓🤢🙄😱],[],[😬😭😓🤢🙄😱],"[😬, 😭, 😓, 🤢, 🙄, 😱]"
3,"[world, 's, problems]","[world, ', s, problems]","[world's, problems]","[world's, problems]","[world's, problems]"
4,"[@, datageneral]","[@, datageneral]",[datageneral],[@datageneral],[@datageneral]
5,"[It, 's, interesting]","[It, ', s, interesting]","[It's, interesting]","[It's, interesting]","[It's, interesting]"
6,"[do, n't, spell, my, name, right]","[don, ', t, spell, my, name, right]","[don't, spell, my, name, right]","[don't, spell, my, name, right]","[don't, spell, my, name, right]"
7,[all-nighter],"[all, -, nighter]","[all, nighter]",[all-nighter],[all-nighter]
