#Data cleaning

Data cleaning steps:
- replace @ handles with: hdl (chose a word with no punctuation so no extra step is needed to process, hdl = handle)
- replace urls by: url
- replace emoticons with corresponding emojis


For all of these, first check what is the prevalence and if it is worth the effort. 

Ideas:
- split hashtags into words
    - Some people seperate words in hashtags with capitals. Make it much easier to seperate. 
- replace contractions
- spellchecking
- what does quoted messages mean? Someone quoting somebody else? Should we consider this text or remove it? 
- there are some smileys that are still in punctuation and not in unicode, eg :-P, :-)
- Tweets not in english
- Character ngram will probably be more efficient due to the really low quality of speach
- Retweets have two formats:
    - Either finish with RT &lt;content of retweet>
    - OR "&lt;content of retweet>" &lt;content of tweet>
    - Can also have mutliple embedings with “ for second level. E.g. : 
"@letwerkaaaaa: “@Palmira_0: HAHA WHEN I WAS LITTLE I WAS FAT AS FUCK:joy:” Same I was so fat that they thought my vagina was a dick" LMFAO
    - For now will leave them here. We might have to consider while training if it is retweeted content or original content. Actually impacts the single response rate, ie when people just add an emoji to a tweet (as in the emoji is the sole content) 

In [57]:
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
pd.options.display.max_colwidth = 140
import nltk
import re

In [58]:
with open('./data/tweets_1M.json','r') as f:
    tweets_df = pd.DataFrame(json.load(f))
clean_tweets_df = pd.DataFrame(tweets_df["text"])
# del tweets_df

### Replace @ handles with hdl

No "hdl" words for confusion in the txt. Good replacement name. 

Best to replace handles with "\ hdl\ ", so for tokenization it will be easier to identify as a word. 

In [59]:
pattern = r"@[a-zA-Z0-9_]{1,15}" #from http://kagan.mactane.org/blog/2009/09/22/what-characters-are-allowed-in-twitter-usernames/
print("{} handles replaced".format(np.sum(clean_tweets_df.text.str.contains(pattern).values)))
clean_tweets_df.text = clean_tweets_df.text.str.replace(pattern, " hdl ")
clean_tweets_df.sample(10)

367031 handles replaced


Unnamed: 0,text
10757,work in 6 hours god daym
396412,"Wow ! You just liked them physically , and that's just wrong"
511881,I want to read TFIOS again. #augustuswaterswannabe
837173,​ hdl I wish I was far away from Congress as you are.
920737,I've been wanting it for a while now. 😏
652375,"hdl he said ""I'm drinking"" lol"
156578,hdl its good tell them Niggas to see me in 7on7
50306,When I hear babydoll I think of Danny... So it's weird if someone else says it. Haha
80700,The crowd sang happy birthday 👍 thanks guys!
694633,What is social life?


### Replace URLs with url

URLs are quiet well formed and are generally at the end of tweets. No risk of engulfing in the cleaning some more text after the url.

keyword url is used only 4 times in dataset, no risk of confusion

In [60]:
pattern = r"http://\S+"
print("{} urls replaced".format(np.sum(clean_tweets_df.text.str.contains(pattern).values)))
clean_tweets_df.text = clean_tweets_df.text.str.replace(pattern, " url ")
clean_tweets_df.sample(10)

227446 urls replaced


Unnamed: 0,text
761455,"hdl Stomping Grounds with Foxtails Brigade and Feed Me Jack at The Crepe Place (October 4, 2013):... url"
576156,Take me back to the other night &lt;3
214573,“ hdl : #TweetYourWeakness Nice tips”
509600,I can't deal with Whiny babies 😭🙅
489798,I'm still thinking about that poodle from yesterday
549081,#aceandjig #thisjustin #newnewnew @ mira mira url
810982,Ain't nothing like riding on the bridge smoking on some good weed #1700block @ San Francisco-Oakland… url
674863,“ hdl : hdl oh princess court”💁
287468,I'm tweeting this from the GTA V phone👌😏👌
399297,"💕 ""LOVERS IN THE PARKING LOT"" video 💕 @ The Westin St. Francis San Francisco on Union Square url"


### Convert emoticons to emojis

In [61]:
# Based on:
# https://slack.zendesk.com/hc/en-us/articles/202931348-Emoji-and-emoticons
# http://unicodey.com/emoji-data/table.htm
# http://www.unicode.org/emoji/charts/emoji-list.html

emoticon2emoji = {
    r"<3": "\u2764",
    r"</3": "\U0001F494",
    r"8\)": "\U0001F60E",
    r"D:": "\U0001F627",
    r":'\(": "\U0001F622",
    r":o\)": "\U0001F435",
    r":-*\*": "\U0001F48B",
    r"=-*\)": "\U0001F600",
    r":-*D": "\U0001F600",
    r";-*\)": "\U0001F609",
    r":-*>": "\U0001F606",
    r":-*\|": "\U0001F610",
    r":-*[Oo]": "\U0001F62E",
    r">:-*\(": "\U0001F620",
    r":-*\)|\(:": "\U0001F603",
    r":-*\(|\):": "\U0001F61E",
    r":-*[/\\]": "\U0001F615",
    r":-*[PpbB]": "\U0001F61B",
    r";-*[PpbB]": "\U0001F61C"
}

In [62]:
for emoticon in emoticon2emoji:
    print("{:10} -> {:>5}".format(emoticon, emoticon2emoji[emoticon]))

:-*\*      ->     💋
:-*D       ->     😀
</3        ->     💔
:-*\(|\):  ->     😞
:'\(       ->     😢
;-*[PpbB]  ->     😜
:-*[PpbB]  ->     😛
:-*\|      ->     😐
:-*[Oo]    ->     😮
D:         ->     😧
>:-*\(     ->     😠
:-*[/\\]   ->     😕
8\)        ->     😎
:-*>       ->     😆
:-*\)|\(:  ->     😃
;-*\)      ->     😉
=-*\)      ->     😀
<3         ->     ❤
:o\)       ->     🐵


In [63]:
total = 0
for emoticon in emoticon2emoji:
    nreplacements = np.sum(clean_tweets_df.text.str.contains(emoticon).values)
    total += nreplacements
    print("{:3} replaced {:6} times".format(emoticon2emoji[emoticon], nreplacements))
    clean_tweets_df.text = clean_tweets_df.text.str.replace(emoticon, emoticon2emoji[emoticon])
print("{:3} replaced {} times".format("ALL", total))

💋   replaced    287 times
😀   replaced   1533 times
💔   replaced      0 times
😞   replaced   5763 times
😢   replaced    341 times
😜   replaced    329 times
😛   replaced   1167 times
😐   replaced     92 times
😮   replaced    364 times
😧   replaced    205 times
😠   replaced      0 times
😕   replaced   7436 times
😎   replaced    109 times
😆   replaced      0 times
😃   replaced  15655 times
😉   replaced   3357 times
😀   replaced    209 times
❤   replaced      0 times
🐵   replaced      0 times
ALL replaced 36847 times


### Explore retweets

In [70]:
pattern = r"""(?:\W|^)RT(?:[ \":“]|$)| # Retweets with RT keyword
            [\"“]\s*hdl                # Retweets with quotes"""
temp = clean_tweets_df.text.str.contains(pattern, flags=re.X)
print("There are {} retweets".format(np.sum(temp.values)))
clean_tweets_df[temp].sample(10)

There are 50202 retweets


Unnamed: 0,text
197770,“ hdl : hdl hdl like I said immature” you act like I care though? 💁 Id rather be immature than basic like you
387539,"Baaaaby! 😢 RT "" hdl : Yes, I can't stop thinking about you."""
978011,Can't wait 😍“ hdl : fishing tomorrow with hdl 🎣☀️”
169658,“ hdl : OMFG WHEN YOU CLICK ON IT... 😳 url
146390,Long day bro “ hdl : hdl me too”
210819,“ hdl : I wish you guys could smell how appetizing my weed smells. Girl Scout cookies is my favorite.”animal cookies 🍪
143131,"😂😂😂 RT hdl : My girlfriend got have some type of booty, I can't mess with that hank hill ass."
530529,“ hdl : TWEET PA GUYS!!! Hihihi. 😀)) #G2BWhatDoWeMeanToEachOther -jemina”
72077,“ hdl : I feel so alone on a Friday night🎶”
838312,Wtf😂😂😂 im not asian and i dont have a 6 pack“ hdl : dayum!!! hdl is that u?! (O.o) lol url


###Twitter hashtags