#Data cleaning

Data cleaning steps:
- replace @ handles with: hdl (chose a word with no punctuation so no extra step is needed to process, hdl = handle)
- replace urls by: url
- replace emoticons with corresponding emojis


For all of these, first check what is the prevalence and if it is worth the effort. 

Ideas:
- split hashtags into words
    - Some people seperate words in hashtags with capitals. Make it much easier to seperate. 
- replace contractions
- spellchecking
- what does quoted messages mean? Someone quoting somebody else? Should we consider this text or remove it? 
- there are some smileys that are still in punctuation and not in unicode, eg :-P, :-)
- Tweets not in english
- Character ngram will probably be more efficient due to the really low quality of speach
- Retweets have two formats:
    - Either finish with RT &lt;content of retweet>
    - OR "&lt;content of retweet>" &lt;content of tweet>
    - Can also have mutliple embedings with “ for second level. E.g. : 
"@letwerkaaaaa: “@Palmira_0: HAHA WHEN I WAS LITTLE I WAS FAT AS FUCK:joy:” Same I was so fat that they thought my vagina was a dick" LMFAO
    - For now will leave them here. We might have to consider while training if it is retweeted content or original content. Actually impacts the single response rate, ie when people just add an emoji to a tweet (as in the emoji is the sole content) 

In [1]:
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
pd.options.display.max_colwidth = 140
import nltk
import re

In [15]:
with open('./data/tweets_training.json','r') as f:
    tweets_df = pd.DataFrame(json.load(f))
clean_tweets_df = pd.DataFrame(tweets_df["text"])
# del tweets_df

### Replace @ handles with hdl

No "hdl" words for confusion in the txt. Good replacement name. 

Best to replace handles with "\ hdl\ ", so for tokenization it will be easier to identify as a word. 

In [16]:
def cleanHandle(df):
    """ Replace in-place handles with hdl keyword
    
    Returns None
    """
    pattern = r"@[a-zA-Z0-9_]{1,15}" #from http://kagan.mactane.org/blog/2009/09/22/what-characters-are-allowed-in-twitter-usernames/
    print("{} handles replaced".format(np.sum(df.text.str.contains(pattern).values)))
    df.text = df.text.str.replace(pattern, " hdl ")
    return

cleanHandle(clean_tweets_df)
clean_tweets_df.sample(10)

293672 handles replaced


Unnamed: 0,text
478549,"Nice use of the round about haha ""Eyewitness video of the #CapitolHill car chase that ended in gunfire: http://t.co/LbNO5PAeo1"""
215754,Oh my shit hdl ... Goodbye Goodbye video is... Omg. Unf. Perf. Isisksnsjosidka
30306,I just saw it :/
342311,Closing apps in the new iOS is super neat!
358082,"I'm never gonna change , I'll always be right there 💛"
341272,All the move in signs are up... I guess it's that time of year again. #ucsc hdl #bananaslugs
167839,Game time #FightOn
452020,hdl thankyou
287165,Attending the service experience conference from hdl (@ Omni San Francisco Hotel - hdl ) [pic]: http://t.co/fIR1FIEHM6
123900,RIDE OR DIE #N.E. #Noevidence #Bear-Chi #DameSmash #E-Time #Cheddar #Hollywood #2-TAll #G-Money #RY


### Replace URLs with url

URLs are quiet well formed and are generally at the end of tweets. No risk of engulfing in the cleaning some more text after the url.

keyword url is used only 4 times in dataset, no risk of confusion

In [17]:
def cleanURL(df):
    """ Replace in-place URLs with url keyword
    
    Returns None
    """
    pattern = r"http://\S+"
    print("{} urls replaced".format(np.sum(clean_tweets_df.text.str.contains(pattern).values)))
    df.text = df.text.str.replace(pattern, " url ")
    return

cleanURL(clean_tweets_df)
clean_tweets_df.sample(10)

182254 urls replaced


Unnamed: 0,text
294459,I'm hella bored at home
653921,"It was nice to share today's Tweets now its time to lay down and sleep. May angels sing you all to rest,good night Tweetie pies"
489917,Sparkly rainbow fiber optics @ Computer History Museum url
689649,Dayger ready \n#JelloShots url
770700,😂😂 Odaly and Kaylind are hilarious
453521,hdl wat lol
786294,"This is probably the 4th time this week she went to work late. Like, her work etiquette makes me mad."
687653,hdl Jons # is 510 277 2352
388198,hdl we could eat BK in the shower.
43002,hdl you saw that too?😨


### Convert emoticons to emojis

In [22]:
# Based on:
# https://slack.zendesk.com/hc/en-us/articles/202931348-Emoji-and-emoticons
# http://unicodey.com/emoji-data/table.htm
# http://www.unicode.org/emoji/charts/emoji-list.html

def convertEmoticon(df):
    """ Replace in-place common emoticons to emojis.
    
    Returns None
    """
    emoticon2emoji = {
        r"<3": "\u2764",
        r"</3": "\U0001F494",
        r"8\)": "\U0001F60E",
        r"D:": "\U0001F627",
        r":'\(": "\U0001F622",
        r":o\)": "\U0001F435",
        r":-*\*": "\U0001F48B",
        r"=-*\)": "\U0001F600",
        r":-*D": "\U0001F600",
        r";-*\)": "\U0001F609",
        r":-*>": "\U0001F606",
        r":-*\|": "\U0001F610",
        r":-*[Oo]": "\U0001F62E",
        r">:-*\(": "\U0001F620",
        r":-*\)|\(:": "\U0001F603",
        r":-*\(|\):": "\U0001F61E",
        r":-*[/\\]": "\U0001F615",
        r":-*[PpbB]": "\U0001F61B",
        r";-*[PpbB]": "\U0001F61C"
    }
    
#     for emoticon in emoticon2emoji:
#         print("{:10} -> {:>5}".format(emoticon, emoticon2emoji[emoticon]))
    
    total = 0
    for emoticon in emoticon2emoji:
        nreplacements = np.sum(df.text.str.contains(emoticon).values)
        total += nreplacements
        print("{:10} -> {:>5} replaced {:6} times".format(emoticon, emoticon2emoji[emoticon], nreplacements))
        df.text = df.text.str.replace(emoticon, emoticon2emoji[emoticon])
    print("{:3} replaced {} times".format("ALL", total))
    return

convertEmoticon(clean_tweets_df)

:-*\)|\(:  ->     😃 replaced      0 times
:-*[Oo]    ->     😮 replaced      0 times
:-*\(|\):  ->     😞 replaced      0 times
:o\)       ->     🐵 replaced      0 times
:-*[PpbB]  ->     😛 replaced      0 times
;-*\)      ->     😉 replaced      0 times
:-*\*      ->     💋 replaced      0 times
:'\(       ->     😢 replaced      0 times
:-*D       ->     😀 replaced      0 times
=-*\)      ->     😀 replaced      0 times
8\)        ->     😎 replaced      0 times
;-*[PpbB]  ->     😜 replaced      0 times
</3        ->     💔 replaced      0 times
:-*>       ->     😆 replaced      0 times
D:         ->     😧 replaced      0 times
<3         ->     ❤ replaced      0 times
:-*\|      ->     😐 replaced      0 times
>:-*\(     ->     😠 replaced      0 times
:-*[/\\]   ->     😕 replaced      0 times
ALL replaced 0 times


### Explore retweets

In [10]:
pattern = r"""(?:\W|^)RT(?:[ \":“]|$)| # Retweets with RT keyword
            [\"“]\s*hdl                # Retweets with quotes"""
temp = clean_tweets_df.text.str.contains(pattern, flags=re.X)
print("There are {} retweets".format(np.sum(temp.values)))
clean_tweets_df[temp].sample(10)

There are 40126 retweets


Unnamed: 0,text
518928,😍“ hdl : PERFECT!!! url
711525,RT hdl : Crochet Storage Basket Pattern: Free and Easy url via hdl
273890,"LIFE RT hdl : Gator boots👢 with a pimped out Gucci suit👔 ain't got no jawb💼, but I stay sharp🔪 #StillFly"
486359,""" hdl : “ hdl : CONCORD!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!” Yeeeeeee"" is full of RAIDER FANS, CUHZ THEY KNO WSSUP!! OAKLAND!!!!! Cuhah"
205832,“ hdl : I love how LAURA MARAVILLA can tweet but not reply” 👏👏👏👏👏👏👏👏
97106,Lmao at Melos RT rn
606491,“ hdl : I fainted upon entry to the new pink store.” It's time for an intervention
509515,!! 💃RT hdl The amount of time I've spent discussing the Phillip Lim for Target collection with hdl is sick.
468119,“ hdl : I was welcomed home by the best!💜 ΑΦ #latepost #lovemynewhome url you're the cutest 😘 AOE
727469,""" hdl : #InitialsOfSomeoneSpecial .. url \nSO Fucking true amen"


###Twitter hashtags

#Functions to extract only emoji or only text from input

In [1]:
def emojiExtract(sent):
    return [word for word in tok.tokenize(sent) if is_emoji(word) == 1]

def textExtract(sent):
    return [word for word in tok.tokenize(sent) if is_emoji(word) == 0]

##Functions to Create Two New Columns [Text Only] & [Emoji Only]

In [2]:
def textEmojiOnly(df):
    df['Emoji'] = [emojiExtract(word) for word in df.text]
    df['only_Text'] = [textExtract(word) for word in df.text]