# Text Pre-processing for tweets

<b>Below I will develop and demonstrate the functions I have created to preprocess the text in my tweets </b>

<i>Importing and installing dependencies </i>

In [1]:
!pip install wordsegment

#Wheel file below is in path
!pip install demoji-0.1.5-py3-none-any.whl



In [2]:
import wordsegment as ws
import demoji
from wordsegment import load, segment
import pandas as pd
import numpy as np
import html
import re
import json
import string
import nltk

<b> Loading in data </b>

In [3]:
train = pd.read_csv('Raw_Data/hateval2019/hateval2019_en_train.csv', sep=',',  index_col = False, encoding = 'utf-8')
train.rename(columns={'text': 'tweet', 'HS': 'label'}, inplace=True)
pd.set_option('display.max_colwidth', -1)
print("Out of {} tweets in this database, {} are not hate, {} are hate".format(len(train.index), 
                                                      len(train[train['label']==0]),
                                                      len(train[train['label']==1])))
train.head(10)

Out of 9000 tweets in this database, 5217 are not hate, 3783 are hate


Unnamed: 0,id,tweet,label,TR,AG
0,201,"Hurray, saving us $$$ in so many ways @potus @realDonaldTrump #LockThemUp #BuildTheWall #EndDACA #BoycottNFL #BoycottNike",1,0,0
1,202,"Why would young fighting age men be the vast majority of the ones escaping a war &amp; not those who cannot fight like women, children, and the elderly?It's because the majority of the refugees are not actually refugees they are economic migrants trying to get into Europe.... https://t.co/Ks0SHbtYqn",1,0,0
2,203,"@KamalaHarris Illegals Dump their Kids at the border like Road Kill and Refuse to Unite! They Hope they get Amnesty, Free Education and Welfare Illegal #FamilesBelongTogether in their Country not on the Taxpayer Dime Its a SCAM #NoDACA #NoAmnesty #SendThe",1,0,0
3,204,NY Times: 'Nearly All White' States Pose 'an Array of Problems' for Immigrants https://t.co/ACZKLhdMV9 https://t.co/CJAlSXCzR6,0,0,0
4,205,"Orban in Brussels: European leaders are ignoring the will of the people, they do not want migrants https://t.co/NeYFyqvYlX",0,0,0
5,206,@KurtSchlichter LEGAL is. Not illegal. #BuildThatWall,1,0,0
6,207,"@RitaPanahi @826Maureen @RealCandaceO Antifa are just a pack of druggie misfits that no one loves, being the violent thugs they are is their cry for attention and their hit of self importance.#JuvenileDelinquents",0,0,0
7,208,Ex-Teacher Pleads Not guilty To Rape Charges https://t.co/D2mGu3VT5G,0,0,0
8,209,still places on our Bengali (Sylheti) class! it's London's 2nd language! know anyone interested @SBSisters @refugeecouncil @DocsNotCops https://t.co/sOx6shjvMx,0,0,0
9,210,DFID Africa Regional Profile: July 2018 https://t.co/npfZCriW0w,0,0,0


## Basic tweet text pre-processing function

In [4]:
def preprocess(text_string):
    """
    Accepts a text string and:
    1) Removes URLS
    2) lots of whitespace with one instance
    3) Removes mentions
    4) Uses the html.unescape() method to convert unicode to text counterpart
    5) Replace & with and
    6) Remove the fact the tweet is a retweet if it is - knowing the tweet is 
       a retweet does not help towards our classification task.
    """
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[#$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+:'
    mention_regex1 = '@[\w\-]+'
    RT_regex = '(RT|rt)[ ]*@[ ]*[\S]+'
    
    # Replaces urls with URL
    parsed_text = re.sub(giant_url_regex, '', text_string)
    parsed_text = re.sub('URL', '', parsed_text)
    
    # Remove the fact the tweet is a retweet. 
    # (we're only interested in the language of the tweet here)
    parsed_text = re.sub(RT_regex, ' ', parsed_text) 
    
    # Removes mentions as they're redundant information
    parsed_text = re.sub(mention_regex, '',  parsed_text)
    #...including mentions with colons after - this seems to come up often
    parsed_text = re.sub(mention_regex1, '',  parsed_text)  

    #Replace &amp; with and
    parsed_text = re.sub('&amp;', 'and', parsed_text)

    # Remove unicode
    parsed_text = re.sub(r'[^\x00-\x7F]','', parsed_text) 
    parsed_text = re.sub(r'&#[0-9]+;', '', parsed_text)  

    # Convert unicode missed by above regex to text
    parsed_text = html.unescape(parsed_text)
    
    # Remove excess whitespace at the end
    parsed_text = re.sub(space_pattern, ' ', parsed_text) 
    
    # Set text to lowercase and strip
    parsed_text = parsed_text.lower()
    parsed_text = parsed_text.strip()
    
    return parsed_text

### Demonstrating basic tweet text preprocessing

On Mentions and urls:

Removes <i>@CanBorder @rcmpgrcpolice </i> as it's redundant information

Also removes URL at the end as it is not only redundant information, but would be harmful if our tokenizer were to parse it as t would give many useless tokens.

In [5]:
testtweet = train['tweet'][1234]

print("Original:\n", testtweet)
print("\nPreprocessed:\n", preprocess(testtweet))

Original:
 @CanBorder @rcmpgrcpolice And meanwhile as they bicker our lettuce strainer of a border sees well dressed and outfitted "refugees" (code for country shoppers), other illegals and undocumented crossing the border daily... https://t.co/ghVM5IZviU

Preprocessed:
 and meanwhile as they bicker our lettuce strainer of a border sees well dressed and outfitted "refugees" (code for country shoppers), other illegals and undocumented crossing the border daily...


Removal of unicode demostrated below <i>‘˜big ideas’ </i> to normal <i>big ideas</i>

In [6]:
testtweet = train['tweet'][310]

print("Original:\n", testtweet)
print("\nPreprocessed:\n", preprocess(testtweet))

Original:
 Why we need to protect refugees from the ‘˜big ideas’ designed to save them https://t.co/nvvpIGyr2f @Refugees @RCKDirector @UNHCR_Kenya @NRC_HoA @drchorn_africaY @tyrusmaina @AmnestyKenya

Preprocessed:
 why we need to protect refugees from the big ideas designed to save them


<b>Removal of unicode can result in removal of emojis though. which could be vital in understanding context of the tweet </b>

In [7]:
testtweet = train['tweet'][317]

print("Original:\n", testtweet)
print("\nPreprocessed:\n", preprocess(testtweet))

Original:
 Chi-town is a 💩⚫ https://t.co/X2QMidmAH9

Preprocessed:
 chi-town is a


## Emoji Translation

If this function is used in the text preprocessing pipeline, it must be used before the basic tweet pre-processing demonstrated above. The function above will wipe out any unicode which in turn wipes out all emojis

In [8]:
demoji.download_codes()# Download most recent emoji codes

def emojiReplace(text_string):
    emoji_dict = demoji.findall(text_string)
    for emoji in emoji_dict.keys():
        text_string = text_string.replace(emoji, ' '+  emoji_dict[emoji])
    return text_string



[33mDownloading emoji data ...[0m
[92m... OK[0m (Got response in 0.67 seconds)
[33mWriting emoji data to C:\Users\fionn\.demoji/codes.json ...[0m
[92m... OK[0m


### Demostrating emojiReplace

In the sequnces below, we obtain perhaps vital context to what is being said in the tweets

In [9]:
testtweet = train['tweet'][317]

print("Original:\n", testtweet)
print("\nReplacing emojis:\n", emojiReplace(testtweet))

Original:
 Chi-town is a 💩⚫ https://t.co/X2QMidmAH9

Replacing emojis:
 Chi-town is a  pile of poo black circle https://t.co/X2QMidmAH9


In [10]:
testtweet = train['tweet'][300]

print("Original:\n", testtweet)
print("\nReplacing Emojis:\n", emojiReplace(testtweet))

Original:
 No seriously. It has 😂🤣 https://t.co/4k4jlLTDUj

Replacing Emojis:
 No seriously. It has  face with tears of joy rolling on the floor laughing https://t.co/4k4jlLTDUj


If we perform the basic pre-processing function before this function, it will eliminate the excess whitespace

In [17]:
testtweet = train['tweet'][300]

print("Original:\n", testtweet)
print("\nReplacing Emojis and Basic Preprocessing:\n", preprocess(emojiReplace(testtweet)))


Original:
 No seriously. It has 😂🤣 https://t.co/4k4jlLTDUj

Replacing Emojis and Basic Preprocessing:
 no seriously. it has face with tears of joy rolling on the floor laughing


In [48]:
testtweet1 = train['tweet'][7436]

print("Original:\n", testtweet1)
print("\nReplacing emojis:\n", emojiReplace(testtweet1))

Original:
 Same. We really are soulmates... Dumb AF but soulmates nonetheless 🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃 https://t.co/ZwXTny02jj

Replacing emojis:
 Same. We really are soulmates... Dumb AF but soulmates nonetheless  upside-down face upside-down face upside-down face upside-down face upside-down face upside-down face upside-down face upside-down face upside-down face upside-down face upside-down face upside-down face upside-down face upside-down face https://t.co/ZwXTny02jj


## EmojiReplace_v2

The above `emojiReplace` function is quite useful for converting emojis into interpretable text that has words that have been seen before by the pre-trained BERT model and so already have contextually informed weights.

Whilst this may often lead to a relatively accurate portrayal of sentiment, perhaps it can have it's drawbacks. What about when there are several consecutive emojis, often of the same type? This can often lead to unneccessarily large sequence lengths. Also it can detract importance from the rest of the sequence which may have the important information that better tells us whether a tweet is hate speech or not.

Furthermore, giving each of these emojis it's own singular token which it can be identified by could be beneficial for our classifier. We can replace the "[unusedX]" tokens, which have already randomly initialized weights. These weights will be updated in the fine-tuning stage, thus giving contextuall representation to these words.

The altering of the BERT vocab file is done in the notebook `New_Vocab_File_for_Emojis.ipynb`. Below is demostrated the fucntion to convert emojis into unique tokens to work with the new vocab file

In [45]:
def emojiReplace_v2(text_string):
    emoji_dict = demoji.findall(text_string)    
    for emoji in emoji_dict.keys():
        #Making the connecting token between words a normal letter 'w' because BERT's tokenizer
        #splits on special tokens like '%' and '$'
        emoji_token = 'x'.join(re.split('\W+', emoji_dict[emoji])) + ' '
        text_string = text_string.replace(emoji, emoji_token)
        
        #Controlling for multiple emojis in a row
        pattern = '(' + emoji_token + ')' + '{2,}'
        text_string = re.sub(pattern, 'mult' + emoji_token + ' ', text_string)
    return text_string

### Demostrating emojiReplace_v2

In [46]:
testtweet = train['tweet'][300]

print("Original:\n", testtweet)
print("\nReplacing emojis:\n", emojiReplace_v2(testtweet))

Original:
 No seriously. It has 😂🤣 https://t.co/4k4jlLTDUj

Replacing emojis:
 No seriously. It has facexwithxtearsxofxjoy rollingxonxthexfloorxlaughing  https://t.co/4k4jlLTDUj


In [47]:
testtweet1 = train['tweet'][7436]

print("Original:\n", testtweet1)
print("\nReplacing emojis:\n", emojiReplace_v2(testtweet1))

Original:
 Same. We really are soulmates... Dumb AF but soulmates nonetheless 🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃 https://t.co/ZwXTny02jj

Replacing emojis:
 Same. We really are soulmates... Dumb AF but soulmates nonetheless multupsidexdownxface   https://t.co/ZwXTny02jj


## Hashtag Segmentation

In [52]:
load() # Loading wordsegment

#The values below of the bigrams reflect the amount of search results on google that come up
ws.BIGRAMS['alt right'] = 1.17e8 # update wordsegment dict so 
                                #it recognises altright as "alt right" rather than salt right
ws.BIGRAMS['white supremacists'] = 3.86e6
ws.BIGRAMS['tweets'] = 6.26e10
ws.BIGRAMS['independece day'] = 6.21e7

def hashtagSegment(text_string):
    
    #We target hashtags so that we only segment the hashtag strings.
    #Otherwise the segment function may operate on misspelled words also; which
    #often appear in hate speech tweets owing to the ill education of those spewing it
    temp_str = []
    for word in text_string.split(' '):
        if word.startswith('#') == False:
            temp_str.append(word)
        else:
            temp_str = temp_str + segment(word)
            
    text_string = ' '.join(temp_str)       

    return text_string

### Demostrating Hashtag Segmentation

</b>Converts hashtags like #lockthemup and #enddaca into interpreatble words which can be converted into features by BERT<b>

In [91]:
testtweet = preprocess(train['tweet'][0])

print("Original:\n", testtweet)
print("\nHashtag Segmented:\n", hashtagSegment(testtweet))

Original:
 hurray, saving us $$$ in so many ways #lockthemup #buildthewall #enddaca #boycottnfl #boycottnike

Hashtag Segmented:
 hurray, saving us $$$ in so many ways lock them up build the wall end daca boycott nfl boycott nike


In [92]:
testtweet = preprocess(train['tweet'][1029])

print("Original:\n", testtweet)
print("\nHashtag Segmented:\n", hashtagSegment(testtweet))

Original:
 great speec mr. president. i would suggest change to keep america great after u #buildthatwall that u promised and to make mexico pay for it. please keep the #promisesmade and #dowhatyousaid.

Hashtag Segmented:
 great speec mr. president. i would suggest change to keep america great after u build that wall that u promised and to make mexico pay for it. please keep the promises made and do what you said


## Removing punctuation

Effects are self-explanatory, although one must be careful to do this after hashtage segmentation otherwise hashtags will be removed.

When BERT tokenizes sequences, it treats punctuation as a separate token. Which could prove harmful to the model possibly as it's taking into account useless informtation here.

Perhaps not though, BERT will have already seen punctuation in it's pre-training stage and already provided appropriate vectorized weights to these punctuation symbols. We shall test the effect of this technique anyways.

In [99]:
def remove_punct(text):
    
    #Return the charater as long as it's not punctuation
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

### Demonstrating Removal of Punctuation

Below we remove full stops from this tweet

In [102]:
testtweet = preprocess(train['tweet'][921])

print("Original:\n", testtweet)
print("\nRemoving Punctuation:\n", remove_punct(testtweet))

Original:
 we have to demand justice and mercy for immigrants. it is unacceptable that anyone would die in detention. congress has to act now with welcome and compassion.

Removing Punctuation:
 we have to demand justice and mercy for immigrants it is unacceptable that anyone would die in detention congress has to act now with welcome and compassion


Here we remove commas, question marks, dashes and even hashtags.

In [103]:
testtweet = preprocess(train['tweet'][910])

print("Original:\n", testtweet)
print("\nRemoving Punctuation:\n", remove_punct(testtweet))

Original:
 why #nodaca ?illegals est. between 11 - 30 millionillegal vote 80% dem or 600,000 dem vote advantage / 1,000,000 illegal voters70,000 votes gave 2016 electionwhy do you think dems pushing no walls, no borders, no voter ids?#potus #maga #kag #trump #news #votered

Removing Punctuation:
 why nodaca illegals est between 11  30 millionillegal vote 80 dem or 600000 dem vote advantage  1000000 illegal voters70000 votes gave 2016 electionwhy do you think dems pushing no walls no borders no voter idspotus maga kag trump news votered


## Lemmatizing text

Very like stemming but more complex, the difference is that it's slower, because unlike stemming - which just heuristically chops of the word without taking into account the context in which it is used, lemmatizing returns words that are actually in the dictionary.

It doesn't just cut off -ing or -ed because it sees it. The resulting word would have to be a real word with also a similar meaning (synonym) to the original word it was cutting off.

Can be useful to our model because it helps us reduce the corpus of words that the model is exposed to by correlating words with similar meaning.

Can be problematic with abbreviations in words, which is common on twitter so caution is advised with this method

In [26]:
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()

def lemmatizing(text):
    word_list = re.split('\W+', text)
    text = " ".join([wn.lemmatize(word) for word in word_list])
    return text

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fionn\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Demonstrating Lemmatizing

Small changes: replaces atrocities with atrocity and families with family. The sentence does not lose much of it's meaning with these changes, unlike stemming which would result in much meaning lost.

In [89]:
testtweet = preprocess(train['tweet'][21])

print("Original:\n", testtweet)
print("\nLemmatized:\n", lemmatizing(testtweet))

Original:
 hitler left a stain on germany for the atrocities he committed against the jews. trump will leave a stain on america for the atrocities he's commiting against these immigrant families. i only hope there is a reenactment of the nuremberg trials at the end of his reign. #inhumane

Lemmatized:
 hitler left a stain on germany for the atrocity he committed against the jew trump will leave a stain on america for the atrocity he s commiting against these immigrant family i only hope there is a reenactment of the nuremberg trial at the end of his reign inhumane


## Removing stopwords
Stopwords are words which have been deemed to not give very much useful information. Words like 'the' and 'with' which are sentiment neutral words and words which don't tell us a lot about the intent of a sentence

Caution is advised with this pre-processing yet again though as removing stopwords can completely transform a sentence. This technique has proven benefits with dealing with basic NLP tasks like sentiment clssifiers and spam detection. Howvere hate speech detection is a whole different matter.

In [87]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(text):
    word_list = re.split('\W+', text)
    text = " ".join([word for word in word_list if word not in stopwords])
    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fionn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Demonstrating removing stopwords

As it's shown below, words like <b><i>'and', 'to', 'them' </i></b> and so on are removed. This sort of text-preprocessing is a blunt tool and should again be treated with caution

In [97]:
testtweet = preprocess(train['tweet'][291])

print("Original:\n", testtweet)
print("\nRemoving stopwords:\n", remove_stopwords(testtweet))

Original:
 angry italian officials refuse to let this italian commercial ship disembark 66 refugees and migrants because they think it should have let libyan coastguards intercept them and return them to inhumane detention centers instead

Removing stopwords:
 angry italian officials refuse let italian commercial ship disembark 66 refugees migrants think let libyan coastguards intercept return inhumane detention centers instead
