## Adding tokens to vocab file to represent emojis

We create a function which reads in the most common ~1500 emojis in real time, (kindly provided by the website http://www.emojistats.org/) and stores them as a dataframe. We do this by saving the website as a html file, then parsing it using BeautifulSoup.

We also add in the mult keyword at the start of the most common 500 of these so that we can have representations for when there are multiple occurences of these emojis in a row.

This will create a dataframe where we have the most common emojis and their 'mult' counterpart, we can store these tokens in our vocab file for BERT where they will all have randomly initialized weights at the initial BERT checkpoint. These weights will be updated in fne-tuning and perhaps we will obtain intelligent contextual information this way.

In [39]:
import bs4
from bs4 import BeautifulSoup as BSHTML
import pandas as pd
import re

#Wheel file below is in path
!pip install demoji-0.1.5-py3-none-any.whl
import demoji
demoji.download_codes()

[33mDownloading emoji data ...[0m
[92m... OK[0m (Got response in 1.01 seconds)
[92m... OK[0m (Got response in 0.86 seconds)
[33mDownloading emoji data ...[0m
[92m... OK[0m (Got response in 1.41 seconds)
[33mWriting emoji data to C:\Users\fionn\.demoji/codes.json ...[0m
[92m... OK[0m


In [14]:
INPUT_FILE = 'Emoji Stats - Realtime Emoji Use on iOS.html'
OUTPUT_FILE = 'emojiCounts.csv'

f = open(OUTPUT_FILE,'w')
f.write('emoji,count\n') # write headers

with open(INPUT_FILE) as texts:
    soup = BSHTML(texts)
    lis = soup.findAll('ul', attrs = {'class' : 'emojilist'})
    for li in lis:
        emjList = li.find_all('span')
        for i, _ in enumerate(emjList):
            emoji = emjList[i]['id'].replace('value_', '')
            emoji = emoji.replace('_', ' ')
            count = emjList[i].next
            count = count.replace (',', '')
            f.write(emoji+','+count+'\n') # write to file

f.close()

We're trying to create token representations that are identical to our `emojiReplace_v2` function, demonstrated below

In [42]:
def emojiReplace_v2(text_string):
    emoji_dict = demoji.findall(text_string)    
    for emoji in emoji_dict.keys():
        #Making the connecting token between words a normal letter 'w' because BERT's tokenizer
        #splits on special tokens like '%' and '$'
        emoji_token = 'x'.join(re.split('\W+', emoji_dict[emoji])) + ' '
        text_string = text_string.replace(emoji, emoji_token)
        
        #Controlling for multiple emojis in a row
        pattern = '(' + emoji_token + ')' + '{2,}'
        text_string = re.sub(pattern, 'mult' + emoji_token + ' ', text_string)
    return text_string

#Load in HatEval data
df = pd.read_csv('../Raw_Data/hateval2019/hateval2019_en_train.csv', sep=',',  index_col = False, encoding = 'utf-8')
df = pd.read_csv('Raw_Data/hateval2019/hateval2019_en_train.csv', sep=',',  index_col = False, encoding = 'utf-8')
df.rename(columns={'text': 'tweet', 'HS': 'label'}, inplace=True)

testtweet = df['tweet'][300]

print("Original:\n", testtweet)
print("\nReplacing Emojis:\n", emojiReplace_v2(testtweet))

testtweet1 = df['tweet'][7436]

print("\n\nOriginal:\n", testtweet1)
print("\nReplacing emojis:\n", emojiReplace_v2(testtweet1))

Original:
 No seriously. It has 😂🤣 https://t.co/4k4jlLTDUj

Replacing Emojis:
 No seriously. It has facexwithxtearsxofxjoy rollingxonxthexfloorxlaughing  https://t.co/4k4jlLTDUj


Original:
 Same. We really are soulmates... Dumb AF but soulmates nonetheless 🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃🙃 https://t.co/ZwXTny02jj

Replacing emojis:
 Same. We really are soulmates... Dumb AF but soulmates nonetheless multupsidexdownxface   https://t.co/ZwXTny02jj


In [43]:
#Converting emoji descriptions to the token representation we will crete in emojiReplace_v2
def makeToken(text):
    text = 'x'.join(re.split('\W+', text))
    return text

#Add 'mult' to most popular 500 emojis so we have token representation
#for consecutive emojis as well as individual ones
def addMult(text):
    text = 'mult' + text
    return text

In [46]:
#Read in csv created in 2nd cell and sort
emojiCounts = pd.read_csv(OUTPUT_FILE, sep = ',', header = 0) 
emojiCounts.sort_values('count', inplace = True, ascending = False)

emojiCounts['tokens'] = emojiCounts['emoji'].apply(makeToken)

emojiCounts['mult'] = emojiCounts['tokens'][0:500].apply(addMult)

emojiCounts.to_csv('emojiCounts.csv', sep = ',', index = False)
emojiCounts.reset_index(inplace = True, drop = True)
emojiCounts.head()

Unnamed: 0,emoji,count,tokens,mult
0,face with tears of joy,332059305,facexwithxtearsxofxjoy,multfacexwithxtearsxofxjoy
1,heavy black heart,171566133,heavyxblackxheart,multheavyxblackxheart
2,face throwing a kiss,122353058,facexthrowingxaxkiss,multfacexthrowingxaxkiss
3,smiling face with heart shaped eyes,87122195,smilingxfacexwithxheartxshapedxeyes,multsmilingxfacexwithxheartxshapedxeyes
4,rolling on the floor laughing,52003805,rollingxonxthexfloorxlaughing,multrollingxonxthexfloorxlaughing


Create new vocab file with additional emoji representations. Hopefully these emoji tokens can become weighted after fine-tuning and have a positive effect on our performance

In [35]:
with open('vocab.txt', encoding='utf-8', errors='ignore') as f:
    lines = f.readlines()

for i in range(1,100):
    lines[i] = emojiCounts['tokens'][i-1] + '\n'
    
#Tokens 100, 101, 102 and 103 are the VITAL [UNK], [CLS], [SEP] and [MASK] tokens respectively.
#We must skip these lines
for i in range(104, 500):
    lines[i] = emojiCounts['tokens'][i-3] + '\n'
    
for j in range(500, 997):
    lines[j] = emojiCounts['mult'][j-500] + '\n'
    
#Also add in words 'user' and 'multuser' to vocab file which will come up often
#in tweets after preprocessing
lines[997] = 'user '+ '\n'
lines[998] = 'multuser' + '\n'
    
for j in range(500, 999):
    lines[j] = emojiCounts['mult'][j-500] + '\n'
    
with open('vocab1.txt', 'w', encoding = 'utf-8', errors = 'ignore') as f:
    f.writelines(lines[:])

<b>We'll store this new vocab file in the GCS bucket containinig the pre-trained BERT model</b>