# Data Augmentation Through Backtranslation
The backtranslation method we use is a neural network model whose checkpoint we must load from google. We use a GPU to run this. Below we test if a GPU is available

In [6]:
#We need the gpu compatible version in the environment
#!pip install tensorflow-gpu==1.14
import tensorflow as tf
import pandas as pd
import re
import html
import os
#!tensorflow_version 1.x

"""device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
"""

#Get GCS Bucket Access
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "C:/Users/fionn/Downloads/storageCreds.json"
from google.cloud import storage
storage_client = storage.Client()
buckets = list(storage_client.list_buckets())
print("If access has been granted, below will print GCS project name:")
print(buckets) # Testing if access to GCS has been granted

pd.set_option('display.max_colwidth', -1)

print("Tensorflow version must be 1.x")
print("Tensorflow version", tf.__version__)

If access has been granted, below will print GCS project name:
[<Bucket: csc3002>]
Tensorflow version must be 1.x
Tensorflow version 1.5.0


## Data
I've stored the dataset in my google drive for ease of access, authentication will have to be provided

In [7]:
DATASET = 'HatEval'

if DATASET == 'AnalyticsVidhya':   
    data = pd.read_csv('./Raw_Data/AnalyticsVidhya/train_E6oV3lV.csv' ,\
                       sep=',',  index_col = False, encoding = 'utf-8')
else:

#For HatEval
    train = pd.read_csv('./Raw_Data/hateval2019/hateval2019_en_train.csv', \
                    sep=',',  index_col = False, encoding = 'utf-8')

    dev = pd.read_csv('./Raw_Data/hateval2019/hateval2019_en_dev.csv',\
                  sep=',',  index_col = False, encoding = 'utf-8')

    data = pd.concat([train,dev], axis=0)
    
    
data.drop_duplicates(inplace=True)
data.rename(columns={'text': 'tweet', 'HS': 'label'}, inplace=True)


data.head()

Unnamed: 0,id,tweet,label,TR,AG
0,201,"Hurray, saving us $$$ in so many ways @potus @realDonaldTrump #LockThemUp #BuildTheWall #EndDACA #BoycottNFL #BoycottNike",1,0,0
1,202,"Why would young fighting age men be the vast majority of the ones escaping a war &amp; not those who cannot fight like women, children, and the elderly?It's because the majority of the refugees are not actually refugees they are economic migrants trying to get into Europe.... https://t.co/Ks0SHbtYqn",1,0,0
2,203,"@KamalaHarris Illegals Dump their Kids at the border like Road Kill and Refuse to Unite! They Hope they get Amnesty, Free Education and Welfare Illegal #FamilesBelongTogether in their Country not on the Taxpayer Dime Its a SCAM #NoDACA #NoAmnesty #SendThe",1,0,0
3,204,NY Times: 'Nearly All White' States Pose 'an Array of Problems' for Immigrants https://t.co/ACZKLhdMV9 https://t.co/CJAlSXCzR6,0,0,0
4,205,"Orban in Brussels: European leaders are ignoring the will of the people, they do not want migrants https://t.co/NeYFyqvYlX",0,0,0


In [8]:
print(data.label.value_counts(), "\n")
data.info()

0    5790
1    4210
Name: label, dtype: int64 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 999
Data columns (total 5 columns):
id       10000 non-null int64
tweet    10000 non-null object
label    10000 non-null int64
TR       10000 non-null int64
AG       10000 non-null int64
dtypes: int64(4), object(1)
memory usage: 468.8+ KB


# Text Pre-Processing
We preprocess the text with the same techniques we'd use in traininig and pre-training. With the addition of removing punctuation so we can ensure our sentences are parsed correctly by the backtranslation model - (as it uses quotation marks to separate sequences and we don't want already existing quotation marks in sequences ruining that distinction).

In [13]:
!cd 'Unsupervised_Data_Augmentation'
!dir

The system cannot find the path specified.


 Volume in drive C is Windows
 Volume Serial Number is 8610-A966

 Directory of C:\Users\fionn\Documents\CSC3002_Project\CSC3002_Detecting_Hate_Speech_On_Social_Media\csc3002_detecting_hate_speech

04/04/2020  12:39    <DIR>          .
04/04/2020  12:39    <DIR>          ..
04/04/2020  12:05    <DIR>          .ipynb_checkpoints
31/03/2020  19:14    <DIR>          bert
31/03/2020  18:40    <DIR>          Fine_Tuning
31/03/2020  18:38    <DIR>          Pre_Training
04/04/2020  12:22    <DIR>          Raw_Data
12/02/2020  14:09             2,268 README.md
31/03/2020  18:38    <DIR>          Report
31/03/2020  18:45    <DIR>          Text_Preprocessing
04/04/2020  12:06    <DIR>          Unsupervised_Data_Augmentation
04/04/2020  12:39         2,072,756 Unsupervised_Data_Augmentation.ipynb
04/04/2020  12:11         5,023,017 Visualisations.ipynb
               3 File(s)      7,098,041 bytes
              10 Dir(s)  333,341,782,016 bytes free


In [9]:
!cd ../Text_Preprocessing/
#!ls
import preprocessing as pre
#Return to original workspace
!cd ..
!cd Unsupervised_Data_Augmentation


#Function caller can optionally load two dataframes and combine them
def loadData(data1, data2 = None):
    if data2 is not None:
        
        frames = [data1,data2]
        data = pd.concat(frames)
    else:
        data = data1

      #Replace emoji must be done before basic preprocess otherwise unicode will be wiped out
      #And this function will be ineffective
    if EMOJI_REPLACEMENT == 'Replace_Emoji_v1':
        data['tweet'] = data['tweet'].apply(pre.emojiReplace)

    if EMOJI_REPLACEMENT == 'Replace_Emoji_v2':
        data['tweet'] = data['tweet'].apply(pre.emojiReplace_v2)

    #Must be performed after emoji translation
    data['tweet'] = data['tweet'].apply(pre.preprocess)

    if HASHTAG_SEGMENTATION == True:
        data['tweet'] = data['tweet'].apply(pre.hashtagSegment)

    if REMOVE_PUNCTUATION == True:
        data['tweet'] = data['tweet'].apply(lambda x: pre.remove_punct(x))

    if REMOVE_STOPWORDS == True:
        data['tweet'] = data['tweet'].apply(lambda x: pre.remove_stopwords(x))

    if LEMMATIZE == True:
        data['tweet'] = data['tweet'].apply(lambda x: pre.lemmatizing(x))


    data.dropna(inplace = True)
    data.reset_index(drop = True, inplace = True) 
    if DATASET == "AnalyticsVidhya"  and len(data.index) < 20000:
        return data
    else:
        #We don't shuffle data when it is the analytics vidhya test set
        data = data.sample(frac = 1, random_state=SEED) # Shuffle data 
        return data

#Testing function
data = loadData(data)

ModuleNotFoundError: No module named 'preprocessing'

In [10]:
!dir

 Volume in drive C is Windows
 Volume Serial Number is 8610-A966

 Directory of C:\Users\fionn\Documents\CSC3002_Project\CSC3002_Detecting_Hate_Speech_On_Social_Media\csc3002_detecting_hate_speech

04/04/2020  12:37    <DIR>          .
04/04/2020  12:37    <DIR>          ..
04/04/2020  12:05    <DIR>          .ipynb_checkpoints
31/03/2020  19:14    <DIR>          bert
31/03/2020  18:40    <DIR>          Fine_Tuning
31/03/2020  18:38    <DIR>          Pre_Training
04/04/2020  12:22    <DIR>          Raw_Data
12/02/2020  14:09             2,268 README.md
31/03/2020  18:38    <DIR>          Report
31/03/2020  18:45    <DIR>          Text_Preprocessing
04/04/2020  12:06    <DIR>          Unsupervised_Data_Augmentation
04/04/2020  12:37         2,070,380 Unsupervised_Data_Augmentation.ipynb
04/04/2020  12:11         5,023,017 Visualisations.ipynb
               3 File(s)      7,095,665 bytes
              10 Dir(s)  333,342,019,584 bytes free


# Advanced pre-processing - Unsupervised Data Augmentation
The code below is where I implement Unsupervised Data Augmentation on the tweets annotated as hate speech in my dataset - as there is an inbalance of them compared to the benign tweets, (hate speech tweets make up 7% of analyticsvidhya dataset). 

This git project (https://github.com/google-research/uda) provides back-translation data augmentation, which is a paraphrasing tool that promises boosts in performance. I have previously cloned the git project into my google bucket and will retrieve functionality from there

We can git clone the project and run the commands necessary to retrieve the backtranslation model and run it on our hate speech tweets below


In [0]:
#Already done the below
"""original = '/content'
%cd '/content/drive/My Drive'
!git clone https://github.com/google-research/uda.git"""

"original = '/content'\n%cd '/content/drive/My Drive'\n!git clone https://github.com/google-research/uda.git"

**Save our hate speech tweets into the back_translate subdirectory of the git project**

Also we must edit manually the run.sh and download.sh powershell scripts to run on our tweettxt text file instead of the example_file in the original git project. 

Download the existing scripts and edit them locally, then return them to the google drive folder becausegoogle drive won't let us edit them in the drive at the moment (says we need authentication or something)

In [5]:
hatetweets = data.loc[data['label'] == 1]
hatetweets.tweet = hatetweets.tweet.apply(preprocess)
hatetweets = hatetweets.tweet
hatetweets.to_csv('/content/drive/My Drive/uda/back_translate/hateEvaltweettxt.txt', header=None, index=None, sep=' ', mode='a')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
  after removing the cwd from sys.path.


In [9]:
pd.set_option('display.max_colwidth', -1)
print(len(hatetweets))
hatetweets.head(50)

4210


0      hurray, saving us $$$ in so many ways #lockthemup #buildthewall #enddaca #boycottnfl #boycottnike                                                                                                                                                                                 
1      why would young fighting age men be the vast majority of the ones escaping a war and not those who cannot fight like women, children, and the elderly?it's because the majority of the refugees are not actually refugees they are economic migrants trying to get into europe....
2      illegals dump their kids at the border like road kill and refuse to unite! they hope they get amnesty, free education and welfare illegal #familesbelongtogether in their country not on the taxpayer dime its a scam #nodaca #noamnesty #sendthe                                 
5      legal is. not illegal. #buildthatwall                                                                                                              

Install Dependencies and naviagate into subdirectory which contains powershell commands that perform unsupervised data augmentation on our tweets

In [11]:
%cd '/content/drive/My Drive/uda/back_translate'

%pip install --user absl-py
%pip install --user nltk
!python -c "import nltk; nltk.download('punkt')"
%pip install tensor2tensor==1.14.0

/content/drive/My Drive/uda/back_translate
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Collecting tensor2tensor==1.14.0
[?25l  Downloading https://files.pythonhosted.org/packages/64/75/e8fab6e46fcfaf278998b9d0a182361eaa1a9b5a9a7ecb58a0796d9e5229/tensor2tensor-1.14.0-py2.py3-none-any.whl (1.6MB)
[K     |████████████████████████████████| 1.6MB 4.7MB/s 
Installing collected packages: tensor2tensor
  Found existing installation: tensor2tensor 1.14.1
    Uninstalling tensor2tensor-1.14.1:
      Successfully uninstalled tensor2tensor-1.14.1
Successfully installed tensor2tensor-1.14.0


Run the modified download.sh  and the run.sh powershell scripts. They automatically split paragraphs into sentences, translate English sentences to French and then translate them back into English. Finally, they return the paraphrased sentences into paragraphs.

**This may take a while.**

In [0]:
!bash download.sh
!bash run.sh

--2019-12-19 23:39:50--  https://storage.googleapis.com/uda_model/text/back_trans_checkpoints.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128, 2607:f8b0:400e:c08::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4587552274 (4.3G) [application/zip]
Saving to: ‘back_trans_checkpoints.zip’


2019-12-19 23:41:16 (51.2 MB/s) - ‘back_trans_checkpoints.zip’ saved [4587552274/4587552274]

Archive:  back_trans_checkpoints.zip
replace checkpoints/enfr/model.ckpt-500000.index? [y]es, [n]o, [A]ll, [N]one, [r]ename: a
error:  invalid response [a]
replace checkpoints/enfr/model.ckpt-500000.index? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: checkpoints/enfr/model.ckpt-500000.index  
  inflating: checkpoints/enfr/model.ckpt-500000.data-00001-of-00002  y
y

  inflating: checkpoints/enfr/model.ckpt-500000.meta  
  inflating: checkpoints/enfr/model.ckpt-500000.

For some reason the script does not store the result in a directory. I believe it may be because I did not ensure tensorflow-gpu was installed in the environment, or perhaps my google drive was OOM on this run.

As a workaround I'll copy all of the backtranslated tweets and put them in a text file outside of this code - I'll check if they've been backtranslated by comparing them to the original file - and delete the paraphrase + number indicator at the start of each tweet, so that we get a clean tweet only dataset which we can insert seamlessly into the original tweet file

In [0]:
!bash backwardgentopara.bash

In [0]:
#I've copied the backtranslated file into my google drive, now to preprocess it so it's fit for use


dat = '/content/drive/My Drive/paraphrased1_tweets.txt' 
dat = pd.read_csv(dat, sep = '\t', header = None, names = ['tweet'])
pd.set_option('max_colwidth', 800)
dat.head()

The tweets above don't seem like proper english - not that the people who most often engage in hate speech are the most educated bunch!

But still the way sentences are communicated seems peculiar - which is exactly what we want! This shows backtranslation has worked and it has doubled the amount of hate speech in our dataset. Upon manual inspection of the tweets this ran true also.

**Now to remove that paraphrase stuff at the start of each tweet**

In [0]:
def removePara(text_string):  
    
    #We account for whether the number is single, double, triple etc. digit. 
    #Probably could be done a much better way but hey ho
    parsed_text = re.sub(r'(paraphrase )[0-9]:','', text_string) #single digit
    parsed_text = re.sub(r'(paraphrase )[0-9][0-9]:','', parsed_text) #double digit
    parsed_text = re.sub(r'(paraphrase )[0-9][0-9][0-9]:','', parsed_text) #triple digit
    parsed_text = re.sub(r'(paraphrase )[0-9][0-9][0-9][0-9]:','', parsed_text) #quadruple digit
    
    return parsed_text

In [0]:
dat.tweet = dat.tweet.apply(removePara)
dat.tail()

In [0]:
dat.to_csv('/content/drive/My Drive/trial/backtranslated_tweets.csv', index = False)