# Further Pre-Training BERT Embeddings

<b>Open in colab for full functionality (if not already open in colab) </b>

<a href="https://colab.research.google.com/drive/1iLhgDQ5aFlrLqyx5772zA2zYmFs5BNn0target" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

We'll have to clone the git repositry for BERT to access the
 scripts for preprocessing our data for pretraining and also a script that runs further pre-training on a provided text corpus. The original BERT repo has been forked to my github where it was edited to remove the <i> next sentence prediction</i> task that BERT as it's redundant when processing tweets

In [1]:
!git clone https://gitlab2.eeecs.qub.ac.uk/csc3002_fionn/csc3002_detecting_hate_speech.git
%cd csc3002_detecting_hate_speech


Cloning into 'csc3002_detecting_hate_speech'...
remote: Enumerating objects: 473, done.[K
remote: Counting objects: 100% (473/473), done.[K
remote: Compressing objects: 100% (293/293), done.[K
remote: Total 473 (delta 258), reused 362 (delta 170)[K
Receiving objects: 100% (473/473), 1.66 GiB | 23.99 MiB/s, done.
Resolving deltas: 100% (258/258), done.
Checking out files: 100% (76/76), done.
/content/csc3002_detecting_hate_speech


Importing dependencies...

In [2]:
import os
import pandas as pd
%tensorflow_version 1.x
import tensorflow as tf
import pprint
import re
import json
import html

TensorFlow 1.x selected.


**We'll use a TPU provided by google colab to run our model**

In [3]:
assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)

USE_TPU=True
try:
  # This address identifies the TPU we'll use when configuring TensorFlow.
  TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  tf.config.experimental_connect_to_host(TPU_WORKER)
except Exception as ex:
  print(ex)
  USE_TPU=False

print("        USE_TPU:", USE_TPU)
print("Eager Execution:", tf.executing_eagerly())

assert not tf.executing_eagerly(), "Eager execution on TPUs have issues currently"

TPU address is grpc://10.100.173.10:8470
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 14936117022277820564),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 354911848024985374),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 16436239899307180411),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 12442358817677048772),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 14068518871292887260),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:

Set random seed and check tensorflow version

In [4]:
tf.set_random_seed(3060)
print("Tensorflow Version:", tf.__version__)

Tensorflow Version: 1.15.2


We'll put the further pre-trained model in a subdirectory inside our BERT model directory on our google bucket. There'll be functionality to delete the subdirectory we create when we make the BERT model, just in case things go wrong

In [5]:
#Large whole word masking BERT
bert_model_name = 'wwm_uncased_L-24_H-1024_A-16' 

output_dir = \
os.path.join(bert_model_name, 'further_pretrained_model')

#@markdown Whether or not to clear/delete the directory and create a new one
DO_DELETE = True #@param {type:"boolean"}
#@markdown Set USE_BUCKET and BUCKET if you want to (optionally) store model output on GCP bucket.
USE_BUCKET = True #@param {type:"boolean"}
BUCKET = 'csc3002' #@param {type:"string"}

if USE_BUCKET:
  OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET, output_dir)
  auth.authenticate_user()

if DO_DELETE:
  try:
    tf.gfile.DeleteRecursively(OUTPUT_DIR)
  except:
    # Doesn't matter if the directory didn't exist
    pass
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))


***** Model output directory: gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model *****


Upon research I found a favourable approach to further pre-training was to do within-task or in-domain pretraining with around 120,000 steps proven to be a success. I've already collected around 150,000 tweets that are categorized as either hate speech, offensive or benign so this roughly satisfies the requirement that the pretraining data is within-task or in-domain

In [6]:
!gcloud config set project 'my-project-csc3002'
!pip install gcsfs
data = pd.read_csv('gs://csc3002/Raw_Data/final.csv' , sep=',',  index_col = False, encoding = 'utf-8')
data.rename(columns={'Tweet': 'tweet'}, inplace = True)
print("\nThe amount of tweets in this dataframe is", len(data.index))
data.head()

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey

Collecting gcsfs
  Downloading https://files.pythonhosted.org/packages/18/3b/454be7c97d05e15eb20a0099f425f0ed6b7552e352c77adb923c3872ba14/gcsfs-0.6.1-py2.py3-none-any.whl
Installing collected packages: gcsfs
Successfully installed gcsfs-0.6.1

The amount of tweets in this dataframe is 146047


Unnamed: 0,Hate_Speech,Offensive,tweet
0,0,1,@USER She should ask a few native Americans wh...
1,0,0,Amazon is investigating Chinese employees who ...
2,0,1,"@USER Someone should'veTaken"" this piece of sh..."
3,0,0,@USER @USER Obama wanted liberals &amp; illega...
4,0,1,@USER Liberals are all Kookoo !!!


We just want the tweets from this file, in this stage of pre-training the other labels are meaningless as we're only performing masked language tasks and possibly next sentence prediction (although I'm not sure if next sentence prediction is striclty necessary....we'll see)

**Let's preprocess the pretraining data, identical to the optimal pre-processing found for our fine tuning data**

In [8]:
#@title Text Pre-Processing Options
HASHTAG_SEGMENTATION = True #@param {type:"boolean"}
EMOJI_REPLACEMENT = "Replace_Emoji_v1" #@param ["None", "Replace_Emoji_v1", "Replace_Emoji_v2"]
LEMMATIZE = False #@param {type:"boolean"}
REMOVE_STOPWORDS = False #@param {type:"boolean"}
REMOVE_PUNCTUATION = True #@param {type:"boolean"}

options = [HASHTAG_SEGMENTATION, EMOJI_REPLACEMENT, LEMMATIZE, REMOVE_STOPWORDS, REMOVE_PUNCTUATION]

%cd Text_Preprocessing/
import preprocessing as pre
#Return to original workspace
%cd ..

data = pre.loadData(data, options = options)

/content/csc3002_detecting_hate_speech/Text_Preprocessing
/content/csc3002_detecting_hate_speech


Load Bert model as an initial checkpoint to train from and also load vocab and config files

In [9]:
bucket_dir = 'gs://csc3002'

bert_ckpt_dir = os.path.join(bucket_dir, bert_model_name) 

bert_ckpt_file   = os.path.join(bert_ckpt_dir, "bert_model.ckpt")
bert_config_file = os.path.join(bert_ckpt_dir, "bert_config.json")
vocab_file = os.path.join(bert_ckpt_dir, "vocab1.txt")

print("Using BERT checkpoint from:", bert_ckpt_dir)

Using BERT checkpoint from: gs://csc3002/wwm_uncased_L-24_H-1024_A-16


### Loading in the rest of the pretraining tweets

See RetrievingPretrainingData.ipynb for more details on the sources of these tweets and how they were pre-processed

Whilst we did remove duplicates in RetrievingPretrainingData.ipynb, we did not remove duplicate tweets that were retweets. Our basic preprocessing function removes the RT symbol. Hopefully after this preprocessing we will be able to remove much more duplicate tweets

In [12]:
imTweets = pd.read_csv('gs://csc3002/pretrain_data/tweetText/immigrationTweets.csv',\
                       sep=',',  index_col = False, names = ['id', 'tweet'])
length = len(imTweets.index)
imTweets.tweet = imTweets.tweet.astype(str) #Making sure text column is type string so preprocessing works
imTweets = pre.loadData(imTweets, options = options)

imTweets.drop_duplicates(subset='id',inplace =True ) #Important to drop duplicates
imTweets.drop_duplicates(subset='tweet',inplace =True )
imlength = length - len(imTweets) 
print("Removed", imlength, "duplicate retweets from the immigration dataset")



womTweets = pd.read_csv('gs://csc3002/pretrain_data/tweetText/YesAllWomenTweets.csv',\
                        sep=',',  index_col = False, names =['id', 'tweet'])
length = len(womTweets.index)
womTweets.tweet = womTweets.tweet.astype(str) #Making sure text column is type string so preprocessing works
womTweets = pre.loadData(womTweets, options = options)

womTweets.drop_duplicates(subset='id',inplace =True ) #Important to drop duplicates
womTweets.drop_duplicates(subset='tweet',inplace =True ) 
womlength= length - len(womTweets)
print("Removed",womlength , "duplicate retweets from the #YesAllWomen dataset")



sexismTweets = pd.read_csv('gs://csc3002/pretrain_data/tweetText/sexismTweets.csv',\
                           sep=',',  index_col = False, names =['id', 'tweet'])
length = len(sexismTweets.index)
sexismTweets.tweet = sexismTweets.tweet.astype(str) #Making sure text column is type string so preprocessing works
sexismTweets = pre.loadData(sexismTweets, options = options)

sexismTweets.drop_duplicates(subset='id',inplace =True ) #Important to drop duplicates
sexismTweets.drop_duplicates(subset='tweet',inplace =True ) 
sexismlength = length - len(sexismTweets)
print("Removed ", sexismlength, "duplicate retweets from the #notallmen and #notallwomen datasets")


chalk = pd.read_csv('gs://csc3002/pretrain_data/tweetText/chalkTweets.csv', \
                    sep=',',  index_col = False, names =['id', 'tweet'])
length = len(chalk.index)
chalk.tweet = chalk.tweet.astype(str) #Making sure text column is type string so preprocessing works
chalk = pre.loadData(chalk, options = options)

chalk.drop_duplicates(subset='id',inplace =True ) #Important to drop duplicates
chalk.drop_duplicates(subset='tweet',inplace =True ) 
chalklength = length - len(chalk)
print("Removed", chalklength, "duplicate retweets from the #thechalkening dataset")



dat = pd.read_csv('gs://csc3002/pretrain_data/tweetText/full.csv',\
                  sep=',',  index_col = False, names =['id', 'tweet'])
length = len(dat.index)
dat.tweet = dat.tweet.astype(str) #Making sure text column is type string so preprocessing works
dat = pre.loadData(dat, options = options)

dat.drop_duplicates(subset='id',inplace =True ) #Important to drop duplicates
dat.drop_duplicates(subset='tweet',inplace =True ) 
datlength = length - len(dat)
print("Removed", datlength, "duplicate retweets from the combined dataset")


dfs = [imTweets, womTweets, sexismTweets, chalk, dat, data]
full = pd.concat(dfs, axis =0)
length = len(full.index)
full.drop(columns = {'Hate_Speech', 'Offensive'}, inplace=True)

#Pre-process so we can remove retweets
full.tweet = full.tweet.astype(str) #Making sure text column is type string so preprocessing works

#Belowremoves a LOT of duplicates - maybe fishy?
full.drop_duplicates(subset='id',inplace =True ) #Important to drop duplicates
full.drop_duplicates(subset='tweet',inplace =True ) 

newlength = len(full.index)
duplicates = (length - newlength) + chalklength + datlength + sexismlength + womlength + imlength
print("Removed", duplicates , "duplicate retweets\n")

#Shuffle Data
full = full.sample(frac=1)
full.reset_index(drop = True, inplace = True)

full.info()

Removed 282267 duplicate retweets from the immigration dataset
Removed 102411 duplicate retweets from the #YesAllWomen dataset
Removed  3875 duplicate retweets from the #notallmen and #notallwomen datasets
Removed 3699 duplicate retweets from the #thechalkening dataset


  interactivity=interactivity, compiler=compiler, result=result)


Removed 55680 duplicate retweets from the combined dataset
Removed 604122 duplicate retweets

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1253403 entries, 0 to 1253402
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   id      1253402 non-null  object
 1   tweet   1253403 non-null  object
dtypes: object(2)
memory usage: 19.1+ MB


We can train BERT on a max of 512 sequence length. Having a large sequence length exponentially increases the amount of memory required to process tensors, and thus increases our time training. 

<i>(Instead of just removing tweets under our defined max seq length we could run the portion of the dataset with a large sequence length on the last tenth of the amount of steps. Further investigation needed here though)</i>

In [13]:
#We'll see if we can get away with max seq length of 256 first
MAX_SEQ_LEN = 256

print("There will be", len(full[full['tweet'].apply(lambda x: len(x) < MAX_SEQ_LEN)]), "tweets")

#The tweets are from many different sources and have been grouped together 
#so we'll shuffle the data
full = full.sample(frac=1)
full.reset_index(drop = True, inplace=True)
full = full[full['tweet'].apply(lambda x: len(x) <= MAX_SEQ_LEN)]

There will be 1250219 tweets


### Lets save our data in a text file and put it in the bert repo directory we cloned

In [14]:
# Go to cloned bert repo in workspace
%cd bert/
tweets = full.tweet
tweets.to_csv('./text_file.txt', sep=',', index = True)
%ls

/content/csc3002_detecting_hate_speech/bert
CONTRIBUTING.md
create_pretraining_data.py
extract_features.py
__init__.py
LICENSE
modeling.py
modeling_test.py
multilingual.md
optimization.py
optimization_test.py
predicting_movie_reviews_with_bert_on_tf_hub.ipynb
README.md
requirements.txt
run_classifier.py
run_classifier_with_tfhub.py
run_pretraining.py
run_squad.py
sample_text.txt
text_file.txt
tokenization.py
tokenization_test.py


The exact string and integer values have to be passed to both scripts

In [0]:
#It's advised to set max predictions per sequence to around max_seq_length * 0.15
max_preds_per_seq = MAX_SEQ_LEN * 0.15
print("max_predictions_per_seq:", max_preds_per_seq)

print("vocab file location:", vocab_file)

output_file = os.path.join(OUTPUT_DIR, 'pretrainingdata.tfrecord' )
print("output file", output_file)

max_predictions_per_seq: 38.4
vocab file location: gs://csc3002/wwm_uncased_L-24_H-1024_A-16/vocab1.txt
output file gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model/pretrainingdata.tfrecord


## Creating Pre-Training Data

In [0]:
!python create_pretraining_data.py \
  --input_file='./text_file.txt' \
  --output_file='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model/pretrainingdata.tfrecord' \
  --vocab_file='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/vocab.txt' \
  --do_lower_case=True \
  --do_whole_word_mask =True \
  --max_seq_length=256 \
  --max_predictions_per_seq=39 \
  --masked_lm_prob=0.15 \
  --random_seed=3060 \
  --dupe_factor=5 \
  --short_seq_prob=0.25

  #The output is a set of tf.train.Examples serialized into TFRecord file format



W0318 21:26:17.290036 140664900233088 module_wrapper.py:139] From create_pretraining_data.py:440: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0318 21:26:17.290290 140664900233088 module_wrapper.py:139] From create_pretraining_data.py:440: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0318 21:26:17.290468 140664900233088 module_wrapper.py:139] From /content/bert/tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W0318 21:26:18.852150 140664900233088 module_wrapper.py:139] From create_pretraining_data.py:447: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.


W0318 21:26:18.853678 140664900233088 module_wrapper.py:139] From create_pretraining_data.py:449: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:*** Reading from input files ***
I0318 21:26:18.8538

 ## Run Further-Pre-Training

Now we run the main pretraining script, which creates the further pretrained model

In [0]:
print("Input file:", OUTPUT_DIR )
print("Bert checkpoint file location:", bert_ckpt_file )
print("Bert config file location:", bert_config_file )
print("TPU ADDRESS: ", TPU_ADDRESS)

Input file: gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model
Bert checkpoint file location: gs://csc3002/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt
Bert config file location: gs://csc3002/wwm_uncased_L-24_H-1024_A-16/bert_config.json
TPU ADDRESS:  grpc://10.76.79.202:8470


In [0]:
#TPU address changes each new runtime. Change to the value above to ensure run success
!python run_pretraining.py \
  --input_file='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model/pretrainingdata.tfrecord' \
  --output_dir='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model'\
  --do_train=True \
  --do_eval=True \
  --bert_config_file='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/bert_config.json' \
  --init_checkpoint='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt' \
  --train_batch_size=32 \
  --max_seq_length=256 \
  --save_checkpoints_steps=20000 \
  --max_predictions_per_seq=39 \
  --num_train_steps=120000 \
  --num_warmup_steps=12000 \
  --learning_rate=5e-5\
  --use_tpu=True \
  --tpu_name='grpc://10.76.79.202:8470'\
  --num_tpu_cores=8




W0318 22:16:16.286983 140652062574464 module_wrapper.py:139] From run_pretraining.py:409: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0318 22:16:16.287181 140652062574464 module_wrapper.py:139] From run_pretraining.py:409: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0318 22:16:16.287331 140652062574464 module_wrapper.py:139] From /content/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W0318 22:16:17.639919 140652062574464 module_wrapper.py:139] From run_pretraining.py:416: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.


W0318 22:16:17.918688 140652062574464 module_wrapper.py:139] From run_pretraining.py:420: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.


W0318 22:16:18.058937 140652062574464 module_wrapper.py:139] From run_pretraining.py:422: The name tf.logg

### We can use tensorboard to visualise the performance of our model

We can observe loss relative to how many steps of pre-training the model underwent

In [0]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip   #Downloads file to google drive
def get_tensorboard(path_to_event_file):
  get_ipython().system_raw('tensorboard --logdir {} --host 0.0.0.0 --port 6006 --reload_multifile=true &'
.format(path_to_event_file))
  
  get_ipython().system_raw('./ngrok http 6006 &')

  !curl -s http://localhost:4040/api/tunnels | python3 -c \
      "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

get_tensorboard('gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model') #gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model

--2020-03-18 22:16:04--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 34.202.138.174, 34.225.254.242, 34.201.246.51, ...
Connecting to bin.equinox.io (bin.equinox.io)|34.202.138.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip’


2020-03-18 22:16:04 (37.8 MB/s) - ‘ngrok-stable-linux-amd64.zip’ saved [13773305/13773305]

Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   
http://b748d4b0.ngrok.io
