We'll have to clone the git repositry for BERT to access their scripts for preprocessing our data for pretraining and also a script that runs further pre-training on a provided text corpus. The git repo is provided bygoogle reseacrh here: https://github.com/google-research/bert

In [0]:
!git clone https://github.com/google-research/bert.git

fatal: destination path 'bert' already exists and is not an empty directory.


Importing dependencies...

In [0]:
import os
import pandas as pd
%tensorflow_version 1.x
import tensorflow as tf
import pprint
import re
import json
import tweepy


**We'll use a TPU provided by google colab to run our model**

In [0]:
assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)

USE_TPU=True
try:
  # This address identifies the TPU we'll use when configuring TensorFlow.
  TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  tf.config.experimental_connect_to_host(TPU_WORKER)
except Exception as ex:
  print(ex)
  USE_TPU=False

print("        USE_TPU:", USE_TPU)
print("Eager Execution:", tf.executing_eagerly())

assert not tf.executing_eagerly(), "Eager execution on TPUs have issues currently"

TPU address is grpc://10.1.55.50:8470
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 11238158767859933477),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 16930398522813338690),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 6350689635013358654),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 17091014494587278884),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 3708746930935667203),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/d

Set random seed and check tensorflow version

In [0]:
tf.set_random_seed(3060)
print("Tensorflow Version:", tf.__version__)

Tensorflow Version: 1.15.0


We'll put the further pre-trained model in a subdirectory inside our BERT model directory on our google bucket. There'll be functionality to delete the subdirectory we create when we make the BERT model, just in case things go wrong

In [0]:
#Bert uncased Large 
#bert_model_name = 'uncased_L-24_H-1024_A-16' 

#Large whole word masking
bert_model_name = 'wwm_uncased_L-24_H-1024_A-16' 

output_dir = \
os.path.join(bert_model_name, 'further_pretrained_model')

#@markdown Whether or not to clear/delete the directory and create a new one
DO_DELETE = True #@param {type:"boolean"}
#@markdown Set USE_BUCKET and BUCKET if you want to (optionally) store model output on GCP bucket.
USE_BUCKET = True #@param {type:"boolean"}
BUCKET = 'csc3002' #@param {type:"string"}

if USE_BUCKET:
  OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET, output_dir)
  auth.authenticate_user()

if DO_DELETE:
  try:
    tf.gfile.DeleteRecursively(OUTPUT_DIR)
  except:
    # Doesn't matter if the directory didn't exist
    pass
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))


***** Model output directory: gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model *****


Upon research I found a favourable approach to further pre-training was to do within-task or in-domain pretraining with around 120,000 steps proven to be a success. I've already collected around 150,000 tweets that are categorized as either hate speech, offensive or benign so this roughly satisfies the requirement that the pretraining data is within-task or in-domain

In [0]:
!gcloud config set project 'my-project-csc3002'
!pip install gcsfs
data = 'gs://csc3002/Raw_Data/final.csv' 
data = pd.read_csv(data, sep=',',  index_col = False, encoding = 'utf-8')

print("\nThe amount of tweets in this daaframe is", len(data.index))
print("Hate speech label count:\n ", data.Hate_Speech.value_counts(), "\n")
print("Offensive language label count:\n ", data.Offensive.value_counts(), "\n")
data.head()

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey

Collecting gcsfs
  Downloading https://files.pythonhosted.org/packages/3e/9f/864a9ff497ed4ba12502c4037db8c66fde0049d9dd0388bd55b67e5c4249/gcsfs-0.6.0-py2.py3-none-any.whl
Installing collected packages: gcsfs
Successfully installed gcsfs-0.6.0

The amount of tweets in this daaframe is 147977
Hate speech label count:
  0    139946
1      8031
Name: Hate_Speech, dtype: int64 

Offensive language label count:
  0    80127
1    52483
-    15367
Name: Offensive, dtype: int64 



Unnamed: 0,Hate_Speech,Offensive,Tweet
0,0,1,@USER She should ask a few native Americans wh...
1,0,1,@USER @USER Go home you’re drunk!!! @USER #MAG...
2,0,0,Amazon is investigating Chinese employees who ...
3,0,1,"@USER Someone should'veTaken"" this piece of sh..."
4,0,0,@USER @USER Obama wanted liberals &amp; illega...


We just want the tweets from this file, in this stage of pre-training the other labels are meaningless as we're only performing masked language tasks and possibly next sentence prediction (although I'm not sure if next sentence prediction is striclty necessary....we'll see)

**Let's preprocess the pretraining data, like we will for our fine tuning data**

In [0]:
def preprocess(text_string):
    """
    Accepts a text string and:
    1) Removes URLS
    2) lots of whitespace with one instance
    3) Removes mentions
    4) Uses the html.unescape() method to convert unicode to text counterpart
    5) Replace & with and
    6) Remove the fact the tweet is a retweet if it is - knowing the tweet is 
       a retweet does not help towards our classification task.
    This allows us to get standardized counts of urls and mentions
    Without caring about specific people mentioned
    """
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[#$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+:'
    mention_regex1 = '@[\w\-]+'
    RT_regex = '(RT|rt)[ ]*@[ ]*[\S]+'
    
    # Replaces urls with URL
    parsed_text = re.sub(giant_url_regex, '', text_string)
    parsed_text = re.sub('URL', '', parsed_text)
    
    # Remove the fact the tweet is a retweet. 
    # (we're only interested in the language of the tweet here)
    parsed_text = re.sub(RT_regex, ' ', parsed_text) 
    
    # Removes mentions as they're redundant information
    parsed_text = re.sub(mention_regex, '',  parsed_text)
    #including ones with semi-colons after - this seems to come up often
    parsed_text = re.sub(mention_regex1, '',  parsed_text)  

    #Remove unicode
    parsed_text = re.sub(r'[^\x00-\x7F]','', parsed_text) 
    parsed_text = re.sub(r'&#[0-9]+;', '', parsed_text)  

    # Convert unicode missed by regex to text
   #parsed_text = html.unescape(parsed_text)

    #Remove excess whitespace at the end
    parsed_text = re.sub(space_pattern, ' ', parsed_text) 
    
    #Set text to lowercase and strip
    parsed_text = parsed_text.lower()
    parsed_text = parsed_text.strip()
    
    return parsed_text

We can train BERT on a max of 512 sequence length. Having a large sequence length exponentially increases the amount of memory required to process tensors, and thus increases our time training. 

Instead of just removing tweets under our defined max seq length we could run the portion of the dataset with a large sequence length on the last tenth of the amount of steps. Further investigation needed here though

In [0]:
#We'll seeif we can get away with max seq length of 256 first
MAX_SEQ_LEN = 256
data['Tweet'] = data['Tweet'].apply(preprocess)
print("There will be", len(data[data['Tweet'].apply(lambda x: len(x) < MAX_SEQ_LEN)]), "tweets")

#The tweets are from many different sources and have been grouped together 
#so we'll shuffle the data
data = data.sample(frac=1)
data = data[data['Tweet'].apply(lambda x: len(x) <= MAX_SEQ_LEN)]

There will be 146636 tweets


In [0]:
bucket_dir = 'gs://csc3002'

bert_ckpt_dir = os.path.join(bucket_dir, bert_model_name) 

bert_ckpt_file   = os.path.join(bert_ckpt_dir, "bert_model.ckpt")
bert_config_file = os.path.join(bert_ckpt_dir, "bert_config.json")
vocab_file = os.path.join(bert_ckpt_dir, "vocab.txt")

print("Using BERT checkpoint from:", bert_ckpt_dir)

Using BERT checkpoint from: gs://csc3002/wwm_uncased_L-24_H-1024_A-16


Lets save our data in a text file and put it in the bert repo directory we cloned

In [0]:
%cd bert/
tweets = data.Tweet
tweets.to_csv('./text_file.txt', sep=',', index = True)
%ls

/content/bert/bert


  This is separate from the ipykernel package so we can avoid doing imports until


CONTRIBUTING.md
create_pretraining_data.py
extract_features.py
__init__.py
LICENSE
modeling.py
modeling_test.py
multilingual.md
optimization.py
optimization_test.py
predicting_movie_reviews_with_bert_on_tf_hub.ipynb
README.md
requirements.txt
run_classifier.py
run_classifier_with_tfhub.py
run_pretraining.py
run_squad.py
sample_text.txt
text_file.txt
tokenization.py
tokenization_test.py


I believe the exact string and integer values have to be passed to both scripts

In [0]:
#It's advised to set max predictions per sequence to around max_seq_length * 0.15
max_preds_per_seq = MAX_SEQ_LEN * 0.15
print("max_predictions_per_seq:", max_preds_per_seq)

print("vocab file location: ", vocab_file)

output_file = os.path.join(OUTPUT_DIR, 'pretrainingdata.tfrecord' )
print("output file", output_file)

max_predictions_per_seq: 38.4
vocab file location:  gs://csc3002/wwm_uncased_L-24_H-1024_A-16/vocab.txt
output file gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model/pretrainingdata.tfrecord


In [0]:
!python create_pretraining_data.py \
  --input_file='./text_file.txt' \
  --output_file='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model/pretrainingdata.tfrecord' \
  --vocab_file='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/vocab.txt' \
  --do_lower_case=True \
  --do_whole_word_mask =True \
  --max_seq_length=256 \
  --max_predictions_per_seq=39 \
  --masked_lm_prob=0.15 \
  --random_seed=3060 \
  --dupe_factor=5

  #The output is a set of tf.train.Examples serialized into TFRecord file format



W0107 20:59:45.717371 139879971035008 module_wrapper.py:139] From create_pretraining_data.py:437: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0107 20:59:45.717634 139879971035008 module_wrapper.py:139] From create_pretraining_data.py:437: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0107 20:59:45.717828 139879971035008 module_wrapper.py:139] From /content/bert/tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W0107 20:59:47.599260 139879971035008 module_wrapper.py:139] From create_pretraining_data.py:444: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.


W0107 20:59:47.601530 139879971035008 module_wrapper.py:139] From create_pretraining_data.py:446: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:*** Reading from input files ***
I0107 20:59:47.6017

Now we run the main pretraining script, which creates the further pretrained model

In [0]:
target = 32 * 5 * 120000
print(target)

19200000


In [0]:
print("input file:", OUTPUT_DIR )
print("Bert checkpoint file location:", bert_ckpt_file )
print("Bert config file location:", bert_config_file )
print("TPU ADDRESS", TPU_ADDRESS)
num_train_steps = int(len(train_features) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

input file: gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model
Bert checkpoint file location: gs://csc3002/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt
Bert config file location: gs://csc3002/wwm_uncased_L-24_H-1024_A-16/bert_config.json
grpc://10.27.148.2:8470


In [0]:
!python run_pretraining.py \
  --input_file='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model/pretrainingdata.tfrecord' \
  --output_dir='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/further_pretrained_model'\
  --do_train=True \
  --do_eval=True \
  --bert_config_file='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/bert_config.json' \
  --init_checkpoint='gs://csc3002/wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt' \
  --train_batch_size=32 \
  --max_seq_length=256 \
  --save_checkpoints_steps=20000 \
  --max_predictions_per_seq=39 \
  --num_train_steps=120000 \
  --num_warmup_steps=12000 \
  --learning_rate=2e-5\
  --use_tpu=True \
  --tpu_name='grpc://10.27.148.2:8470'\
  --num_tpu_cores=8




W1226 23:23:40.095652 140396150970240 module_wrapper.py:139] From run_pretraining.py:407: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1226 23:23:40.095869 140396150970240 module_wrapper.py:139] From run_pretraining.py:407: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1226 23:23:40.096047 140396150970240 module_wrapper.py:139] From /content/bert/bert/bert/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1226 23:23:41.688500 140396150970240 module_wrapper.py:139] From run_pretraining.py:414: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.


W1226 23:23:41.967213 140396150970240 module_wrapper.py:139] From run_pretraining.py:418: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.


W1226 23:23:42.119373 140396150970240 module_wrapper.py:139] From run_pretraining.py:420: T