<a href="https://colab.research.google.com/github/digitalepidemiologylab/covid-twitter-bert/blob/master/Finetune_COVID_Twitter_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="right" width="450px" src="https://github.com/digitalepidemiologylab/covid-twitter-bert/raw/master/images/COVID-Twitter-BERT-medium.png">

# Finetuning COVID-Twitter-BERT
The current notebook is inspired by the [BERT End to End (Fine-tuning + Predicting) with Cloud TPU](https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb). 

The notebook above also has a detailed description on how to set up a GCP (Google Compute Engine) account and a GCS (Google Cloud Storage) bucket. 

##Before proceeding
* Create a copy of this notebook by going to "File - Save a Copy in Drive"
* Create the bucket as described above. You will need full write access to this GSC bucket and you will need to enter the address in field below.



## Set Bucket Name
Please provide the name of your private GSC bucket below. You need full access to this bucket both for storing the data and for storing the models.

In [None]:
BUCKET_NAME = 'gs://' #@param {type:"string"}

## Set up your TPU environment
This section performs a few tasks like importing some libraries, authenticating and setting up logging.

It then activates the TPU runtime. 


In [None]:
import os, sys, datetime, time, json, math, logging,datetime
import tensorflow as tf
from google.colab import auth
import tensorflow_gcs_config

#Authenticate to be able to store objects in bucket
auth.authenticate_user()

assert BUCKET_NAME

# set up logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s [%(levelname)-5.5s] [%(name)-12.12s]: %(message)s')
logger = logging.getLogger(__name__)

# remove duplicate logger
tf_logger = tf.get_logger()
if len(tf_logger.handlers) > 1 :
  tf_logger.handlers.pop()

# set up TPU strategy
logger.info(f'Running tensorflow version {tf.__version__}')
assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
logger.info(f'TPU address is {TPU_ADDRESS}')
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

# authenticate cloud bucket
tensorflow_gcs_config.configure_gcs_from_colab_auth()

# set TF hub caching to bucket
os.environ['TFHUB_CACHE_DIR'] = os.path.join(f'{BUCKET_NAME}', 'tmp')

## Clone Repository
​
The following step clones the CT-BERT repository. The repository also contains the Tensorflow 2.2  compatible version of the  official tensorflow/models. It then installs Sentencepiece and imports the necessary dependencies from the repocitory. 





In [None]:
!test -d covid-twitter-bert || git clone https://github.com/digitalepidemiologylab/covid-twitter-bert.git --recursive
%cd covid-twitter-bert

# Needed for tokenization
!pip install sentencepiece

sys.path.append('tensorflow_models')
from official.nlp.bert import bert_models
from official.utils.misc import distribution_utils
from official.nlp.bert import configs as bert_configs
from official.modeling import performance
from official.nlp.bert import input_pipeline
from official.utils.misc import keras_utils
from official.nlp import optimization
from config import PRETRAINED_MODELS

# Download GLUE Dataset
This downloads the SST-2 task from the GLUE dataset. This is simple tsv-files (tab separated csv). You can examine them and download them from the Colab disk.

You might want to test CT-BERT on your own data. All you need to do is to rewrite this cell. Prepare your annotated data in the format: 

`text \t label \n`

After this is done, split it in a training file (train.tsv) and a development file (dev.tsv) and upload it to /content/DATA_DIR/TASK_NAME/.

In [None]:
TASK_NAME = "SST-2" #This is used as directory names both locally and in the bucket
DATA_DIR = '/content/glue_data' #The actual data is stored in the subfolder called TASK_NAME

!test -f download_glue_data.py || wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
!python download_glue_data.py --tasks SST --data_dir {DATA_DIR}

print(f"The following datafiles are downloaded to {DATA_DIR}/{TASK_NAME}:")
!ls /content/glue_data/SST-2

# Choose Model 
Most likely you will like to leave this as covid-twitter-bert. However you can change this to compare with other pretrained models. We need to know which model to prepare the dataset for. 

In [None]:
MODEL_CLASS = 'covid-twitter-bert' #@param ["covid-twitter-bert", "bert_large_uncased_wwwm", "bert_large_uncased"]





# Create TFRecord Dataset
This converts your tab-separeted files to a binary tfrecord-file. These files are uploaded to the bucket you created earlier.

In [None]:
assert MODEL_CLASS

#Get path for vocab-file and config file.
VOCAB_FILE = os.path.join('.', 'vocabs', PRETRAINED_MODELS[MODEL_CLASS]['vocab_file'])
CONFIG_FILE = os.path.join('.', 'configs', PRETRAINED_MODELS[MODEL_CLASS]['config'])

!PYTHONPATH=tensorflow_models python tensorflow_models/official/nlp/data/create_finetuning_data.py \
 --input_data_dir={DATA_DIR}/{TASK_NAME}/ \
 --vocab_file={VOCAB_FILE} \
 --train_data_output_path={BUCKET_NAME}/data/{TASK_NAME}/train.tf_record \
 --eval_data_output_path={BUCKET_NAME}/data/{TASK_NAME}/eval.tf_record \
 --meta_data_file_path={BUCKET_NAME}/data/{TASK_NAME}/meta.json \
 --fine_tuning_task_type=classification \
 --max_seq_length=96 \
 --classification_task_name=SST-2 #Even when using your own data, leave this as SST-2. It will tell the script to create a standard classification task dataset

##Set HyperParameters for Training
The default settings are good in most cases. Adjust the number of epochs between 3 and 10. Smaller and more unbalanced datasets will require more epochs to finetune. 

In [None]:
# You might want to change some of these parameters
TRAIN_BATCH_SIZE = 32
EVAL_BATCH_SIZE = 8
PREDICT_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
WARMUP_PROPORTION = 0.1
NUM_EPOCHS = 1
TIME_HISTORY_LOG_STEPS = 10
MAX_SEQ_LENGTH = 96   # CT-BERT is optimised for a sequence length of 96. This is sufficient for Twitter

# Helper functions
Defines a few extra helper functions that are needed for training. Since most users will not need to modify this, the code is hidden by default. You will still have to run it.

In [None]:
#@title
def get_model_config():
    """Reads BERT config"""
    config = bert_configs.BertConfig.from_json_file(CONFIG_FILE)
    return config
  
def get_input_meta_data():                                                                                                                                                   
    """Reads meta data file (containing information about finetune data)"""
    with tf.io.gfile.GFile(os.path.join(DATA_DIR, 'meta.json'), 'rb') as reader:                                                                                                     
        input_meta_data = json.loads(reader.read().decode('utf-8'))                                                                                                                  
    return input_meta_data     

def get_model(model_config, steps_per_epoch, warmup_steps, num_labels, max_seq_length, is_hub_module=False):                                                                   
    """Get classifier and core model (used to initialize from checkpoint)"""
    if PRETRAINED_MODELS[MODEL_CLASS]['is_tfhub_model']:                                                                                                                        
        hub_module_url = f"https://tfhub.dev/{PRETRAINED_MODELS[MODEL_CLASS]['hub_url']}"                                                                                       
        hub_module_trainable = True                                                                                                                                                  
    else:                                                                                                                                                                            
        hub_module_url = None                                                                                                                                                        
        hub_module_trainable = False                                                                                                                                                 
    # Build BERT classifier model
    classifier_model, core_model = bert_models.classifier_model(                                                                                                                     
            model_config,                                                                                                                                                            
            num_labels,                                                                                                                                                              
            MAX_SEQ_LENGTH,                                                                                                                                                          
            hub_module_url=hub_module_url,                                                                                                                                           
            hub_module_trainable=hub_module_trainable)                                                                                                                               
    # Create optimizer with linear decay learning rate + warmup
    optimizer = optimization.create_optimizer(                                                                                                                                    
            LEARNING_RATE,                                                                                                                                                      
            steps_per_epoch * NUM_EPOCHS,                                                                                                                                       
            warmup_steps)                                                                                                                                                           
    classifier_model.optimizer = optimizer
    return classifier_model, core_model     

def get_loss_fn(num_classes):
    """Gets the classification loss function."""
    def classification_loss_fn(labels, logits):
        """Classification loss."""
        labels = tf.squeeze(labels)
        log_probs = tf.nn.log_softmax(logits, axis=-1)
        one_hot_labels = tf.one_hot(tf.cast(labels, dtype=tf.int32), depth=num_classes, dtype=tf.float32)
        per_example_loss = -tf.reduce_sum(tf.cast(one_hot_labels, dtype=tf.float32) * log_probs, axis=-1)
        return tf.reduce_mean(per_example_loss)
    return classification_loss_fn

def get_dataset_fn(input_file_pattern, max_seq_length, global_batch_size, is_training=True):
  """Gets a closure to create a dataset."""
  def _dataset_fn(ctx=None):
    """Returns tf.data.Dataset for distributed BERT pretraining."""
    batch_size = ctx.get_per_replica_batch_size(
        global_batch_size) if ctx else global_batch_size
    dataset = input_pipeline.create_classifier_dataset(
        input_file_pattern,
        max_seq_length,
        batch_size,
        is_training=is_training,
        input_pipeline_context=ctx)
    return dataset
  return _dataset_fn

def get_metrics():
    return [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]


##Prepare the Model for Training
A few more steps is needed to prepare the model for training. This is mainly getting and calculating the configs.

At the end the model is loaded and compiled.

In [None]:
from config import PRETRAINED_MODELS

# Get configs
model_config = get_model_config()

# Setting some variables automatically
RUN_NAME = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S_%f')
OUTPUT_DIR = f'{BUCKET_NAME}/runs/{RUN_NAME}'
DATA_DIR = f'{BUCKET_NAME}/data/{TASK_NAME}'
os.environ['TFHUB_CACHE_DIR'] = OUTPUT_DIR

# Setting callbacks
summary_dir = os.path.join(OUTPUT_DIR, 'summaries')
summary_callback = tf.keras.callbacks.TensorBoard(summary_dir, profile_batch=0)
checkpoint_path = os.path.join(OUTPUT_DIR, 'checkpoint')
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, save_weights_only=True)
time_history_callback = keras_utils.TimeHistory(
    batch_size=TRAIN_BATCH_SIZE,
    log_steps=TIME_HISTORY_LOG_STEPS,
    logdir=summary_dir)
custom_callbacks = [summary_callback, checkpoint_callback, time_history_callback]

# Meta data/label mapping
input_meta_data = get_input_meta_data()
label_mapping = None
logger.info(f'Loaded training data meta.json file: {input_meta_data}')

# Calculate steps, warmup steps and eval steps
train_data_size = input_meta_data['train_data_size']
num_labels = input_meta_data['num_labels']
max_seq_length = input_meta_data['max_seq_length']
steps_per_epoch = int(train_data_size / TRAIN_BATCH_SIZE)
warmup_steps = int(NUM_EPOCHS * train_data_size * WARMUP_PROPORTION / TRAIN_BATCH_SIZE)
eval_steps = int(math.ceil(input_meta_data['eval_data_size'] / EVAL_BATCH_SIZE))

logger.info(f'Running {NUM_EPOCHS} epochs with {steps_per_epoch:,} steps per epoch')
logger.info(f'Using warmup proportion of {WARMUP_PROPORTION}, resulting in {warmup_steps:,} warmup steps')
logger.info(f'Using learning rate: {LEARNING_RATE}, training batch size: {TRAIN_BATCH_SIZE}, num_epochs: {NUM_EPOCHS}')

# Generate dataset functions
with tpu_strategy.scope():
  train_input_fn = get_dataset_fn(
      os.path.join(DATA_DIR, 'train.tf_record'),
      MAX_SEQ_LENGTH,
      TRAIN_BATCH_SIZE,
      is_training=True)
  eval_input_fn = get_dataset_fn(
      os.path.join(DATA_DIR, 'eval.tf_record'),
      MAX_SEQ_LENGTH,
      EVAL_BATCH_SIZE,
      is_training=False)
 
# Get model
with tpu_strategy.scope():
  classifier_model, core_model = get_model(model_config, steps_per_epoch, warmup_steps, num_labels, max_seq_length)
  optimizer = classifier_model.optimizer
  loss_fn = get_loss_fn(num_labels)
logger.info('The model is loaded!')

#Compile the model
logger.info(f'Compiling keras model...')
with tpu_strategy.scope():
  classifier_model.compile(
      optimizer=optimizer,
      loss=loss_fn,
      metrics=get_metrics())
logger.info(f'The model is compiled')

# Train the Model
Finally we are ready to train the model. How long this takes depends on the size of your dataset and the number of epochs. Typical training times are 10-30 minutes on a TPU.

In [None]:
time_start = time.time()
logger.info('Run training...')
with tpu_strategy.scope():
  history = classifier_model.fit(
      x=train_input_fn(),
      validation_data=eval_input_fn(),
      steps_per_epoch=steps_per_epoch,
      epochs=NUM_EPOCHS,
      validation_steps=eval_steps,
      callbacks=custom_callbacks,
      verbose=1)
time_end = time.time()
training_time_min = (time_end-time_start)/60
logger.info(f'Finished training after {training_time_min:.1f} min')

# show final results
logger.info(f'Final results {history.history}')


# The checkpoints should be stored in your Google bucket
# Let us also save the final model
classifier_model.save(os.path.join(OUTPUT_DIR,"my_classifier_model"))


logger.info(f'You have now finetuned a COVID-Twitter-BERT-model on your dataset. This finetuned model can be found in {OUTPUT_DIR}, and contains these files:')
!gsutil ls {OUTPUT_DIR}

# Prediction
Let's load the trained model to run some inference. Since the model was saved at the end of the training, it is possible to run this part of the notebook separated. 



## Load the saved model

In [None]:
from official.nlp.bert import tokenization
model_name = "my_classifier_model"
label_mapping = {0: 'negative', 1: 'positive'}  # label names for SST-2

model_dir = os.path.join(OUTPUT_DIR,model_name)
vocab_file = os.path.join(model_dir, "assets",tf.io.gfile.listdir(os.path.join(model_dir,"assets"))[0])

##Load model and weights
classifier_model_saved = tf.keras.models.load_model(model_dir, compile=False)

##Create a tokenizer
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)

max_seq_length = classifier_model_saved.input_shape['input_mask'][1] #Needed for creating examples of correct length
num_labels = len(label_mapping)



#Load the model from a checkpoint
It is also possible to load the model from the latest checkpoint. We then need to first build the model before loading the weights from the checkpoint. Uncomment the code to run it, and change the last cell to use this model for running the predictions.

In [None]:
'''
from official.nlp.bert import tokenization
from official.nlp.bert import configs as bert_configs
assert MODEL_CLASS
label_mapping = {0: 'negative', 1: 'positive'}  # label names for SST-2

VOCAB_PATH = os.path.join('.', 'vocabs', PRETRAINED_MODELS[MODEL_CLASS]['vocab_file'])
CONFIG_FILE = os.path.join('.', 'configs', PRETRAINED_MODELS[MODEL_CLASS]['config'])
checkpoint_path = os.path.join(OUTPUT_DIR, 'checkpoint')  # Points to the previously trained model checkpoint

##Build model
model_config = get_model_config()
classifier_model_checkpoint = tf.keras.models.load_model(model_dir, compile=False)

##Load weights
checkpoint_load_status = classifier_model_checkpoint.load_weights(checkpoint_path)
checkpoint_load_status.expect_partial()

##Create a tokenizer
tokenizer = get_tokenizer(MODEL_CLASS)

max_seq_length = 96 #Needed for creating examples of correct length. Set manually or automatically as above
num_labels = len(label_mapping)



## Predict helper functions

In [None]:
#Only needed when restoring checkpoints
def get_tokenizer(model_class):
    model = PRETRAINED_MODELS[model_class]
    tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_PATH, do_lower_case=model['lower_case'])
    return tokenizer

#Only needed when restoring checkpoints
def get_model_config():
    """Reads BERT config"""
    config = bert_configs.BertConfig.from_json_file(CONFIG_FILE)
    return config

def create_example(text, tokenizer, max_seq_length):
    tokens = ['[CLS]']
    input_tokenized = tokenizer.tokenize(text)
    if len(input_tokenized) + 2 > max_seq_length:
        # truncate
        input_tokenized = input_tokenized[:(max_seq_length + 2)]
    tokens.extend(input_tokenized)
    tokens.append('[SEP]')
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    num_tokens = len(input_ids)
    input_mask = num_tokens * [1]
    # pad
    input_ids += (max_seq_length - num_tokens) * [0]
    input_mask += (max_seq_length - num_tokens) * [0]
    segment_ids = max_seq_length * [0]
    return tf.constant(input_ids, dtype=tf.int32), tf.constant(input_mask, dtype=tf.int32), tf.constant(segment_ids, dtype=tf.int32)

def generate_single_example(text, tokenizer, max_seq_length):
    example = create_example(text, tokenizer, max_seq_length)
    example_features = {
        'input_word_ids': example[0][None, :],
        'input_mask': example[1][None, :],
        'input_type_ids': example[2][None, :]
    }
    return example_features

def format_prediction(preds, label_mapping, label_name):
    preds = tf.nn.softmax(preds, axis=1)
    formatted_preds = []
    for pred in preds.numpy():
        # convert to Python types and sort
        pred = {label: float(probability) for label, probability in zip(label_mapping.values(), pred)}
        pred = {k: v for k, v in sorted(pred.items(), key=lambda item: item[1], reverse=True)}
        formatted_preds.append({label_name: list(pred.keys())[0], f'{label_name}_probabilities': pred})
    return formatted_preds



## Run inference
Change the text and rerun the cell

In [None]:
input_text = 'The colors of the rainbow'  #@param {type: "string"}

example = generate_single_example(input_text, tokenizer, max_seq_length)
preds = classifier_model_saved.predict(example)
#preds = classifier_model_checkpoint.predict(example) #Comment out to use checkpoint restored model

formatted_preds = format_prediction(preds, label_mapping, 'sentiment')

print('Logits:')
print(preds)
print('\nProbabilities:')
print(json.dumps(formatted_preds, indent=4))

##### Copyright 2020 Per Egil Kummervold and Martin Müller
