# CS236 Final Project

# Do Generative Transformers Read Like Bidirectional Models (BERT)?

Last edit: 12-7-2021

By: Amil Merchant

This notebook is used to re-produce main results for the paper, specifically those regarding edge probing. Additional functionality can be obtained by modifying the Jiant files, but be careful as progress is lost once a session is deleted on Colab.

Notes:
- This notebook is written and tested for GPU runtime on colab. An equivalent setup for Jupyter notebook and local evaluation could be created based on these schemas.
- Based on the example notebook from Jiant NLP.

## Install necessary libraries

In [1]:
# Please relaunch the runtime once the installation completes
!rm jiant -r
!git clone --branch mybranch https://github.com/amil5/jiant.git
!cd jiant && pip install -r requirements.txt
!pip install allennlp==0.8.4
!pip install --upgrade google-cloud-storage

Cloning into 'jiant'...
remote: Enumerating objects: 13844, done.[K
remote: Counting objects: 100% (144/144), done.[K
remote: Compressing objects: 100% (105/105), done.[K
remote: Total 13844 (delta 50), reused 90 (delta 33), pack-reused 13700[K
Receiving objects: 100% (13844/13844), 5.03 MiB | 14.52 MiB/s, done.
Resolving deltas: 100% (9658/9658), done.


In [2]:
import os
import sys
sys.path.insert(0, "/content/jiant")

# Import the installed Jiant library
import jiant.proj.main.tokenize_and_cache as tokenize_and_cache
import jiant.proj.main.export_model as export_model
import jiant.proj.main.scripts.configurator as configurator
import jiant.proj.main.runscript as main_runscript
import jiant.shared.caching as caching
import jiant.utils.python.io as py_io
import jiant.utils.display as display
import json

In [3]:
model_name = 'bert-base-cased' #@param ['bert-base-cased', 'gpt2']

In [4]:
# Example data provide to examine the structure of the input:
# Recreated from https://github.com/nyu-mll/jiant/blob/master/examples/notebooks/jiant_EdgeProbing_Example.ipynb

# example = {
#   "text": "The current view is that the chronic inflammation in the distal part of the stomach caused by Helicobacter pylori infection results in an increased acid production from the non-infected upper corpus region of the stomach.",
#   "info": {"id": 7},
#   "targets": [
#     {
#       "label": "Cause-Effect(e2,e1)",
#       "span1": [7,8],
#       "span2": [19, 20],
#       "info": {"comment": ""}
#     }
#   ]
# }
# # Simulate a training set of 1000 examples
# train_data = [example] * 1000
# # Simulate a validation set of 100 examples
# val_data = [example] * 100

# py_io.write_jsonl(
#     data=train_data,
#     path="/content/jiant/content/tasks/data/semeval/train.all.json",
# )
# py_io.write_jsonl(
#     data=val_data,
#     path="/content/jiant/content/tasks/data/semeval/val.jsonl",
# )

## Utility for uploading files to colab.

Please ensure that data files have either been placed in the appropriate locations or upload via this utility. 

This code path will block until an upload; please comment out if you are uploading data in any other way.

In [None]:
# from google.colab import files
# uploaded = files.upload()
# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(
#       name=fn, length=len(uploaded[fn])))

## Configure data paths

In [6]:
os.makedirs("/content/tasks/configs/", exist_ok=True)
os.makedirs("/content/tasks/data/semeval", exist_ok=True)
os.makedirs("/content/tasks/data/dep", exist_ok=True)

## Ensure that the path to the Relations data is correct

In [7]:
# Configure the Semeval-2010 Relations Classification Task-8
py_io.write_json({
  "task": "semeval",
  "paths": {
    "train": "/content/jiant/content/tasks/data/semeval/train.all.json",
    "val": "/content/jiant/content/tasks/data/semeval/test.json",
  },
  "name": "semeval"
}, "/content/tasks/configs/semeval_config.json")

In [8]:
# Recreate the smaller version of the test set used in our experiments
# This is currently commented out as we provide the smaller version in the Github repository
# but note that it can be easily re-created
# !head -250 en_ewt-ud-test.json > en_ewt-ud-test-small.json 

## Ensure that the path to the Dependencies data is correct

In [9]:
# Due to training limitations, we use the dev set for training and the smaller 
# test set for evaluation
py_io.write_json({
  "task": "dep",
  "paths": {
    "train": "/content/en_ewt-ud-dev.json",
    "val": "/content/en_ewt-ud-test-small.json",
  },
  "name": "dep"
}, "/content/tasks/configs/dep_config.json")

In [10]:
# Download the desired model (i.e. BERT or GPT-2)
export_model.export_model(
    hf_pretrained_model_name_or_path=f"{model_name}",
    output_base_path=f"./models/{model_name}",
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




In [11]:
%%capture

# For a given task, tokenize and cache the required data
# This ensures that the text does not need to repeatedly be processed upon input
task_name = "semeval" #@param ['semeval', 'dep']
tokenize_and_cache.main(tokenize_and_cache.RunConfiguration(
    task_config_path=f"./tasks/configs/{task_name}_config.json",
    hf_pretrained_model_name_or_path=f"{model_name}",
    output_dir=f"./cache/{task_name}",
    phases=["train", "val"],
))

In [12]:
# Examine a row of the input
row = caching.ChunkedFilesDataCache(f"./cache/{task_name}/train").load_chunk(0)[0]["data_row"]
print(row.input_ids)
print(row.tokens)
print(row.tokens[row.spans[0][0]: row.spans[0][1]+1])
print(row.tokens[row.spans[1][0]: row.spans[1][1]+1])

[  101  1109  1449  1112  1758  1807  1144  1157  4459  4048  1107  1126
  9245  1174  9566  1104 14843  3050   119   102     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0]
['[CLS]', 'The', 'system', 'as', 'described', 'above', 'has', 'its', 'greatest', 'application', 'in', 'an', 'array', '##ed', 'configuration', 'of', 'antenna', 'elements', '.', '[SEP]']
['configuration']
['elements', '.']

In [13]:
# Configuration for the edge probing run
jiant_run_config = configurator.SimpleAPIMultiTaskConfigurator(
    task_config_base_path="./tasks/configs",
    task_cache_base_path="./cache",
    train_task_name_list=[task_name],
    val_task_name_list=[task_name],
    train_batch_size=8,
    eval_batch_size=16,
    epochs=3,
    num_gpus=1,
).create_config()
os.makedirs("./run_configs/", exist_ok=True)
py_io.write_json(jiant_run_config, f"./run_configs/{task_name}_run_config.json")
display.show_json(jiant_run_config)

{
  "task_config_path_dict": {
    "semeval": "./tasks/configs/semeval_config.json"
  },
  "task_cache_config_dict": {
    "semeval": {
      "train": "./cache/semeval/train",
      "val": "./cache/semeval/val",
      "val_labels": "./cache/semeval/val_labels"
    }
  },
  "sampler_config": {
    "sampler_type": "ProportionalMultiTaskSampler"
  },
  "global_train_config": {
    "max_steps": 3000,
    "warmup_steps": 300
  },
  "task_specific_configs_dict": {
    "semeval": {
      "train_batch_size": 8,
      "eval_batch_size": 16,
      "gradient_accumulation_steps": 1,
      "eval_subset_num": 500
    }
  },
  "taskmodels_config": {
    "task_to_taskmodel_map": {
      "semeval": "semeval"
    },
    "taskmodel_config_map": {
      "semeval": null
    }
  },
  "task_run_config": {
    "train_task_list": [
      "semeval"
    ],
    "train_val_task_list": [
      "semeval"
    ],
    "val_task_list": [
      "semeval"
    ],
    "test_task_list": []
  },
  "metric_aggregator_config": 

In [14]:
# Run the edge probe
run_args = main_runscript.RunConfiguration(
    jiant_task_container_config_path=f"./run_configs/{task_name}_run_config.json",
    output_dir=f"./runs/{task_name}",
    hf_pretrained_model_name_or_path=f"{model_name}",
    model_path=f"./models/{model_name}/model/model.p",
    model_config_path=f"./models/{model_name}/model/config.json",
    learning_rate=1e-4,
    eval_every_steps=500,
    do_train=True,
    do_val=True,
    do_save=True,
    force_overwrite=True,
)
main_runscript.run_loop(run_args)

  jiant_task_container_config_path: ./run_configs/semeval_run_config.json
  output_dir: ./runs/semeval
  hf_pretrained_model_name_or_path: bert-base-cased
  model_path: ./models/bert-base-cased/model/model.p
  model_config_path: ./models/bert-base-cased/model/config.json
  model_load_mode: from_transformers
  do_train: True
  do_val: True
  do_save: True
  do_save_last: False
  do_save_best: False
  write_val_preds: False
  write_test_preds: False
  eval_every_steps: 500
  save_every_steps: 0
  save_checkpoint_every_steps: 0
  no_improvements_for_n_evals: 0
  keep_checkpoint_when_done: False
  force_overwrite: True
  seed: -1
  learning_rate: 0.0001
  adam_epsilon: 1e-08
  max_grad_norm: 1.0
  optimizer_type: adam
  no_cuda: False
  fp16: False
  fp16_opt_level: O1
  local_rank: -1
  server_ip: 
  server_port: 
device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
Using seed: 1503425537
{
  "jiant_task_container_config_path": "./run_configs/semeval_run_config.json

  "The following weights were not loaded: {}".format(remainder_weights_dict.keys())


No optimizer decay for:
  encoder.embeddings.LayerNorm.weight
  encoder.embeddings.LayerNorm.bias
  encoder.encoder.layer.0.attention.self.query.bias
  encoder.encoder.layer.0.attention.self.key.bias
  encoder.encoder.layer.0.attention.self.value.bias
  encoder.encoder.layer.0.attention.output.dense.bias
  encoder.encoder.layer.0.attention.output.LayerNorm.weight
  encoder.encoder.layer.0.attention.output.LayerNorm.bias
  encoder.encoder.layer.0.intermediate.dense.bias
  encoder.encoder.layer.0.output.dense.bias
  encoder.encoder.layer.0.output.LayerNorm.weight
  encoder.encoder.layer.0.output.LayerNorm.bias
  encoder.encoder.layer.1.attention.self.query.bias
  encoder.encoder.layer.1.attention.self.key.bias
  encoder.encoder.layer.1.attention.self.value.bias
  encoder.encoder.layer.1.attention.output.dense.bias
  encoder.encoder.layer.1.attention.output.LayerNorm.weight
  encoder.encoder.layer.1.attention.output.LayerNorm.bias
  encoder.encoder.layer.1.intermediate.dense.bias
  encode

HBox(children=(FloatProgress(value=0.0, description='Training', max=3000.0, style=ProgressStyle(description_wi…

HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=32.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=32.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=32.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=32.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=32.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=32.0, style=ProgressStyle(descr…





HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=32.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=32.0, style=ProgressStyle(descr…


Loading Best


HBox(children=(FloatProgress(value=0.0, description='Eval (semeval, Val)', max=170.0, style=ProgressStyle(desc…


{
  "aggregated": 0.9083235162823498,
  "semeval": {
    "loss": 0.05574246243957211,
    "metrics": {
      "major": 0.9083235162823498,
      "minor": {
        "acc": 0.9831857892799721,
        "f1_micro": 0.8334612432847275,
        "acc_and_f1_micro": 0.9083235162823498
      }
    }
  }
}
