<a href="https://colab.research.google.com/github/WGLab/mutformer/blob/main/mutformer_finetuning/mutformer_finetuning_benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Finetuning Script

This notebook performs finetuning with varying models, batch sizes, and sequence lengths in order to find the best model. 

# Configure settings

In [22]:
#@markdown ## General Config
#@markdown If preferred, a GCP TPU/runtime can be used to run this notebook (instructions below)
USE_GCP_RUNTIME = False #@param {type:"boolean"}
#@markdown Which task to perform: options are "MRPC" for paired sequence method, "MRPC_w_ex_data" for paired sequence method with external data, "RE" for single sequence method, or "NER" for single sequance per residue prediction (you can add more modes by editing the model code files downloaded from github, but if you add more modes make sure to change the corresponding code segments)
MODE = "MRPC_w_ex_data" #@param {type:"string"}
#@markdown Whether or not external data is being used
USING_EX_DATA = True #@param {type:"boolean"}
#@markdown If using external data, how many pieces of external data are included in total
PRED_NUM =   27#@param {type:"integer"}
MAX_SEQ_LENGTH =  1024#@param {type:"integer"}
BUCKET_NAME = "theodore_jiang" #@param {type:"string"}
#@markdown Folder for where to save the finetuned model (if you want to save into GCS directly, leave this entry blank)
OUTPUT_MODEL_DIR = "bert_model_mrpc_adding_preds" #@param {type:"string"}
if OUTPUT_MODEL_DIR=="":
  OUTPUT_MODEL_DIR = ":::::"
#@markdown Folder in GCS where the pretrained model needs to be loaded from (if saved directly into GCS, leave blank) (the specific model folders will be specified later)
INIT_MODEL_DIR = "" #@param {type:"string"}
if INIT_MODEL_DIR=="":
  INIT_MODEL_DIR = ":::::"
#@markdown Where in GCS the data needs to be loaded from (should be the same as the DATA_DIR variable in the data generation script)
DATA_DIR = "MRPC_w_preds_all_loaded" #@param {type:"string"}
#@markdown Which folder to store the logs in (the LOGGING_DIR variable can be the same across all finetuning notebooks)
LOGGING_DIR = "mrpc_loss_spam_model_comparison_final" #@param {type:"string"}

#@markdown ### Training procedure config
INIT_LEARNING_RATE =  1e-5 #@param {type:"number"}
END_LEARNING_RATE = 5e-7 #@param {type:"number"}
#@markdown Save a checkpoint every this amount of steps
SAVE_CHECKPOINTS_STEPS =  1000 #@param {type:"integer"}
#@markdown ###### TPUEstimator will keep this number of checkpoints; older checkpoints will all be deleted
KEEP_N_CHECKPOINTS_AT_A_TIME =  10#@param {type:"integer"}
#@markdown If using colab, NUM_TPU_SCORES is 8
NUM_TPU_CORES = 8 #@param {type:"number"}
#@markdown How many sequences should the model train on before stopping
PLANNED_TOTAL_SEQUENCES_SEEN =  2e5 #@param {type:"number"}
#@markdown How many steps should the model train for before stopping (number of total sequences seen will depend on the batch size used). NOTE: PLANNED_TOTAL_STEPS will override PLANNED_TOTAL_SEQUENCES_SEEN; if using PLANNED_TOTAL_SEQUENCES_SEEN, set PLANNED_TOTAL_STEPS to -1 (PLANNED TOTAL STEPS will be based on the train batch size used, which can be specified later)
PLANNED_TOTAL_STEPS = 8000 #@param {type:"number"}


#If running on a GCP runtime, follow these instructions to set it up:

###1) Create a VM from the GCP website
###2) Open a command prompt on your computer and perform the following steps"
To ssh into the VM:

```
gcloud beta compute ssh --zone <COMPUTE ZONE> <VM NAME> --project <PROJECT NAME> -- -L 8888:localhost:8888
```

Note: Make sure the port above matches the port below (in this case it's 8888)
\
\
Run each of these commands individually, or copy and paste the one command below:
```
sudo apt-get update
sudo apt-get -y install python3 python3-pip
sudo apt-get install pkg-config
sudo apt-get install libhdf5-serial-dev
sudo apt-get install libffi6 libffi-dev
sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm
sudo -H pip3 install jupyter_http_over_ws
jupyter serverextension enable --py jupyter_http_over_ws
jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
One command:
```
sudo apt-get update ; sudo apt-get -y install python3 python3-pip ; sudo apt-get install pkg-config ; sudo apt-get -y install libhdf5-serial-dev ; sudo apt-get install libffi6 libffi-dev; sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm ; sudo -H pip3 install jupyter_http_over_ws ; jupyter serverextension enable --py jupyter_http_over_ws ; jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
###3) In this notebook, to connect to this runtime, click the "connect to local runtime" option under the connect button, and copy and paste the outputted link with "locahost: ..."
###4) Finally, run this code segment, which creates a TPU


In [None]:
GCE_PROJECT_NAME = "genome-project-319100" #@param {type:"string"}
TPU_ZONE = "us-central1-f" #@param {type:"string"}
TPU_NAME = "mutformer-tpu" #@param {type:"string"}

!gcloud alpha compute tpus create $TPU_NAME --accelerator-type=tpu-v2 --version=1.15.5 --zone=$TPU_ZONE ##create new TPU

!gsutil iam ch serviceAccount:`gcloud alpha compute tpus describe $TPU_NAME | grep serviceAccount | cut -d' ' -f2`:admin gs://theodore_jiang && echo 'Successfully set permissions!' ##give TPU access to GCS

#Clone the MutFormer repo

In [17]:
if USE_GCP_RUNTIME:
  !sudo apt-get -y install git
#@markdown ######Where to clone the repo into (only value that it can't be is "mutformer"):
REPO_DESTINATION_PATH = "code/mutformer" #@param {type:"string"}
import os,shutil
if not os.path.exists(REPO_DESTINATION_PATH):
  os.makedirs(REPO_DESTINATION_PATH)
else:
  shutil.rmtree(REPO_DESTINATION_PATH)
  os.makedirs(REPO_DESTINATION_PATH)
cmd = "git clone https://github.com/WGLab/mutformer.git \"" + REPO_DESTINATION_PATH + "\""
!{cmd}

Cloning into 'code/mutformer'...
remote: Enumerating objects: 462, done.[K
remote: Counting objects: 100% (263/263), done.[K
remote: Compressing objects: 100% (229/229), done.[K
remote: Total 462 (delta 188), reused 38 (delta 33), pack-reused 199[K
Receiving objects: 100% (462/462), 2.08 MiB | 12.34 MiB/s, done.
Resolving deltas: 100% (299/299), done.


#Imports

In [18]:
if not USE_GCP_RUNTIME:
  %tensorflow_version 1.x
  from google.colab import auth
  print("Authorize for GCS:")
  auth.authenticate_user()
  print("Authorize done")

import sys
import json
import random
import logging
import tensorflow as tf
import time
import importlib

if not os.path.exists("mutformer"):
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
else:
  shutil.rmtree("mutformer")
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
if "mutformer" in sys.path:
  sys.path.remove("mutformer")
sys.path.append("mutformer")

from mutformer import modeling, optimization, tokenization,run_classifier,run_ner_for_pathogenic
from mutformer.modeling import BertModel,BertModelModified
from mutformer.run_classifier import MrpcProcessor,REProcessor,MrpcWithPredsProcessor ##change this part if you add more modes--
from mutformer.run_ner_for_pathogenic import NERProcessor      ##--

##reload modules in case that's needed
modules2reload = [modeling, 
                  optimization, 
                  tokenization,
                  run_classifier,
                  run_ner_for_pathogenic]
for module in modules2reload:
    importlib.reload(module)

# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

log.handlers = []

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

#@markdown ###### Whether or not to write logs to a file
DO_FILE_LOGGING = True #@param {type:"boolean"}
if DO_FILE_LOGGING:
  #@markdown ###### If using file logging, what path to write logs to
  FILE_LOGGING_PATH = 'file_logging/spam.log' #@param {type:"string"}
  if not os.path.exists("/".join(FILE_LOGGING_PATH.split("/")[:-1])):
    os.makedirs("/".join(FILE_LOGGING_PATH.split("/")[:-1]))
  fh = logging.FileHandler(FILE_LOGGING_PATH)
  fh.setLevel(logging.INFO)
  fh.setFormatter(formatter)
  log.addHandler(fh)

ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
log.addHandler(ch)

# create formatter and add it to the handlers
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
log.addHandler(ch)

log.handlers = [fh,ch]

if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']

  with tf.Session(TPU_ADDRESS) as session:
    log.info('TPU address is ' + TPU_ADDRESS)
    # Upload credentials to TPU.
    with open('/content/adc.json', 'r') as f:
      auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    
else:
  log.warning('Not connected to TPU runtime')


if MODE=="MRPC": ##change this part if you added more modes
  processor = MrpcProcessor()
  script = run_classifier
elif MODE=="MRPC_w_ex_data":
  processor = MrpcWithPredsProcessor()
  script = run_classifier
elif MODE=="RE":
  processor = REProcessor()
  script = run_classifier
elif MODE=="NER":
  processor = NERProcessor()
  script = run_ner_for_pathogenic
else:
  raise Exception("The mode specified was not one of the available modes: [\"MRPC\",\"MRPC_w_ex_data\" \"RE\",\"NER\"].")
label_list = processor.get_labels()

2021-12-21 07:28:19,208 - tensorflow - INFO - Using TPU runtime
2021-12-21 07:28:19,211 - tensorflow - INFO - TPU address is grpc://10.87.144.74:8470


Authorize for GCS:
Authorize done


#Select preference for communication with eval script/Mount drive if necessary

In [19]:
import os
import shutil

#@markdown ###### Note: for all of these, if using USE_GCP_RUNTIME, all of these parameters must use GCS, because a GCP TPU can't access google drive
#@markdown \
DRIVE_PATH = "/content/drive/My Drive"
BUCKET_PATH = "gs://{}".format(BUCKET_NAME)
#@markdown whether to use GCS for communicating with the evaluation script (to inform it of which model to evaluate during parallel training/eval), if set to False, communication will defaults to drive. If using USE_GCP_RUNTIME, communication will default to GCS
GCS_COMS = True #@param {type:"boolean"}

COMS_PATH = BUCKET_PATH if GCS_COMS or USE_GCP_RUNTIME else DRIVE_PATH

if COMS_PATH==DRIVE_PATH:
  from google.colab import drive,auth
  !fusermount -u /content/drive
  drive.flush_and_unmount()
  drive.mount('/content/drive', force_remount=True)
  


# Run Finetuning

This following section will perform finetuning tests for testing different models' performance with different parameters.

###General definitions

In [20]:
def latest_checkpoint(dir):
  cmd = "gsutil ls "+dir
  files = !{cmd}
  for file in files:
    if "model.ckpt" in file:
      return file.replace("."+file.split(".")[-1],"")

def correct_path(path):
  return path.replace("/:::::","")

def training_loop(BATCH_SIZE,
                  RESUMING,
                  PLANNED_TOTAL_STEPS,
                  DECAY_PER_STEP,
                  DATA_SEQ_LENGTH,
                  MODEL_NAME,
                  MODEL,
                  INIT_CHECKPOINT_DIR,
                  BERT_GCS_DIR,
                  DATA_GCS_DIR,
                  USING_SHARDS,
                  START_SHARD,
                  USING_EX_DATA,
                  PRED_NUM,
                  GCS_LOGGING_DIR,
                  CONFIG_FILE):
  
  RESTORE_CHECKPOINT = None if not RESUMING else tf.train.latest_checkpoint(BERT_GCS_DIR)
  if not RESUMING:
    cmd = "gsutil -m rm -r "+BERT_GCS_DIR
    !{cmd}

  try: 
    INIT_CHECKPOINT = tf.train.latest_checkpoint(INIT_CHECKPOINT_DIR)
  except:
    INIT_CHECKPOINT = latest_checkpoint(INIT_CHECKPOINT_DIR)
  print("init checkpoint:",INIT_CHECKPOINT,"restore/save checkpont:",RESTORE_CHECKPOINT)

  config = modeling.BertConfig.from_json_file(CONFIG_FILE)
  config.hidden_dropout_prob = 0.1
  config.attention_probs_dropout_prob = 0.1

  model_fn = script.model_fn_builder(
      bert_config=config,
      logging_dir=GCS_LOGGING_DIR,
      num_labels=len(label_list),
      init_checkpoint=INIT_CHECKPOINT,
      restore_checkpoint=RESTORE_CHECKPOINT,
      init_learning_rate=INIT_LEARNING_RATE,
      decay_per_step=DECAY_PER_STEP,
      num_warmup_steps=10,
      use_tpu=True,
      use_one_hot_embeddings=True,
      bert=MODEL,
      weight_decay=0.01,
      epsilon=1e-6,
      clip_grads=False,
      using_preds=USING_EX_DATA)

  tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

  run_config = tf.contrib.tpu.RunConfig(
      cluster=tpu_cluster_resolver,
      model_dir=BERT_GCS_DIR,
      save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
      keep_checkpoint_max=KEEP_N_CHECKPOINTS_AT_A_TIME,
      tpu_config=tf.contrib.tpu.TPUConfig(
          iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
          num_shards=NUM_TPU_CORES,
          per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

  estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=True,
      model_fn=model_fn,
      config=run_config,
      train_batch_size=BATCH_SIZE)
  
  train_file_name = "train.tf_record"
  train_file = os.path.join(DATA_GCS_DIR, train_file_name)

  if USING_SHARDS:
    shards_folder = DATA_GCS_DIR
    input_file = os.path.join(DATA_GCS_DIR, train_file_name)
    import re
    file_name = input_file.split("/")[-1]
    shards = [shards_folder + "/" + file for file in tf.io.gfile.listdir(shards_folder) if
              re.match(file_name + "_\d+", file)]
    shards = sorted(shards,key=lambda shard:int(shard.split("_")[-1]))[START_SHARD:]
  else:
    shards = [train_file]

  if USING_SHARDS:
    print("\nUSING SHARDs:")
    for shard in shards:
      print(shard)
    print("\n")

  tf.logging.info("***** Running training *****")
  tf.logging.info("  Batch size = %d", BATCH_SIZE)
  for n,shard in enumerate(shards):
      train_input_fn = script.file_based_input_fn_builder(
          input_file=shard,
          seq_length=DATA_SEQ_LENGTH,
          is_training=True,
          drop_remainder=True,
          pred_num=PRED_NUM if USING_EX_DATA else None)
      try:
        tf.gfile.Open(COMS_PATH+"/finetuning_run_paired_model.txt","w+").write(MODEL_NAME)
        tf.gfile.Open(COMS_PATH+"/finetuning_run_paired_seq_length.txt","w+").write(str(DATA_SEQ_LENGTH))
        tf.gfile.Open(COMS_PATH+"/finetuning_run_paired_batch_size.txt","w+").write(str(BATCH_SIZE))
      except:
        pass
      estimator.train(input_fn=train_input_fn, max_steps=PLANNED_TOTAL_STEPS)



###Training Loops

Following are three code segments to run. These options are:
1. Model/sequence length: different model architectures will be tested using a fixed batch size on data of varying sequence lengths \
2. Sequence length/batch size: one model architecture will be tested using varying batch sizes on data of varying sequence lengths\
3. One model: one model architecture will be tested using a fixed batch size on a fixed set of data of a given sequence length

Note: During training, evaluation results on the training dataset will be written into GCS. To view these results, use the colab notebook titled "mutformer processing and viewing finetuning results."

####Model/sequence length

In [None]:
#@markdown Train batch size to use
BATCH_SIZE=16 #@param
#@markdown List of folder names for models to test
MODELS = ["bert_model_modified_medium","bert_model_modified_large"] #@param
#@markdown List of model architectures for the variable "MODELS" defined in the entry above. NOTE: each position in this list must correspond correctly with each position in "MODELS." BertModel indicates the original BERT, BertModelModified indicates MutFormer's architecture
MODEL_ARCHITECTURES = [BertModelModified,BertModelModified] #@param
#@markdown List of maximum sequence lengths to test
lengths = [256,512,1024] #@param
#@markdown Which folder inside of LOGGING_DIR to store the logs in
RUN_NAME = "MRPC_adding_preds_mn_sl_testing" #@param {type:"string"}
#@markdown Whether or not to resume training from a previous finetuned checkpoint; if no, always train from pretrained model
RESUMING = False #@param {type:"boolean"}
#@markdown whether or not training data was generated in shards (for really large databases)
USING_SHARDS = True #@param {type:"boolean"}
#@markdown if using shards, which shard index to start at (defualt 0 for first shard)
START_SHARD =   0#@param {type:"integer"}

PLANNED_TOTAL_STEPS = PLANNED_TOTAL_STEPS if PLANNED_TOTAL_STEPS != -1 else PLANNED_TOTAL_SEQUENCES_SEEN//BATCH_SIZE
DECAY_PER_STEP = (END_LEARNING_RATE-INIT_LEARNING_RATE)/(PLANNED_TOTAL_STEPS if PLANNED_TOTAL_STEPS!=-1 else PLANNED_TOTAL_SEQUENCES_SEEN/BATCH_SIZE) 

for DATA_SEQ_LENGTH in lengths:
  for m,MODEL_NAME in MODELS:
    print("\n\n\nMODEL NAME:",MODEL_NAME,
          "\nINPUT MAX SEQ LENGTH:",DATA_SEQ_LENGTH,
          "\nTRAIN_BATCH_SIZE:",BATCH_SIZE,"\n\n\n")

    MODEL = MODEL_ARCHITECTURES[m]
    INIT_CHECKPOINT_DIR = correct_path(BUCKET_PATH+"/"+INIT_MODEL_DIR+"/"+MODEL_NAME)
    BERT_GCS_DIR = correct_path(BUCKET_PATH+"/"+OUTPUT_MODEL_DIR+"/mn_"+MODEL_NAME+"_sl_"+str(DATA_SEQ_LENGTH))
    DATA_GCS_DIR = BUCKET_PATH+"/"+DATA_DIR+"/"+str(DATA_SEQ_LENGTH)
    
    GCS_LOGGING_DIR = BUCKET_PATH+"/"+LOGGING_DIR+"/"+RUN_NAME+"/mn_"+MODEL_NAME+"_sl_"+str(DATA_SEQ_LENGTH)

    CONFIG_FILE = correct_path(BUCKET_PATH+"/"+INIT_MODEL_DIR+"/"+MODEL_NAME+"/config.json")

    training_loop(BATCH_SIZE,
                  RESUMING,
                  PLANNED_TOTAL_STEPS,
                  DECAY_PER_STEP,
                  DATA_SEQ_LENGTH,
                  MODEL_NAME,
                  MODEL,
                  INIT_CHECKPOINT_DIR,
                  BERT_GCS_DIR,
                  DATA_GCS_DIR,
                  USING_SHARDS,
                  START_SHARD,
                  USING_EX_DATA,
                  PRED_NUM,
                  GCS_LOGGING_DIR,
                  CONFIG_FILE)
  
  

####Batch size/sequence length

In [None]:
#@markdown list of batch sizes to test
batch_sizes = [64] #@param
#@markdown list of maximum sequence lengths to test
lengths = [1024] #@param
#@markdown model to load from inside the specified INIT_MODEL_DIR
MODEL_NAME="bert_model_modified_large" #@param {type:"string"}
#@markdown model architecture to use BertModel indicates the original BERT, BertModelModified indicates MutFormer's architecture
MODEL_ARCHITECTURE = BertModelModified #@param
#@markdown Which folder inside of LOGGING_DIR to store the logs in
RUN_NAME = "MRPC_adding_preds_bs_sl_testing" #@param {type:"string"}
#@markdown whether or not to resume training from a previous finetuned checkpoint; if no, always train from pretrained model
RESUMING = False #@param {type:"boolean"}
#@markdown whether or not training data was generated in shards (for really large databases)
USING_SHARDS = True #@param {type:"boolean"}
#@markdown if using shards, which shard index to start at (defualt 0 for first shard)
START_SHARD =   0#@param {type:"integer"}

BUCKET_PATH = "gs://"+BUCKET_NAME
PLANNED_TOTAL_STEPS = PLANNED_TOTAL_STEPS if PLANNED_TOTAL_STEPS != -1 else PLANNED_TOTAL_SEQUENCES_SEEN//BATCH_SIZE
DECAY_PER_STEP = (END_LEARNING_RATE-INIT_LEARNING_RATE)/(PLANNED_TOTAL_STEPS if PLANNED_TOTAL_STEPS!=-1 else PLANNED_TOTAL_SEQUENCES_SEEN/BATCH_SIZE) 

for DATA_SEQ_LENGTH in lengths:
    for BATCH_SIZE in batch_sizes:
        print("\n\n\nMODEL NAME:",MODEL_NAME,
              "\nINPUT MAX SEQ LENGTH:",DATA_SEQ_LENGTH,
              "\nTRAIN_BATCH_SIZE:",BATCH_SIZE,"\n\n\n")
       
        MODEL = MODEL_ARCHITECTURE
        INIT_CHECKPOINT_DIR = correct_path(BUCKET_PATH+"/"+INIT_MODEL_DIR+"/"+MODEL_NAME)
        BERT_GCS_DIR = correct_path(BUCKET_PATH+"/"+OUTPUT_MODEL_DIR+"/mn_"+MODEL_NAME+"_sl_"+str(DATA_SEQ_LENGTH)+"_bs_"+str(BATCH_SIZE))
        DATA_GCS_DIR = BUCKET_PATH+"/"+DATA_DIR+"/"+str(DATA_SEQ_LENGTH)
      
        GCS_LOGGING_DIR = BUCKET_PATH+"/"+LOGGING_DIR+"/"+RUN_NAME+"/mn_"+MODEL_NAME+"_sl_"+str(DATA_SEQ_LENGTH)+"_bs_"+str(BATCH_SIZE)
        
        CONFIG_FILE = correct_path(BUCKET_PATH+"/"+INIT_MODEL_DIR+"/"+MODEL_NAME+"/config.json")

        training_loop(BATCH_SIZE,
                      RESUMING,
                      PLANNED_TOTAL_STEPS,
                      DECAY_PER_STEP,
                      DATA_SEQ_LENGTH,
                      MODEL_NAME,
                      MODEL,
                      INIT_CHECKPOINT_DIR,
                      BERT_GCS_DIR,
                      DATA_GCS_DIR,
                      USING_SHARDS,
                      START_SHARD,
                      USING_EX_DATA,
                      PRED_NUM,
                      GCS_LOGGING_DIR,
                      CONFIG_FILE)

####One model

In [None]:
#@markdown batch size to use
BATCH_SIZE = 32 #@param
#@markdown maximum sequence length to use
DATA_SEQ_LENGTH = 512 #@param
#@markdown model to load from inside the specified INIT_MODEL_DIR
MODEL_NAME="bert_model_modified_large" #@param {type:"string"}
#@markdown model architecture to use BertModel indicates the original BERT, BertModelModified indicates MutFormer's architecture
MODEL_ARCHITECTURE = BertModelModified #@param
#@markdown Which folder inside of LOGGING_DIR to store the logs in
RUN_NAME = "MRPC_adding_preds_w_mutformer12L" #@param {type:"string"}
#@markdown whether or not to resume training from a previous checkpoint; if no, always train from scratch
RESUMING = False #@param {type:"boolean"}
#@markdown whether or not training data was generated in shards (for really large databases)
USING_SHARDS = False #@param {type:"boolean"}
#@markdown if using shards, which shard index to start at (defualt 0 for first shard)
START_SHARD =   0#@param {type:"integer"}


PLANNED_TOTAL_STEPS = PLANNED_TOTAL_STEPS if PLANNED_TOTAL_STEPS != -1 else PLANNED_TOTAL_SEQUENCES_SEEN//BATCH_SIZE
DECAY_PER_STEP = (END_LEARNING_RATE-INIT_LEARNING_RATE)/(PLANNED_TOTAL_STEPS if PLANNED_TOTAL_STEPS!=-1 else PLANNED_TOTAL_SEQUENCES_SEEN/BATCH_SIZE) 


print("\n\n\nMODEL NAME:",MODEL_NAME,
      "\nINPUT MAX SEQ LENGTH:",DATA_SEQ_LENGTH,
      "\nTRAIN_BATCH_SIZE:",BATCH_SIZE,"\n\n\n")

MODEL = MODEL_ARCHITECTURE
INIT_CHECKPOINT_DIR = correct_path(BUCKET_PATH+"/"+INIT_MODEL_DIR+"/"+MODEL_NAME)
BERT_GCS_DIR = correct_path(BUCKET_PATH+"/"+OUTPUT_MODEL_DIR)
DATA_GCS_DIR = BUCKET_PATH+"/"+DATA_DIR+"/"+str(DATA_SEQ_LENGTH)

GCS_LOGGING_DIR = BUCKET_PATH+"/"+LOGGING_DIR+"/"+RUN_NAME

CONFIG_FILE = correct_path(BUCKET_PATH+"/"+INIT_MODEL_DIR+"/"+MODEL_NAME+"/config.json")


training_loop(BATCH_SIZE,
              RESUMING,
              PLANNED_TOTAL_STEPS,
              DECAY_PER_STEP,
              DATA_SEQ_LENGTH,
              MODEL_NAME,
              MODEL,
              INIT_CHECKPOINT_DIR,
              BERT_GCS_DIR,
              DATA_GCS_DIR,
              USING_SHARDS,
              START_SHARD,
              USING_EX_DATA,
              PRED_NUM,
              GCS_LOGGING_DIR,
              CONFIG_FILE)




MODEL NAME: bert_model_modified_large 
INPUT MAX SEQ LENGTH: 512 
TRAIN_BATCH_SIZE: 32 



CommandException: 1 files/objects could not be removed.
init checkpoint: gs://theodore_jiang/bert_model_modified_large/model.ckpt-2002192 restore/save checkpont: None


2021-12-21 07:30:05,524 - tensorflow - INFO - Using config: {'_model_dir': 'gs://theodore_jiang/bert_model_mrpc_adding_preds', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.87.144.74:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 10, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fefadcee8d0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.87.144.74:8470', '_evaluation_master': 'grpc://10.87.14

step 1


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.



shape1 (4, 768)
shape2 (4, 768)


2021-12-21 07:30:10,606 - tensorflow - INFO - **** Trainable Variables ****
2021-12-21 07:30:10,609 - tensorflow - INFO -   name = bert/embeddings/word_embeddings:0, shape = (27, 768), *INIT_FROM_CKPT*
2021-12-21 07:30:10,610 - tensorflow - INFO -   name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
2021-12-21 07:30:10,612 - tensorflow - INFO -   name = bert/embeddings/position_embeddings:0, shape = (1024, 768), *INIT_FROM_CKPT*
2021-12-21 07:30:10,614 - tensorflow - INFO -   name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
2021-12-21 07:30:10,616 - tensorflow - INFO -   name = bert/embeddings/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
2021-12-21 07:30:10,617 - tensorflow - INFO -   name = bert/embeddings/conv1d/kernel:0, shape = (3, 768, 768), *INIT_FROM_CKPT*
2021-12-21 07:30:10,619 - tensorflow - INFO -   name = bert/embeddings/conv1d/bias:0, shape = (768,), *INIT_FROM_CKPT*
2021-12-21 07:30:10,621 - tensorflow - INFO

(4, 2) (4, 2) (4, 1)
acctot: Tensor("Sum_13:0", shape=(), dtype=float32)



2021-12-21 07:30:22,796 - tensorflow - INFO - Create CheckpointSaverHook.
2021-12-21 07:30:23,149 - tensorflow - INFO - Done calling model_fn.
2021-12-21 07:30:26,113 - tensorflow - INFO - TPU job name worker
2021-12-21 07:30:27,636 - tensorflow - INFO - Graph was finalized.
2021-12-21 07:30:33,198 - tensorflow - INFO - Restoring parameters from gs://theodore_jiang/bert_model_modified_large/model.ckpt-2002192
2021-12-21 07:30:45,436 - tensorflow - INFO - Running local_init_op.
2021-12-21 07:30:45,891 - tensorflow - INFO - Done running local_init_op.
2021-12-21 07:30:54,147 - tensorflow - INFO - Saving checkpoints for 0 into gs://theodore_jiang/bert_model_mrpc_adding_preds/model.ckpt.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.
2021-12-21 07:31:11,312 - tensorflow - INFO - Initialized dataset iterators in 0 seconds
2021-12-21 07:31:11,313 - tensorflow - INFO - Installing graceful shutdown hook.
2021-12-21 07:31:11,319 - tensorflow - INFO - Cr