#Pretraining Script

This script pretrains a transformer model on protein sequences.

Note: If using a TPU from Google Cloud (not the Colab TPU), make sure to run this notebook on a VM with access to all GCP APIs, and make sure TPUs are enabled for the GCP project

Note: Run multiple copies of this notebook in multiple VMs to train multiple models in parallel

#Downgrade Python and Tensorflow 

(the default python version in Colab does not support Tensorflow 1.15)

* **Note** that because the Python used in this notebook is not the default path, syntax highlighting most likely will not function.

####1. First, download and install Python version 3.7:

In [None]:
!wget -O mini.sh https://repo.anaconda.com/miniconda/Miniconda3-py37_22.11.1-1-Linux-x86_64.sh
!chmod +x mini.sh
!bash ./mini.sh -b -f -p /usr/local
!conda install -q -y jupyter
!conda install -q -y google-colab -c conda-forge
!python -m ipykernel install --name "py37" --user

####2. Then, reload the webpage (not restart runtime) to allow Colab to recognize the newly installed python
####3. Finally, run the following commands to install tensorflow 1.15:

In [None]:
!python3 -m pip install tensorflow==1.15

# Configure settings

In [None]:
#@markdown ## General Config
#@markdown If preferred, a GCP TPU/runtime can be used to run this notebook (instructions below):
GCP_RUNTIME = False #@param {type:"boolean"}
#@markdown How many TPU scores the TPU has: if using colab, NUM_TPU_CORES is 8:
NUM_TPU_CORES = 8 #@param {type:"number"}
#@markdown Name of the GCS bucket to use (Make sure to set this to the name of your own GCS  bucket):
BUCKET_NAME = "" #@param {type:"string"}
BUCKET_PATH = "gs://"+BUCKET_NAME
#@markdown ## IO Config
OUTPUT_MODEL_DIR = "bert_model_embedded_mutformer_12L" #@param {type:"string"}
#@markdown Folder in GCS where data was stored:
DATA_DIR = "pretraining_data_1024_embedded_mutformer" #@param {type:"string"}
LOGGING_DIR = "mutformer2_0_pretraining_logs" #@param {type:"string"}
RUN_NAME = "bert_model_embedded_mutformer_12L" #@param {type:"string"}


#### Vocabulary for the model (MutFormer uses the vocabulary below) ([PAD]
#### [UNK],[CLS],[SEP], and [MASK] are necessary default tokens; B and J
#### are markers for the beginning and ending of a protein sequence,
#### respectively; the rest are all amino acids possible, ranked 
#### approximately by frequency of occurence in human population)
#### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vocab = "\n".join("[PAD] [UNK] [CLS] [SEP] [MASK] L S B J E A P T G V K R D Q I N F H Y C M W".split(" "))

#If running on a GCP TPU, use these commands prior to running this notebook

To ssh into the VM:

```
gcloud beta compute ssh --zone <COMPUTE ZONE> <VM NAME> --project <PROJECT NAME> -- -L 8888:localhost:8888
```

Make sure the port above matches the port below (in this case it's 8888)

```
sudo apt-get update
sudo apt-get -y install python3 python3-pip
sudo apt-get install pkg-config
sudo apt-get install libhdf5-serial-dev
sudo apt-get install libffi6 libffi-dev
sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm
sudo -H pip3 install jupyter_http_over_ws
jupyter serverextension enable --py jupyter_http_over_ws
jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser

(one command):sudo apt-get update ; sudo apt-get -y install python3 python3-pip ; sudo apt-get install pkg-config ; sudo apt-get -y install libhdf5-serial-dev ; sudo apt-get install libffi6 libffi-dev; sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm ; sudo -H pip3 install jupyter_http_over_ws ; jupyter serverextension enable --py jupyter_http_over_ws ; jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
And then copy and paste the outputted link with "locahost: ..." into the colab connect to local runtime option


####Also run this code segment, which creates a TPU

In [None]:
GCE_PROJECT_NAME = "" #@param {type:"string"}
TPU_ZONE = "us-central1-f" #@param {type:"string"}
TPU_NAME = "mutformer-tpu" #@param {type:"string"}

!gcloud alpha compute tpus create $TPU_NAME --accelerator-type=tpu-v2 --version=1.15.5 --zone=$TPU_ZONE ##create new TPU

!gsutil iam ch serviceAccount:`gcloud alpha compute tpus describe $TPU_NAME | grep serviceAccount | cut -d' ' -f2`:admin $BUCKET_PATH && echo 'Successfully set permissions!' ##give TPU access to GCS

#Clone the repo

In [None]:
if GCP_RUNTIME:
  !sudo apt-get -y install git
#@markdown ######Where to clone the repo into (only value that it can't be is "mutformer"):
REPO_DESTINATION_PATH = "code/mutformer" #@param {type:"string"}
import os,shutil
if not os.path.exists(REPO_DESTINATION_PATH):
  os.makedirs(REPO_DESTINATION_PATH)
else:
  shutil.rmtree(REPO_DESTINATION_PATH)
  os.makedirs(REPO_DESTINATION_PATH)
cmd = "git clone https://github.com/WGLab/mutformer.git \"" + REPO_DESTINATION_PATH + "\""
!{cmd}

Cloning into 'code/mutformer'...
remote: Enumerating objects: 1506, done.[K
remote: Counting objects: 100% (386/386), done.[K
remote: Compressing objects: 100% (197/197), done.[K
remote: Total 1506 (delta 265), reused 257 (delta 184), pack-reused 1120[K
Receiving objects: 100% (1506/1506), 6.02 MiB | 17.60 MiB/s, done.
Resolving deltas: 100% (1054/1054), done.


#Imports/Authenticate for GCP

In [None]:
if not GCP_RUNTIME:
  def authenticate_user(): ##authentication function that uses link authentication instead of popup
    if os.path.exists("/content/.config/application_default_credentials.json"): 
      return
    print("Authorize for runtime GCS:")
    !gcloud auth login --no-launch-browser
    print("Authorize for TPU GCS:")
    !gcloud auth application-default login  --no-launch-browser
  authenticate_user()

import sys
import json
import random
import logging
import tensorflow as tf
import time
import os
import shutil
import importlib

if REPO_DESTINATION_PATH == "mutformer":
  shutil.copytree(REPO_DESTINATION_PATH,"mutformer_code")
  REPO_DESTINATION_PATH = "mutformer_code"
if not os.path.exists("mutformer"):
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
else:
  shutil.rmtree("mutformer")
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
if "mutformer" in sys.path:
  sys.path.remove("mutformer")
sys.path.append("mutformer")

from mutformer import modeling, optimization, tokenization, run_pretraining

##reload modules so that you don't need to restart the runtime to reload modules in case that's needed
modules2reload = [modeling, 
                  optimization, 
                  tokenization,
                  run_pretraining]
for module in modules2reload:
    importlib.reload(module)

from modeling import *

##configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

log.handlers = []

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

#@markdown Whether or not to write logs to a file
DO_FILE_LOGGING = True #@param {type:"boolean"}
if DO_FILE_LOGGING:
  #@markdown If using file logging, what path to write logs to
  FILE_LOGGING_PATH = 'file_logging/spam.log' #@param {type:"string"}
  if not os.path.exists("/".join(FILE_LOGGING_PATH.split("/")[:-1])):
    os.makedirs("/".join(FILE_LOGGING_PATH.split("/")[:-1]))
  fh = logging.FileHandler(FILE_LOGGING_PATH)
  fh.setLevel(logging.INFO)
  fh.setFormatter(formatter)
  log.addHandler(fh)

ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
log.addHandler(ch)

if GCP_RUNTIME:
  tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_NAME, zone=TPU_ZONE, project=GCE_PROJECT_NAME)
  TPU_ADDRESS = tpu_cluster_resolver.get_master()
  with tf.Session(TPU_ADDRESS) as session:
      log.info('TPU address is ' + TPU_ADDRESS)
      # Upload credentials to TPU.
      tf.contrib.cloud.configure_gcs(session)
else:
  if 'COLAB_TPU_ADDR' in os.environ:
    log.info("Using TPU runtime")
    TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']

    with tf.Session(TPU_ADDRESS) as session:
      log.info('TPU address is ' + TPU_ADDRESS)
      # Upload credentials to TPU.
      with tf.gfile.Open("/content/.config/application_default_credentials.json", 'r') as f:
        auth_info = json.load(f)
      tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
      
  else:
    raise Exception('Not connected to TPU runtime, TPU required to run mutformer')


#Auto Detect amount of sequences per epoch

In [None]:
#@markdown If not GCP_RUNTIME and data was stored in drive, folder where the original data was stored (for detecting the # of steps per epoch) (this variable should match up with the "INPUT_DATA_FOLDER" variable in the data generation script) (this is used to limit interaction with GCS; it can also be left blank and steps will be automatically detected from tfrecords stored in GCS).
#@markdown 
#@markdown Note: if data was originally stored in GCS or GCP_RUNTIME is true, leave this item blank and steps per epoch will be autodetected from tfrecords:
ORIG_DATA_FOLDER = "" #@param {type: "string"}

if not GCP_RUNTIME and "/content/drive" in ORIG_DATA_FOLDER:
  from google.colab import drive
  !fusermount -u /content/drive
  drive.flush_and_unmount()
  drive.mount('/content/drive', force_remount=True)
  DRIVE_PATH = "/content/drive/My Drive"

  data_path_train = ORIG_DATA_FOLDER+"/train.txt" 

  lines = tf.gfile.Open(data_path_train).read().split("\n")
  SEQUENCES_PER_EPOCH = len(lines)

  print("sequences per epoch:",SEQUENCES_PER_EPOCH)
else:
  from tqdm import tqdm
  def steps_getter(input_files):
    tot_sequences = 0
    for input_file in input_files:
      print("reading:",input_file)

      d = tf.data.TFRecordDataset(input_file)

      with tf.Session() as sess:
        tot_sequences+=sess.run(d.reduce(0, lambda x,_: x+1))

    return tot_sequences

  BUCKET_PATH = "gs://{}".format(BUCKET_NAME)
  got_data = False
  while not got_data: ##will keep trying to access the data until available
    try:
      for f in tf.io.gfile.listdir(BUCKET_PATH+"/"+DATA_DIR+"/train"): ##try to access any of the data bins
          print("trying to access training data from saved copy number "+str(f))
          DATA_GCS_DIR = BUCKET_PATH+"/"+DATA_DIR+"/train/"+str(f)
          train_input_files = tf.gfile.Glob(os.path.join(DATA_GCS_DIR,'*tfrecord'))
          print("Using:",train_input_files)
          if len(train_input_files)>0:
            got_data = True
            try:
              SEQUENCES_PER_EPOCH = steps_getter(train_input_files)
              print("sequences per epoch:",SEQUENCES_PER_EPOCH)
              if not SEQUENCES_PER_EPOCH:
                for file in train_input_files:
                  tf.io.gfile.remove(file)
                raise
              break
            except:
              got_data=False
    except:
      pass
    if got_data:
      break
    raise Exception("Could not find data, wait for data generation to create another epoch of data and try again.")

# Run Training

Run the pretraining loop (should run this in parallel with the dynamic masking data generation loop).

In [None]:
#@markdown ## Model Config:
#@markdown Model architecture to use (BertModel indicates the original BERT, BertModelModified indicates MutFormer's architecture without integrated convs, MutFormer_embedded_convs indicates MutFormer with integrated convolutions):
MODEL_ARCHITECTURE = MutFormer_embedded_convs #@param
#@markdown Maximum sequence length the model should be able to handle (the internal attention mechanisms and embeddings will be created to only account for sequences up to this length) (larger maximum sequence length will take more memory and time to train):
model_max_seq_length = 1024 #@param
#@markdown Other miscellaneous config entries:
hidden_size =   768 #@param {type:"integer"}
num_hidden_layers =   12#@param {type:"integer"}
tf_variables_intializer_value_stdev = 0.02 #@param {type:"number"}
hidden_layers_dropout_probability = 0.1 #@param {type:"number"}
intermediate_size = 3072 #@param {type:"integer"}
self_attention_dropout_probability = 0.1 #@param {type:"number"}


bert_config = {                            
  "hidden_size": hidden_size,
  "hidden_act": "gelu", 
  "initializer_range": tf_variables_intializer_value_stdev, 
  "hidden_dropout_prob": hidden_layers_dropout_probability, 
  "num_attention_heads": num_hidden_layers, 
  "type_vocab_size": 2, 
  "max_position_embeddings": model_max_seq_length, 
  "num_hidden_layers": num_hidden_layers, 
  "intermediate_size": intermediate_size, 
  "attention_probs_dropout_prob": self_attention_dropout_probability
}

##upload config
bert_config["vocab_size"] = len(vocab.split("\n"))

if not os.path.exists(OUTPUT_MODEL_DIR):
  os.makedirs(OUTPUT_MODEL_DIR)
with tf.gfile.Open(OUTPUT_MODEL_DIR+"/config.json", "w") as fo:
  json.dump(bert_config, fo, indent=2)

!gsutil -m cp -r $OUTPUT_MODEL_DIR gs://$BUCKET_NAME


#@markdown \
#@markdown 
#@markdown 
#@markdown ## Training procedure config
#@markdown When checking for dynamically generated data, how long to wait between each check (to minimize interaction with GCS, should be around the same time it takes for the data generation script to generate 1 epoch worth of data):
CHECK_DATA_EVERY_N_SECS = 1200 #@param {type:"integer"}
INIT_LEARNING_RATE =  2e-5 #@param {type:"number"}
END_LEARNING_RATE = 1e-9 #@param {type:"number"}
#@markdown How many checkpoints to keep at a time (older checkpoints will be deleted):
KEEP_N_CHECKPOINTS_AT_A_TIME = 20 #@param {type:"integer"}
#@markdown Stopping condition for training can be set by either a certain number of sequences or a certain number of steps. from below, PLANNED_TOTAL_STEPS will override PLANNED_TOTAL_SEQUENCES_SEEN; therefore, if using PLANNED_TOTAL_SEQUENCES_SEEN, set PLANNED_TOTAL_STEPS to -1.
#@markdown 
#@markdown * Option 1: How many sequences the model should train on before stopping:
PLANNED_TOTAL_SEQUENCES_SEEN =  1e9 #@param {type:"number"}
#@markdown * Option 2: How many steps the model should train for before stopping (number of total sequences trained on will depend on the batch size used):
PLANNED_TOTAL_STEPS =  -1#@param {type:"number"}
TRAIN_BATCH_SIZE =   64#@param {type:"integer"}
#@markdown If using gradient accumulation (to save memory), what multiplier to use (memory usage and training speed will both be divided by this value) (Note: batch size must be divisible by this number):
GRADIENT_ACCUMULATION_MULTIPLIER = 2 #@param {type:"integer"}


#@markdown How many steps to wait for each save (not that if SAVE_CHECKPOINT_STEPS is larger than the steps per epoch, the model will be saved every "steps per epoch" number of steps):
SAVE_CHECKPOINTS_STEPS = 1000 #@param {type:"integer"}
#@markdown When writing out training logs, how often to write them out:
SAVE_LOGS_EVERY_N_STEPS = 500 #@param (type:"integer")

PLANNED_TOTAL_STEPS = PLANNED_TOTAL_SEQUENCES_SEEN/TRAIN_BATCH_SIZE if PLANNED_TOTAL_STEPS==-1 else PLANNED_TOTAL_STEPS
DECAY_PER_STEP = (END_LEARNING_RATE-INIT_LEARNING_RATE)/PLANNED_TOTAL_STEPS


BERT_GCS_DIR = BUCKET_PATH+"/"+OUTPUT_MODEL_DIR
GCS_LOGGING_DIR = BUCKET_PATH+"/"+LOGGING_DIR+"/"+RUN_NAME

CONFIG_FILE = BERT_GCS_DIR+"/config.json"

while True: ##training loop
  INIT_CHECKPOINT = tf.train.latest_checkpoint(BERT_GCS_DIR)
  try:
    INIT_CHECKPOINT_STEP = int(INIT_CHECKPOINT.split("-")[-1])
    current_epoch = int(INIT_CHECKPOINT_STEP/STEPS_PER_EPOCH)
    print("CURRENT STEP:",INIT_CHECKPOINT_STEP)
    if int(INIT_CHECKPOINT_STEP)>=2000000:#PLANNED_TOTAL_STEPS: ##if reached planed total steps, stop
      break
  except:
    current_epoch = 0
  try: ###wrap entire training loop into try and except loop so glitches don't kill training
    print("\n\n\n\n\nEPOCH:"+str(current_epoch)+"\n")
    STEPS_PER_EPOCH = int(SEQUENCES_PER_EPOCH/TRAIN_BATCH_SIZE)
    print("Steps per epoch:",STEPS_PER_EPOCH)
    print("\n\n\n\n\n")

    got_data = False
    while not got_data:
      try:
        for f in tf.io.gfile.listdir(BUCKET_PATH+"/"+DATA_DIR+"/train"): ##try to access any of the data bins
          print("trying to access training data from saved copy number "+str(f))
          DATA_GCS_DIR = BUCKET_PATH+"/"+DATA_DIR+"/train/"+str(f)
          train_input_files = tf.gfile.Glob(os.path.join(DATA_GCS_DIR,'*tfrecord'))
          print("train_input_files:",train_input_files)
          if len(train_input_files)>0:
            got_data = True
            break
      except:
          pass
      if not got_data:
        print("Could not find data, waiting for data generation...trying again in another "+str(CHECK_DATA_EVERY_N_SECS)+" seconds.")
        time.sleep(CHECK_DATA_EVERY_N_SECS)

    config = modeling.BertConfig.from_json_file(CONFIG_FILE)

    log.info(f"Using checkpoint: {INIT_CHECKPOINT}")
    log.info(f"Using {len(train_input_files)} data shards for training")
    model_fn = run_pretraining.model_fn_builder(
        bert_config=config,
        logging_dir=GCS_LOGGING_DIR,
        save_logs_every_n_steps=SAVE_LOGS_EVERY_N_STEPS,
        init_checkpoint=INIT_CHECKPOINT,
        init_learning_rate=INIT_LEARNING_RATE,
        decay_per_step=DECAY_PER_STEP,
        num_warmup_steps=10,
        use_tpu=True,
        use_one_hot_embeddings=True,
        bert=MODEL_ARCHITECTURE,
        grad_accum_mul=GRADIENT_ACCUMULATION_MULTIPLIER)

    tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

    run_config = tf.contrib.tpu.RunConfig(
        cluster=tpu_cluster_resolver,
        model_dir=BERT_GCS_DIR,
        save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
        keep_checkpoint_max=KEEP_N_CHECKPOINTS_AT_A_TIME,
        tpu_config=tf.contrib.tpu.TPUConfig(
            iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
            num_shards=NUM_TPU_CORES,
            per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

    estimator = tf.contrib.tpu.TPUEstimator(
        use_tpu=True,
        model_fn=model_fn,
        config=run_config,
        train_batch_size=TRAIN_BATCH_SIZE//GRADIENT_ACCUMULATION_MULTIPLIER,
        eval_batch_size=1)
      
    
    DATA_INFO = json.load(tf.gfile.Open(DATA_GCS_DIR+"info.json"))
    MAX_SEQ_LENGTH = DATA_INFO["sequence_length"]
    MAX_PREDICTIONS = DATA_INFO["max_num_predictions"]
    
    train_input_fn = run_pretraining.input_fn_builder(
            input_files=train_input_files,
            max_seq_length=MAX_SEQ_LENGTH,
            max_predictions_per_seq=MAX_PREDICTIONS,
            is_training=True)
  except Exception as e:
    log.info(f"Training load failed. error: {e}")
    continue
  try:
    estimator.train(input_fn=train_input_fn, steps=STEPS_PER_EPOCH)
    # For dynamic masking, a parallel data generation is used. This portion deletes the current dataset.
    cmd = "gsutil -m rm -r "+DATA_GCS_DIR
    !{cmd}
  except Exception as e:
    log.info(f"Training loop failed. error: {e}")




  
