#Pretraining Evaluation script

This script evaluates pretrained models on protein sequences.

Note: If using a TPU from Google Cloud (not the Colab TPU), make sure to run this notebook on a VM with access to all GCP APIs, and make sure TPUs are enabled for the GCP project

This file can evaluate in parallel multiple models at the same time. However, if more frequent evaluations on more models are desired, run multiple copies of this notebook in multiple VMs

#Downgrade Python and Tensorflow 

(the default python version in Colab does not support Tensorflow 1.15)

* **Note** that because the Python used in this notebook is not the default path, syntax highlighting most likely will not function.

####1. First, download and install Python version 3.7:

In [None]:
!wget -O mini.sh https://repo.anaconda.com/miniconda/Miniconda3-py37_22.11.1-1-Linux-x86_64.sh
!chmod +x mini.sh
!bash ./mini.sh -b -f -p /usr/local
!conda install -q -y jupyter
!conda install -q -y google-colab -c conda-forge
!python -m ipykernel install --name "py37" --user

####2. Then, reload the webpage (not restart runtime) to allow Colab to recognize the newly installed python
####3. Finally, run the following commands to install tensorflow 1.15:

In [None]:
!python3 -m pip install tensorflow==1.15

# Configure settings

In [None]:
#@markdown ### General Config
GCP_RUNTIME = False #@param {type:"boolean"}
PROCESSES = 2 #@param {type:"integer"}
NUM_TPU_CORES = 8 #@param {type:"integer"}
#@markdown Name of the GCS bucket to use (Make sure to set this to the name of your own GCS  bucket):
BUCKET_NAME = "" #@param {type:"string"}
#@markdown Evaluation and testing data location:
DATA_DIR = "pretraining_data_1024_embedded_mutformer" #@param {type:"string"}
#@markdown What folder to write evaluation results into:
EVALUATIONS_LOGS_DIR = "mutformer2_0_pretraining_logs" #@param {type:"string"}

#If running on a GCP TPU, use these commands prior to running this notebook

To ssh into the VM:

```
gcloud beta compute ssh --zone <COMPUTE ZONE> <VM NAME> --project <PROJECT NAME> -- -L 8888:localhost:8888
```

Make sure the port above matches the port below (in this case it's 8888)

```
sudo apt-get update
sudo apt-get -y install python3 python3-pip
sudo apt-get install pkg-config
sudo apt-get install libhdf5-serial-dev
sudo apt-get install libffi6 libffi-dev
sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm
sudo -H pip3 install jupyter_http_over_ws
jupyter serverextension enable --py jupyter_http_over_ws
jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser

(one command):sudo apt-get update ; sudo apt-get -y install python3 python3-pip ; sudo apt-get install pkg-config ; sudo apt-get -y install libhdf5-serial-dev ; sudo apt-get install libffi6 libffi-dev; sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm ; sudo -H pip3 install jupyter_http_over_ws ; jupyter serverextension enable --py jupyter_http_over_ws ; jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
And then copy and paste the outputted link with "locahost: ..." into the colab connect to local runtime option


###Also run this code segment, which creates a TPU

In [None]:
GCE_PROJECT_NAME = "" #@param {type:"string"}
TPU_ZONE = "us-central1-f" #@param {type:"string"}
TPU_NAME = "mutformer-tpu" #@param {type:"string"}

!gcloud alpha compute tpus create $TPU_NAME --accelerator-type=tpu-v2 --version=1.15.5 --zone=$TPU_ZONE ##create new TPU

!gsutil iam ch serviceAccount:`gcloud alpha compute tpus describe $TPU_NAME | grep serviceAccount | cut -d' ' -f2`:admin $BUCKET_PATH && echo 'Successfully set permissions!' ##give TPU access to GCS

#Clone the repo

In [None]:
if GCP_RUNTIME:
  !sudo apt-get -y install git
#@markdown Where to clone the repo into:
REPO_DESTINATION_PATH = "mutformer" #@param {type:"string"}
import os,shutil
if not os.path.exists(REPO_DESTINATION_PATH):
  os.makedirs(REPO_DESTINATION_PATH)
else:
  shutil.rmtree(REPO_DESTINATION_PATH)
  os.makedirs(REPO_DESTINATION_PATH)
cmd = "git clone https://github.com/WGLab/mutformer.git \"" + REPO_DESTINATION_PATH + "\""
!{cmd}

Cloning into 'mutformer'...
remote: Enumerating objects: 1217, done.[K
remote: Counting objects: 100% (97/97), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 1217 (delta 78), reused 56 (delta 46), pack-reused 1120[K
Receiving objects: 100% (1217/1217), 2.30 MiB | 13.95 MiB/s, done.
Resolving deltas: 100% (867/867), done.


#Imports/Authenticate for GCP

In [None]:
if not GCP_RUNTIME:
  def authenticate_user(): ##authentication function that uses link authentication instead of popup
    if os.path.exists("/content/.config/application_default_credentials.json"): 
      return
    print("Authorize for runtime GCS:")
    !gcloud auth login --no-launch-browser
    print("Authorize for TPU GCS:")
    !gcloud auth application-default login  --no-launch-browser
  authenticate_user()

import sys
import json
import random
import logging
import tensorflow as tf
import time
import importlib
import os
import shutil

if REPO_DESTINATION_PATH == "mutformer":
  if os.path.exists("mutformer_code"):
    shutil.rmtree("mutformer_code")
  shutil.copytree(REPO_DESTINATION_PATH,"mutformer_code")
  REPO_DESTINATION_PATH = "mutformer_code"
if not os.path.exists("mutformer"):
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
else:
  shutil.rmtree("mutformer")
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
if "mutformer" in sys.path:
  sys.path.remove("mutformer")
sys.path.append("mutformer")

from mutformer import modeling, optimization, tokenization, run_pretraining

##reload modules so that you don't need to restart the runtime to reload modules in case that's needed
modules2reload = [modeling, 
                  optimization, 
                  tokenization,
                  run_pretraining]
for module in modules2reload:
    importlib.reload(module)

from modeling import *

# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

log.handlers = []

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

#@markdown Whether or not to write logs to a file
DO_FILE_LOGGING = False #@param {type:"boolean"}
if DO_FILE_LOGGING:
  #@markdown * If using file logging, what path to write logs to
  FILE_LOGGING_PATH = 'file_logging/spam.log' #@param {type:"string"}
  if not os.path.exists("/".join(FILE_LOGGING_PATH.split("/")[:-1])):
    os.makedirs("/".join(FILE_LOGGING_PATH.split("/")[:-1]))
  fh = logging.FileHandler(FILE_LOGGING_PATH)
  fh.setLevel(logging.INFO)
  fh.setFormatter(formatter)
  log.addHandler(fh)

ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
log.addHandler(ch)


if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']

  with tf.Session(TPU_ADDRESS) as session:
    log.info('TPU address is ' + TPU_ADDRESS)
    ##upload credentials to TPU.
    with open("/content/.config/application_default_credentials.json", 'r') as f:
      auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    
else:
  log.warning('Not connected to TPU runtime')

#Specify original data location for detection of steps per epoch (optional)/specify GCS or Drive preference for saving evaluation results

In [None]:
#@markdown To minimize interaction with GCS, for sequences per dataset detection, if not USE_GCP_TPU and data was stored in drive, folder where the original data (tsv format) was stored (for detecting the # of steps per epoch) (this variable should match up with the "INPUT_DATA_FOLDER" variable in the data generation script). Alternatively, this item can also be left blank and steps will be autodetected using tfrecords on GCS. (if data was stored in GCS or USE_GCP_TPU is true, leave this item blank, as steps must be detected from tfrecords in this case)
orig_data_folder = "" #@param {type:"string"}
BUCKET_PATH = "gs://{}".format(BUCKET_NAME)
DRIVE_PATH = "/content/drive/My Drive"

#@markdown whether to use GCS for writing eval results, if not, defaults to drive
GCS_EVAL = True #@param {type:"boolean"}
EVALS_PATH = BUCKET_PATH if GCS_EVAL else DRIVE_PATH
if EVALS_PATH==DRIVE_PATH:
  from google.colab import drive
  !fusermount -u /content/drive
  drive.flush_and_unmount()
  drive.mount('/content/drive', force_remount=True)

#Evaluation

###General Setup and definitions

In [None]:
def write_metrics(metrics,dir):
  gs = metrics["global_step"]
  print("global step",gs)

  tf.compat.v1.disable_eager_execution()
  tf.reset_default_graph()  
  if os.path.exists(dir):
    shutil.rmtree(dir)
  for key,value in metrics.items():
    if key=="global_step":
      continue
    print(key,value)
    x_scalar = tf.constant(value)
    first_summary = tf.summary.scalar(name=f"eval_{key}", tensor=x_scalar)

    init = tf.global_variables_initializer()
   
    with tf.Session() as sess:
        writer = tf.summary.FileWriter(dir)
        sess.run(init)
        summary = sess.run(first_summary)
        writer.add_summary(summary, gs)
        writer.flush()
        print('Done with writing the scalar summary')
    time.sleep(1)
  
  if "gs:" in EVALS_PATH:
    cmd = "gsutil -m cp -r \""+dir+"/.\" \""+EVALS_PATH+"/"+dir+"\""
  else:
    if not os.path.exists(EVALS_PATH+"/"+dir):
      os.makedirs(EVALS_PATH+"/"+dir)
    shutil.copytree(dir,EVALS_PATH+"/"+dir)
  !{cmd}


def reload_ckpt(model_dir,current_ckpt,model,data_dir):
  BERT_GCS_DIR = f"{BUCKET_PATH}/{model_dir}"


  CONFIG_FILE = os.path.join(BERT_GCS_DIR, "config.json")

  INIT_CHECKPOINT = tf.train.latest_checkpoint(BERT_GCS_DIR)
  log.info(f"init chkpt: {INIT_CHECKPOINT}")
  log.info(f"current chkpt: {current_ckpt}")
  if INIT_CHECKPOINT != current_ckpt:
    log.info(f"Using data from {data_dir}")
    config = modeling.BertConfig.from_json_file(CONFIG_FILE)
    test_input_files = tf.gfile.Glob(os.path.join(data_dir,'*tfrecord'))
    log.info(f"Using {len(test_input_files)} data shards for testing")
    model_fn = run_pretraining.model_fn_builder(
          bert_config=config,
          init_checkpoint=INIT_CHECKPOINT,
          init_learning_rate=0,
          decay_per_step=0,
          num_warmup_steps=10,
          use_tpu=True,
          use_one_hot_embeddings=True,
          bert=model)

    
    tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

    run_config = tf.contrib.tpu.RunConfig(
        cluster=tpu_cluster_resolver,
        model_dir=BERT_GCS_DIR,
        tpu_config=tf.contrib.tpu.TPUConfig(
            num_shards=NUM_TPU_CORES,
            per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

    estimator = tf.contrib.tpu.TPUEstimator(
        use_tpu=True,
        model_fn=model_fn,
        config=run_config,
        train_batch_size=1,
        eval_batch_size=EVAL_BATCH_SIZE)

    DATA_INFO = json.load(tf.gfile.Open(data_dir+"/info.json")) 
    MAX_SEQ_LENGTH = DATA_INFO["sequence_length"]
    MAX_PREDICTIONS = DATA_INFO["max_num_predictions"]

    
    input_fn = run_pretraining.input_fn_builder(
        input_files=test_input_files,
        max_seq_length=MAX_SEQ_LENGTH,
        max_predictions_per_seq=MAX_PREDICTIONS,
        is_training=False)
    return INIT_CHECKPOINT,estimator,input_fn,True
  else:
    return None,None,None,False

###Run Eval

Run evaluation for the pretraining task.

Note: All evaluation results will be written into the previously specified logging directory either under google drive or GCS, depending on the values of GCS_EVAL specified before. To view the results, use the colab notebook titled "mutformer processing and viewing pretraining results."

In [None]:
#@markdown ###IO config
#@markdown Whether to evaluate on the test set or the dev set (value can be "test" or "dev")
dataset = "dev" #@param{type:"string"}
#@markdown Whether to continuously evaluate in a while loop
REPEAT_EVAL = False #@param{type:"boolean"}
#@markdown * List of pretrained models to evaluate (should indicate the names of the pretrained model folders inside GCS):
MODELS = ["bert_model_embedded_mutformer_8L"] #@param
#@markdown * List of model architectures for each model in the "MODELS" list defined in the entry above: each position in this list must correctly indicate the model architecture of its corresponding model folder in the list "MODELS" (BertModel indicates the original BERT, BertModelModified indicates MutFormer's architecture).
MODEL_ARCHITECTURES = ["MutFormer_embedded_convs"] #@param
#@markdown Folders within EVALUATIONS_LOGS_DIR for where evaluation logs should be written to (each run name should correspond to a model and model architecture)
RUN_NAMES = ["bert_model_embedded_mutformer_8L"] #@param {type:"string"}
#@markdown \
#@markdown ### Evaluation procedure config
EVAL_BATCH_SIZE = 64 #@param {type:"integer"}
#@markdown How many seconds to wait in between each evaluation loop (to minimize interaction with GCS, should be around the same time it takes for the training script to train and save 1 checkpoint)
SECS_BETWEEN_EVALS = 1200 #@param {type:"integer"}


if dataset=="test":
  using_DATA_DIR = f"{DATA_DIR}/test"
elif dataset=="dev":
  using_DATA_DIR = f"{DATA_DIR}/train"
else:
  raise Exception("only datasets supported are dev and test")

current_ckpts = ["N/A" for i in range(len(MODELS))]

total_metrics = {}

while True:
  try:
    for n,model in enumerate(MODELS):
      MODEL_ARCHITECTURE = getattr(modeling, MODEL_ARCHITECTURES[n])
      RUN_NAME=  RUN_NAMES[n]
      LOCAL_EVALUATIONS_LOGS_DIR = f"{EVALUATIONS_LOGS_DIR}/{RUN_NAME}"
      current_ckpt = current_ckpts[n]
      current_ckpt,estimator,test_input_fn,new = reload_ckpt(model,current_ckpt,MODEL_ARCHITECTURE,BUCKET_PATH+"/"+using_DATA_DIR)
      current_ckpts[n] = current_ckpt
      if new:
        print("\n\nEVALUATING "+model+"\n\n")
        log.info(f"Using checkpoint: {current_ckpt}")
        def steps_getter(input_files):
          tot_sequences = 0
          for input_file in input_files:
            tf.logging.info(f"reading: {input_file} for steps")

            d = tf.data.TFRecordDataset(input_file)

            with tf.Session() as sess:
              tot_sequences+=sess.run(d.reduce(0, lambda x,_: x+1))

          return tot_sequences
    
        try:
          if dataset=="dev":
            data_path_eval = orig_data_folder+"/train.tsv"
          else: ##dataset == "test"
            data_path_eval = orig_data_folder+"/test.tsv"
          lines = open(data_path_eval).read().split("\n")
          EVAL_STEPS = int(len(lines)/EVAL_BATCH_SIZE)
        except Exception:
          DATA_GCS_DIR_train = f"{BUCKET_PATH}/{using_DATA_DIR}"
          eval_input_files = tf.gfile.Glob(os.path.join(DATA_GCS_DIR_train,'*tfrecord'))
          SEQUENCES_PER_EPOCH = steps_getter(eval_input_files)
          EVAL_STEPS = int(SEQUENCES_PER_EPOCH/EVAL_BATCH_SIZE)

        tf.logging.info("eval steps:"+str(EVAL_STEPS))
        metrics = estimator.evaluate(input_fn=test_input_fn, steps=EVAL_STEPS)
        if REPEAT_EVAL:
          write_metrics(metrics,LOCAL_EVALUATIONS_LOGS_DIR)
        else:
          total_metrics[LOCAL_EVALUATIONS_LOGS_DIR] = metrics
      else:
        log.info(f"\n\nNo new checkpoints were found for evaluation. Checking again in {SECS_BETWEEN_EVALS} seconds.\n\n")
    print("finished 1 eval loop")
    if not REPEAT_EVAL:
      break
  except Exception as e:
    log.info(f"\n\nEvaluation failed. error: {e}\n\n")
  if not REPEAT_EVAL:
      break
  time.sleep(SECS_BETWEEN_EVALS)
if dataset == "test":
  for logging_dir,metrics in total_metrics.items():
    print("Printing metrics for:",logging_dir,"\n")
    for key,metric in metrics.items():
      print(key+":",metric)
    print("\n")