#Finetuning Data Generation Script

This notebook processes tsv data and uploads the processed data to GCS to be used for finetuning MutFormer

# Configure settings

In [1]:
#@markdown ## General Config
BUCKET_NAME = "theodore_jiang" #@param {type:"string"}
#@markdown For the name of the folder to put data into when uploading to GCS. For generating multiple datasets i.e. for different sequence lengths, they will be written as individual folders inside of this folder
OUTPUT_DATA_DIR = "MRPC_w_ex_data_all_loaded" #@param {type:"string"}
#@markdown Whether or not this script is being run in a GCP runtime (if more memory is required for large databases)
USE_GCP_RUNTIME = False #@param {type:"boolean"}

#@markdown Which task to perform: options are "MRPC" for paired sequence method, "MRPC_w_ex_data" for paired sequence method with external data, "RE" for single sequence method, or "NER" for single sequance per residue prediction (if you add more modes make sure to change the corresponding code segments)
MODE = "MRPC_w_ex_data" #@param {type:"string"}


#If running on a GCP runtime, follow these instructions to set it up:

###1) Create a VM from the GCP website
###2) Open a command prompt on your computer and perform the following steps"
To ssh into the VM:

```
gcloud beta compute ssh --zone <COMPUTE ZONE> <VM NAME> --project <PROJECT NAME> -- -L 8888:localhost:8888
```

Note: Make sure the port above matches the port below (in this case it's 8888)
\
\
Run each of these commands individually, or copy and paste the one command below:
```
sudo apt-get update
sudo apt-get -y install python3 python3-pip
sudo apt-get install pkg-config
sudo apt-get install libhdf5-serial-dev
sudo apt-get install libffi6 libffi-dev
sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm
sudo -H pip3 install jupyter_http_over_ws
jupyter serverextension enable --py jupyter_http_over_ws
jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
One command:
```
sudo apt-get update ; sudo apt-get -y install python3 python3-pip ; sudo apt-get install pkg-config ; sudo apt-get -y install libhdf5-serial-dev ; sudo apt-get install libffi6 libffi-dev; sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm ; sudo -H pip3 install jupyter_http_over_ws ; jupyter serverextension enable --py jupyter_http_over_ws ; jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
###3) In this notebook, to connect to this runtime, click the "connect to local runtime" option under the connect button, and copy and paste the outputted link with "locahost: ..."

#Clone the repo

In [2]:
if USE_GCP_RUNTIME:
  !sudo apt-get -y install git-all
#@markdown ######Where to clone the repo into (only value that it can't be is "mutformer"):
REPO_DESTINATION_PATH = "code/mutformer" #@param {type:"string"}
import os,shutil
if not os.path.exists(REPO_DESTINATION_PATH):
  os.makedirs(REPO_DESTINATION_PATH)
else:
  shutil.rmtree(REPO_DESTINATION_PATH)
  os.makedirs(REPO_DESTINATION_PATH)
cmd = "git clone https://github.com/WGLab/mutformer.git \"" + REPO_DESTINATION_PATH + "\""
!{cmd}

Cloning into 'code/mutformer'...
remote: Enumerating objects: 450, done.[K
remote: Counting objects: 100% (251/251), done.[K
remote: Compressing objects: 100% (215/215), done.[K
remote: Total 450 (delta 176), reused 40 (delta 35), pack-reused 199[K
Receiving objects: 100% (450/450), 2.06 MiB | 15.99 MiB/s, done.
Resolving deltas: 100% (287/287), done.


#Authorize for GCS and Imports

In [3]:
if not USE_GCP_RUNTIME:
  from google.colab import auth
  print("Authorize for GCS:")
  auth.authenticate_user()
  print("Authorize done")

  %tensorflow_version 1.x
import sys
import json
import random
import logging
import tensorflow as tf
import time
import os
import shutil
import importlib

if not os.path.exists("mutformer"):
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
else:
  shutil.rmtree("mutformer")
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
if "mutformer" in sys.path:
  sys.path.remove("mutformer")
sys.path.append("mutformer")

from mutformer import modeling, optimization, tokenization,run_classifier,run_ner_for_pathogenic
from mutformer.modeling import BertModel,BertModelModified
from mutformer.run_classifier import MrpcProcessor,REProcessor,MrpcWithPredsProcessor  ##change this part if you add more modes--
from mutformer.run_ner_for_pathogenic import NERProcessor       ##--

##reload modules so that you don't need to restart the runtime to reload modules in case that's needed
modules2reload = [modeling, 
                  optimization, 
                  tokenization,
                  run_classifier,
                  run_ner_for_pathogenic]
for module in modules2reload:
    importlib.reload(module)

# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

log.handlers = []

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

# create formatter and add it to the handlers
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
log.addHandler(ch)

##Vocabulary for the model (B and J are markers for the beginning and ending of a protein sequence)
vocab = \
'''[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
L
S
B
J
E
A
P
T
G
V
K
R
D
Q
I
N
F
H
Y
C
M
W'''

  
with open("vocab.txt", "w") as fo:
  for token in vocab.split("\n"):
    fo.write(token+"\n")


if MODE=="MRPC": ##change this part if you added more modes
  processor = MrpcProcessor()
  script = run_classifier
elif MODE=="MRPC_w_ex_data":
  processor = MrpcWithPredsProcessor()
  script = run_classifier
elif MODE=="RE":
  processor = REProcessor()
  script = run_classifier
elif MODE=="NER":
  processor = NERProcessor()
  script = run_ner_for_pathogenic
else:
  raise Exception("The mode specified was not one of the available modes: [\"MRPC\", \"RE\",\"NER\"].")
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(vocab_file="vocab.txt", do_lower_case=False)

Authorize for GCS:
Authorize done
TensorFlow 1.x selected.






#Specify Data location/Mount Drive if needed

In [4]:
if not USE_GCP_RUNTIME:
  from google.colab import drive,auth
import os
import shutil
#@markdown input finetuning data folder: data will be read from here to be processed and uploaded to GCS (can be a GCS path if needed for large databases; must be a GCS path if using GCP_RUNTIME); Note that for processing multiple sets i.e. for multiple sequence lengths, simply store these sets into separate subfolders named their corresponding sequence length inside of the folder listed below. The script will automatically save these sets into GCS in the same manner
data_folder = "gs://theodore_jiang/MRPC_w_ex_data_all" #@param {type: "string"}
if "/content/drive" in data_folder:
  if USE_GCP_RUNTIME:
    raise Exception("if USE_GCP_RUNTIME, a GCS path must be used, since Google's cloud TPUs can only communicate with GCS and not drive")
  !fusermount -u /content/drive
  drive.flush_and_unmount()
  drive.mount('/content/drive', force_remount=True)

# Data Generation

###General setup and definitions

In [5]:
#@markdown Maximum batch size the training script can handle without OOM (must be divisible by NUM_TPU_CORES_WHEN_TESTING)
MAX_BATCH_SIZE =  1024 #@param {type:"integer"}
#@markdown If using PRECISE_TESTING, how many tpu cores will be used during testing (for colab runtimes, it's 8)
NUM_TPU_CORES_WHEN_TESTING = 8 #@param {type:"integer"}


BUCKET_PATH = "gs://"+BUCKET_NAME

def generate_data(MAX_SEQ_LENGTH,
                  data_folder_current,
                  DATA_GCS_DIR,
                  PRECISE_TESTING,
                  USING_SHARDS,
                  SHARD_SIZE):  

  try:
    print("\nGenerating train set...\n")
    if USING_SHARDS:
      rd_rg = [0,SHARD_SIZE]
      i=0
    else:
      rd_rg = None
    while True:
      train_examples = processor.get_train_examples(data_folder_current,read_range=rd_rg)
      if len(train_examples) == 0:
        break
      train_file = os.path.join(DATA_GCS_DIR, "train.tf_record")
      if USING_SHARDS:
        train_file+="_"+str(i)
      script.file_based_convert_examples_to_features(
          train_examples, label_list, MAX_SEQ_LENGTH, tokenizer, train_file)
      if not USING_SHARDS:
        break
      else:
        rd_rg = [pt+SHARD_SIZE for pt in rd_rg]
        i+=1
  except Exception as e:
    print("training data generation failed. Error:",e)

  try:
    print("\nGenerating eval set...\n")
    if USING_SHARDS:
      rd_rg = [0,SHARD_SIZE]
      i=0
    else:
      rd_rg = None
    while True:
      eval_examples = processor.get_dev_examples(data_folder_current,read_range=rd_rg)
      if len(eval_examples) == 0:
        break
      eval_file = os.path.join(DATA_GCS_DIR, "eval.tf_record")
      if USING_SHARDS:
        eval_file+="_"+str(i)
      script.file_based_convert_examples_to_features(
          eval_examples, label_list, MAX_SEQ_LENGTH, tokenizer, eval_file)
      if not USING_SHARDS:
        break
      else:
        rd_rg = [pt+SHARD_SIZE for pt in rd_rg]
        i+=1
  except Exception as e:
    print("eval data generation failed. Error:",e)

  try:
    print("\nGenerating test set...\n")
    if USING_SHARDS:
      rd_rg = [0,SHARD_SIZE]
      i=0
    else:
      rd_rg = None
    while True:
      test_examples = processor.get_test_examples(data_folder_current,read_range=rd_rg)
      if len(test_examples) == 0:
        break
      test_file = os.path.join(DATA_GCS_DIR, "test.tf_record")
      if USING_SHARDS:
        test_file+="_"+str(i)
      ## if using precise testing, the data will be split into two sets: 
      ## one set will be able to be predicted on the maximum possible batch 
      ## size, while the other will be predicted on a batch size of one, to 
      ##ensure the fastest prediction without leaving out any datapoints
      if PRECISE_TESTING and len(test_examples)<SHARD_SIZE:
        test_file_trailing = os.path.join(DATA_GCS_DIR, "test_trailing.tf_record")
        def largest_mutiple_under_max(max,multiple_base):
          return int(max/multiple_base)*multiple_base

        split = largest_mutiple_under_max(len(test_examples),MAX_BATCH_SIZE)
        test_examples_head = test_examples[:split]
        test_examples_trailing = test_examples[split:]
        script.file_based_convert_examples_to_features(
            test_examples_head, label_list, MAX_SEQ_LENGTH, tokenizer, test_file)
        script.file_based_convert_examples_to_features(
            test_examples_trailing, label_list, MAX_SEQ_LENGTH, tokenizer, test_file_trailing)
      else:
        script.file_based_convert_examples_to_features(
            test_examples, label_list, MAX_SEQ_LENGTH, tokenizer, test_file)
      if not USING_SHARDS:
        break
      else:
        rd_rg = [pt+SHARD_SIZE for pt in rd_rg]
        i+=1
  except Exception as e:
    print("testing data generation failed. Error:",e)

###Varying sequence lengths

In [6]:
#@markdown list of maximum sequence lengths to generate data for
lengths = [64,128,256,512] #@param
#@markdown whether or not to ensure all dataponts are predicted
PRECISE_TESTING = False #@param {type:"boolean"}
#@markdown whether or not to split the data processing into (for really large databases, since finetuning data typically isn't that large)
USING_SHARDS = False #@param {type:"boolean"}
#@markdown if USING_SHARDS, what shard size to use (MUST BE DIVISIBLE BY "MAX_BATCH_SIZE")
SHARD_SIZE = 1024000 #@param {type:"integer"}

for MAX_SEQ_LENGTH in lengths:
  print("Generating data for seq length:",MAX_SEQ_LENGTH)
  DATA_GCS_DIR = BUCKET_PATH+"/"+OUTPUT_DATA_DIR +"/"+ str(MAX_SEQ_LENGTH)
  data_folder_current= data_folder+"/"+str(MAX_SEQ_LENGTH)

  generate_data(MAX_SEQ_LENGTH,
                data_folder_current,
                DATA_GCS_DIR,
                PRECISE_TESTING,
                USING_SHARDS,
                SHARD_SIZE)
  




Generating data for seq length: 64

Generating train set...



reading tsv: 0it [00:01, ?it/s]
creating_examples: 0it [00:00, ?it/s]



Generating eval set...



reading tsv: 0it [00:00, ?it/s]
creating_examples: 0it [00:00, ?it/s]



Generating test set...



reading tsv: 0it [00:00, ?it/s]


KeyboardInterrupt: ignored

###Only one dataset

In [7]:
#@markdown maximum output data length (because using paired method, actual protein sequence length is half)
MAX_SEQ_LENGTH = 512 #@param {type:"integer"}
#@markdown whether or not to ensure all dataponts are predicted
PRECISE_TESTING = True #@param {type:"boolean"}
#@markdown whether or not to split the data processing into (for really large databases, since finetuning data typically isn't that large)
USING_SHARDS = False #@param {type:"boolean"}
#@markdown if USING_SHARDS, what shard size to use (MUST BE DIVISIBLE BY "MAX_BATCH_SIZE")
SHARD_SIZE = 1024000 #@param {type:"integer"}

DATA_GCS_DIR = BUCKET_PATH+"/"+OUTPUT_DATA_DIR+"/"+str(MAX_SEQ_LENGTH)
data_folder_current = data_folder

generate_data(MAX_SEQ_LENGTH,
              data_folder_current,
              DATA_GCS_DIR,
              PRECISE_TESTING,
              USING_SHARDS,
              SHARD_SIZE)



Generating train set...



reading tsv: 115784it [00:13, 8843.09it/s]
creating_examples: 100%|██████████| 115784/115784 [00:00<00:00, 152175.51it/s]


2021-12-21 06:48:49,954 - tensorflow - INFO - Writing example 0 of 115784
2021-12-21 06:48:49,971 - tensorflow - INFO - *** Example ***
2021-12-21 06:48:49,976 - tensorflow - INFO - guid: train-0
2021-12-21 06:48:49,978 - tensorflow - INFO - tokens: [CLS] F G E D V A F G G V F R C T V G L R D K Y G K D R V F N T P L C E Q G I V G F G I G I A V T G A T A I A E I Q F A D Y I F P A F D Q I V N E A A K Y R Y R S G D L F N C G S L T I R S P W G C V G H G A L Y H S Q S P E A F F A H C P G I K V V I P R S P F Q A K G L L L S C I E D K N P C I F F E P K I L Y R A A A E E V P I E P Y N I P L S Q A E V I Q E G S D V T L V A W G T Q V H V I R E V A S M A K E K L G V S C E V I D L R T I I P W D V D T I C K S V I K T G R L L I S H E A P L T G G F A S E I S S T V Q E [SEP] F G E D V A F G G V F R C T V G L R D K Y G K D R V F N T P L C E Q G I V G F G I G I A V T G A T A I A E 


Generating eval set...



reading tsv: 14474it [00:02, 7060.39it/s] 
creating_examples: 100%|██████████| 14474/14474 [00:00<00:00, 282134.80it/s]
2021-12-21 06:57:02,596 - tensorflow - INFO - Writing example 0 of 14474
2021-12-21 06:57:02,605 - tensorflow - INFO - *** Example ***
2021-12-21 06:57:02,607 - tensorflow - INFO - guid: dev-0
2021-12-21 06:57:02,609 - tensorflow - INFO - tokens: [CLS] Q M V G M Y A S S Y M I L A M T L D R H R A I C R P M L A Y R H G S G A H W N R P V L V A W A F S L L L S L P Q L F I F A Q R N V E G G S G V T D C W A C F A E P W G R R T Y V T W I A L M V F V A P T L G I A A C Q V L I F R E I H A S L V P G P S E R P G G R R R G R R T G S P G E G A H V S A A V A K T V R M T L V I V V V Y V L C W A P F F L V Q L W A A W D P E A P L E G A P F V L L M L L A S L N S C T N P W I Y A S F S S S V S S E L R S L L C C A R G R T P P S L G P Q D E S C T T A S S S L A K D T S S J [SEP] Q M V G M Y A S S Y M I L A M T L D R H R A I C R P M L A Y R H G S G A H W N R P V L V A W A F S L L L S L P Q L


Generating test set...



reading tsv: 14472it [00:01, 9363.01it/s] 
creating_examples: 100%|██████████| 14472/14472 [00:00<00:00, 233162.79it/s]
2021-12-21 06:58:06,015 - tensorflow - INFO - Writing example 0 of 14336
2021-12-21 06:58:06,025 - tensorflow - INFO - *** Example ***
2021-12-21 06:58:06,026 - tensorflow - INFO - guid: test-0
2021-12-21 06:58:06,029 - tensorflow - INFO - tokens: [CLS] S F L C K C P P G Y S G T I C E T T I G S C G K N S C Q H G G I C H Q D P I Y P V C I C P A G Y A G R F C E I D H D E C A S S P C Q N G A V C Q D G I D G Y S C F C V P G Y Q G R H C D L E V D E C A S D P C K N E A T C L N E I G R Y T C I C P H N Y S G Y T G A Q C E I D L N E C N S N P C Q S N G E C V E L S S E K Q Y G R I T G L P S S F S Y H E A S G Y V C I C Q P G F T G I H C E E D V N E C S S N P C Q N G G T C E N L P G N Y T C H C P F D N L S R T F Y G G R D C S D I L L G C T H Q Q C L N N G T C I P [SEP] S F L C K C P P G Y S G T I C E T T I G S C G K N S C Q H G G I C H Q D P I Y P V C I C P A G Y A G R F C E I D 