#Finetuning Data Generation Script

This notebook processes tsv data and uploads the processed data to GCS to be used for finetuning MutFormer.

# Configure settings/Mount Drive if needed

In [1]:
#@markdown ## General Config
#@markdown Whether or not this script is being run in a GCP runtime (if more memory is required for large databases)
GCP_RUNTIME = False #@param {type:"boolean"}
#@markdown Which mode to use (a different mode means a different finetuning task): options are:
#@markdown * "MRPC" - paired sequence method
#@markdown * "MRPC_w_ex_data" - paired sequence method with external data
#@markdown * "RE" - single sequence method
#@markdown * "NER" - single sequence per residue prediction
#@markdown 
#@markdown You can add more modes by creating a new processor and/or a new model_fn inside of the "mutformer_model_code" folder downloaded from github, then changing the corresponding code snippets in the code segment named "Authorize for GCS, Imports, and General Setup" (also edit the dropdown below).
MODE = "MRPC_w_ex_data" #@param   ["MRPC_w_ex_data", "MRPC", "RE", "NER"]   {type:"string"} 
            ####      ^^^^^ dropdown list for all modes ^^^^^

#@markdown Name of the GCS bucket to use:
BUCKET_NAME = "theodore_jiang" #@param {type:"string"}
BUCKET_PATH = "gs://"+BUCKET_NAME
#@markdown \
#@markdown 
#@markdown 
#@markdown ## IO Config
#@markdown Input finetuning data folder: data will be read from here to be processed and uploaded to GCS (can be a drive path, or a GCS path if needed for large databases; must be a GCS path if using GCP_RUNTIME):
#@markdown 
#@markdown * For processing multiple sets i.e. for multiple sequence lengths, simply store these sets into separate subfolders inside of the folder listed below, with each subfolder being named as specified in the following section.
#@markdown 
#@markdown * For processing a single set, this folder should directly contain one dataset.
#@markdown
INPUT_DATA_DIR = "gs://theodore_jiang/updated_all_snp_prediction_data" #@param {type: "string"}


if not GCP_RUNTIME:                    ##if INPUT_DATA_DIR is a drive path,
  if "/content/drive" in INPUT_DATA_DIR:   ##mount google drive
    from google.colab import drive
    if GCP_RUNTIME:
      raise Exception("if GCP_RUNTIME, a GCS path must be used, since Google's cloud TPUs can only communicate with GCS and not drive")
    !fusermount -u /content/drive
    drive.flush_and_unmount()
    drive.mount('/content/drive', force_remount=True)


#@markdown Name of the folder in GCS to put processed data into: 
#@markdown * For generating multiple datasets i.e. for different sequence lengths, they will be written as individual subfolders inside of this folder.
OUTPUT_DATA_DIR = "all_snp_prediction_data_loaded" #@param {type:"string"}


DATA_INFO = {      ##dictionary that will be uploaded alongside 
    "mode":MODE    ##each dataset to indicate its parameters
}


#### Vocabulary for the model (MutFormer uses the vocabulary below) ([PAD]
#### [UNK],[CLS],[SEP], and [MASK] are necessary default tokens; B and J
#### are markers for the beginning and ending of a protein sequence,
#### respectively; the rest are all amino acids possible, ranked 
#### approximately by frequency of occurence in human population)
#### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vocab = \
'''[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
L
S
B
J
E
A
P
T
G
V
K
R
D
Q
I
N
F
H
Y
C
M
W'''
with open("vocab.txt", "w") as fo:
  for token in vocab.split("\n"):
    fo.write(token+"\n")


#If running on a GCP runtime, follow these instructions to set it up:

###1) Create a VM from the GCP website
###2) Open a command prompt on your computer and perform the following steps"
To ssh into the VM, run:

```
gcloud beta compute ssh --zone <COMPUTE ZONE> <VM NAME> --project <PROJECT NAME> -- -L 8888:localhost:8888
```

Note: Make sure the port above matches the port below (in this case it's 8888)
\
\
In the new command prompt that popped out, either run each of the commands below individually, or copy and paste the one liner below:
```
sudo apt-get update
sudo apt-get -y install python3 python3-pip
sudo apt-get install pkg-config
sudo apt-get install libhdf5-serial-dev
sudo apt-get install libffi6 libffi-dev
sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm
sudo -H pip3 install jupyter_http_over_ws
jupyter serverextension enable --py jupyter_http_over_ws
jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
One command:
```
sudo apt-get update ; sudo apt-get -y install python3 python3-pip ; sudo apt-get install pkg-config ; sudo apt-get -y install libhdf5-serial-dev ; sudo apt-get install libffi6 libffi-dev; sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm ; sudo -H pip3 install jupyter_http_over_ws ; jupyter serverextension enable --py jupyter_http_over_ws ; jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
###3) In this notebook, click the "connect to local runtime" option under the connect button, and copy and paste the link outputted by command prompt with "locahost: ..."

#Clone the MutFormer repo

In [2]:
if GCP_RUNTIME:
  !sudo apt-get -y install git-all
#@markdown Where to clone the repo into:
REPO_DESTINATION_PATH = "mutformer" #@param {type:"string"}
import os,shutil
if not os.path.exists(REPO_DESTINATION_PATH):
  os.makedirs(REPO_DESTINATION_PATH)
else:
  shutil.rmtree(REPO_DESTINATION_PATH)
  os.makedirs(REPO_DESTINATION_PATH)
cmd = "git clone https://github.com/WGLab/mutformer.git \"" + REPO_DESTINATION_PATH + "\""
!{cmd}

Cloning into 'mutformer'...
remote: Enumerating objects: 1358, done.[K
remote: Counting objects: 100% (238/238), done.[K
remote: Compressing objects: 100% (140/140), done.[K
remote: Total 1358 (delta 184), reused 142 (delta 98), pack-reused 1120[K
Receiving objects: 100% (1358/1358), 2.33 MiB | 9.67 MiB/s, done.
Resolving deltas: 100% (973/973), done.


#Authorize for GCS, Imports, and General Setup

In [3]:
#@markdown whether to use link authorization for GCS (link authorization allows connection to another account (though more cumbersome to click through each run), while normal authorization disables connecting to another account):
LINK_AUTHORIZATION = False #@param {type:"boolean"}

if not GCP_RUNTIME:
  from google.colab import auth
  print("Authorize for GCS:")
  if not LINK_AUTHORIZATION: 
    auth.authenticate_user()
  else: 
    !gcloud auth login --no-launch-browser
  print("Authorize done")

  %tensorflow_version 1.x
import sys
import json
import random
import logging
import tensorflow as tf
import time
import os
import shutil
import importlib
import re
from tqdm import tqdm

if REPO_DESTINATION_PATH == "mutformer":
  if os.path.exists("mutformer_code"):
    shutil.rmtree("mutformer_code")
  shutil.copytree(REPO_DESTINATION_PATH,"mutformer_code")
  REPO_DESTINATION_PATH = "mutformer_code"
if not os.path.exists("mutformer"):
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
else:
  shutil.rmtree("mutformer")
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
if "mutformer" in sys.path:
  sys.path.remove("mutformer")
sys.path.append("mutformer")

from mutformer import modeling, optimization, tokenization,run_classifier,run_ner_for_pathogenic  #### <<<<< if you added more modes, change these imports to import the correct processors         
from mutformer.run_classifier import MrpcProcessor,REProcessor,MrpcWithExDataProcessor            #### <<<<< and correct training scripts (i.e. run_classifier and run_ner_for_pathogenic)
from mutformer.run_ner_for_pathogenic import NERProcessor                                

##reload modules so that you don't need to restart the runtime to reload modules in case that's needed
modules2reload = [modeling, 
                  optimization, 
                  tokenization,
                  run_classifier,
                  run_ner_for_pathogenic]
for module in modules2reload:
    importlib.reload(module)

# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

log.handlers = []

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
log.addHandler(ch)


if MODE=="MRPC":      ####       vvvvv if you added more modes, change this part to set the processors and training scripts correctly vvvvv
  processor = run_classifier.MrpcProcessor()
  script = run_classifier
  USE_EX_DATA = False
elif MODE=="MRPC_w_ex_data":
  processor = run_classifier.MrpcWithExDataProcessor()
  script = run_classifier
  USE_EX_DATA = True
elif MODE=="RE":
  processor = run_classifier.REProcessor()
  script = run_classifier
  USE_EX_DATA = False
elif MODE=="NER":
  processor = run_ner_for_pathogenic.NERProcessor()
  script = run_ner_for_pathogenic
  USE_EX_DATA = False
else:
  raise Exception("The mode specified was not one of the available modes: [\"MRPC\", \"RE\",\"NER\"].")
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(vocab_file="vocab.txt", do_lower_case=False)
                      ####       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Authorize for GCS:
Authorize done
After that, `%tensorflow_version 1.x` will throw an error.

Your notebook should be updated to use Tensorflow 2.
See the guide at https://www.tensorflow.org/guide/migrate#migrate-from-tensorflow-1x-to-tensorflow-2.

TensorFlow 1.x selected.



# Data Generation

###General setup and definitions

In [4]:
#@markdown Maximum batch size the finetuning_benchmark script can handle without OOM (must be divisible by NUM_TPU_CORES_WHEN_TESTING):
MAX_BATCH_SIZE =  1024 #@param {type:"integer"}
#@markdown How many tpu cores will be used during evaluation and prediction (for colab runtimes, it's 8):
NUM_TPU_CORES_WHEN_TESTING = 8 #@param {type:"integer"}

def generate_data(MAX_SEQ_LENGTH,
                  data_folder_current,
                  DATA_GCS_DIR,
                  PRECISE_TESTING,
                  USING_SHARDS,
                  START_SHARD,
                  AUGMENT_COPIES_TRAIN,
                  SHARD_SIZE,
                  GENERATE_SETS):

  try:
    print("\nUpdating and uploading data info json...\n")
    DATA_INFO["sequence_length"] = MAX_SEQ_LENGTH    ##update data info with sequence length

    if USE_EX_DATA:                         ##if using external data, update data 
      def get_ex_data_num(file):            ##info with the # of external datapoints being used
        with tf.gfile.Open(file) as filein: 
          while True:
            line = filein.readline().strip()
            if line:
              ex_data = line.split("\t")[3].split()
              return len(ex_data)
      DATA_INFO["ex_data_num"] = get_ex_data_num(data_folder_current+"/"+tf.io.gfile.listdir(data_folder_current)[0])
    
    with tf.gfile.Open(DATA_GCS_DIR+"/info.json","w+") as out: ##writes out a dictionary containing
      json.dump(DATA_INFO,out,indent=2)                           ##the dataset's parameters
    print("Data info json uploaded successfully")
  except Exception as e:
    print("could not update and upload data info json. Error:",e)

              
  def get_or_create_shards(infile, SHARD_SIZE, START_SHARD, END_SHARD):
      shard_files = []
      with tf.gfile.Open(infile) as filein:
          shard_ind = START_SHARD
          read_ind = -1
          while True:
              current_start_line = shard_ind * SHARD_SIZE
              if shard_ind == END_SHARD: break
              shard_file = f"{infile}_(shardsize_{SHARD_SIZE})_shard_{shard_ind}"
              shard_files.append([shard_file, shard_ind])
              if not tf.io.gfile.exists(shard_file):
                  with tf.gfile.Open(shard_file, "w+") as shardout:
                      wroteout = 0
                      for line in tqdm(filein, f"creating shard number {shard_ind}"):
                          if not line.strip():
                              continue
                          read_ind += 1
                          if read_ind < current_start_line:
                              continue
                          shardout.write(line)
                          wroteout += 1
                          if wroteout == SHARD_SIZE:
                              break
                      if wroteout == 0:
                          shardout.close()
                          del shard_files[-1]
                          break
                      if wroteout < SHARD_SIZE:
                          break
              shard_ind += 1
      return shard_files
  

  DO_TRAIN, DO_DEV, DO_TEST = GENERATE_SETS
  

  if DO_TRAIN:
    try:
      print("\nGenerating train set...\n")
      train_data_input_file = processor.get_train_file(data_folder_current)
      if USING_SHARDS:
        shards = get_or_create_shards(train_data_input_file,SHARD_SIZE//(AUGMENT_COPIES_TRAIN+1),START_SHARD,END_SHARD)
      else:
        shards = [train_data_input_file,None]
      for shard,shard_ind in shards:
        if USING_SHARDS: print(f"generating data for shard number {shard_ind}")
        train_examples = processor._create_examples(processor._read_tsv(shard),"train")
        if len(train_examples) == 0:
          raise Exception("no data present in the train dataset")
        train_file = os.path.join(DATA_GCS_DIR, "train.tf_record")
        if USING_SHARDS:
          train_file+="_"+str(shard_ind)
        script.file_based_convert_examples_to_features(train_examples, 
                                                      label_list, 
                                                      MAX_SEQ_LENGTH, 
                                                      tokenizer, 
                                                      train_file,
                                                      augmented_data_copies=AUGMENT_COPIES_TRAIN,
                                                      shuffle_data=True)
    except Exception as e:
      print("train dataset generation failed. Error:",e)

  if DO_DEV:
    try:
      print("\nGenerating dev set...\n")
      dev_data_input_file = processor.get_dev_file(data_folder_current)
      if USING_SHARDS:
        shards = get_or_create_shards(dev_data_input_file,SHARD_SIZE,START_SHARD,END_SHARD)
      else:
        shards = [dev_data_input_file,None]
      for shard,shard_ind in shards:
        if USING_SHARDS: print(f"generating data for shard number {shard_ind}")
        dev_examples = processor._create_examples(processor._read_tsv(shard),"dev")
        if len(dev_examples) == 0:
          raise Exception("no data present in the dev dataset")
        dev_file = os.path.join(DATA_GCS_DIR, "dev.tf_record")
        if USING_SHARDS:
          dev_file+="_"+str(shard_ind)
        script.file_based_convert_examples_to_features(dev_examples, 
                                                      label_list, 
                                                      MAX_SEQ_LENGTH, 
                                                      tokenizer, 
                                                      dev_file)
    except Exception as e:
      print("dev dataset generation failed. Error:",e)

  if DO_TEST:
    try:
      print("\nGenerating test set...\n")
      datasets = [re.match("test_(\w+).tsv",file).groups()[0] for file in tf.io.gfile.listdir(data_folder_current) if re.match("test_(\w+).tsv",file)]
      if not datasets:
        datasets = [None]
      for dataset in datasets:
        if dataset: print(f"Processing dataset: {dataset}")
        test_data_input_file = processor.get_test_file(data_folder_current,dataset=dataset)
        if USING_SHARDS:
          shards = get_or_create_shards(test_data_input_file,SHARD_SIZE,START_SHARD,END_SHARD)
        else:
          shards = [test_data_input_file,None]
        for n,(shard,shard_ind) in enumerate(shards):
          if USING_SHARDS: print(f"generating data for shard number {shard_ind}")
          test_examples = processor._create_examples(processor._read_tsv(shard),"test")
          if len(test_examples) == 0:
            raise Exception("no data present in the test dataset")
          test_file = os.path.join(DATA_GCS_DIR, f"test_{dataset}.tf_record" if dataset else "test.tf_record")
          if USING_SHARDS:
            test_file+="_"+str(shard_ind)
          ## if using precise testing, the data will be split into two sets: 
          ## one set will be able to be predicted on the maximum possible batch 
          ## size, while the other will be predicted on a batch size of 1, to 
          ## ensure the fastest prediction without leaving out any datapoints
          if PRECISE_TESTING and n==len(shards)-1:
            test_file_trailing = os.path.join(DATA_GCS_DIR, f"test_trailing_{dataset}.tf_record" if dataset else "test_trailing.tf_record")
            def largest_mutiple_under_max(max,multiple_base):
              return int(max/multiple_base)*multiple_base

            split = largest_mutiple_under_max(len(test_examples),MAX_BATCH_SIZE)
            test_examples_head = test_examples[:split]
            test_examples_trailing = test_examples[split:]
            script.file_based_convert_examples_to_features(test_examples_head, 
                                                           label_list, 
                                                           MAX_SEQ_LENGTH, 
                                                           tokenizer, 
                                                           test_file)
            if test_examples_trailing:
              script.file_based_convert_examples_to_features(test_examples_trailing, 
                                                            label_list, 
                                                            MAX_SEQ_LENGTH, 
                                                            tokenizer, 
                                                            test_file_trailing)
          else:
            script.file_based_convert_examples_to_features(test_examples, 
                                                           label_list, 
                                                           MAX_SEQ_LENGTH, 
                                                           tokenizer, 
                                                           test_file)
    except Exception as e:
      print("test dataset generation failed. Error:",e)


###Generation ops

There are currently two data generations ops (more can be added):
1. Varying sequence lengths: multiple sets of different sequence lengths will be generated
  * Store multiple individual datasets as subfolders inside of Input finetuning data folder, with each folder named its corresponding sequence length.
2. Only one dataset: a single dataset with a specified set of parameters will be generated 
  * Directly store only the files train.tsv, dev.tsv, and test.tsv for one dataset inside Input finetuning data folder

####Varying sequence lengths

In [None]:
#@markdown List of maximum sequence lengths to generate data for
MAX_SEQ_LENGTHS = [1024] #@param
#@markdown Whether or not to ensure all datapoints are used during prediction by using an extra trailing test dataset so no datapoints will be skipped due to the batch size. (This option should be used unless an extra trailing test dataset is a large problem)
PRECISE_TESTING = True #@param {type:"boolean"}
#@markdown Whether or not to split the data processing into (for really large databases, since finetuning data typically isn't that large)
USING_SHARDS = False #@param {type:"boolean"}
#@markdown If USING_SHARDS, what shard size to use (how many lines/datapoints should be in each shard) (MUST BE DIVISIBLE BY "MAX_BATCH_SIZE") (if using data augmentation, size indicates the size of augmented data)
SHARD_SIZE = 1024000 #@param {type:"integer"}
#@markdown If USING_SHARDS, which shard to start at (typically should start at shard 0)
START_SHARD = 0 #@param {type:"integer"}
#@markdown Which sets to generate out of train, dev, and test
TRAIN = False #@param {type:"boolean"}
DEV = False #@param {type:"boolean"}
TEST = True #@param {type:"boolean"}
#@markdown How many additional augmented copies to load:
AUGMENT_COPIES_TRAIN =  0#@param{type:"integer"}

for MAX_SEQ_LENGTH in MAX_SEQ_LENGTHS:
  print("\n\nGenerating data for seq length:",MAX_SEQ_LENGTH,"\n\n")
  DATA_GCS_DIR = BUCKET_PATH+"/"+OUTPUT_DATA_DIR +"/"+ str(MAX_SEQ_LENGTH)
  data_folder_current= INPUT_DATA_DIR+"/"+str(MAX_SEQ_LENGTH)

  generate_data(MAX_SEQ_LENGTH,
                data_folder_current,
                DATA_GCS_DIR,
                PRECISE_TESTING,
                USING_SHARDS,
                START_SHARD,
                AUGMENT_COPIES_TRAIN,
                SHARD_SIZE,
                [TRAIN,DEV,TEST])
  



Generating data for seq length: 1024 



Updating and uploading data info json...

Data info json uploaded successfully

Generating train set...



reading tsv: 85380it [00:14, 5891.11it/s]
creating_examples: 100%|██████████| 85380/85380 [00:00<00:00, 172598.20it/s]


2022-06-04 21:29:36,358 - tensorflow - INFO - Writing example 0 of 85380
2022-06-04 21:29:36,374 - tensorflow - INFO - *** Example ***
2022-06-04 21:29:36,375 - tensorflow - INFO - guid: train-52943
2022-06-04 21:29:36,378 - tensorflow - INFO - tokens (length = 1023): [CLS] Y L G R V R T T T I G E P E N K S K Q N E M L V A A A A V G V A T V F A A P F S G V L F S I E V M S S H F S V R D Y W R G F F A A T C G A F I F R L L A V F N S E Q E T I T S L Y K T S F R V D V P F D L P E I F F F V A L G G I C G V L S C A Y L F C Q R T F L S F I K T N R Y S S K L L A T S K P V Y S A L A T L L L A S I T Y P P G V G H F L A S R L S M K Q H L D S L F D N H S W A L M T Q N S S P P W P E E L D P Q H L W W E W Y H P R F T I F G T L A F F L V M K F W M L I L A T T I P M P A G Y F M P I F I L G A A I G R L L G E A L A V A F P E G I V T G G V T N P I M P G G Y A L A G A A A F S G A V T H 

shuffling examples...


2022-06-04 21:29:36,478 - tensorflow - INFO - input_ids (length = 1024): 2 7 25 10 9 25 13 6 15 13 14 12 10 13 15 19 10 6 20 14 18 15 15 5 12 16 10 18 9 15 14 5 18 15 5 13 15 10 17 9 12 15 17 9 18 21 9 18 24 14 18 20 21 20 15 18 5 12 9 13 12 16 5 18 15 17 5 16 12 23 5 10 6 14 15 10 25 22 9 10 6 15 15 5 20 9 24 5 18 9 14 23 9 11 17 26 11 13 16 17 9 10 20 15 19 10 9 20 20 17 5 5 26 25 17 23 22 18 15 5 14 17 18 10 5 5 12 25 17 12 23 5 13 18 21 11 17 19 15 6 16 19 10 15 16 13 16 15 5 14 17 23 17 6 10 16 22 22 23 9 6 5 18 12 10 15 15 15 17 9 10 15 19 10 15 10 9 9 9 5 19 15 10 18 15 14 21 9 9 25 20 14 17 5 18 9 9 5 11 6 5 26 20 6 16 14 13 21 23 14 20 12 21 18 6 19 10 13 5 9 9 20 21 22 15 9 25 6 15 5 20 18 20 5 20 17 14 5 14 13 5 9 15 18 22 13 6 20 12 21 12 14 15 10 18 11 16 15 15 6 15 5 21 6 16 5 16 16 15 15 20 6 17 20 10 11 10 15 13 20 15 6 11 6 11 11 17 13 6 11 10 10 12 11 9 19 16 14 20 22 9 11 9 11 10 13 13 10 12 11 13 10 12 5 11 15 6 11 6 18 6 6 5 11 10 14 14 14 9 12 21 11 10 12 14 20 13


Generating dev set...



reading tsv: 7702it [00:01, 4943.10it/s] 
creating_examples: 100%|██████████| 7702/7702 [00:00<00:00, 198092.50it/s]
2022-06-04 21:40:26,425 - tensorflow - INFO - Writing example 0 of 7702
2022-06-04 21:40:26,437 - tensorflow - INFO - *** Example ***
2022-06-04 21:40:26,440 - tensorflow - INFO - guid: dev-0
2022-06-04 21:40:26,445 - tensorflow - INFO - tokens (length = 929): [CLS] B M T P N S M T E N G L T A W D K P K H C P D R E H D W K L V G M S E A C L H R K S H S E R R S T L K N E Q S S P H L I Q T T W T S S I F H L D H D D V N D Q S V S S A Q T F Q T E E K K C K G Y I P S Y L D K D E L C V V C G D K A T G Y H Y R C I T C E G C K G F F R R T I Q K N L H P S Y S C K Y E G K C V I D K V T R N Q C Q E C R F K K C I Y V G M A T D L V L D D S K R L A K R K L I E E N R E K R R R E E L Q K S I G H K P E P T D E E W E L I K T V T E A H V A T N A Q G S H W K Q K R K F L P E D I G Q A P I V N A P E G G K V D L E A F S H F T K I I T P A I T R V V D F A K K L P M F C E L P C E D Q I I L L K G 


Generating test set...

Dataset: MVP_otherset


reading tsv: 2324it [00:01, 1943.81it/s]
creating_examples: 100%|██████████| 2324/2324 [00:00<00:00, 158242.22it/s]
2022-06-04 21:41:28,866 - tensorflow - INFO - Writing example 0 of 2048
2022-06-04 21:41:28,878 - tensorflow - INFO - *** Example ***
2022-06-04 21:41:28,881 - tensorflow - INFO - guid: test-0
2022-06-04 21:41:28,885 - tensorflow - INFO - tokens (length = 1023): [CLS] Q V N M E L A K I K Q K C P L Y E A N G Q A D T V K V P K E K D E M V E Q E F N R L L E A T S Y L S H Q L D F N V L N N K P V S L G Q A L E V V I Q L Q E K H V K D E Q I E H W K K I V K T Q E E L K E L L N K M V N L K E K I K E L H Q Q Y K E A S E V K P P R D I T A E F L V K S K H R D L T A L C K E Y D E L A E T Q G K L E E K L Q E L E A N P P S D V Y L S S R D R Q I L D W H F A N L E F A N A T P L S T L S L K H W D Q D D D F E F T G S H L T V R N G Y S C V P V A L A E G L D I K L N T A V R Q V R Y T A S G C E V I A V N T R S T S Q T F I Y K C D A V L C T L P L G V L K Q Q P P A V Q F V P P L P E W K T S A V

Dataset: MVP_wangpreviousset1


reading tsv: 91it [00:00, 138.24it/s]
creating_examples: 100%|██████████| 91/91 [00:00<00:00, 127397.08it/s]
2022-06-04 21:41:49,005 - tensorflow - INFO - Writing example 0 of 91
2022-06-04 21:41:49,014 - tensorflow - INFO - *** Example ***
2022-06-04 21:41:49,017 - tensorflow - INFO - guid: test-0
2022-06-04 21:41:49,024 - tensorflow - INFO - tokens (length = 385): [CLS] B M T E Y K L V V V G A G G V G K S A L T I Q L I Q N H F V D E Y D P T I E D S Y R K Q V V I D G E T C L L D I L D T A G Q E E Y S A M R D Q Y M R T G E G F L C V F A I N N T K S F E D I H Q Y R E Q I K R V K D S D D V P M V L V G N K C D L A A R T V E S R Q A Q D L A R S Y G I P Y I E T S A K T R Q G V E D A F Y T L V R E I R Q H K L R K L N P P D E S G P G C M S C K C V L S J [SEP] B M T E Y K L V V V G A G R V G K S A L T I Q L I Q N H F V D E Y D P T I E D S Y R K Q V V I D G E T C L L D I L D T A G Q E E Y S A M R D Q Y M R T G E G F L C V F A I N N T K S F E D I H Q Y R E Q I K R V K D S D D V P M V L V G N K C

Dataset: MVP_wangpreviousset2


reading tsv: 9981it [00:02, 4613.18it/s]
creating_examples: 100%|██████████| 9981/9981 [00:00<00:00, 184476.92it/s]
2022-06-04 21:41:52,771 - tensorflow - INFO - Writing example 0 of 9216
2022-06-04 21:41:52,790 - tensorflow - INFO - *** Example ***
2022-06-04 21:41:52,794 - tensorflow - INFO - guid: test-0
2022-06-04 21:41:52,798 - tensorflow - INFO - tokens (length = 1023): [CLS] G H S C L R A L S P F A E S S Q L K G Q T G V T T S F S L F I D K T T G H F L C M T S L A E G S W E D F Q A S V E G R G D G A R E G F L L S K A P E F E D S E E V R R I W N R A I P L W E L P D Q E E V Q L A D T M F G L T K V T D D T L K R F S V R Y L R P A R S L V F P W F S P G G S G L R G L K L L E A K C Q G D G V S Y E E T T I P R P S A Y H N L F G L P L I S R R D A E V V L T S R E L D S L A L N Q S T G L P T L T L P R G T T C L P P A L L P Y L E Q F R R I V F W L G D D L R S W E A A K L F A R K L N P K R C F L V R P G D Q Q P R P L E A L N G G F N L S R I L R T A L P A W H K S I V S F R Q L R E E V L G E L

Dataset: MVP_wangpreviousset3


reading tsv: 2422it [00:00, 2437.59it/s]
creating_examples: 100%|██████████| 2422/2422 [00:00<00:00, 113675.42it/s]
2022-06-04 21:43:07,252 - tensorflow - INFO - Writing example 0 of 2048
2022-06-04 21:43:07,268 - tensorflow - INFO - *** Example ***
2022-06-04 21:43:07,273 - tensorflow - INFO - guid: test-0
2022-06-04 21:43:07,275 - tensorflow - INFO - tokens (length = 1001): [CLS] B M A A A G E G T P S S R G P R R D P P R R P P R N G Y G V Y V Y P N S F F R Y E G E W K A G R K H G H G K L L F K D G S Y Y E G A F V D G E I T G E G R R H W A W S G D T F S G Q F V L G E P Q G Y G V M E Y K A G G C Y E G E V S H G M R E G H G F L V D R D G Q V Y Q G S F H D N K R H G P G Q M L F Q N G D K Y D G D W V R D R R Q G H G V L R C A D G S T Y K G Q W H S D V F S G L G S M A H C S G V T Y Y G L W I N G H P A E Q A T R I V I L G P E V M E V A Q G S P F S V N V Q L L Q D H G E I A K S E S G R V L Q I S A G V R Y V Q L S A Y S E V N F F K V D R D N Q E T L I Q T P F G F E C I P Y P V S S P A A G V P

Dataset: varibench_PPARG


reading tsv: 8099it [00:01, 4403.99it/s]
creating_examples: 100%|██████████| 8099/8099 [00:00<00:00, 176946.55it/s]
2022-06-04 21:43:28,103 - tensorflow - INFO - Writing example 0 of 7168
2022-06-04 21:43:28,120 - tensorflow - INFO - *** Example ***
2022-06-04 21:43:28,123 - tensorflow - INFO - guid: test-0
2022-06-04 21:43:28,124 - tensorflow - INFO - tokens (length = 1023): [CLS] B M D N D D F F S M D F K E V V E N L V T N D N S P N I P E A I D R L F S D I A N I N R E S M A E I T D I Q I E E M A V N L W N W A L T I G G G W L V N E E Q K I R L H Y V A C K L L S M C E A S F A S E Q S I Q R L I M M N M R I G K E W L D A G N F L I A D E C F Q A A V A S L E Q L Y V K L I Q R S S P E A D L T M E K I T V E S D H F R V L S Y Q A E S A V A Q G D F Q R A S M C V L Q C K D M L M R L P Q M T S S L H H L C Y N F G V E T Q K N N K Y E E S S F W L S Q S Y D I G K M D K K S T G P E M L A K V L R L L A T N Y L D W D D T K Y Y D K A L N A V N L A N K E H L S S P G L F L K M K I L L K G E T S N E E L L

Dataset: varibench_TP53


reading tsv: 7949it [00:01, 4536.31it/s]
creating_examples: 100%|██████████| 7949/7949 [00:00<00:00, 153560.17it/s]
2022-06-04 21:44:31,952 - tensorflow - INFO - Writing example 0 of 7168
2022-06-04 21:44:31,971 - tensorflow - INFO - *** Example ***
2022-06-04 21:44:31,973 - tensorflow - INFO - guid: test-0
2022-06-04 21:44:31,975 - tensorflow - INFO - tokens (length = 1023): [CLS] Q L H M Q L E I Q K K E S T T R L Q E L E Q E N K L F K D D M E K L G L A I K E S D A M S T Q D Q H V L F G K F A Q I I Q E K E V E I D Q L N E Q V T K L Q Q Q L K I T T D N K V I E E K N E L I R D L E T Q I E C L M S D Q E C V K R N R E E E I E Q L N E V I E K L Q Q E L A N I G Q K T S M N A H S L S E E A D S L K H Q L D V V I A E K L A L E Q Q V E T A N E E M T F M K N V L K E T N F K M N Q L T Q E L F S L K R E R E S V E K I Q S I P E N S V N V A I D H L S K D K P E L E V V L T E D A L K S L E N Q T Y F K S F E E N G K G S I I N L E T R L L Q L E S T V S A K D L E L T Q C Y K Q I K D M Q E Q G Q F E T E M

###Only one dataset

In [5]:
#@markdown Maximum output data length (when using paired method, actual protein sequence length is about half of this value):
MAX_SEQ_LENGTH = 1024 #@param {type:"integer"}
#@markdown Whether or not to ensure all datapoints are used during prediction by using an extra trailing test dataset so no datapoints will be skipped due to the batch size. (This option should be used most of the time unless an extra trailing test dataset is a large problem)
PRECISE_TESTING = True #@param {type:"boolean"}
#@markdown Whether or not to split the data processing into (only for really large databases, since finetuning data typically isn't that large)
USING_SHARDS = True #@param {type:"boolean"}
#@markdown If USING_SHARDS, what shard size to use (how many lines/datapoints should be in each shard) (MUST BE DIVISIBLE BY "MAX_BATCH_SIZE")
SHARD_SIZE = 1024000 #@param {type:"integer"}
#@markdown * If USING_SHARDS, set this value to indicate which shard to start processing at (defualt 0 for first shard)
START_SHARD =  53#@param {type:"integer"}
#@markdown * If USING_SHARDS, set this value to indicate which shard to process until (not inclusive) (defualt -1 for last shard)
END_SHARD =  54#@param {type:"integer"}
#@markdown Which sets to generate out of train, dev, and test
TRAIN = False #@param {type:"boolean"}
DEV = False #@param {type:"boolean"}
TEST = True #@param {type:"boolean"}
#@markdown How many additional augmented copies to load:
AUGMENT_COPIES_TRAIN =  0#@param{type:"integer"}

DATA_GCS_DIR = BUCKET_PATH+"/"+OUTPUT_DATA_DIR
data_folder_current = INPUT_DATA_DIR

generate_data(MAX_SEQ_LENGTH,
              data_folder_current,
              DATA_GCS_DIR,
              PRECISE_TESTING,
              USING_SHARDS,
              START_SHARD,
              AUGMENT_COPIES_TRAIN,
              SHARD_SIZE,
              [TRAIN,DEV,TEST])



Updating and uploading data info json...

Data info json uploaded successfully

Generating test set...

generating data for shard number 53


reading tsv: 1024000it [01:27, 11670.80it/s]
creating_examples: 100%|██████████| 1024000/1024000 [00:05<00:00, 187778.34it/s]


2022-07-18 01:56:30,655 - tensorflow - INFO - Writing example 0 of 1024000
2022-07-18 01:56:30,671 - tensorflow - INFO - *** Example ***
2022-07-18 01:56:30,674 - tensorflow - INFO - guid: test-0
2022-07-18 01:56:30,678 - tensorflow - INFO - tokens (length = 1023): [CLS] E E E E E E D L I D G F A I A S F A T L E A L Q K D A S L Q P P E R L E H R L K H S G K R K R G G S S G A T G E P G D S S D R E P G R P P G D R A R K W P N K R R R K E A S S R H S L E A G Y I C D A E S D L D E R V S D D D L D P S F T V S T S K A S G P H G A F N G N C E A K L S V V P K V S G L E R S Q E Q P P G P D P L L V P F P P K E P P P P P V P R P P V S P P A P L P A T P S L P P P P Q P Q L Q L R V S P F G L R T S P Y G S S L D L S T G S S S R P P P K A P A P P V A Q P P P S S S S S S S S S S S A S S S S A Q L T H R P P T P S L P L P L S T H S F P P P G L R P P P P P H H P S L F S P G P T 