#Finetuning Data Generation Script

This notebook processes tsv data and uploads the processed data to GCS to be used for finetuning MutFormer.

#Downgrade Python and Tensorflow 

(the default python version in Colab does not support Tensorflow 1.15)

* **Note** that because the Python used in this notebook is not the default path, syntax highlighting most likely will not function.

####1. First, download and install Python version 3.7:

In [None]:
!wget -O mini.sh https://repo.anaconda.com/miniconda/Miniconda3-py37_22.11.1-1-Linux-x86_64.sh
!chmod +x mini.sh
!bash ./mini.sh -b -f -p /usr/local
!conda install -q -y jupyter
!conda install -q -y google-colab -c conda-forge
!python -m ipykernel install --name "py37" --user

####2. Then, reload the webpage (not restart runtime) to allow Colab to recognize the newly installed python
####3. Finally, run the following commands to install tensorflow 1.15:

In [None]:
!python3 -m pip install tensorflow==1.15

# Configure settings/Mount Drive if needed

In [None]:
#@markdown ## General Config
#@markdown Whether or not this script is being run in a GCP runtime (if more memory is required for large databases)
GCP_RUNTIME = False #@param {type:"boolean"}
#@markdown Which mode to use (a different mode means a different finetuning task): options are:
#@markdown * "MRPC" - paired sequence method
#@markdown * "MRPC_w_ex_data" - paired sequence method with external data
#@markdown * "RE" - single sequence method
#@markdown * "NER" - single sequence per residue prediction
#@markdown 
#@markdown You can add more modes by creating a new processor and/or a new model_fn inside of the "mutformer_model_code" folder downloaded from github, then changing the corresponding code snippets in the code segment named "Authorize for GCS, Imports, and General Setup" (also edit the dropdown below).
MODE = "MRPC_w_ex_data" #@param   ["MRPC_w_ex_data", "MRPC", "RE", "NER"]   {type:"string"} 
            ####      ^^^^^ dropdown list for all modes ^^^^^

#@markdown Name of the GCS bucket to use (Make sure to set this to the name of your own GCS  bucket):
BUCKET_NAME = "" #@param {type:"string"}
BUCKET_PATH = "gs://"+BUCKET_NAME
#@markdown \
#@markdown 
#@markdown 
#@markdown ## IO Config
#@markdown Input finetuning data folder: data will be read from here to be processed and uploaded to GCS (can be a drive path, or a GCS path if needed for large databases; must be a GCS path if using GCP_RUNTIME):
#@markdown 
#@markdown * For processing multiple sets i.e. for multiple sequence lengths, simply store these sets into separate subfolders inside of the folder listed below, with each subfolder being named as specified in the following section.
#@markdown 
#@markdown * For processing a single set, this folder should directly contain one dataset.
#@markdown
INPUT_DATA_DIR = "gs://theodore_jiang/updated_all_snp_prediction_data" #@param {type: "string"}


if not GCP_RUNTIME:                    ##if INPUT_DATA_DIR is a drive path,
  if "/content/drive" in INPUT_DATA_DIR:   ##mount google drive
    from google.colab import drive
    if GCP_RUNTIME:
      raise Exception("if GCP_RUNTIME, a GCS path must be used, since Google's cloud TPUs can only communicate with GCS and not drive")
    !fusermount -u /content/drive
    drive.flush_and_unmount()
    drive.mount('/content/drive', force_remount=True)


#@markdown Name of the folder in GCS to put processed data into: 
#@markdown * For generating multiple datasets i.e. for different sequence lengths, they will be written as individual subfolders inside of this folder.
OUTPUT_DATA_DIR = "all_snp_prediction_data_loaded" #@param {type:"string"}


DATA_INFO = {      ##dictionary that will be uploaded alongside 
    "mode":MODE    ##each dataset to indicate its parameters
}


#### Vocabulary for the model (MutFormer uses the vocabulary below) ([PAD]
#### [UNK],[CLS],[SEP], and [MASK] are necessary default tokens; B and J
#### are markers for the beginning and ending of a protein sequence,
#### respectively; the rest are all amino acids possible, ranked 
#### approximately by frequency of occurence in human population)
#### vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vocab = \
'''[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
L
S
B
J
E
A
P
T
G
V
K
R
D
Q
I
N
F
H
Y
C
M
W'''
with open("vocab.txt", "w") as fo:
  for token in vocab.split("\n"):
    fo.write(token+"\n")


#If running on a GCP runtime, follow these instructions to set it up:

###1) Create a VM from the GCP website
###2) Open a command prompt on your computer and perform the following steps"
To ssh into the VM, run:

```
gcloud beta compute ssh --zone <COMPUTE ZONE> <VM NAME> --project <PROJECT NAME> -- -L 8888:localhost:8888
```

Note: Make sure the port above matches the port below (in this case it's 8888)
\
\
In the new command prompt that popped out, either run each of the commands below individually, or copy and paste the one liner below:
```
sudo apt-get update
sudo apt-get -y install python3 python3-pip
sudo apt-get install pkg-config
sudo apt-get install libhdf5-serial-dev
sudo apt-get install libffi6 libffi-dev
sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm
sudo -H pip3 install jupyter_http_over_ws
jupyter serverextension enable --py jupyter_http_over_ws
jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
One command:
```
sudo apt-get update ; sudo apt-get -y install python3 python3-pip ; sudo apt-get install pkg-config ; sudo apt-get -y install libhdf5-serial-dev ; sudo apt-get install libffi6 libffi-dev; sudo -H pip3 install jupyter tensorflow==1.14 google-api-python-client tqdm ; sudo -H pip3 install jupyter_http_over_ws ; jupyter serverextension enable --py jupyter_http_over_ws ; jupyter notebook   --NotebookApp.allow_origin='https://colab.research.google.com'   --port=8888   --NotebookApp.port_retries=0   --no-browser
```
###3) In this notebook, click the "connect to local runtime" option under the connect button, and copy and paste the link outputted by command prompt with "locahost: ..."

#Clone the MutFormer repo

In [None]:
if GCP_RUNTIME:
  !sudo apt-get -y install git-all
#@markdown Where to clone the repo into:
REPO_DESTINATION_PATH = "mutformer" #@param {type:"string"}
import os,shutil
if not os.path.exists(REPO_DESTINATION_PATH):
  os.makedirs(REPO_DESTINATION_PATH)
else:
  shutil.rmtree(REPO_DESTINATION_PATH)
  os.makedirs(REPO_DESTINATION_PATH)
cmd = "git clone https://github.com/WGLab/mutformer.git \"" + REPO_DESTINATION_PATH + "\""
!{cmd}

#Authorize for GCS, Imports, and General Setup

In [None]:
#@markdown Whether to use link authorization for GCS (link authorization allows connection to another account other than the one running the script, while normal authorization disables connecting to different accounts):
LINK_AUTHORIZATION = False #@param {type:"boolean"}

if not GCP_RUNTIME:
  from google.colab import auth
  print("Authorize for GCS:")
  if not LINK_AUTHORIZATION: 
    auth.authenticate_user()
  else: 
    !gcloud auth login --no-launch-browser
  print("Authorize done")
  
import sys
import json
import random
import logging
import tensorflow as tf
import time
import os
import shutil
import importlib
import re
from tqdm import tqdm

if REPO_DESTINATION_PATH == "mutformer":
  if os.path.exists("mutformer_code"):
    shutil.rmtree("mutformer_code")
  shutil.copytree(REPO_DESTINATION_PATH,"mutformer_code")
  REPO_DESTINATION_PATH = "mutformer_code"
if not os.path.exists("mutformer"):
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
else:
  shutil.rmtree("mutformer")
  shutil.copytree(REPO_DESTINATION_PATH+"/mutformer_model_code","mutformer")
if "mutformer" in sys.path:
  sys.path.remove("mutformer")
sys.path.append("mutformer")

from mutformer import modeling, optimization, tokenization,run_classifier,run_ner_for_pathogenic  #### <<<<< if you added more modes, change these imports to import the correct processors         
from mutformer.run_classifier import MrpcProcessor,REProcessor,MrpcWithExDataProcessor            #### <<<<< and correct training scripts (i.e. run_classifier and run_ner_for_pathogenic)
from mutformer.run_ner_for_pathogenic import NERProcessor                                

##reload modules so that you don't need to restart the runtime to reload modules in case that's needed
modules2reload = [modeling, 
                  optimization, 
                  tokenization,
                  run_classifier,
                  run_ner_for_pathogenic]
for module in modules2reload:
    importlib.reload(module)

# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

log.handlers = []

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
log.addHandler(ch)


if MODE=="MRPC":      ####       vvvvv if you added more modes, change this part to set the processors and training scripts correctly vvvvv
  processor = run_classifier.MrpcProcessor()
  script = run_classifier
  USE_EX_DATA = False
elif MODE=="MRPC_w_ex_data":
  processor = run_classifier.MrpcWithExDataProcessor()
  script = run_classifier
  USE_EX_DATA = True
elif MODE=="RE":
  processor = run_classifier.REProcessor()
  script = run_classifier
  USE_EX_DATA = False
elif MODE=="NER":
  processor = run_ner_for_pathogenic.NERProcessor()
  script = run_ner_for_pathogenic
  USE_EX_DATA = False
else:
  raise Exception("The mode specified was not one of the available modes: [\"MRPC\", \"RE\",\"NER\"].")
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(vocab_file="vocab.txt", do_lower_case=False)
                      ####       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


# Data Generation

###General setup and definitions

In [None]:
#@markdown Maximum batch size the finetuning_benchmark script can handle without OOM (must be divisible by NUM_TPU_CORES_WHEN_TESTING):
MAX_BATCH_SIZE =  1024 #@param {type:"integer"}
#@markdown How many tpu cores will be used during evaluation and prediction (for colab runtimes, it's 8):
NUM_TPU_CORES_WHEN_TESTING = 8 #@param {type:"integer"}

def generate_data(MAX_SEQ_LENGTH,
                  data_folder_current,
                  DATA_GCS_DIR,
                  PRECISE_TESTING,
                  USING_SHARDS,
                  START_SHARD,
                  AUGMENT_COPIES_TRAIN,
                  SHARD_SIZE,
                  GENERATE_SETS):

  try:
    print("\nUpdating and uploading data info json...\n")
    DATA_INFO["sequence_length"] = MAX_SEQ_LENGTH    ##update data info with sequence length

    if USE_EX_DATA:                         ##if using external data, update data 
      def get_ex_data_num(file):            ##info with the # of external datapoints being used
        with tf.gfile.Open(file) as filein: 
          while True:
            line = filein.readline().strip()
            if line:
              ex_data = line.split("\t")[3].split()
              return len(ex_data)
      DATA_INFO["ex_data_num"] = get_ex_data_num(data_folder_current+"/"+tf.io.gfile.listdir(data_folder_current)[0])
    
    with tf.gfile.Open(DATA_GCS_DIR+"/info.json","w+") as out: ##writes out a dictionary containing
      json.dump(DATA_INFO,out,indent=2)                           ##the dataset's parameters
    print("Data info json uploaded successfully")
  except Exception as e:
    print("could not update and upload data info json. Error:",e)

              
  def get_or_create_shards(infile, SHARD_SIZE, START_SHARD, END_SHARD):
      shard_files = []
      with tf.gfile.Open(infile) as filein:
          shard_ind = START_SHARD
          read_ind = -1
          while True:
              current_start_line = shard_ind * SHARD_SIZE
              if shard_ind == END_SHARD: break
              shard_file = f"{infile}_(shardsize_{SHARD_SIZE})_shard_{shard_ind}"
              shard_files.append([shard_file, shard_ind])
              if not tf.io.gfile.exists(shard_file):
                  with tf.gfile.Open(shard_file, "w+") as shardout:
                      wroteout = 0
                      for line in tqdm(filein, f"creating shard number {shard_ind}"):
                          if not line.strip():
                              continue
                          read_ind += 1
                          if read_ind < current_start_line:
                              continue
                          shardout.write(line)
                          wroteout += 1
                          if wroteout == SHARD_SIZE:
                              break
                      if wroteout == 0:
                          shardout.close()
                          del shard_files[-1]
                          break
                      if wroteout < SHARD_SIZE:
                          break
              shard_ind += 1
      return shard_files
  

  DO_TRAIN, DO_DEV, DO_TEST = GENERATE_SETS
  

  if DO_TRAIN:
    try:
      print("\nGenerating train set...\n")
      train_data_input_file = processor.get_train_file(data_folder_current)
      if USING_SHARDS:
        shards = get_or_create_shards(train_data_input_file,SHARD_SIZE//(AUGMENT_COPIES_TRAIN+1),START_SHARD,END_SHARD)
      else:
        shards = [train_data_input_file,None]
      for shard,shard_ind in shards:
        if USING_SHARDS: print(f"generating data for shard number {shard_ind}")
        train_examples = processor._create_examples(processor._read_tsv(shard),"train")
        if len(train_examples) == 0:
          raise Exception("no data present in the train dataset")
        train_file = os.path.join(DATA_GCS_DIR, "train.tf_record")
        if USING_SHARDS:
          train_file+="_"+str(shard_ind)
        script.file_based_convert_examples_to_features(train_examples, 
                                                      label_list, 
                                                      MAX_SEQ_LENGTH, 
                                                      tokenizer, 
                                                      train_file,
                                                      augmented_data_copies=AUGMENT_COPIES_TRAIN,
                                                      shuffle_data=True)
    except Exception as e:
      print("train dataset generation failed. Error:",e)

  if DO_DEV:
    try:
      print("\nGenerating dev set...\n")
      dev_data_input_file = processor.get_dev_file(data_folder_current)
      if USING_SHARDS:
        shards = get_or_create_shards(dev_data_input_file,SHARD_SIZE,START_SHARD,END_SHARD)
      else:
        shards = [dev_data_input_file,None]
      for shard,shard_ind in shards:
        if USING_SHARDS: print(f"generating data for shard number {shard_ind}")
        dev_examples = processor._create_examples(processor._read_tsv(shard),"dev")
        if len(dev_examples) == 0:
          raise Exception("no data present in the dev dataset")
        dev_file = os.path.join(DATA_GCS_DIR, "dev.tf_record")
        if USING_SHARDS:
          dev_file+="_"+str(shard_ind)
        script.file_based_convert_examples_to_features(dev_examples, 
                                                      label_list, 
                                                      MAX_SEQ_LENGTH, 
                                                      tokenizer, 
                                                      dev_file)
    except Exception as e:
      print("dev dataset generation failed. Error:",e)

  if DO_TEST:
    try:
      print("\nGenerating test set...\n")
      datasets = [re.match("test_(\w+).tsv",file).groups()[0] for file in tf.io.gfile.listdir(data_folder_current) if re.match("test_(\w+).tsv",file)]
      if not datasets:
        datasets = [None]
      for dataset in datasets:
        if dataset: print(f"Processing dataset: {dataset}")
        test_data_input_file = processor.get_test_file(data_folder_current,dataset=dataset)
        if USING_SHARDS:
          shards = get_or_create_shards(test_data_input_file,SHARD_SIZE,START_SHARD,END_SHARD)
        else:
          shards = [test_data_input_file,None]
        for n,(shard,shard_ind) in enumerate(shards):
          if USING_SHARDS: print(f"generating data for shard number {shard_ind}")
          test_examples = processor._create_examples(processor._read_tsv(shard),"test")
          if len(test_examples) == 0:
            raise Exception("no data present in the test dataset")
          test_file = os.path.join(DATA_GCS_DIR, f"test_{dataset}.tf_record" if dataset else "test.tf_record")
          if USING_SHARDS:
            test_file+="_"+str(shard_ind)
          ## if using precise testing, the data will be split into two sets: 
          ## one set will be able to be predicted on the maximum possible batch 
          ## size, while the other will be predicted on a batch size of 1, to 
          ## ensure the fastest prediction without leaving out any datapoints
          if PRECISE_TESTING and n==len(shards)-1:
            test_file_trailing = os.path.join(DATA_GCS_DIR, f"test_trailing_{dataset}.tf_record" if dataset else "test_trailing.tf_record")
            def largest_mutiple_under_max(max,multiple_base):
              return int(max/multiple_base)*multiple_base

            split = largest_mutiple_under_max(len(test_examples),MAX_BATCH_SIZE)
            test_examples_head = test_examples[:split]
            test_examples_trailing = test_examples[split:]
            script.file_based_convert_examples_to_features(test_examples_head, 
                                                           label_list, 
                                                           MAX_SEQ_LENGTH, 
                                                           tokenizer, 
                                                           test_file)
            if test_examples_trailing:
              script.file_based_convert_examples_to_features(test_examples_trailing, 
                                                            label_list, 
                                                            MAX_SEQ_LENGTH, 
                                                            tokenizer, 
                                                            test_file_trailing)
          else:
            script.file_based_convert_examples_to_features(test_examples, 
                                                           label_list, 
                                                           MAX_SEQ_LENGTH, 
                                                           tokenizer, 
                                                           test_file)
    except Exception as e:
      print("test dataset generation failed. Error:",e)


###Data Generation ops

There are currently two data generations loops/ops (more can be added using a similar format to these two examples):
1. Varying sequence lengths: multiple sets of different sequence lengths will be generated
  * Store multiple individual datasets as subfolders inside of Input finetuning data folder, with each folder named its corresponding sequence length.
2. Only one dataset: a single dataset with a specified set of parameters will be generated 
  * Directly store only the files train.tsv, dev.tsv, and test.tsv for one dataset inside Input finetuning data folder

####Varying sequence lengths

In [None]:
#@markdown List of maximum sequence lengths to generate data for
MAX_SEQ_LENGTHS = [1024] #@param
#@markdown Whether or not to ensure all datapoints are used during prediction by using an extra trailing test dataset so no datapoints will be skipped due to the batch size. (This option should be used unless an extra trailing test dataset is a large problem)
PRECISE_TESTING = True #@param {type:"boolean"}
#@markdown Whether or not to split the data processing into shards (only for really large databases, since finetuning data typically isn't that large)
USING_SHARDS = False #@param {type:"boolean"}
#@markdown If USING_SHARDS, what shard size to use (how many lines/datapoints should be in each shard) (MUST BE DIVISIBLE BY "MAX_BATCH_SIZE") (if using data augmentation, size indicates the size of augmented data)
SHARD_SIZE = 1024000 #@param {type:"integer"}
#@markdown If USING_SHARDS, which shard to start at (default start at shard 0)
START_SHARD = 0 #@param {type:"integer"}
#@markdown Which sets to generate out of train, dev, and test
TRAIN = False #@param {type:"boolean"}
DEV = False #@param {type:"boolean"}
TEST = True #@param {type:"boolean"}
#@markdown How many additional augmented copies to load (augmented copies refer to duplicates of the same sequence but with different clip locations. This parameter is defined by the "run_classifier.py" file in the "mutformer_model_code" folder):
AUGMENT_COPIES_TRAIN =  0#@param{type:"integer"}

for MAX_SEQ_LENGTH in MAX_SEQ_LENGTHS:
  print("\n\nGenerating data for seq length:",MAX_SEQ_LENGTH,"\n\n")
  DATA_GCS_DIR = BUCKET_PATH+"/"+OUTPUT_DATA_DIR +"/"+ str(MAX_SEQ_LENGTH)
  data_folder_current= INPUT_DATA_DIR+"/"+str(MAX_SEQ_LENGTH)

  generate_data(MAX_SEQ_LENGTH,
                data_folder_current,
                DATA_GCS_DIR,
                PRECISE_TESTING,
                USING_SHARDS,
                START_SHARD,
                AUGMENT_COPIES_TRAIN,
                SHARD_SIZE,
                [TRAIN,DEV,TEST])
  

###Only one dataset

In [None]:
#@markdown Maximum output data length (when using paired method, actual protein sequence length is about half of this value):
MAX_SEQ_LENGTH = 1024 #@param {type:"integer"}
#@markdown Whether or not to ensure all datapoints are used during prediction by using an extra trailing test dataset so no datapoints will be skipped due to the batch size. (This option should be used most of the time unless an extra trailing test dataset is a large problem)
PRECISE_TESTING = True #@param {type:"boolean"}
#@markdown Whether or not to split the data processing into shards (only for really large databases, since finetuning data typically isn't that large)
USING_SHARDS = True #@param {type:"boolean"}
#@markdown If USING_SHARDS, what shard size to use (how many lines/datapoints should be in each shard) (MUST BE DIVISIBLE BY "MAX_BATCH_SIZE")
SHARD_SIZE = 1024000 #@param {type:"integer"}
#@markdown * If USING_SHARDS, set this value to indicate which shard to start processing at (defualt 0 for first shard)
START_SHARD =  53#@param {type:"integer"}
#@markdown * If USING_SHARDS, set this value to indicate which shard to process until (not inclusive) (defualt -1 for last shard)
END_SHARD =  54#@param {type:"integer"}
#@markdown Which sets to generate out of train, dev, and test
TRAIN = False #@param {type:"boolean"}
DEV = False #@param {type:"boolean"}
TEST = True #@param {type:"boolean"}
#@markdown How many additional augmented copies to load (augmented copies refer to duplicates of the same sequence but with different clip locations. This parameter is defined by the "run_classifier.py" file in the "mutformer_model_code" folder):
AUGMENT_COPIES_TRAIN =  0#@param{type:"integer"}

DATA_GCS_DIR = BUCKET_PATH+"/"+OUTPUT_DATA_DIR
data_folder_current = INPUT_DATA_DIR

generate_data(MAX_SEQ_LENGTH,
              data_folder_current,
              DATA_GCS_DIR,
              PRECISE_TESTING,
              USING_SHARDS,
              START_SHARD,
              AUGMENT_COPIES_TRAIN,
              SHARD_SIZE,
              [TRAIN,DEV,TEST])
