# Inference with METL models
This notebook shows how to run inference with METL models trained in this repository.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys

# define the name of the project root directory
project_root_dir_name = "metl"

# find the project root by checking each parent directory
current_dir = os.getcwd()
while os.path.basename(current_dir) != project_root_dir_name and current_dir != os.path.dirname(current_dir):
    current_dir = os.path.dirname(current_dir)

# change the current working directory to the project root directory
if os.path.basename(current_dir) == project_root_dir_name:
    os.chdir(current_dir)
else:
    print("project root directory not found")
    
# add the project code folder to the system path so imports work
module_path = os.path.abspath("code")
if module_path not in sys.path:
    sys.path.append(module_path)

# Using our inference framework

We provide the script [inference.py](../code/inference.py) for running inference with models trained in this repository. It supports similar arguments and datamodule capabilities as used for training the models. 

The arguements `--write_interval` and `--batch_write_mode` control how often predictions are saved and in what format. 

The `write_interval` can be set to "batch", "epoch", or "batch_and_epoch". When set to "batch", predictions will be saved to disk after each batch. When set to "epoch", predictions will first be stored in RAM until all data has been processed, and then they will be written to disk. If you have a lot of data which might not fit in RAM, then you will want to set `--write_interval` to "batch" (default).

The `--batch_write_mode` can be set to "combined_csv", "separate_csv", or "separate_npy". When set to "combined_csv", there will be a single output csv file, and it will be appended to after each batch is processed. When set to either "separate_csv" or "separate_npy", there will be a separate output file for each batch in either .csv or .npy format. 

## Source model example
This repository contains a sample GFP Rosetta dataset and a pretrained METL-Local GFP source model, which we can use as examples. 

We specify the following arguments:

| Argument               | Description                                                | Value                                      |
|:------------------------|:------------------------------------------------------------|:--------------------------------------------|
| `pretrained_ckpt_path` | Path to the pretrained model checkpoint                    | `pretrained_models/Hr4GNHws.pt`            |
| `dataset_type`         | Type of dataset being used (rosetta or dms)                                | `rosetta`                                  |
| `ds_fn`                | Path to the database file for the dataset                  | `data/rosetta_data/avgfp/avgfp.db`         |
| `batch_size`           | Batch size used during inference               | `512`                                     |

The inference script will automatically save output in the `output/inference` directory. There will be an output csv file for each processed batch.

In [3]:
!python code/inference.py --pretrained_ckpt_path=pretrained_models/Hr4GNHws.pt --dataset_type=rosetta --ds_fn=data/rosetta_data/avgfp/avgfp.db --batch_size=512

Using example_input_array with pdb_fn='1gfl_cm.pdb' and aa_seq_len=237
Output directory: output/inference/Hr4GNHws/rosetta_avgfp/full_dataset
Writing predictions to output/inference/Hr4GNHws/rosetta_avgfp/full_dataset/predictions.npy
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100%|██████████████████| 20/20 [00:06<00:00,  2.88it/s]


By default, the script will compute predictions for the full dataset. If you only need to save predictions for a particular train, validation, or test set, you can do so by setting the `--split_dir` and `--predict_mode` arguments. The function call below will compute predictions just for the test set.

In [4]:
!python code/inference.py --pretrained_ckpt_path=pretrained_models/Hr4GNHws.pt --dataset_type=rosetta --ds_fn=data/rosetta_data/avgfp/avgfp.db --batch_size=512 --split_dir=data/rosetta_data/avgfp/splits/standard_tr0.8_tu0.1_te0.1_w1aea30517f4f_r4991 --predict_mode=test

Using example_input_array with pdb_fn='1gfl_cm.pdb' and aa_seq_len=237
Output directory: output/inference/Hr4GNHws/rosetta_avgfp/standard_tr0.8_tu0.1_te0.1_w1aea30517f4f_r4991/test
Writing predictions to output/inference/Hr4GNHws/rosetta_avgfp/standard_tr0.8_tu0.1_te0.1_w1aea30517f4f_r4991/test/predictions.npy
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100%|████████████████████| 2/2 [00:01<00:00,  1.18it/s]


## Target (finetuned) model example
We first need to finetune a model using experimental data. Run the command below, which will finetune the pretrained model above using the GFP experimental dataset. Note we manually specify the UUID `examplemodel` for this model. See the [finetuning.ipynb](finetuning.ipynb) notebook for more details. 

In [5]:
!python code/train_target_model.py @args/finetune_avgfp_local.txt --enable_progress_bar false --enable_simple_progress_messages --max_epochs 50 --unfreeze_backbone_at_epoch 25 --uuid examplemodel  

Random seed not specified, using: 233751893
Global seed set to 233751893
User gave model UUID: examplemodel
Did not find existing log directory corresponding to given UUID: examplemodel
Created log directory: output/training_logs/examplemodel
Final UUID: examplemodel
Final log directory: output/training_logs/examplemodel
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Loading `train_dataloader` to estimate number of stepping batches.
  rank_zero_warn(
Number of training steps is 50
Number of warmup steps is 0.5
Second warmup phase starts at step 25
total_steps 50
phase1_total_steps 25
phase2_total_steps 25
┏━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃[1;35m [0m[

We can now run inference with this finetuned model using the [inference.py](../code/inference.py) script.

| Argument                   | Description                                 | Value                                                             |
|:---------------------------|:---------------------------------------------|:------------------------------------------------------------------|
| `pretrained_ckpt_path`     | Path to the pretrained model checkpoint     | `output/training_logs/examplemodel/checkpoints/epoch=49-step=50.ckpt` |
| `dataset_type`             | Type of dataset being used (rosetta or dms)                 | `dms`                                                             |
| `ds_name`                  | Name of the predefined dataset to use       | `avgfp`                                                           |
| `encoding`                 | Input encoding method (should be int_seqs for transformer-based METL models)                       | `int_seqs`                                                        |
| `predict_mode`             | Prediction mode for inference               | `full_dataset`                                                    |
| `batch_size`               | Batch size used during inference            | `512`                                                             |

In [6]:
!python code/inference.py --pretrained_ckpt_path=output/training_logs/examplemodel/checkpoints/epoch=49-step=50.ckpt --dataset_type=dms --ds_name=avgfp --encoding=int_seqs --predict_mode full_dataset --batch_size 512 

Output directory: output/inference/examplemodel/dms_avgfp
Writing predictions to output/inference/examplemodel/dms_avgfp/predictions.npy
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Predicting DataLoader 0: 100%|████████████████| 102/102 [00:30<00:00,  3.30it/s]


# Using your own inference loop