## Complete run pipeline

This notebook contains all the step required to pre-train and fine-tune the TCRP model, as well as generate baseline models to evaluate drug reponse prediction. It requires GPUs and is currently set to be run on Google Colab.

For this, you must have your "tcrp_model" folder (including prepared labels and features and all scripts) uploaded to your Google drive.

In [None]:
import os, sys
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
# change to root directory
%cd /content/drive/MyDrive/MBP1413H/
# install required packages and modules
!pip install -r requirements.txt

In [None]:
# run scripts from "pipelines" directory (otherwise you will get module import errors)
%cd /content/drive/MyDrive/MBP1413H/tcrp_model/pipelines

## Run Instructions

### Preparing the pipeline:

**Step 1.** Choose a dataset, tissue, and drug on which you would like to apply the TCRP framework

**Step 2.** Prepare scripts for generating fewshot samples, the TCRP model, and baseline models for your chosen dataset, tissues, and drugs. For this, run "prepare_complete_run.py" as follows:
```
usage: prepare_complete_run.py --dataset DATASET --tissue TISSUE

--dataset # data set used to generate labels
--tissue # target tissue to perform TCRP on
```

*Approx. run time:* **10s** per tissue (for all drugs specified in drug list)

*Note:* You must specify the drugs for which you want to construct models by adding them to the "priority drugs file" in the "pipelines" directory

In [None]:
# 1. start by running the script to prepare the TCRP run
# in this example, the dataset is "GDSC" (challenge 1b) and the tissue is "soft_tissue"
# you will have to do thir for each tissue
!python prepare_complete_run.py --dataset GDSC --tissue soft_tissue

### Generating TCRP model:

**Step 3**. Generate fewshot samples and TCRP model by running MAML commands

*Approx. run time:* 40 **min** *per tissue (for one drug)*

In [None]:
# 2. build your TCRP model by running the jobs generated above
# replace "GDSC_soft_tissue" with your "{dataset}_{tissue}" from above 
!bash ../data/output/runs/GDSC_soft_tissue/MAML_cmd/run_MAML.sh


## Manually parsing through results to find the optimal TCRP run:

**Step 4.** For each dataset, tissue, and drug that TCRP was run on, use the *find_max_run.ipynb* to find the most optimal run. This notebook will tell you which output file has the best TCRP correlation, retrieve optimal performance outputs, and update current performance results (which default to the last TCRP run) with the optimized result.

*Approx. run time:* **<5 min** to find optimal run and update performance metrics

## Generating baseline models:

**Step 5**. Generate baseline models by running baseline commands

*Approx. run time:* **25 min** *per tissue (for one drug)*:


*   <1 min for LR, NN
*   ~25 min for RF

**Note:** Because generating the RF model takes long, you can choose to ommit this baseline by setting the --RF argument to "False"

In [None]:
# only run this after the TCRP model has been generated (because you will need to have fewshot samples in hand)
# relpace "--dataset", "--tissue", "--drug", and run_name ("{dataset}_{tissue}") with the appropriate values
!python -m baselines.baseline_DRUG --dataset GDSC --tissue autonomic_ganglia --drug GSK429286A --K 10 --num_trials 20 --run_name GDSC_autonomic_ganglia --fewshot_data_path /content/drive/MyDrive/MBP1413H/tcrp_model/data/fewshot_data/GDSC

## Plotting results:

**Step 6.** For each dataset, use the *plot_results.ipynb* to visualize performance across models. This notebook will allow you to compare TCRP and baseline models (LR, NN, KNN, RF) based on the correlation between actual and predicted cellular response.
