<a href="https://colab.research.google.com/github/dthorburn/rb_automation/blob/main/af2_small_batch/AlphaFold2_batch_RBGCP_current.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ColabFold v1.5.5: AlphaFold2 w/ MMseqs2 BATCH & GCP Bucket Access

<img src="https://raw.githubusercontent.com/sokrypton/ColabFold/main/.github/ColabFold_Marv_Logo_Small.png" height="256" align="right" style="height:256px">

# Instructions <a name="Instructions"></a>
**Quick start**
1. Upload your AA sequences in fasta format to GCP (NB. files must be appended with `.fasta` to be identified as fasta here.
2. Define GCP project ID using (`project_name`), then path to the GCP cloud storage bucket and sub-directory containing the fasta files using (`bucket_name`) define an outdir (`folder_name`).
3. Check the parameters are correctly set. Defaults are fine for most use cases.
4. Launch GCP GPU VM with Colab backend container using this [link]( https://console.cloud.google.com/marketplace/product/colab-marketplace-image-public/colab) (ensure you are logged into our GCP account). An NVIDIA T4 with 4 vCPU and 13Gb-26Gb of memory should be sufficient for most use cases.
5. Select `Connect to a custom GCE VM` under connection options in the top right and follow instructions.
6. Once connected, press `Runtime` -> `Run all` (Or select each block and run individually).
7. Follow the link provided to retrieve temporary access code to GCP during execution of the first block and paste it in the proivded space.

**Output**
1. A new directory will be made in the same folder as the `folder_name` variable below called `Completed_${folder_name}`.
2. For each prediction, 1 PDB files and 2 JSON files will be uploaded to the output directory.
3. This output directory can be used as input for the RB candidate genes protein-protein interaction scoring pipeline.  



In [None]:
#@title Access GCP files
!pip install --upgrade google-cloud-storage
!gcloud auth application-default login

from google.cloud import storage

In [None]:
#@title Fetch inputs from Google Cloud Storage

#@markdown ## Set parameters to get input files run and wait for summary. Check correct before running Alphafold
#@markdown ## Note: Will remove input directory and contents if you run this cell.

import os
from pathlib import Path
import shutil
import re

project_name = '' #@param {type:"string"}
bucket_name = '' #@param {type:"string"}
folder_name = '' #@param {type:"string"}

print("Ensuring download folder free")
input_local_dir = Path.home().resolve() / "input"

if os.path.isdir(input_local_dir):
  print("Removing existing data from input dir")
  shutil.rmtree(input_local_dir)

os.mkdir(input_local_dir)


print("Fetching data...")
storage_client = storage.Client(project_name)
bucket = storage_client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=folder_name)  # Get list of files
for blob in blobs:
  filename = blob.name.split("/")[-1]
  if bool(re.search("fasta", str(filename))):
    blob.download_to_filename(input_local_dir.resolve() / filename)
#for blob in blobs:
#  blob.download_to_filename(input_local_dir.resolve() / blob.name.split("/")[-1])

print("Done, input directory state:")
print(os.listdir(input_local_dir))

In [4]:
#@title Settings
#input_dir = '/content/drive/MyDrive/input_fasta' #@param {type:"string"}
#result_dir = '/content/drive/MyDrive/result' #@param {type:"string"}

## Adding a temp directory for input files that have been run
input_dir = input_local_dir
result_dir = Path.home().resolve() / "output"
finished_dir = Path.home().resolve() / "finished"
uploaded_dir = Path.home().resolve() / "uploaded"
working_dir = Path.home().resolve() / "temp_work"

#os.mkdir(result_dir)
#os.mkdir(finished_dir)
#os.mkdir(working_dir)
#os.mkdir(uploaded_dir)

# number of models to use

#@markdown ### Advanced settings
msa_mode = "MMseqs2 (UniRef+Environmental)" #@param ["MMseqs2 (UniRef+Environmental)", "MMseqs2 (UniRef only)","single_sequence","custom"]
num_models = 1 #@param [1,2,3,4,5] {type:"raw"}
num_recycles = 3 #@param {type:"raw"}
stop_at_score = 100 #@param {type:"string"}
use_custom_msa = False
num_relax = 0 #@param [0, 1, 5] {type:"raw"}
use_amber = num_relax > 0
relax_max_iterations = 200 #@param [0,200,2000] {type:"raw"}
use_templates = False #@param {type:"boolean"}
do_not_overwrite_results = True #@param {type:"boolean"}
zip_results = False #@param {type:"boolean"}


In [5]:
#@title Install dependencies
%%bash -s $use_amber $use_templates $python_version

set -e

USE_AMBER=$1
USE_TEMPLATES=$2
PYTHON_VERSION=$3

if [ ! -f COLABFOLD_READY ]; then
  # install dependencies
  # We have to use "--no-warn-conflicts" because colab already has a lot preinstalled with requirements different to ours
  pip install -q --no-warn-conflicts "colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold"
  ln -s /usr/local/lib/python3.*/dist-packages/colabfold colabfold
  ln -s /usr/local/lib/python3.*/dist-packages/alphafold alphafold
  touch COLABFOLD_READY
fi

# Download params (~1min)
python -m colabfold.download

# setup conda
if [ ${USE_AMBER} == "True" ] || [ ${USE_TEMPLATES} == "True" ]; then
  if [ ! -f CONDA_READY ]; then
    wget -qnc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh -bfp /usr/local 2>&1 1>/dev/null
    rm Miniconda3-latest-Linux-x86_64.sh
    conda config --set auto_update_conda false
    touch CONDA_READY
  fi
fi
# setup template search
if [ ${USE_TEMPLATES} == "True" ] && [ ! -f HH_READY ]; then
  conda install -y -q -c conda-forge -c bioconda kalign2=2.04 hhsuite=3.3.0 python="${PYTHON_VERSION}" 2>&1 1>/dev/null
  touch HH_READY
fi
# setup openmm for amber refinement
if [ ${USE_AMBER} == "True" ] && [ ! -f AMBER_READY ]; then
  conda install -y -q -c conda-forge openmm=7.7.0 python="${PYTHON_VERSION}" pdbfixer 2>&1 1>/dev/null
  touch AMBER_READY
fi

In [None]:
#@title Run Prediction

import sys

from colabfold.batch import get_queries, run
from colabfold.download import default_data_dir
from colabfold.utils import setup_logging
from pathlib import Path

# For some reason we need that to get pdbfixer to import
if use_amber and f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
    sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")

setup_logging(Path(result_dir).joinpath("log.txt"))

# make the folder to upload is visible
storage_client = storage.Client(project_name)
bucket = storage_client.get_bucket(bucket_name)

for filename in os.listdir(input_dir):
    print("Starting: " + filename)
    if os.path.isfile(input_dir.resolve()/ filename):
        shutil.move(input_dir.resolve()/ filename, working_dir.resolve())

    queries, is_complex = get_queries(working_dir)
    run(
      queries=queries,
      result_dir=result_dir,
      use_templates=use_templates,
      num_relax=num_relax,
      relax_max_iterations=relax_max_iterations,
      msa_mode=msa_mode,
      model_type="auto",
      num_models=num_models,
      num_recycles=num_recycles,
      model_order=[1, 2, 3, 4, 5],
      is_complex=is_complex,
      data_dir=default_data_dir,
      keep_existing_results=do_not_overwrite_results,
      rank_by="auto",
      pair_mode="unpaired+paired",
      stop_at_score=stop_at_score,
      zip_results=zip_results,
      user_agent="colabfold/google-colab-batch",
    )
    print('Moving input file...')
    shutil.move(working_dir.resolve()/ filename, finished_dir.resolve())

    print('Uploading results...')
    for filename in os.listdir(result_dir):
      if os.path.isfile(result_dir.resolve()/ filename):
          blob = bucket.blob("Completed_" + folder_name + "/" + filename)
          blob.upload_from_filename(result_dir.resolve()/ filename)
          shutil.move(os.path.join(result_dir, filename), os.path.join(uploaded_dir, filename))
    print('Done. Moving on to next sample...')

**Limitations**

MSAs: MMseqs2 is very precise and sensitive but might find less hits compared to HHblits/HMMer searched against BFD or Mgnify.

**License**

The source code of ColabFold is licensed under [MIT](https://raw.githubusercontent.com/sokrypton/ColabFold/main/LICENSE). Additionally, this notebook uses AlphaFold2 source code and its parameters licensed under [Apache 2.0](https://raw.githubusercontent.com/deepmind/alphafold/main/LICENSE) and  [CC BY 4.0](https://creativecommons.org/licenses/by-sa/4.0/) respectively. Read more about the AlphaFold license [here](https://github.com/deepmind/alphafold).

**Acknowledgments**
- We thank the AlphaFold team for developing an excellent model and open sourcing the software.
- Do-Yoon Kim for creating the ColabFold logo.
- A colab by Sergey Ovchinnikov ([@sokrypton](https://twitter.com/sokrypton)), Milot Mirdita ([@milot_mirdita](https://twitter.com/milot_mirdita)) and Martin Steinegger ([@thesteinegger](https://twitter.com/thesteinegger)).
