<a href="https://colab.research.google.com/github/gbouras13/phold/blob/main/run_phold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##phold

[phold](https://github.com/gbouras13/phold) is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology.

phold uses the [ProstT5](https://github.com/mheinzinger/ProstT5) protein language model to translate protein amino acid sequences to the 3Di token alphabet used by [Foldseek](https://github.com/steineggerlab/foldseek). Foldseek is then used to search these against a database of 803k protein structures mostly predicted using [Colabfold](https://github.com/sokrypton/ColabFold).


* Please make sure you change the runtime to T4 GPU, as this will make Phold run faster.
* To do this, go to the top toolbar, then to Runtime -> Change runtime type -> Hardware accelerator
* To run the cells, press the play button on the left side
* Cells 1 and 2 install phold and download the database
* Once they have been run, you can re-run Cell 3 as many times as you would like



In [1]:
#@title 1. Install phold

#@markdown This cell installs phold. It will take a few minutes. Please be patient

%%time
import os
from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"
PYTHON_VERSION = python_version
PHOLD_VERSION = "0.1.4"


if not os.path.isfile("MAMBA_READY"):
  print("installing mamba...")
  os.system("wget -qnc https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh")
  os.system("bash Mambaforge-Linux-x86_64.sh -bfp /usr/local")
  os.system("mamba config --set auto_update_conda false")
  os.system("touch MAMBA_READY")

if not os.path.isfile("PHOLD_READY"):
  print("installing phold...")
  os.system(f"mamba install -y -c conda-forge -c bioconda phold=={PHOLD_VERSION} python='{PYTHON_VERSION}'  pytorch=*=cuda* ")
  os.system("touch PHOLD_READY")



installing mamba...
installing phold...
CPU times: user 618 ms, sys: 95 ms, total: 713 ms
Wall time: 2min 47s


In [2]:
#@title 2. Download phold databases

#@markdown This cell downloads the phold database. It will take a few minutes. Please be patient.


%%time
print("Downloading phold database. This will take a few minutes. Please be patient :)")
os.system("phold install")




Downloading phold database. This will take a few minutes. Please be patient :)
CPU times: user 1.15 s, sys: 179 ms, total: 1.33 s
Wall time: 6min 39s


0

In [4]:
#@title 3. Run phold

#@markdown Upload your phage(s) as a Pharokka GenBank or a nucleotide input FASTA

#@markdown To do this, click on the folder icon to the left and use file upload button (the one with the arrow).

#@markdown Once it is uploaded, write the file name in the INPUT_FILE field on the right.

#@markdown This is required, otherwise phold will not run properly.

#@markdown After this, you can optionally provide a directory for phold's output.
#@markdown If you don't, it will default to 'output_phold'.

#@markdown You can also provide a prefix for your output files.
#@markdown If you provide nothing it will default to 'phold'.

#@markdown You can also click FORCE to overwrite the output directory.
#@markdown This may be useful if your earlier phold run has crashed for whatever reason.

#@markdown If your input has multiple phages, you can click SEPARATE.
#@markdown This will output separate GenBank files in the output directory.

#@markdown The results of Phold will be available in the folder icon on the left hand panel.

#@markdown Additionally, the output directory will be zipped so you can download the whole directory.
#@markdown The file to download is PREFIX.zip, where PREFIX is what you provided.

#@markdown Then click on the 3 dots and press download.

#@markdown If you do not see the output directory,
#@markdown refresh the window by clicking the folder with the refresh icon below "Files".



import os
import subprocess
import zipfile
INPUT_FILE = '' #@param {type:"string"}
OUT_DIR = 'output_phold'  #@param {type:"string"}
PREFIX = 'phold'  #@param {type:"string"}
FORCE = False  #@param {type:"boolean"}
SEPARATE = False  #@param {type:"boolean"}

# Construct the command
command = f"phold run -i {INPUT_FILE} -t 4 -o {OUT_DIR} -p {PREFIX}"

if FORCE is True:
  command = f"{command} -f"
if SEPARATE is True:
  command = f"{command} --separate"


# Execute the command
try:
    print("Running phold")
    subprocess.run(command, shell=True, check=True)
    print("phold completed successfully.")
    print(f"Your output is in {OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Running phold
phold completed successfully.
Your output is in output_phold.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to output_phold.zip
