<a href="https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold_and_phynteny.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Pharokka + Phold + Phynteny

[pharokka](https://github.com/gbouras13/pharokka) is a rapid standardised annotation tool for bacteriophage genomes and metagenomes. You can read more about pharokka in the [documentation](https://pharokka.readthedocs.io/).

[phold](https://github.com/gbouras13/phold) is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology. You can read more about phold in the [documentation](https://phold.readthedocs.io/).

phold uses the [ProstT5](https://github.com/mheinzinger/ProstT5) protein language model to translate protein amino acid sequences to the 3Di token alphabet used by [Foldseek](https://github.com/steineggerlab/foldseek). Foldseek is then used to search these against a database of 803k protein structures mostly predicted using [Colabfold](https://github.com/sokrypton/ColabFold).

[phyntney](https://github.com/susiegriggo/Phynteny_transformer) uses phage synteny (the conserved gene order across phages) with a hybrid transformer/LSTM architecture to assign hypothetical phage proteins to a PHROG category.

The tools are best run sequentially, as Pharokka conducts extra annotation steps like tRNA, tmRNA, CRISPR and INPHARED searches that Phold lacks (for now at least). Pharokka will also (rarely) annotate CDS that Phold can miss. Phynteny can then help annotate remaining hypothetical proteins with a PHROG category.

* **Before you start, please make sure you change the runtime to T4 GPU (or any other kind of GPU if you have $$$), otherwise Phold won't be installed properly**
* To do this, go to the top toolbar, then to Runtime -> Change runtime type -> Hardware accelerator

* To run the cells, press the play button on the left side
* Cells 1, 2 and 3 install pharokka, phold and phyntney and download the databases/models.
* Once they have been run, you can re-run Cell 4 (to run Pharokka), Cell 5 (to run Phold) and Cell 6 (to run Phynteny) as many times as you would like



In [None]:

#@title 1. Install miniforge

#@markdown This cell installs miniforge

%%bash

set -e

PYTHON_VERSION="3.13"


echo "python version ${PYTHON_VERSION}"

if [ ! -f CONDA_READY ]; then
  echo "installing python"
  wget -qnc https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
  bash Miniforge3-Linux-x86_64.sh -bfp /usr/local 2>&1 1>/dev/null
  conda config --set auto_update_conda false
  touch CONDA_READY
fi

pip install --upgrade matplotlib matplotlib-inline



In [None]:
#@title 2. Install pharokka and phold and phynteny

#@markdown This cell installs pharokka and phold. It will take a few minutes. Please be patient

# add paths
import sys
sys.path.append("/usr/local/bin")

# Update environment variables for shell usage
import os
os.environ["PATH"] = "/usr/local/bin:" + os.environ["PATH"]

# create envs
# pharokka isn't compatible with Python 3.13 (Google Colab default)
# so it needs a separate env
from pathlib import Path
flag_file = Path("PHAROKKA_PHOLD_PHYNTENY_READY")
if not flag_file.exists():
  !conda create -y -n pharokka -c bioconda pharokka==1.7.5
  !conda install -y -c bioconda phold==1.0.0 pytorch=*=cuda* phynteny_transformer==0.1.2
  # Touch the flag file
  flag_file.touch()

In [None]:
#@title 3. Download pharokka phold databases

#@markdown This cell downloads the pharokka then the phold database. It will take some time (10-15 minutes probably depending on Zenodo's traffic). Please be patient. Perhaps go for a walk or have a coffee or tea.


%%time

print("Downloading pharokka database. This will take a few minutes. Please be patient :)")
!conda run -n pharokka install_databases.py -o pharokka_db


print("Downloading phold database. This will take a few minutes. Please be patient :)")
!phold install -d phold_db -t 8 --foldseek_gpu


print("Downloading phynteny database. This will take a few minutes. Please be patient :)")
!install_models -o  phynteny_models


In [None]:
#@title 4. Run Pharokka

#@markdown First, upload your phage(s) as a nucleotide input FASTA file

#@markdown Click on the folder icon to the left and use the file upload button.

#@markdown Once it is uploaded, write the file name in the INPUT_FILE field on the right.

#@markdown Then provide a directory for pharokka's output using PHAROKKA_OUT_DIR.
#@markdown The default is 'output_pharokka'.

#@markdown Then type in a gene prediction tool for pharokka.
#@markdown Please choose either 'phanotate', 'prodigal', or 'prodigal-gv'.

#@markdown You can also provide a prefix for your output files with PHAROKKA_PREFIX.
#@markdown If you provide nothing it will default to 'pharokka'.

#@markdown You can also provide a locus tag for your output files.
#@markdown If you provide nothing it will generate a random locus tag.

#@markdown You can click FAST to turn off --fast.
#@markdown By default it is True so that Pharokka runs faster in the Colab environment.

#@markdown You can click META to turn on --meta if you have multiple phages in your input.

#@markdown You can click META_HMM to turn on --meta_hmm.

#@markdown You can click FORCE to overwrite the output directory.
#@markdown This may be useful if your earlier pharokka run has crashed for whatever reason.

#@markdown The results of Pharokka will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PHAROKKA_OUT_DIR.zip, where PHAROKKA_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import sys
import subprocess
import zipfile
INPUT_FILE = '' #@param {type:"string"}

if os.path.exists(INPUT_FILE):
    print(f"Input file {INPUT_FILE} exists")
else:
    print(f"Error: File {INPUT_FILE} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

PHAROKKA_OUT_DIR = 'output_pharokka'  #@param {type:"string"}
GENE_PREDICTOR = 'phanotate'  #@param {type:"string"}
allowed_gene_predictors = ['phanotate', 'prodigal', 'prodigal-gv']
# Check if the input parameter is valid
if GENE_PREDICTOR.lower() not in allowed_gene_predictors:
    raise ValueError("Invalid GENE_PREDICTOR. Please choose from: 'phanotate', 'prodigal', 'prodigal-gv'.")

PHAROKKA_PREFIX = 'pharokka'  #@param {type:"string"}
LOCUS_TAG = 'Default'  #@param {type:"string"}
FAST = True  #@param {type:"boolean"}
META = False  #@param {type:"boolean"}
META_HMM = False  #@param {type:"boolean"}
FORCE = True  #@param {type:"boolean"}


# Construct the command
# need to suppress PYTHONWARNINGS for phanotate version handling
command = (
    f'PYTHONWARNINGS="ignore::UserWarning:phanotate_modules.file_handling" '
    f'conda run -n pharokka pharokka.py -d pharokka_db -i {INPUT_FILE} -t 4 '
    f'-o {PHAROKKA_OUT_DIR} -p {PHAROKKA_PREFIX} -l {LOCUS_TAG} -g {GENE_PREDICTOR}'
)

if FORCE is True:
  command = f"{command} -f"

if FAST is True:
  command = f"{command} --fast"

if META is True:
  command = f"{command} -m"

if META_HMM is True:
  command = f"{command} --meta_hmm"

# Execute the command
try:
    print("Running pharokka")
    subprocess.run(command, shell=True, check=True)
    print("pharokka completed successfully.")
    print(f"Your output is in {PHAROKKA_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PHAROKKA_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PHAROKKA_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHAROKKA_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







In [None]:
#@title 5. Run phold

#@markdown This cell will run phold on the output of cell 3's Pharokka run

#@markdown You do not need to provide any further input files

#@markdown You can now provide a directory for phold's output with PHOLD_OUT_DIR.
#@markdown The default is 'output_phold'.

#@markdown You can also provide a prefix for your output files with PHOLD_PREFIX.
#@markdown If you provide nothing it will default to 'phold'.

#@markdown You can click FORCE to overwrite the output directory with .
#@markdown This may be useful if your earlier phold run has crashed for whatever reason.

#@markdown If your input has multiple phages, you can click SEPARATE.
#@markdown This will output separate GenBank files in the output directory.

#@markdown The results of Phold will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PHOLD_OUT_DIR.zip, where PHOLD_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import subprocess
import zipfile

# phold input is pharokka output
PHOLD_INPUT = f"{PHAROKKA_OUT_DIR}/{PHAROKKA_PREFIX}.gbk"
PHOLD_OUT_DIR = 'output_phold'  #@param {type:"string"}
PHOLD_PREFIX = 'phold'  #@param {type:"string"}
FORCE = True  #@param {type:"boolean"}
SEPARATE = False  #@param {type:"boolean"}

# Construct the command
command = f"phold run -i {PHOLD_INPUT} -t 4 -o {PHOLD_OUT_DIR} -p {PHOLD_PREFIX} -d phold_db --foldseek_gpu"

if FORCE is True:
  command = f"{command} -f"
if SEPARATE is True:
  command = f"{command} --separate"


# Execute the command
try:
    print("Running phold")
    subprocess.run(command, shell=True, check=True)
    print("phold completed successfully.")
    print(f"Your output is in {PHOLD_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PHOLD_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PHOLD_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHOLD_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







In [None]:
#@title 6. Run Phynteny

#@markdown This cell will run phynteny on the output of cell 5's Phold run to predict the function of remaining hypothetical proteins

#@markdown You do not need to provide any further input files

#@markdown You can now provide a directory for phynteny's output with PHYNTENY_OUT_DIR.
#@markdown The default is 'output_phynteny'.

#@markdown You can click FORCE to overwrite the output directory with .
#@markdown This may be useful if your phynteny run has crashed for whatever reason.

#@markdown The results of Phynteny will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PHYNTENY_OUT_DIR.zip, where PHYNTENY_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import subprocess
import zipfile

# phynteny input is pharokka output
PHYNTENY_INPUT = f"{PHOLD_OUT_DIR}/{PHOLD_PREFIX}.gbk"
PHYNTENY_OUT_DIR = 'output_phynteny'  #@param {type:"string"}
FORCE = False  #@param {type:"boolean"}

# Construct the command
command = f"phynteny_transformer -m  /content/phynteny_models/models -o {PHYNTENY_OUT_DIR} {PHYNTENY_INPUT}"
if FORCE is True:
  command = f"{command} -f"


# Execute the command
try:
    print("Running phynteny")
    subprocess.run(command, shell=True, check=True)
    print("phynteny completed successfully.")
    print(f"Your output is in {PHYNTENY_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PHYNTENY_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PHYNTENY_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHYNTENY_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")





