<a href="https://colab.research.google.com/github/gbouras13/phold/blob/main/run_pharokka_and_phold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Hybracter

[hybracter](https://github.com/gbouras13/hybracter)  is an automated long-read first bacterial genome assembly tool. You can read more about hybracter in the [documentation](https://hybracter.readthedocs.io/).

**This notebook can be used to run Hybracter hybrid and/or Hybracter long with single isolates. If you have more than this, I'd recommend a local install to run Hybracter**

**To run the code cells, press the play buttons on the top left of each block**

Main Instructions

* Cells 1 and 2 installs hybracter, downloads the databases and runs tests. These must be run first.
* Once they have been run, you can run either Cell 3, Cell 4 or both as many times as you wish.
* To run hybracter hybrid-single, run Cell 3.
* To run hybracter long-single, run Cell 4 (you can skip Cell 3).

Other instructions

* Please make sure you change the runtime to CPU (GPU is not required).
* To do this, go to the top toolbar, then to Runtime -> Change runtime type -> Hardware accelerator
* You may want to upload your FASTQ files first as this takes a while for large files
* Click on the folder icon to the left and use file upload button (with the upwards facing arrow)


In [1]:
#@title 1. Install hybracter

#@markdown This cell installs hybracter. It will take a few minutes. Please be patient

%%time
import os
from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"
PYTHON_VERSION = python_version
HYBRACTER_VERSION = "0.7.3"


print(PYTHON_VERSION)

if not os.path.isfile("MAMBA_READY"):
  print("installing mamba...")
  os.system("wget -qnc https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh")
  os.system("bash Mambaforge-Linux-x86_64.sh -bfp /usr/local")
  os.system("mamba config --set auto_update_conda false")
  os.system("touch MAMBA_READY")

if not os.path.isfile("HYBRACTER_READY"):
  print("installing hybracter ...")
  os.system(f"mamba install -y -c conda-forge -c bioconda python='{PYTHON_VERSION}' hybracter=={HYBRACTER_VERSION}")
  os.system("touch HYBRACTER_READY")


3.10
installing mamba...
installing hybracter ...
CPU times: user 163 ms, sys: 24.4 ms, total: 188 ms
Wall time: 44 s


In [5]:
#@title 2. Download hybracter database and tet

#@markdown This cell downloads the hybracter database and runs tests to install all environments.
#@markdown It will take some time (20-30 mins). Please be patient.

%%time
os.environ["CUDA_VISIBLE_DEVICES"]=""
THREADS = "2"
print("Downloading hybracter database. This will take some time. Please be patient :)")
os.system("hybracter install")
print("Running hybracter tests to install all environments. This will take some time. Please be patient :)")
os.system(f"hybracter test-hybrid --threads {THREADS}")
os.system(f"hybracter test-long --threads {THREADS}")
os.system(f"rm -rf hybracter_out")


Downloading hybracter database. This will take some time. Please be patient :)
Running hybracter tests to install all environments. This will take some time. Please be patient :)
CPU times: user 114 ms, sys: 20.3 ms, total: 134 ms
Wall time: 28.1 s


0

In [None]:
#@title 3. Run Hybracter Hybrid

#@markdown First, upload your long-reads as a single input .fastq or .fastq.gz file

#@markdown Click on the folder icon to the left and use file upload button.

#@markdown Once it is uploaded, write the file name in the LONG_FASTQ field on the right.

#@markdown Then, upload your short-reads as 2 input .fastq or .fastq.gz files

#@markdown Click on the folder icon to the left and use file upload button.

#@markdown Once they are uploaded, write the forward (R1) file name in the R1_FASTQ field on the right and the the reverse (R2) file name in the R2_FASTQ field on the right.

#@markdown Then provide a directory for hybracter's output using HYBRACTER_OUT_DIR.
#@markdown The default is 'hybracter_hybrid_output'.

#@markdown Then provide a sample name using SAMPLE.
#@markdown The default is 'sample'.

#@markdown Then provide an estimated chromosome size (as an integer) name using CHROMOSOME.
#@markdown The default is 1000000.

#@markdown You can also provide a min_length for QC filtering the long read data with MIN_LENGTH.
#@markdown If you provide nothing it will default to 1000.

#@markdown You can also provide a min_quality for QC filtering the long read data with MIN_QUALITY.
#@markdown If you provide nothing it will default to 9.

#@markdown You can also provide a flyeModel specifying the Flye model with FLYE_MODEL.
#@markdown If you provide nothing it will default to '--nano-hq'.

#@markdown You can also provide a medakaModel specifying the Medaka polishing model with MEDAKA_MODEL.
#@markdown If you provide nothing it will default to 'r1041_e82_400bps_sup_v4.2.0'.

#@markdown You can click NO_MEDAKA to turn off medaka polishing (recommended for the latest Q20+ SUP v4.3.0 Nanopore data).
#@markdown By default it is False.

#@markdown The results of hybracter hybrid-single will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is HYBRACTER_OUT_DIR.zip, where HYBRACTER_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import sys
import subprocess
import zipfile
LONG_FASTQ = '' #@param {type:"string"}
R1_FASTQ = '' #@param {type:"string"}
R2_FASTQ = '' #@param {type:"string"}

if os.path.exists(LONG_FASTQ):
    print(f"Input file {LONG_FASTQ} exists")
else:
    print(f"Error: File {LONG_FASTQ} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

if os.path.exists(R1_FASTQ):
    print(f"Input file {R1_FASTQ} exists")
else:
    print(f"Error: File {R1_FASTQ} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

if os.path.exists(R2_FASTQ):
    print(f"Input file {R2_FASTQ} exists")
else:
    print(f"Error: File {R2_FASTQ} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

HYBRACTER_OUT_DIR = 'hybracter_hybrid_output'  #@param {type:"string"}
SAMPLE = 'sample'  #@param {type:"string"}
CHROMOSOME = 1000000  #@param {type:"integer"}
MIN_LENGTH = 1000  #@param {type:"integer"}
MIN_QUALITY = 9  #@param {type:"integer"}
FLYE_MODEL = '--nano-hq'  #@param {type:"string"}
MEDAKA_MODEL = 'r1041_e82_400bps_sup_v4.2.0'  #@param {type:"string"}
NO_MEDAKA = False  #@param {type:"boolean"}

# Construct the command
command = f"hybracter hybrid-single -s {SAMPLE} -c {CHROMOSOME} -l {LONG_FASTQ} -1 {R1_FASTQ} -2 {R2_FASTQ} -o {HYBRACTER_OUT_DIR} -t {THREADS} --min_length {MIN_LENGTH} --min_quality {MIN_QUALITY} --flyeModel {FLYE_MODEL}"

# if no medaka is false, add in the medaka model
if NO_MEDAKA is False:
  command = f"{command} --medakaModel {MEDAKA_MODEL}"
else:
  command = f"{command} --no_medaka"

# Execute the command
try:
    print("Running hybracter hybrid-single")
    subprocess.run(command, shell=True, check=True)
    print("hybracter hybrid-single completed successfully.")
    print(f"Your output is in {HYBRACTER_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{HYBRACTER_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(HYBRACTER_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), HYBRACTER_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Input file E_faecalis_CI_1043.fastq.gz exists
Input file 1043_S42_R1_001.fastq.gz exists
Input file 1043_S42_R2_001.fastq.gz exists
Running hybracter hybrid-single


In [None]:
#@title 4. Run Hybracter Long

#@markdown First, upload your long-reads as a single input .fastq or .fastq.gz file

#@markdown Click on the folder icon to the left and use file upload button.

#@markdown Once it is uploaded, write the file name in the LONG_FASTQ field on the right.

#@markdown Then provide a directory for hybracter's output using HYBRACTER_OUT_DIR.
#@markdown The default is 'hybracter_long_output'.

#@markdown Then provide a sample name using SAMPLE.
#@markdown The default is 'sample'.

#@markdown Then provide an estimated chromosome size (as an integer) name using CHROMOSOME.
#@markdown The default is 1000000.

#@markdown You can also provide a min_length for QC filtering the long read data with MIN_LENGTH.
#@markdown If you provide nothing it will default to 1000.

#@markdown You can also provide a min_quality for QC filtering the long read data with MIN_QUALITY.
#@markdown If you provide nothing it will default to 9.

#@markdown You can also provide a flyeModel specifying the Flye model with FLYE_MODEL.
#@markdown If you provide nothing it will default to '--nano-hq'.

#@markdown You can also provide a medakaModel specifying the Medaka polishing model with MEDAKA_MODEL.
#@markdown If you provide nothing it will default to 'r1041_e82_400bps_sup_v4.2.0'.

#@markdown You can click NO_MEDAKA to turn off medaka polishing (recommended for the latest Q20+ SUP v4.3.0 Nanopore data).
#@markdown By default it is False.

#@markdown The results of hybracter hybrid-single will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is HYBRACTER_OUT_DIR.zip, where HYBRACTER_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".


%%time
import os
import sys
import subprocess
import zipfile
LONG_FASTQ = '' #@param {type:"string"}

if os.path.exists(LONG_FASTQ):
    print(f"Input file {LONG_FASTQ} exists")
else:
    print(f"Error: File {LONG_FASTQ} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

HYBRACTER_OUT_DIR = 'hybracter_long_output'  #@param {type:"string"}
SAMPLE = 'sample'  #@param {type:"string"}
CHROMOSOME = 1000000  #@param {type:"integer"}
MIN_LENGTH = 1000  #@param {type:"integer"}
MIN_QUALITY = 9  #@param {type:"integer"}
FLYE_MODEL = '--nano-hq'  #@param {type:"string"}
MEDAKA_MODEL = 'r1041_e82_400bps_sup_v4.2.0'  #@param {type:"string"}
NO_MEDAKA = False  #@param {type:"boolean"}

# Construct the command
command = f"hybracter long-single -s {SAMPLE} -c {CHROMOSOME} -l {LONG_FASTQ} -o {HYBRACTER_OUT_DIR} -t {THREADS} --min_length {MIN_LENGTH} --min_quality {MIN_QUALITY} --flyeModel {FLYE_MODEL}"

# if no medaka is false, add in the medaka model
if NO_MEDAKA is False:
  command = f"{command} --medakaModel {MEDAKA_MODEL}"
else:
  command = f"{command} --no_medaka"

# Execute the command
try:
    print("Running hybracter long-single")
    subprocess.run(command, shell=True, check=True)
    print("hybracter long-single completed successfully.")
    print(f"Your output is in {HYBRACTER_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{HYBRACTER_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(HYBRACTER_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), HYBRACTER_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")







Running phold
phold completed successfully.
Your output is in output_phold.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to output_phold.zip
CPU times: user 1.25 s, sys: 153 ms, total: 1.4 s
Wall time: 4min 17s
