<a href="https://colab.research.google.com/github/gbouras13/plassembler/blob/main/run_plassembler.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Plassembler

[plassembler](https://github.com/gbouras13/plassembler) is a program that is designed for automated & fast assembly of plasmids in  bacterial genomes that have been hybrid sequenced with long read & paired-end short read sequencing. It was originally designed for Oxford Nanopore Technologies long reads, but it will also work with Pacbio reads. As of v1.3.0, it also works well for long-read only assembled genomes.

The full documentation for Plassembler can be found [here](https://plassembler.readthedocs.io/en/latest).

**To run the code cells, press the play buttons on the top left of each block**

Main Instructions

* Cells 1 and 2 installs plassembler and download the databases. These must be run first.
* Once they have been run, you can run either Cell 3, Cell 4 or both as many times as you wish.
* To run `plassembler run` (if you have both long- and short-reads), run Cell 3.
* To run `plassembler long` (if you have only long-reads), run Cell 4.

Other instructions

* Please make sure you change the runtime to CPU (GPU is not required).
* To do this, go to the top toolbar, then to Runtime -> Change runtime type -> Hardware accelerator
* You may want to upload your FASTQ files first as this takes a while for large files
* Click on the folder icon to the left and use file upload button (with the upwards facing arrow)


In [7]:
#@title 1. Install plassembler

#@markdown This cell installs plassembler.

%%time
import os
from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"
PYTHON_VERSION = python_version
PLASSEMBLER_VERSION = "1.6.2"

print(PYTHON_VERSION)

if not os.path.isfile("CONDA_READY"):
  print("installing conda...")
  os.system("wget -qnc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh")
  os.system("bash Miniconda3-latest-Linux-x86_64.sh -bfp /usr/local")
  os.system("mamba config --set auto_update_conda false")
  os.system("touch CONDA_READY")

if not os.path.isfile("PLASSEMBLER_READY"):
  print("installing plassembler ...")
  os.system(f"conda create -n plassemblerENV -y -c conda-forge -c bioconda python=3.9 plassembler==1.6.2 unicycler==0.5.0")
  os.system("touch PLASSEMBLER_READY")



3.11
CPU times: user 2.61 ms, sys: 0 ns, total: 2.61 ms
Wall time: 4.48 ms


32512

In [2]:
#@title 2. Download plassembler database

#@markdown This cell downloads the plassembler database.
#@markdown It will take some time (5-10 mins). Please be patient.
%%bash
source activate /usr/local/envs/plassemblerENV
python
print("Downloading plassembler database. This will take some time. Please be patient :)")
os.system("/usr/local/envs/plassemblerENV/bin/plassembler download -d plassembler_db")


Downloading plassembler database. This will take some time. Please be patient :)


0

In [19]:
#@title 3. Plassembler Run (Hybrid reads)

#@markdown This will probably take a while (depends on your read sets - an hour or two probably: best to put it on over lunch) as the colab environment has limited resources.

#@markdown First, upload your long-reads as a single input .fastq or .fastq.gz file

#@markdown Click on the folder icon to the left and use file upload button.

#@markdown Once it is uploaded, write the file name in the LONG_FASTQ field on the right.

#@markdown Then, upload your short-reads as 2 input .fastq or .fastq.gz files

#@markdown Click on the folder icon to the left and use file upload button.

#@markdown Once they are uploaded, write the forward (R1) file name in the R1_FASTQ field on the right and the the reverse (R2) file name in the R2_FASTQ field on the right.

#@markdown Then provide a directory for plassembler's output using PLASSEMBLER_OUT_DIR.
#@markdown The default is 'plassembler_output'.

#@markdown Then provide an estimated chromosome size (as an integer) name using CHROMOSOME.
#@markdown The default is 1000000.

#@markdown You can also provide a min_length for QC filtering the long read data with MIN_LENGTH.
#@markdown If you provide nothing it will default to 1000.

#@markdown You can also provide a min_quality for QC filtering the long read data with MIN_QUALITY.
#@markdown If you provide nothing it will default to 9.

#@markdown You can click SKIP_QC to turn off qc (fastp and filtlong).
#@markdown By default it is False.

#@markdown You can click RAW to  pass --nano-raw for Flye.  Designed for Guppy fast
#@markdown configuration reads.  By default, Flye will assume
#@markdown SUP or HAC reads and use --nano-hq.

#@markdown If you have pacbio reads, please change PACBIO_MODEL
#@markdown from 'none' to one of 'pacbio-hifi', 'pacbio-corr', 'pacbio-raw'. Use pacbio-raw for
#@markdown  PacBio regular CLR reads (<20 percent error),
#@markdown pacbio-corr for PacBio reads that were corrected with other methods (<3 percent error) or pacbio-
#@markdown  hifi for PacBio HiFi reads (<1 percent error).

#@markdown You can click FORCE to overwrite the output directory.
#@markdown This may be useful if your earlier pharokka run has crashed for whatever reason.

#@markdown The results of `plassembler run` will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PLASSEMBLER_OUT_DIR.zip, where PLASSEMBLER_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".
%%bash
source activate /usr/local/envs/plassemblerENV
python
import os
import sys
import subprocess
import zipfile

LONG_FASTQ = '' #@param {type:"string"}
R1_FASTQ = '' #@param {type:"string"}
R2_FASTQ = '' #@param {type:"string"}
THREADS = "2"

if os.path.exists(LONG_FASTQ):
    print(f"Input file {LONG_FASTQ} exists")
else:
    print(f"Error: File {LONG_FASTQ} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

if os.path.exists(R1_FASTQ):
    print(f"Input file {R1_FASTQ} exists")
else:
    print(f"Error: File {R1_FASTQ} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

if os.path.exists(R2_FASTQ):
    print(f"Input file {R2_FASTQ} exists")
else:
    print(f"Error: File {R2_FASTQ} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)

PLASSEMBLER_OUT_DIR = 'plassembler_output'  #@param {type:"string"}
CHROMOSOME = 1000000  #@param {type:"integer"}
MIN_LENGTH = 1000  #@param {type:"integer"}
MIN_QUALITY = 9  #@param {type:"integer"}
SKIP_QC = False  #@param {type:"boolean"}
RAW = False  #@param {type:"boolean"}
PACBIO_MODEL = 'none'  #@param {type:"string"}
allowed_gene_predictors = ['none', 'pacbio-hifi', 'pacbio-corr', 'pacbio-raw']
# Check if the input parameter is valid
if PACBIO_MODEL.lower() not in allowed_gene_predictors:
    raise ValueError("Invalid PACBIO_MODEL. Please choose from: 'none', 'pacbio-hifi', 'pacbio-corr', 'pacbio-raw'.")

FORCE = True  #@param {type:"boolean"}


# Construct the command
command = f"/usr/local/envs/plassemblerENV/bin/plassembler run -d plassembler_db -c {CHROMOSOME} -l {LONG_FASTQ} -1 {R1_FASTQ} -2 {R2_FASTQ} -o {PLASSEMBLER_OUT_DIR} -t {THREADS} --min_length {MIN_LENGTH} --min_quality {MIN_QUALITY}"

if SKIP_QC is True:
  command = f"{command} --skip_qc"

if RAW is True:
  command = f"{command} -r"

if FORCE is True:
  command = f"{command} -f"

if PACBIO_MODEL != 'none':
  command = f"{command}  --pacbio_model {PACBIO_MODEL}"


# Execute the command
try:
    print("Running plassembler run")
    subprocess.run(command, shell=True, check=True)
    print("plassembler run completed successfully.")
    print(f"Your output is in {PLASSEMBLER_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PLASSEMBLER_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PLASSEMBLER_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PLASSEMBLER_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")





Input file input_fastq.gz exists
Input file input_R1.fastq.gz exists
Input file input_R2.fastq.gz exists
Running plassembler run
plassembler run completed successfully.
Your output is in plassembler_output.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to plassembler_output.zip


2025-01-16 06:09:00.342 | INFO     | plassembler:begin_plassembler:100 - You are using Plassembler version 1.6.2
2025-01-16 06:09:00.342 | INFO     | plassembler:begin_plassembler:101 - Repository homepage is https://github.com/gbouras13/plassembler
2025-01-16 06:09:00.342 | INFO     | plassembler:begin_plassembler:102 - Written by George Bouras: george.bouras@adelaide.edu.au
2025-01-16 06:09:00.343 | INFO     | plassembler:run:405 - Database directory is plassembler_db
2025-01-16 06:09:00.343 | INFO     | plassembler:run:406 - Longreads file is input_fastq.gz
2025-01-16 06:09:00.343 | INFO     | plassembler:run:407 - R1 fasta file is input_R1.fastq.gz
2025-01-16 06:09:00.343 | INFO     | plassembler:run:408 - R2 fasta file is input_R2.fastq.gz
2025-01-16 06:09:00.343 | INFO     | plassembler:run:409 - Chromosome length threshold is 50000
2025-01-16 06:09:00.344 | INFO     | plassembler:run:410 - Output directory is plassembler_output
2025-01-16 06:09:00.344 | INFO     | plassembler:ru

In [20]:
#@title 4. Plassembler Long (long-reads only)

#@markdown This will probably take a while (best to put it on overnight or over lunch) as the colab environment has limited resources.

#@markdown First, upload your long-reads as a single input .fastq or .fastq.gz file

#@markdown Click on the folder icon to the left and use file upload button.

#@markdown Once it is uploaded, write the file name in the LONG_FASTQ field on the right.

#@markdown Then provide a directory for plassembler's output using PLASSEMBLER_OUT_DIR.
#@markdown The default is 'plassembler_output'.

#@markdown Then provide an estimated chromosome size (as an integer) name using CHROMOSOME.
#@markdown The default is 1000000.

#@markdown You can also provide a min_length for QC filtering the long read data with MIN_LENGTH.
#@markdown If you provide nothing it will default to 1000.

#@markdown You can also provide a min_quality for QC filtering the long read data with MIN_QUALITY.
#@markdown If you provide nothing it will default to 9.

#@markdown You can click SKIP_QC to turn off qc (filtlong).
#@markdown By default it is False.

#@markdown You can click RAW to  pass --nano-raw for Flye.  Designed for Guppy fast
#@markdown configuration reads.  By default, Flye will assume
#@markdown SUP or HAC reads and use --nano-hq.

#@markdown If you have pacbio reads, please change PACBIO_MODEL
#@markdown from 'none' to one of 'pacbio-hifi', 'pacbio-corr', 'pacbio-raw'. Use pacbio-raw for
#@markdown  PacBio regular CLR reads (<20 percent error),
#@markdown pacbio-corr for PacBio reads that were corrected with other methods (<3 percent error) or pacbio-
#@markdown  hifi for PacBio HiFi reads (<1 percent error).

#@markdown You can click FORCE to overwrite the output directory.
#@markdown This may be useful if your earlier pharokka run has crashed for whatever reason.

#@markdown The results of `plassembler long` will be in the folder icon on the left hand panel.
#@markdown Additionally, it will be zipped so you can download the whole directory.

#@markdown The file to download is PLASSEMBLER_OUT_DIR.zip, where PLASSEMBLER_OUT_DIR is what you provided

#@markdown If you do not see the output directory,
#@markdown refresh the window by either clicking the folder with the refresh icon below "Files"
#@markdown or double click and select "Refresh".

%%bash
source activate /usr/local/envs/plassemblerENV
python
import os
import sys
import subprocess
import zipfile
LONG_FASTQ = '' #@param {type:"string"}
THREADS = "2"

if os.path.exists(LONG_FASTQ):
    print(f"Input file {LONG_FASTQ} exists")
else:
    print(f"Error: File {LONG_FASTQ} does not exist")
    print(f"Please check the spelling and that you have uploaded it correctly")
    sys.exit(1)


PLASSEMBLER_OUT_DIR = 'plassembler_output'  #@param {type:"string"}
CHROMOSOME = 1000000  #@param {type:"integer"}
MIN_LENGTH = 1000  #@param {type:"integer"}
MIN_QUALITY = 9  #@param {type:"integer"}
SKIP_QC = False  #@param {type:"boolean"}
RAW = False  #@param {type:"boolean"}
PACBIO_MODEL = 'none'  #@param {type:"string"}
allowed_gene_predictors = ['none', 'pacbio-hifi', 'pacbio-corr', 'pacbio-raw']
# Check if the input parameter is valid
if PACBIO_MODEL.lower() not in allowed_gene_predictors:
    raise ValueError("Invalid PACBIO_MODEL. Please choose from: 'none', 'pacbio-hifi', 'pacbio-corr', 'pacbio-raw'.")

FORCE = True  #@param {type:"boolean"}

# Construct the command
command = f"/usr/local/envs/plassemblerENV/bin/plassembler long -d plassembler_db -c {CHROMOSOME} -l {LONG_FASTQ} -o {PLASSEMBLER_OUT_DIR} -t {THREADS} --min_length {MIN_LENGTH} --min_quality {MIN_QUALITY}"

if SKIP_QC is True:
  command = f"{command} --skip_qc"

if RAW is True:
  command = f"{command} -r"

if FORCE is True:
  command = f"{command} -f"

if PACBIO_MODEL != 'none':
  command = f"{command}  --pacbio_model {PACBIO_MODEL}"

# Execute the command
try:
    print("Running plassembler long")
    subprocess.run(command, shell=True, check=True)
    print("plassembler long completed successfully.")
    print(f"Your output is in {PLASSEMBLER_OUT_DIR}.")
    print(f"Zipping the output directory so you can download it all in one go.")

    zip_filename = f"{PLASSEMBLER_OUT_DIR}.zip"

    # Zip the contents of the output directory
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(PLASSEMBLER_OUT_DIR):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PLASSEMBLER_OUT_DIR))
    print(f"Output directory has been zipped to {zip_filename}")


except subprocess.CalledProcessError as e:
    print(f"Error occurred: {e}")



Input file input_fastq.gz exists
Running plassembler long
plassembler long completed successfully.
Your output is in plassembler_output.
Zipping the output directory so you can download it all in one go.
Output directory has been zipped to plassembler_output.zip


2025-01-16 06:22:17.399 | INFO     | plassembler:begin_plassembler:100 - You are using Plassembler version 1.6.2
2025-01-16 06:22:17.400 | INFO     | plassembler:begin_plassembler:101 - Repository homepage is https://github.com/gbouras13/plassembler
2025-01-16 06:22:17.400 | INFO     | plassembler:begin_plassembler:102 - Written by George Bouras: george.bouras@adelaide.edu.au
2025-01-16 06:22:17.400 | INFO     | plassembler:long:1294 - Database directory is plassembler_db
2025-01-16 06:22:17.400 | INFO     | plassembler:long:1295 - Longreads file is input_fastq.gz
2025-01-16 06:22:17.400 | INFO     | plassembler:long:1296 - Chromosome length threshold is 50000
2025-01-16 06:22:17.401 | INFO     | plassembler:long:1297 - Output directory is plassembler_output
2025-01-16 06:22:17.401 | INFO     | plassembler:long:1298 - Min long read length is 1000
2025-01-16 06:22:17.401 | INFO     | plassembler:long:1299 - Min long read quality is 9
2025-01-16 06:22:17.401 | INFO     | plassembler:long