# Example usage of HyAsP

In this notebook, we show how to use HyAsP to determine plasmids - starting from an assembly or FASTQ reads.

We provide the commands for the different steps, once when using HyAsP through a Singularity container and once when using it directly. The commands can be executed from this notebook but are rather slow this way.


The following example uses the plasmids [CP004026](https://www.ncbi.nlm.nih.gov/nuccore/CP004026.1/) and [CP004028](https://www.ncbi.nlm.nih.gov/nuccore/CP004028.1/) (resp. their genes) and the reads [SRR1029665](https://www.ncbi.nlm.nih.gov/sra/SRR1029665) (from which the plasmids were derived).

In [12]:
example_dir = 'example_files'

## Using HyAsP through Singularity

Here, we will use HyAsP (or the pipeline) through [Singularity](https://www.sylabs.io/) containers.

We assume that Singularity is available through the command `singularity` and that the containers have been built from the definition files in `singularity/` as described in the main README of the repository.

We provide two containers: `hyasp_only.img` contains only HyAsP (and its direct dependencies) but not the pipeline, while `hyasp_pipeline.img` contains HyAsP and all the dependencies necessary to run the pipeline.

In [None]:
# adapt to your system
hyasp_only_path = '/path/to/hyasp_only.img'
hyasp_pipeline_path = '/path/to/hyasp_pipeline.img'

### HyAsP (from assembly to plasmids)

In the following commands, we use the HyAsP-only container. Alternatively, the commands can be executed with the pipeline container by replacing `singularity run hyasp_only.img` with `singularity run --app hyasp hyasp_pipeline.img`.

***1) Create a gene database from a collection of plasmids***

Here, we collect the genes from two plasmids given by their accession numbers and store them in `genes.fasta`.

In [None]:
%%bash -s "$example_dir" "$hyasp_only_path"
EXAMPLE_DIR=$1
HYASP_PATH=$2

cd ${EXAMPLE_DIR}
singularity run ${HYASP_PATH} create genes.fasta -a CP004026,CP004028 -c

The newly created `genes.fasta` should be equal to the provided `_genes.fasta`.

***2) Map a collection of genes to the contigs of an assembly***

Next, we map the collected genes (in `genes.fasta`) to the contigs of an assembly (provided in `_assembly.gfa`) and store the resulting gene-contig mapping in `gcm.csv`.

Requires *BLAST+* (v2.6.0; makeblastdb, blastn).

In [None]:
%%bash -s "$example_dir" "$hyasp_only_path"
EXAMPLE_DIR=$1
HYASP_PATH=$2

cd ${EXAMPLE_DIR}
singularity run ${HYASP_PATH} map genes.fasta gcm.csv -g _assembly.gfa -c

The newly created created `gcm.csv` should be equal to the provided `_gcm.csv`.

***3) Filter a gene-contig mapping***

Now, we remove short and low-quality hits from the gene-contig mapping (in `gcm.csv`, supported by the gene collection in `genes.fasta`) and store the filtered mapping in `filtered_gcm.csv`.

In [None]:
%%bash -s "$example_dir" "$hyasp_only_path"
EXAMPLE_DIR=$1
HYASP_PATH=$2

cd ${EXAMPLE_DIR}
singularity run ${HYASP_PATH} filter genes.fasta gcm.csv filtered_gcm.csv

The newly created `filtered_gcm.csv` should be equal to the provided `_filtered_gcm.csv`.

***4) Find plasmids in an assembly graph***

Finally, we use the gene collection (`genes.fasta`) and the filtered gene-contig mapping (`filtered_gcm.csv`) to find plasmids in the assembly graph (`_assembly.gfa`) and store the results in directory `hyasp_outputs`.

In [None]:
%%bash -s "$example_dir" "$hyasp_only_path"
EXAMPLE_DIR=$1
HYASP_PATH=$2

cd ${EXAMPLE_DIR}
mkdir hyasp_outputs
singularity run ${HYASP_PATH} find _assembly.gfa genes.fasta filtered_gcm.csv hyasp_outputs

The newly created output files in `hyasp_outputs` should be equal to the provided `_hyasp_outputs`.

### Pipeline (from FASTQ reads to plasmids)

Here, we find plasmids in an assembly (to be created from the provided files `_SRR1029665_1.fastq` and `_SRR1029665_2.fastq`) using the genes provided in `_genes.fasta`.

Requires *Pilon* (v1.22), *Trim Galore* (v0.4.5_dev), *SPAdes* (v3.12.0) and *Unicycler* (v0.4.5). 

In [None]:
%%bash -s "$example_dir" "$hyasp_pipeline_path"
EXAMPLE_DIR=$1
HYASP_PATH=$2

cd ${EXAMPLE_DIR}
mkdir pipeline_outputs
singularity run ${HYASP_PATH} pipeline_outputs _genes.fasta -1 _SRR1029665_1.fastq -2 _SRR1029665_2.fastq

The outputs in `pipeline_outputs` (the outputs of HyAsP are stored in in the `pipeline_outputs/plasmids/greedy/`) should be equal to the provided ones in `_pipeline_outputs`.

## Using HyAsP directly

Here, we will use HyAsP (or the pipeline) directly instead of running them through a Singularity container.

We assume that HyAsP has been installed, i.e. the commands `hyasp` and `hyasp_pipeline` are available. Alternatively, enter the directory containing the scripts `hyasp.py` and `fastq_to_plasmids.py` and replace `hyasp` (resp. `hyasp_pipeline`) with `python hyasp.py` (resp. `python fastq_to_plasmids.py`) in the following commands.

We further assume that the dependencies of HyAsP (or the pipeline) are satisfied. It might be necessary to specifiy the paths to executables the corresponding dependencies in the commands. The different commands list the dependencies (if any).

### HyAsP (from assembly to plasmids)

***1) Create a gene database from a collection of plasmids***

Here, we collect the genes from two plasmids given by their accession numbers and store them in `genes.fasta`.

In [None]:
%%bash -s "$example_dir"
EXAMPLE_DIR=$1

cd ${EXAMPLE_DIR}
hyasp create genes.fasta -a CP004026,CP004028 -c

The newly created `genes.fasta` should be equal to the provided `_genes.fasta`.

***2) Map a collection of genes to the contigs of an assembly***

Next, we map the collected genes (in `genes.fasta`) to the contigs of an assembly (provided in `_assembly.gfa`) and store the resulting gene-contig mapping in `gcm.csv`.

Requires *BLAST+* (v2.6.0; makeblastdb, blastn).

In [None]:
%%bash -s "$example_dir"
EXAMPLE_DIR=$1

cd ${EXAMPLE_DIR}
hyasp map genes.fasta gcm.csv -g _assembly.gfa -c

The newly created created `gcm.csv` should be equal to the provided `_gcm.csv`.

***3) Filter a gene-contig mapping***

Now, we remove short and low-quality hits from the gene-contig mapping (in `gcm.csv`, supported by the gene collection in `genes.fasta`) and store the filtered mapping in `filtered_gcm.csv`.

In [None]:
%%bash -s "$example_dir"
EXAMPLE_DIR=$1

cd ${EXAMPLE_DIR}
hyasp filter genes.fasta gcm.csv filtered_gcm.csv

The newly created `filtered_gcm.csv` should be equal to the provided `_filtered_gcm.csv`.

***4) Find plasmids in an assembly graph***

Finally, we use the gene collection (`genes.fasta`) and the filtered gene-contig mapping (`filtered_gcm.csv`) to find plasmids in the assembly graph (`_assembly.gfa`) and store the results in directory `hyasp_outputs`.

In [None]:
%%bash -s "$example_dir"
EXAMPLE_DIR=$1

cd ${EXAMPLE_DIR}
mkdir hyasp_outputs
hyasp find _assembly.gfa genes.fasta filtered_gcm.csv hyasp_outputs

The newly created output files in `hyasp_outputs` should be equal to the provided `_hyasp_outputs`.

### Pipeline (from FASTQ reads to plasmids)

Here, we find plasmids in an assembly (to be created from the provided files `_SRR1029665_1.fastq` and `_SRR1029665_2.fastq`) using the genes provided in `_genes.fasta`.

Requires *Pilon* (v1.22), *Trim Galore* (v0.4.5_dev), *SPAdes* (v3.12.0) and *Unicycler* (v0.4.5). 

In [None]:
%%bash -s "$example_dir"
EXAMPLE_DIR=$1

cd ${EXAMPLE_DIR}
mkdir pipeline_outputs
hyasp_pipeline pipeline_outputs _genes.fasta -1 _SRR1029665_1.fastq -2 _SRR1029665_2.fastq

The outputs in `pipeline_outputs` (the outputs of HyAsP are stored in in the `pipeline_outputs/plasmids/greedy/`) should be equal to the provided ones in `_pipeline_outputs`.