2023_QuinonesOlvera-Owen

About

Code corresponing to the paper:

Diverse and abundant viruses exploit conjugative plasmids (2023)

[bioRxiv]

Natalia Quinones-Olvera*, Siân V. Owen*, Lucy M. McCully, Maximillian G. Marin, Eleanor A. Rand, Alice C. Fan, Oluremi J. Martins Dosumu, Kay Paul, Cleotilde E. Sanchez Castaño, Rachel Petherbridge, Jillian S. Paull, Michael Baym

Genomes

Code
- Snakemake pipeline for genome assembly and annotation
Data
- SRA BioProject: PRJNA954020
  - Sequencing runs: SRR24145707 - SRR24145772
- Metadata phage_metadata.tsv
- Assemblies genomes/data/assemblies_oriented
- Genbank annotations genomes/data/annotation/_gbks/

Figure 1

1a, 1c: Phage DisCo

Code
- Jupyter notebook with image processing: Fig1.ipynb
Data
- Image files: Fig1/data

Figure 2

2a: Tree

Code

Jupyter notebook: Whole genome alignment and tree building

Key commands:

# whole genome alignment
clustalo -i <alphatv.fasta> -o <alphatv.msa.fasta> --outfmt=fa

# tree building
iqtree -st DNA -m MFP -bb 1000 -alrt 1000 -s <alphatv.msa.trim.fasta>

Data
- Genomes used (unaligned fasta): alphatv.fasta
- Whole-genome alignment (trimmed aligned fasta): alphatv.msa.trim.fasta
- Newick tree file (iqtree output): alphatv.msa.trim.fasta.treefile

2b: Map

Code
- Jupyter notebook: Map figure
Data
- Coordinates and references for map: coordinates.tsv

2c: Nucleotide diversity

Code

Snakemake: Pipeline producing the alignments and nucleotide diversity calculation.

Key commands:

# align each assembly to reference
minimap2 -ax asm20 -B2 -O6,26 --end-bonus 100 --cs <NC_001421.fasta> <assembly> > <output.sam>

# calculate nucleotide diversity
vcftools --vcf <merged_vcf> --window-pi 100 --window-pi-step 1 --out <NucDiv.100bp.slideby1.windowed.pi>

Jupyer notebook: Plot, heatmap, and genome map.

Data
- Nucleotide diversity values for sliding window size 100 bp: NucDiv.100bp.slideby1.windowed.pi
- PRD1 reference annotation (curated version of NC_001421): PRD1_updated.gb
- Assemblies: genomes/data/assemblies_oriented

Figure 3

3b,c: Host-range heatmap

Code
- Jupyter notebook: Processing growth curves, calculating area under the curve and liquid assay score, producing heatmap.
- Jupyter notebook: Plotting sample curves and heatmap.
- Custom functions imported in notebooks: EOL_tools.py
  - Area under the curve calculation (line): $auc = \sum_{i=1}^{120}\frac{OD_{i+1} + OD_{i}}{2}$
  - Liquid assay score calculation (line): $las = \frac{(auc_{no\ phage} - auc_{phage})}{auc_{no\ phage}} \times 100$
Data
- Raw growth curve data: all_growthcurves.tsv
- Liquid assay score values: all_liquidasssayscores.tsv
- Phage tree: (see Figure 2)
- Strain 16S alignment: 16S.afa
- Strain tree: 16S.tree

Figure 4

4a: Abundance

Code
- Plot abundance abundance.ipynb
Data
- Raw counts counts.tsv

4b: Genome maps of uncultivated tectiviruses

Data
- NCBI RefSeq/Genbank tectiviruses: NCBI/tectivirus_metadata.tsv
- JGI IMG/VR matches: JGI_IMGVR/JGI_metadata.tsv
- From Yutin et. al. (2018) (paper): Yutin/yutin_metadata.tsv
- Genbank files of shown genomes: figures/Fig4/data/tecti_genomes/gb
- hmm models used to for color annotations: Fig4/data/models/hmm

4c: Tree of the DNA packaging ATPase of tectiviruses

Code

Jupyer notebook: Build ATPase tree

Key commands:

# align ATPase sequences from all tectiviruses with ATPase hmm model
hmmalign --trim <IX.2.hmm> <P9.faa> | esl-reformat --gapsym='-' afa - > <P9.afa>

# build tree 
phyml -d aa -m LG -b -4 -v 0.0 -c 4 -a e -f e --no_memory_check -i <P9.phy>

Data
- ATPase hmm model: IX.2.hmm
- ATPase sequences used (unaligned fasta): P9.faa
- ATPase alignment (aligned fasta): P9.afa
- Newick tree: P9.phy_phyml_tree

4d: Alphatectivirus metagenomic reads in wastewater datasets

Code
- Jupyter notebook: Build kraken database with viral database + tectiviruses from this study.
- Snakemake pipeline: To run kraken on metagenomic datasets.
  - Key commands:
```
kraken2 --paired --report <kraken_report> --db <custom_db> <fastq_1> <fastq_2> > <kraken_results>
```
- Jupyter notebook: Extract kraken results and produce plot.
Data
- Kraken results summary results.tsv

4e: Mapped alphatectivirus metagenomic reads

Code
- Jupyter notebook: Align metagenomic reads to reference PRD1 genome
Data
- SRA BioProject: PRJNA954020
  - Runs: SRR24211943 - SRR24211944
- Metagenomic reads classified as alphatectivirus
  - all_reads_r1.fastq
  - all_reads_r2.fastq
- Mapped reads
  - mm.p3.sam
  - mm.p2.sam

Figure 5

5a, b, c: Trees

Code
- Jupyter notebook: Produce trees
Data
- Genomes used (unaligned fasta)
  - Emesvirus emesvirus.fasta
  - Qubevirus qubevirus.fasta
  - Inovirus inovirus.fasta
- Alignments (trimmed aligned fasta)
  - Emesvirus emesvirus.trim.afa
  - Qubevirus qubevirus.trim.afa
  - Inovirus inovirus.trim.afa
- Newick trees
  - Emesvirus emesvirus.tree
  - Qubevirus qubevirus.tree
  - Inovirus inovirus.tree

5e: FtMidnight genome map

Code
- Jupyter notebook: Produce genome map graphic.
Data
- FtMidnight genbank file FtMidnight.rotated.gb

How to replicate these figures

Notebooks

Everything in the notebooks should be able to run after installing this conda environment.

conda env create -f envs/pdep.yml

I tried including all the raw files in this repository, with the exception of large files such as sequencing runs, which can be accessed through the SRA (see specific section of accessions). Likewise, some intermediate files might be absent, but everything should be obtainable by running the code in the notebooks.

Snakemake pipelines

The snakemake piplelines should be able to run also from the same conda environment. Additional dependencies of each pipeline are included in the envs/ directory, next to the corresponding Snakefile, and are dealt with by snakemake. I've included a run_snakemake.sh and a run_snakemake.loc.sh file for each, which show how they can be executed for running it in a computer cluster or locally (respectively).

Questions?

If you have trouble finding or running anything shown here, please do get in contact. You can submit an issue or send me an email: nquinones@g.harvard.edu

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
env		env
figures		figures
genomes		genomes
.gitignore		.gitignore
README.md		README.md

baymlab/2023_QuinonesOlvera-Owen

Folders and files

Latest commit

History

Repository files navigation

2023_QuinonesOlvera-Owen

About

Table of Contents

Genomes

Figure 1

1a, 1c: Phage DisCo

Figure 2

2a: Tree

2b: Map

2c: Nucleotide diversity

Figure 3

3b,c: Host-range heatmap

Figure 4

4a: Abundance

4b: Genome maps of uncultivated tectiviruses

4c: Tree of the DNA packaging ATPase of tectiviruses

4d: Alphatectivirus metagenomic reads in wastewater datasets

4e: Mapped alphatectivirus metagenomic reads

Figure 5

5a, b, c: Trees

5e: FtMidnight genome map

How to replicate these figures

Notebooks

Snakemake pipelines

Questions?

About

Resources

Stars

Watchers

Forks

Languages