# Context

[Video tutorial](https://youtu.be/XhhzJDdsQG4?si=tHnWNAfxiKlOQ3B3)

Genome of an [endophyte](https://en.wikipedia.org/wiki/Endophyte). Often a fungus or a bacterium that lives in plants during their full life cycle without causing apparent disease.

# 1. Data download

This is downloaded from the SRA (Sequence Read Archive), maintained by the NCBI.

We will download `SRR9321164` from the SRA, with this [link](https://www.ncbi.nlm.nih.gov/sra/?term=SRR9321164)

In [None]:
!prefetch SRR9321164

# 2. Data splitting

In [None]:
!fastq-dump --split-files SRR9321164.sra

This will find the forward (1) and reverse (2) reads and put them into separate files: `SRR9321164_1.fastq` and `SRR9321164_2.fastq`

# 3. Data QC (Quality Control)

Uses [`fastqc`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [`multiqc`](https://multiqc.info/).

fastqc generates individual reports for each input fastq file.

multiqc generates a combined report out of individual fastqc reports. 

In [None]:
!fastqc -o rawReads/ -t 2 rawReads/SRR9321164*.fastq

This command will run quality control checks for the forwards and reverse reads. The execution users 2 threads (`-t`) (CPUs) and takes the `fastq` files for both read types.  

# 4. Read Trimming

Uses [`trimmomatic`](http://www.usadellab.org/cms/index.php?page=trimmomatic) [0.39 version](https://github.com/usadellab/Trimmomatic/releases/tag/v0.39). Needs Java.

In [None]:
!java -jar /home/sanjuan/miniconda3/envs/genome-assembly-tutorial/share/Trimmomatic-0.39/trimmomatic-0.39.jar PE \
     -threads 8 rawReads/SRR9321164_1.fastq rawReads/SRR9321164_2.fastq \
     -baseout trimmedReads/SRR9321164.fastq \
      ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:8:keepBothReads HEADCROP:15 SLIDINGWINDOW:4:25

# 5. Genome Assembly

Uses [`unicycler`](https://github.com/rrwick/Unicycler) (an assembly pipeline for bacterial genomes)

Dependencies

* [Spades](https://github.com/ablab/spades)

In [None]:
!unicycler -1 trimmedReads/SRR9321164_1P.fastq \
          -2 trimmedReads/SRR9321164_2P.fastq \
          -s trimmedReads/SRR9321164_unpaired.fastq \
          -o assembly --verbosity 2 --min_fasta_length 200 \
          -t 12 --spades_path spades/bin/spades.py

# 6. Genome Annotation

Uses [`prokka`](https://github.com/tseemann/prokka) (Rapid Prokaryotic Genome Annotation)

In [None]:
!prokka --outdir prokkaResults --genus 'Methylorubrum' --strain 'Q1' --cpus 12 assembly/assembly.fasta

# 7. Genome QA (Quality Assessment)

Uses `quast` (Quality Assessment Tool for Genome Assemblies)

In [None]:
!quast -o quastResults -g prokkaResults/PROKKA_09232024.gff -t 12 \
    -1 trimmedReads/SRR9321164_1P.fastq -2 trimmedReads/SRR9321164_2P.fastq \
    --single trimmedReads/SRR9321164_unpaired.fastq --gene-thresholds 0,1000 assembly/assembly.fasta \
    --glimmer