Skip to content

Methods

Bonita van Waardenburg edited this page May 17, 2023 · 2 revisions

General

This Wiki page provides an overview of the commands and options that are used in this Genome Analysis project. The analysis is performed on the computational cluster of Uppsala University, making use of sbatch scripts to be able to continue running the scripts while not being connected to the cluster.

Header:

#!/bin/bash -l
#SBATCH -A uppmax2023-2-8
#SBATCH -M snowy
#SBATCH -p core
#SBATCH -n 2
#SBATCH -t 01:00:00
#SBATCH -J trimming
#SBATCH -o /domus/h1/bonitavw/GenomeAnalysis/logs/trimming.log
#SBATCH --mail-type=ALL
#SBATCH --mail-user vanwaardenburgbonita@gmail.com

Global variables:

ILLUM_DATA="/domus/h1/bonitavw/GenomeAnalysis/data/raw_data/genomics_data/Illumina"
TRIM_DATA="/domus/h1/bonitavw/GenomeAnalysis/data/trimmed_data/genomics_data/Illumina"
PACBIO_DATA="/domus/h1/bonitavw/GenomeAnalysis/data/raw_data/genomics_data/PacBio"
REFERENCE="/domus/h1/bonitavw/GenomeAnalysis/data/raw_data/GCF_009734005.1_ASM973400v2_genomic.fna.gz"
ANALYSIS_DIR="/domus/h1/bonitavw/GenomeAnalysis/analysis"

Modules that have to be loaded:

module load bioinfo-tools
module load FastQC
module load trimmomatic/0.39
module load spades
module load quast
module load MUMmer/3.23
module load prokka
module load bwa
module load samtools
module load htseq
module load R/4.1.1
module load R_packages/4.1.1 

1. Preprocessing

FastQC

Quality Control

To check the quality of the data, the program 'fastqc' is used both before and after trimming. It uses the basic settings, only adding the '-o' to specify in which directory the output file will be created. Once the program is finished, it has created an HTML file and zipped directory for each fastq file, displaying the quality of the reads.

fastqc -o ${ANALYSIS_DIR}/01_preprocessing/ ${ILLUM_DATA}/*

Trimmomatic

Trimming

Trimmomatic is used to trim the reads to prepare them for downstream analysis. It removes the adapters that were added during the sequencing process and trims the reads based on quality scores. To assess the quality of the reads, a sliding window method is applied. In this case, there was chosen for a sliding window of 4:20, meaning that it checks 4 bases (the window) and cuts when the average quality drops below 20. The option 'PE' specifies that the input data is paired-end.

trimmomatic PE -threads 2 ${ILLUM_DATA}/E745-1.L500_SZAXPI015146-56_1_clean.fq.gz ${ILLUM_DATA}/E745-1.L500_SZAXPI015146-56_2_clean.fq.gz \
              ${TRIM_DATA}/E745-1.L500_SZAXPI015146-56_1P_clean.fq.gz ${TRIM_DATA}/E745-1.L500_SZAXPI015146-56_1U_clean.fq.gz \
             ${TRIM_DATA}/E745-1.L500_SZAXPI015146-56_2P_clean.fq.gz ${TRIM_DATA}/E745-1.L500_SZAXPI015146-56_2U_clean.fq.gz \
           ILLUMINACLIP:/sw/apps/bioinfo/trimmomatic/0.39/rackham/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:30

FastQC

Quality Control

After trimming, the quality control step is repeated to check if the quality has improved.

fastqc -o ${ANALYSIS_DIR}/01_preprocessing/ ${TRIM_DATA}/*

2. Genome Assembly

SPAdes

Illumina data

SPAdes is a genome assembly algorithm designed for bacterial data. It works with Illumina reads and is able to work with single- and paired-end reads. As the sequenced genome of E. faecium E745 is a clinical isolate, the flag '--isolate' is used to improve the running time and quality of the assembly.

spades.py --pe1-1 ${TRIM_DATA}/E745-1.L500_SZAXPI015146-56_1P_clean.fq.gz --pe1-2 ${TRIM_DATA}/E745-1.L500_SZAXPI015146-56_2P_clean.fq.gz \
    --isolate --nanopore ${NANOPORE_DATA}/E745_all.fasta.gz -o ${ANALYSIS_DIR}/02_genome_assembly/spades_output 

Canu

PacBio data

Canu is designed to be used with high-noise single-molecule sequencing techniques, such as PacBio reads. Before performing the genome assembly, it corrects the errors in the long reads, trims them and at last, assembles them.

canu -p pacbio -d ${ANALYSIS_DIR}/02_genome_assembly/canu_output/ genomeSize=2.8m useGrid=true gridOptions="-A uppmax2023-2-8 -t 05:00:00" -pacbio-raw ${PACBIO_DATA}/*.subreads.fastq.gz &>> mylogcanu

3. Genome Evaluation

Quast

Quast can run both with and without a reference genome, but in this case, I chose to work with a reference. As a reference, I used the genome of Enterococcus faecium found on NCBI. I downloaded the genome in fasta format (link: https://www.ncbi.nlm.nih.gov/genome/?term=Enterococcus%20faecium[Organism]&cmd=DetailsSearch). The tool was run on both the contig.fasta files from Canu and Spades, produced in the previous step.

Canu input file:

quast.py ${ANALYSIS_DIR}/02_genome_assembly/canu_output/pacbio.contigs.fasta -r ${REFERENCE} -o ${ANALYSIS_DIR}/03_assembly_evaluation/quast_output_canu/

Spades input file:

quast.py ${ANALYSIS_DIR}/02_genome_assembly/spades_output/contigs.fasta -r ${REFERENCE} -o ${ANALYSIS_DIR}/03_assembly_evaluation/quast_output_spades/

MUMmerplot

Canu input file:

mummer -mum -b ${REFERENCE} ${ANALYSIS_DIR}/02_genome_assembly/canu_output/pacbio.contigs.fasta > ${ANALYSIS_DIR}/03_assembly_evaluation/mummerplot_output_canu/output_canu.mum
mummerplot -Q ${ANALYSIS_DIR}/02_genome_assembly/canu_output/pacbio.contigs.fasta -t png -p ${ANALYSIS_DIR}/03_assembly_evaluation/mummerplot_output_canu/mummerplot_output_canu \
${ANALYSIS_DIR}/03_assembly_evaluation/mummerplot_output_canu/output_canu.mum

SPAdes input file:

mummer -mum -b ${REFERENCE} ${ANALYSIS_DIR}/02_genome_assembly/spades_output/contigs.fasta > ${ANALYSIS_DIR}/03_assembly_evaluation/mummerplot_output_spades/output_spades.mum
mummerplot -Q ${ANALYSIS_DIR}/02_genome_assembly/spades_output/contigs.fasta -t png -p ${ANALYSIS_DIR}/03_assembly_evaluation/mummerplot_output_spades/mummerplot_output_spades \
${ANALYSIS_DIR}/03_assembly_evaluation/mummerplot_output_spades/output_spades.mum

4. Genome Annotation

Prokka

Prokka is a popular annotation tool for bacterial genomes. As input, it uses the assembled genome sequence produced by Canu. It provides both a structural and functional annotation, with a General Feature Format (.gff) annotation file, and FASTA files as a result.

prokka ${ANALYSIS_DIR}/02_genome_assembly/canu_output/pacbio.contigs.fasta --outdir ${ANALYSIS_DIR}/04_genome_annotation/

5. Alignment

BWA

The Burrows-Wheeler Alignment (BWA) tool is used for mapping sequencing reads to a reference genome. After the alignment, the output file is converted into BAM format and sorted. The index file is created to make data retrieval faster and more efficient. This alignment step is repeated multiple times, each time on a different set of paired-end reads.

DIR="Serum"
FILE="ERR1797969"

PASS1="/proj/genomeanalysis2023/Genome_Analysis/1_Zhang_2017/transcriptomics_data/RNA-Seq_${DIR}/trim_paired_${FILE}_pass_1.fastq.gz"
PASS2="/proj/genomeanalysis2023/Genome_Analysis/1_Zhang_2017/transcriptomics_data/RNA-Seq_${DIR}/trim_paired_${FILE}_pass_2.fastq.gz"

bwa mem ${ANALYSIS_DIR}/02_genome_assembly/canu_output/pacbio.contigs.fasta ${PASS1} ${PASS2} | samtools view -b | samtools sort \
-o ${ANALYSIS_DIR}/05_alignment/${DIR}/trim_paired_${FILE}_aligned.bam
samtools index ${ANALYSIS_DIR}/05_alignment/${DIR}/trim_paired_${FILE}_aligned.bam

6. Comparative Genomics

HTSeq

The tool HTSeq is used to count the number of reads that map on the genomic features. It requires the BAM alignment files to be sorted and the features file (.gff) stripped from nucleotide sequences. The features file contains information about genomic features, such as gene locations and annotations. It can be downloaded from NCBI, but in this case the .gff file was produced during annotation.

DIR="BH"
FILE="ERR1797974"

BAM_FILE="/domus/h1/bonitavw/GenomeAnalysis/analysis/05_alignment/${DIR}/trim_paired_${FILE}_aligned.bam"
GFF_FILE="/domus/h1/bonitavw/GenomeAnalysis/analysis/04_genome_annotation/GFF_file.gff"
COUNT_FILE="/domus/h1/bonitavw/GenomeAnalysis/analysis/06_comparative_genomics/htseq_counts_${DIR}_${FILE}.txt"

htseq-count --format=bam --i=locus_tag --type=CDS ${BAM_FILE} ${GFF_FILE} > ${COUNT_FILE}

DESeq2

The sbatch script calls the Rscript 'R_DESeq2_analysis.R', which performs the actual differential expression analysis. To handle the libraries and output files, the option -e is applied. In total, the R script generates 5 files:

  • topGenes.txt
  • upreg_genes.txt
  • downreg_genes.txt
  • MAplot.png
  • heatmap.png

The sbatch script:

Rscript -e 'library("DESeq2"); sink("/domus/h1/bonitavw/GenomeAnalysis/analysis/06_comparative_genomics/topGenes.txt"); source("R_DESeq2_analysis.R")'

Clone this wiki locally