This repo contains shell and python scripts used to assemble raw fastq sequencing files into full concatenated genomes. All of these scripts are optimized to run in Cornell's HPC environment, but can easily be transfered to other HPCs or AWS.
The following scripts are including in this repo:
- influenza_analysis.sh (De-novo assembly with Kraken binning and Trinity Assembly)
- influenza_analysis_no-kraken.sh (De-novo assembly without Kraken binning and Trinity Assembly)
- filter-genomic-segments.py (Filters out small contigs and keeps only the largest contig for each of the 8 genes)
- influenza_analysis_reference_based.sh (Reference-based assembly, with Bowtie2 and Samtools)
- influneza_analysis_reference_based_snippy.sh (Reference-based assembly, with the Snippy pipeline)
I tend to use these pipelines more often than De-novo because they don't require as much sequencing depth to obtain complete and full genomes
- Running Pipeline
./influenza_analysis_reference_based_snippy.sh
or
./influenza_analysis_reference_based.sh
- IMPORTANT - Analysis script needs to be in the same directory as the raw fastq files. Raw fastq files need to be gzip. No other files can be in the directory with the raw fastq files and analysis script. There is no limit to the number of samples run at a time, the script scales to number of samples inputed
- User Provided Positional Arguments
enter fasta genome PATH: <$PATH to reference fasta file>
- Output files
fastqc/ - Fastqc output from raw reads and post Q/C reads
lighter-output/ - Error corrected fastq files
trimmomatic-output/ - Error corrected and trimmed reads
trimmomatic-output/Sample_output/ - Snippy output files, contains vcf, bam, etc.
trimmomatic-output/consensus-fasta/final_consensus_fasta/ - Final fasta file containing called SNPs instantiated