<H2>Generate seqs for building a species tree via aligning reads or assemblies to single ref genome then calling variants</H2>

<H3>I. Generating consensus seqs from fastq sequencing reads</H3>

<B>Step 1 - Map fastq reads from ONT output to reference genome using minimap2</B>

inputs: reference fasta, sequencing reads in fastq format<br>
output: sam file of aligned reads

In [None]:
minimap2 -ax map-ont ref.fasta reads.fastq > strain.sam

<B>Step 2 - Convert SAM to BAM, sort, index</B>

inputs: SAM of aligned reads from minimap2<br>
output: sorted and indexed BAM file of aligned reads

In [None]:
# sam to bam
samtools view -bS strain.sam > strain.bam

# sort
samtools sort -o strain.sorted.bam strain.bam

#index
samtools index strain.sorted.bam

<B>Step 3 - BCF tools mpileup to get genotype likihoods</B>

inputs: reference fasta, sorted bam<br>
output: bcf pileup file<br>
*notes: -Ou specifies uncompressed output (otherwise I believe output is compressed by default?)

In [None]:
bcftools mpileup -f ref.fasta strain.sorted.bam -Ou -o strain.mpileup.bcf

<B>Step 4 - Call variants</B>

inputs: bcf mpileup output<br>
output: zipped vcf file <br>
*notes: -mv (could use -m -v separately) m for calling snps and indels, v for only outputing variants from ref genome (don't need every position); --ploidy 1 specifies I am working with a haploid genome; -Oz specifies to zip the output sicne zipped vcf is required for vcf indexing and consensus commands 

In [None]:
bcftools call -mv --ploidy 1 -Oz -o strain.vcf.gz strain.mpileup.bcf

<B>Step 5 - Index vcf</B>

input: zipped vcf file<br>
output: indexed vcf file (still zipped)

In [None]:
bcftools index strain.vcf.gz

<B>Step 6 - Filter vcf file for mapping quality and read depth</B>

input: indexed zipped vcf file<br>
output: filtered version of input file<br>
*notes: can change threshold for mapping and depth, these seem reasonable

In [None]:
bcftools filter -e 'Qual<30 || DP<10' -Oz -o strain.filtered.vcf.gz strain.vcf.gz

<B>Step 7 - Keep just snps and index again</B>

input: filtered vcf file (zipped)<br>
output: indexed vcf with only snps

In [None]:
bcftools view -v snps -Oz -o strain.snps.vcf.gz strain.filtered.vcf.gz

bcftools index strain.snps.vcf.gz

<B>Step 8 - Generate consensus sequence</B>

inputs: reference fasta, indexed snps.vcf file<br>
outputs: consensus sequence! 

In [None]:
bcftools consensus -f ref.fasta strain.snps.vcf.gz > strain.consensus.fasta

<B>Step 9 - Change ID on fasta header line</B>

This was neccessary since by default bcftools uses the ID on the reference header for the new consensus seq

In [None]:
sed -i '1s/.*/>new_id/' strain.consensus.fasta

<H3>II. Generating consensus seqs from already assembled genomes</H3>

*mostly same as when using sequencing reads, see above for specifics<br>
*note - for same strains of same species I used -ax asm5, for different species I used asm10

In [None]:
# step 1 - map assembled genome (or part of it) to the reference
minimap2 -ax asm5 ref.fasta strain.fasta > strain.sam

# step 2 - convert sam to bam, sort, and index
samtools view -bS strain.sam > strain.bam

samtools sort strain.bam -o strain.sorted.bam

samtools index strain.sorted.bam

# step 3 - generate bcf pileup
bcftools mpileup -f ref.fasta strain.sorted.bam -Ou -o strain.mpileup.bcf

# step 4 - call variants
bcftools call -mv --ploidy 1 -Oz -o strain.vcf.gz strain.mpileup.bcf

# step 5 - index vcf file
bcftools index strain.vcf.gz

# step 6 - filter and re-index vcf 
bcftools filter -e 'QUAL<30' -Oz -o strain.filtered.vcf.gz strain.vcf.gz

bcftools index strain.filtered.vcf.gz

#step 7 - get snps only and re-index
bcftools view -v snps -Oz -o strain.snps.vcf.gz strain.filtered.vcf.gz

bcftools index strain.snps.vcf.gz

# step 8 - generate consensus sequence
bcftools consensus -f ref.fasta strain.snps.vcf.gz > strain.consensus.fasta

# step 9 - rename fasta header id
sed -i '1s/.*/>new_id/' strain.consensus.fasta