The goal of this project is to build a reproducible pipeline that takes whole genome sequence data from pangolins and identifies single nucleotide polymorphisms (SNPs) between two species.
Create a well documented and reproducible pipeline that:
- Runs FASTQC to check for quality of reads
- Aligns fastq files to a reference genome using BWA
- Identifies SNPs between species using FreeBayes and ANGSD
data
README and files containing information about data files. Raw data and reference genomes files are too big to store on Github.
- raw-data: contains .fastq.gz files
- reference-genome: contains downloaded reference genome .fa and .gff.gz files
tutorials
Jupyter and R notebooks from tutorials in class.
- BLAST tutorial
notebooks
Jupyter notebooks used for analyses.
- Notebook containing md5checksum check for reference genome
scripts
Bash scripts used to run analyses on Mox.
analyses Results and intermediate files from analysis.
- aligned-files: contains .sam and .bam files
- fastqc: contains
FASTQC
andmultiQC
results - genome: contains scaffold length text file
Week 4: Set up project directory and organization for running analyses on Mox
Week 5: Run FASTQC
on raw sequences files using GNU parallel to learn how to split up commands
Week 6: Check md5sum of the downloaded reference genome and index reference genome for BWA
Week 7: Run BWA
on fastq files for all 10 individuals
Week 9: Run FreeBayes
and ANGSD
on aligned bam files
Week 10: Visualize the process and results of the project
- Filter identified SNPs using various quality filter and identify the top most informative SNPs
- Use a genome-aware primer designing software to design primers around SNPs of interest
- Sequence museum samples and re-analyze the data with full dataset
Adam Tusk / CC BY 2.0