project results and methods project presentation
- OS: macOS Big Sur Version 11.2 (20D64)
- RStudio Version 1.3.959
- kallisto 0.46.2
- FastQC v0.11.9 (Win/Linux)
- GitHub Desktop Version 2.5.4
- JupyterLab 3.0.6
Data courtesy of Chris Monson (UW) and Giles Goetz (NOAA). A more through description can be found in the data
subdirectory's readme
All RNA seq raw data files can be found here
- Only a subset of files were used for this project due to storage limitations, but all codes are written so they can be executed with the full dataset if you have space on your computer or external hard drive to do so. The subset of files are:
17104-02RT-01-10_S18_L002_R1_001.fastq.gz 17104-02RT-01-10_S18_L002_R2_001.fastq.gz 17104-02RT-01-11_S19_L002_R1_001.fastq.gz 17104-02RT-01-11_S19_L002_R2_001.fastq.gz 17104-02RT-01-13_S21_L002_R2_001.fastq.gz 17104-02RT-01-13_S21_L003_R1_001.fastq.gz 17104-02RT-01-13_S21_L003_R2_001.fastq.gz 17104-02RT-01-14_S22_L002_R1_001.fastq.gz 17104-02RT-01-7_S15_L002_R2_001.fastq.gz 17104-02RT-01-7_S15_L003_R1_001.fastq.gz 17104-02RT-01-7_S15_L003_R2_001.fastq.gz 17104-02RT-01-8_S16_L002_R1_001.fastq.gz 17104-02RT-01-8_S16_L002_R2_001.fastq.gz 17104-02RT-01-8_S16_L003_R1_001.fastq.gz 17104-02RT-01-8_S16_L003_R2_001.fastq.gz
- The
R1
orR2
in the file names correspond to the read ends - The sequences of the adapters used in library prep are
R1
:AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
andR2
:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
.
- The
As of March 3rd, 2021, Week 8
Below, the code files for this project are listed and described in order. File names for the final project are formatted step#-description
Old file names / draft code names are formatted MMDD-description
where MMDD is the day they were created.
Giles transferred the files from a NOAA server to Steven's ostrich server. Steven then transferred them to gannet, which is linked above. The md5sum text, 0128-Giles-md5sums.txt
file generated by generated by Giles during the initial transfer is located in data/raw/
subdirectory.
Retrieves the data from gannet using wget
. Recall, that not all of the files were used in this project, but all files are available on gannet.
Saves data to data/raw/
subdirectory.
Compares the file with md5sums that Giles provided, 0128-Giles-md5sums.txt
to the md5sums of the downloaded files.
Runs fastqc for all of our raw data files. Output directed to analyses/step3-fastqc/
Runs multiqc using all of the fastqc outputs directed to analyses/step3-fastqc/
in order to visualize all our sequences' qualities. Output directed to analyses/step4-multiqc/
and the html output is multiqc_report.html
within this subdirectory. Multiqc showed that the first ~15 bp of all sequences needed to be trimmed.
Skipped in this project for the sake of time
Gene expression quantified and put into a trinity matrix using Kallisto. Kallisto index built using the ensembl reference transcriptome for Oncorhynchus kisutch, located in data/Oncorhynchus_kisutch.Okis_V2.ncrna.fa
Outputs directed to analyses/step6-kallisto.idx
and analyses/step6-output/
Used DESeq2 to identify DEGs, and visualize DEGs with volcano plot and heatmap. Images of the volcano plot and heatmap are in images/
Ran blastx for the reference transcriptome to identify what the DEGs' functions were.
Note: only one of 12 DEGs had a match to the reference transcriptome.
The remaining 11 were identified using the web version of blastn using default settings and the fasta file data/Oncorhynchus_kisutch.Okis_V2.ncrna_11-DEGs.fa
joins the DEG statistics generated using DESeq2 with the expression levels of the 12 DEGs. Output is analyes/step9-DEGandBlastTable.tab