Skip to content

fohebert/align_on_reference

Repository files navigation

RNA-seq - Bern-Cichlids team <°)))o><

General Description & Disclaimer

This pipeline was designed to clean raw read files and map them back on a reference transcriptome. Be careful when using the scripts, everything is well formatted for the specific requirements of the KATAK cluster at Laval University. Follow the instructions step by step to produce the desired output, but if this pipeline is used on another cluster, there is absolutely no guarantee that it will work properly. You can always go back into the script files and change the code according to your needs.

Step 1 - Copying and expanding raw files

Description
First, you need to copy the files that are on the external hard-drive to your account on KATAK, in the "raw data" folder of the pipeline. Here's how you can perform this task:

Copying the files from external hard-drive to raw data folder using the "rsync" utility. Open a terminal window and type the following code:

rsync -avzP /Volumes/$HD/*.fastq.tar.gz ckavoekl@ibis.ulaval.ca:/home/ckavoek1/align_on_reference/02_raw_data

NOTE: $HD = name of your external hard-drive

1.1 First, go to the pipeline main folder:

cd /home/ckavoekl/align_on_reference

1.2 Submit the expand job:

qsub 01_scripts/jobs/00_expand.job.sh

Step 2 - Trimming reads (PHRED score > 2)

Description
This step will clean the reads according to a sequencing quality threshold, i.e. PHRED score greater than 2. Trimmomatic will discard the bases from both ends that show a sequencing quality below the user-defined threshold. In this case, the threshold is quite low (PHRED > 2). Agressive trimming leads to loss of important information, so the threshold is set low. Still from the main folder of the pipeline (i.e. /home/ckavoek1/align_on_reference), here's how you can perform this task:

2.1 Submit the trimming job:

qsub 01_scripts/jobs/01_trimming_job.sh

IMPORTANT NOTE - COMPLETING THE PIPELINE WITH A SUBSET OF THE DATA

  • That last command line will take ALL of the FASTQ files found in '02_raw_data/', trim/clean them and place the output files (i.e. trimmed read files) into the folder for the next step, i.e. 03_trimmed.
  • If you want to perform the analysis with only a subset of the data to practice or test the pipeline, here is what you could do:

From the main folder of the pipeline, create a folder in which you place all the read files except the few ones you want to work with:

mkdir 02_raw_data/temp_samples # Creates the folder
mv 02_raw_data/*.fastq 02_raw_data/temp_samples # Places all the read files in the temporary folder
mv 02_raw_data/temp_samples/*92S518*.fastq 02_raw_data/ # Moves sample number 92S518 back in the raw_data directory

And then submit the job:

qsub 01_scripts/jobs/01_trimming.job.sh

This will allow you to continue the pipeline with sample number 92S51 only.

Step 3 - Aligning the reads on the reference

Description
This step will align the reads on the reference transcriptome (coding sequences, CDS), in this case the Nile tilapia (Oreochromis niloticus). Still from the main folder of the pipeline (i.e. /home/ckavoek1/align_on_reference), here's how you can perform this task:

qsub 01_scripts/jobs/02_align.job.sh

Step 4 - Getting the count matrix

Description
This step will allow you to produce a count matrix based on the SAM files generated by BWA during the mapping step. Still from the main folder of the pipeline (i.e. /home/ckavoek1/align_on_reference), here's how you can perform this task:

qsub 01_scripts/jobs/03_read_counts.job.sh

This last script should produce an output file called samples.read-count.tsv in the folder 05_read_counts. This output file is the actual matrix that you need as the input file in edgeR or limma. ROWS = genes ("reference sequences"), COLUMNS = samples.

Step 5 - Annotation

Description
There is no extensive annotation step in this pipeline since the tilapia genome is used as the reference. The cool thing is that an annotation file already exists. You can find it in:

02_raw_data/tilapia_genome/ensembl.annotation.txt

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages