A hands-on training course at Instituto Gulbenkian de CiĂŞncia (4-days)
Official course page of the Gulbenkian Training Programme in Bioinformatics - GTPB
http://gtpb.igc.gulbenkian.pt/bicourses/ADER17S/
High-throughput technologies allow us to detect transcripts present in a cell or tissue. This introductory course covers practical aspects of the analysis of differential gene expression by RNAseq. Participants will be presented with real world examples and work with them in the training room, covering all the steps of RNAseq analysis, from planning the gathering of sequence data to the generation of tables of differentially expressed gene lists and visualization of results. We we will also cover some of the initial steps of secondary analysis, such as functional enrichment of the obtained gene lists.
Life Scientists who want to be able to use NGS data to evaluate gene expression (RNAseq). Computational researchers that wish to get acquainted with the concepts and methodologies used in RNAseq are also welcome.
Familiarity with elementary statistics and a few basics of scripting in R will be helpful.
Please have a look at the following resources and gauge your ability to use R in statitics at the basic level: Coursera videos; Introduction to r
Basic Unix command line skills, such as being able to navigate in a directory tree and copy files. See, for example, "Session 1" of the Software Carpentry training for a Unix introduction.
Course participants will go through a series of experiences that utimately lead to create enhanced capabilities to:
- List broad characteristics of NGS technologies and choose adequate sequencing for your biological question
- Have a broad overview of the steps in the analysis of RNA-Seq differential expression experiments
- Assess the general quality of the raw data from the sequencing facility
- Do simple processing operations in the raw data to improve its quality
- Generate alignments against a reference genome
- Assess the general quality of the alignments and detect possible problems
- Generate tables of counts using the alignment and a reference gene annotation
- Generate lists of differentially expressed genes, at least for a simple pairwise comparison
- Perform simple functional enrichment analysis and understand the concepts behind them
For this, we are providing small example datasets and exercises that participants can use to learn.
What choices do you have when sending your samples to the sequencing facility
How do the sequencing choices influence the kind of questions you can answer
What are the steps in RNA-Seq data analysis
What information is in fastq files, and how is it organized
LO 3.3 - Read QC reports of raw data to assess the general quality of data and presence of sequence bias
Detect low quality bases in the QC reports
Detect sequence bias and possible presence of adaptors and other contaminants
Use seqtk to remove a fixed number of bases from either ends of a fastq
Use seqtk to remove low quality bases from end of a fastq file
Use trimmomatic to filter/trim low quality bases using more complex approaches
LO 4.2 - Use tools such as cutadapt and trimmomatic to remove adaptors and other artefactual sequences from your reads
Remove Illumina adaptor from an example dataset using cutadapt
Remove PolyA from an example dataset using cutadapt
Check results using FastQC on filtered data
Are genomes constant?
Obtain genome fasta from Ensembl
What are the conditions of using burrows-wheeler approaches?
Prepare a reference genome to use with hisat2 and bwa
Run hisat2 / bwa mem in an example dataset
What is the SAM format; what is the BAM format
What is the GFF/GTF format
Obtain genome GTF from Ensembl
Interpret general alignment statistics such as percentage of aligned reads
Check the reports to assess RNA integrity and diversity
What parameters we need to consider when counting
LO 8.1 - Using the R package edgeR and DESeq2 to produce a pairwise differential expression analysis
Use Galaxy to produce differentially expressed genes with edgeR and DESeq2
Use edgeR and DESeq2 in R and RStudio
Produce PCA plots comparing all samples: outlier detection
Visualize expression profiles of top differentially expressed genes
Produce other plots such as vulcano plots
Account for confounders using Generalized Linear Models
Performing ANOVA-like comparisons
What are functional annotations, what types exist, and where to get them
When and why do we need multiple test corrections
Using functional enrichment analysis with your lists of genes
- 09:30 - 10:00 Introduction to the course and self presentation of the participants
- 10:00 - 11:00 Possibilities and limitations of NGS sequencing technologies. Choose adequate sequencing for your biological question
- 11:00 - 11:30 Coffee Break
- 11:30 - 12:30 Steps in the analysis of RNA-Seq differential expression experiments
- 12:30 - 14:00 LUNCH BREAK
- 14:00 - 16:00 Interpret what are fastq files and what is their content. Use software like FastQC to process fastq files and produce QC reports. Read QC reports of raw data to assess the general quality of data and presence of sequence bias. Use tools such as seqtk, cutadapt and trimmomatic to remove low quality bases, adaptors and other artefactual sequences from your reads.
- 16:00 - 16:30 Tea Break
- 16:30 - 18:00 What is a reference genome, versioning and where to obtain genomes. Alignment software: hisat2; bwa; salmon. Run an alignment: the SAM/BAM alignment format.
- 09:30 - 10:00 Morning Wrap-up (what have we done so far?)
- 10:00 - 11:00 What is a reference gene annotation, versioning and where to obtain. Visualizing alignments in IGV for single genes.
- 11:00 - 11:30 Coffee Break
- 11:30 - 12:30 Use tools such as RSeQC and Qualimap to assess quality of alignments.
- 12:30 - 14:00 LUNCH BREAK
- 14:00 - 16:00 The process of generating gene counts from genome aligments. Use tools such as htseq-counts and featurecounts to generate tables of gene counts. Use Salmon to generate counts using only the transcriptome.
- 16:00 - 16:30 Tea Break
- 16:30 - 18:00 Using the R package edgeR and DESeq2 in Galaxy to produce a pairwise differential expression analysis
- 09:30 - 10:00 Morning Wrap-up (what have we done so far?)
- 10:00 - 11:00 Use edgeR and DESeq2 in R and RStudio.
- 11:00 - 11:30 Coffee Break
- 11:30 - 12:30 Interpretation and visualization of results.
- 12:30 - 14:00 LUNCH BREAK
- 14:00 - 16:00 Interpretation and visualization of results.
- 16:00 - 16:30 Tea Break
- 16:30 - 18:00 Use more complex settings: Generalized Linear Models.
- 09:30 - 10:00 Morning wrap-up (what have we done so far?)
- 10:00 - 11:00 Use more complex settings: Generalized Linear Models.
- 11:00 - 11:30 Coffee Break
- 11:30 - 12:30 How to extract meaning from a list of genes. Understand the concept of functional enrichment analysis, and the statistics involved.
- 12:30 - 14:00 LUNCH BREAK
- 14:00 - 16:00 Interpreting the results of functional enrichment analysis.
- 16:00 - 16:30 Tea Break
- 16:30 - 18:00 Final wrap-up Session.