Skip to content

Syllabus

boulderrinnlab edited this page May 12, 2020 · 160 revisions
Clone this wiki locally
Group 1 (mRNA): Kristen, Shelby, Ben, Soroya, Arpan
Group 2 (lncRNA): Savannah, Michael, Thao, Dan, Graycen 
Group 3 (TE): Alison, Devin, Tom, Kevin, Guilia

January 13: Introduction and overview of course

Lecture 1: Introduction and overview

Read Chapters 1-3 for Monday Jan 22

Bioinformatics Data Skills by Vince Buffalo

January 15: Set up tools and practice examples in Chapters 1-3

First we will install the following tools, and after that work through some command line exercises.

Terminals:

Mac:

PC:

SSH Client:

Mac:

  • iterm will work for this!

PC:

FTP Client:

Mac:

PC:

Text Editor:

Github:

Slack:

  • Download here
  • Sign up for the class slack here

Command line exercises

Exercises


January 17: ENCODE data reproducibility and Example datasets

Lecture 2: Data Reproducibility in Science / Intro to Transcriptional regulation

Overview of Encode / ChIP and Transposons as missing regulatory regions

What is a promoter and transcription factor?

Encode Data portal

Go through each category and get familiar — we will specifically be looking at:

DNA binding / TF-CHIPseq / K562 / paired-ended

Exploring metadata in unix

select samples, click columns add control, click on table and then download .tsv 

Use ls, head, tail, cat, awk / grep to explore this metadata table.

January 22: Start organizing metadata for data retrieval

Sorting hat / intro to bash scripts

Taking notes in Markdown

Organize input files for our ChIP-seq pipeline

  • In addition to the sample-level metadata you retrieved last time, download this table of file-level metadata from ENCODE

  • As a group, choose one transcription factors that you would like to analyze.

  • We're going to subset and organize our metadata file to include just those files that you would like to download and the columns that will be useful to you using awk and grep.

  • We'll also make a file which contains the URLs to retrieve the fastq files from Encode.

Read chapters 4-5 

Jan 24: Let’s go get data!

Lecture 3: Where does data live in Biology, how do we get it, and did we get the right file?

We will each go retrieve a ENCODE data file from our sample sheet.

SFTP, SSH, SCP
wget -i file.txt
md5sum

Jan 27: git & GitHub 

Lecture 4: git & gitting GitHub

Class Exercises:

  • Create a git repository and commit some changes
  • Create one GitHub repository per group and commit your sample sheet script

Beyond the samplesheet

We're going to create a file that matches the ChIP samples to their control samples. The format of this file is specified by the pipeline that we will be running.

THESE ARE THE REQUIRED COLUMNS FOR THE DESIGN FILE group,replicate,fastq_1,fastq_2,antibody,control (**** fastq1 and fastq2 URLS ****)

Make a design file by Friday January 31 for your TF
  • Hint: this maybe easiest in excel. Look up file accession number for YTF. Then look for "paired with" you will see a new File accession number -- that needs to be in your control column.
  • If your "paired with" identifier is not in the sample sheet (Jan 22 lecture notes) -- then go to encode portal and find it :)
  • Advanced exercise : Script this in bash (going to need a few greps & joins :)

Jan 29: IT lecture on Fiji 

Tour of Fiji data center

Please take notes on the key rules and regulations — to do and not to do’s !

Jan 31: Connect to Fiji 

  • Layout of class directories -- where will you be doing work?
  • Get a local git repo -- set up ssh key for fiji-GitHub
  • Moving files to and from fiji

HOME : Data you really want to keep and back up not intermediate analyses

/Users/<identikey>

Scratch: THe wild west no limits (within reason) here is where we will start doing analysis and set up git etc.

scratch/Users/<identikey>

Folder to submit final files/analysis -- more later

/Shares/rinn_class/students/<identikey>

The precooked class

/Shares/rinn_class/data
Design File presentations
rsync

Feb 3: Set up Fiji to get ready to run nextflow

  • SCREEN (screen -list / ctr-d + a/ screen -r)
  • Get fastq's for your TF
  • SLURM review (interactive & batch jobs)
  • md5sum -c
  • Go over class design file
exchange design files to have a total of 3 TFs (e.g., collaborate with another group)
`cp` design files. 
What happened? How can we solve this? 

Discuss and catch up on what we have learned about unix and commands etc

Feb 5: NextFlow / nf-core chipseq

Lecture 5: Flowing with NEXTFlow

Nextflow paper: Nextflow enables reproducible computational workflows

Nextflow

NF-CORE

Read basic documentation and install nextflow in your path!

Feb 7: NextFlow / nf-core chipseq

Set up design and sample files — folder structures — Run.sh  


GOAL -- Set up your project directory run for 3 TFs.

  • design.csv
  • nextflow.config
  • run.sh
  • blacklist
  • fastq directory w/ fastqs downloaded
  • checked by John or Michael
  • run pipeline

  sbatch run.sh

squeue -u X000

scancel jobid

Familiarize yourself and take notes on file types

https://www.encodeproject.org/help/file-formats/

Read next flow documentation and nextflow.out

Homework google the programs used in nextflow.out    
Fastqc
TimaGalore
BWAMem
SortBAM
MergeBAM
BigWig
MACSCallPeak
Peak QC

Feb 10: Regroup, debug and / or rerun pipeline

Feb 12: Lecture : Transcriptional regulation and ChiPseq analysis pipeline.

Lecture

Lets cover some of the basic statistics being used in the NF-Core Chip-Seq pipeline. Probability Distributions: Poisson, Binomial, negative binomial Scan Statistics

Recomended reading: Biometry Chapter 4

Feb 14: Class Presentations / IGV and UCSC Genome Browsing of your data!

Class exercise: each group presents a statistical principle and how it is used in NF-CORE ChIPseq 

Download IGV

IGV Resources

UCSC Genome Browser

Class UCSC Account: MyData > Sign in:
BCHM_5631
Pswd : will tell y'all in class

Feb 17: DNA Binding Proteins from Structures to "Meta-analysis"

Lecture 7: DNA Binding Proteins from Structures to "Meta-analysis"

repeat masker

UCSC Download page

Gencode annotations

Gencode Download Page

Pre-downloaded

#Genome file
/Shares/rinn_class/data/genomes/human/gencode/v32/GRCh38.p13.genome.fa \

#Annotation file
/Shares/rinn_class/data/genomes/human/gencode/v32/gencode.v32.annotation.gtf \

Required Reading

TE-DNA TE-RNA E-CLIP

Feb 19: Exploring the larger dataset.

/Shares/rinn_class/data/k562_chip/

Assess reproducibility of tracks

  • Your tracks
  • Encode tracks (go to portal download bigwigs)
  • Pre-baked tracks

Install x2go to use IGV on fiji.

Class exercise: download and view:
BigWig and BroadPeak Files from this run versus on of yours
Are the results similar?

Feb 21: Introduction to R: Part 1

First: a quick tour of UCSC table browser

  • How to load a RMSK track into IGV or R

Lecture 8: Introduction to R

Try loading a peak file into R

  • they are just tab seperated tables and can be loaded with read.table(sep = "\t)

Feb 24: Class discussion of questions

Each group presents three questions they would like to address based on the TE-DNA, TE-RBP, E-CLIP study designs.
Each person 3 questions.

Presentation outline:

  • Introduce yourself and your research
  • Present the question that you would like to pursue with the class dataset
  • Discuss how you'd like to use lessons from this class in your own research

Feb 26: R Part II

Lecture 9: Intro to R -- part II

  • Continuation of R data types
  • Introduction to ggplot2 and tidyverse
  • Exercise -- plotting gene profiles
  • Git from R
Good R tutorial:
https://www.youtube.com/watch?v=fDRa82lxzaU

Feb 28: R Part III : R for Genomics (GRanges / rtracklayer)

Lecture 10: R for Genomics -- part I

Install the following packages in fiji-viz/RStudio

install.packages("BiocManager")
BiocManager::install("GenomicRanges")
BiocManager::install("rtracklayer")
  • Review your solutions to the for loop/plotting exercise
  • Introduce GRanges and findOverlaps
  • Read in peak files, repeatMasker files, and find overlaps

March 2: Class on own -- peak plotting exercise

Exercise: Make some plots to characterize the overlap of ChIP-seq peaks with TEs.

Can be as simple as plotting the number of overlaps of one particular TF with a class of TEs - OR - since you have data for all the TFs, you can plot each protein's peaks and where they fall in relation to the center of the repeat -- i.e. a metaplot heatmap or profile plot.

If you get stuck, ask your group members for help and if you're still stuck, ask in the general slack channel. We will go over your plots and code on Wednesday.

March 4: Review TE / TF metaplots exercise

  • 3 Groups of 5
Group 1 (mRNA): Kristen, Shelby, Ben, Soroya, Arpan
Group 2 (lncRNA): Savannah, Michael, Tao, Dan, Graycen 
Group 3 (TE): Alison, Devon, Tom, Kevin, Guilia
  • Granges Gencode
  • Granges consensus.peak.file
  • Intersect Granges
  • Go over TE intersection plots and problems

March 6:

  • Fix RMarkdown with Jon
  • Introduction to RMarkdown and functions
  • Git structure -- how teams will be committing to class repository /scratch/Users/<identikey>
  • Discussion: Clustering

Exercise 1: Practicing with Git

Each person contributes commmits to the README.md in each group. Submit a pull request to the master branch.

Exercise 2: Creating a gold-standard peak set

Write a function that will require peaks to be present in all replicates per TF. Then iterate over all TFs to create peak sets (GRanges objects) that consist of peaks present in all replicates. Write these peaks to one bed file per TF. Copy these peak files to your class directory /Shares/rinn_class/students/<identikey>. We will be reviewing these files on Monday.

Bonus: Write the function such that the number or percent of replicates required is adjustable.

Considerations: Do you want to merge the peak regions? What is the minimum overlap required? How do the results change when this parameter is varied? How many peaks do we lose by doing this approach?

  • going remote as of Friday March 13.
  • Browsing / spot checking consensus peaks in UCSC (session example "consensus_peaks" in UCSC class session list -- Randomly sampled peaks to check out)

UCSC Resources

  • Peak files for each replicate /Shares/rinn_class/data/ucsc_peaks
  • Consensus peak files /Shares/rinn_class/data/k562_chip/analysis/00_consensus_peaks/ucsc_peak_tracks
  • BigWig file link bigWigs

Profile plot RMarkdown output

Class Exercise: 
Look through the profile plots and remake the plots for your 
favorite TF(s) or all of them.

Find two TFs that have different profile plots. 
Find examples of their consensus peaks with bigWig replicates. 
Present interesting aspects about these TFs from literature (NCBI Gene).
Prepare a presentation per group for Friday.

Slack a ppt or keynote to the general channel before class on Friday.

March 16: Remote class orientation (pre cooked .Rmd)

  • Welcome to Zoooooom !
  • Break out rooms for groups
  • Slack and zoom / trello
  • Presentations

March 18: Intersect annotation features in GRanges for mRNA, lncRNA and TE

Intersect_excercise.Rmd

Do intersects in class for your "biotype"

Class exercise (presentations Friday March 20): Find some interesting examples for your group (5 TFs).

Is there a trend with number of peaks and number of overlaps? How could we "shuffle" to understand if this is significant or happens by chance?

Which ones bind your biotype more than others? What is the most unique DNA binding protein for your group?  

##### March 30: Functions, Features and Fun and git organization for analyses

[Paper to read on mRNA and lncRNA promoter properties](https://www.dropbox.com/s/ux3e7xzl9lsflxz/Mele_et_al.pdf?dl=0)

[Second paper to read on promoter properties](https://www.dropbox.com/s/m4832lsedpt826f/Genome%20Res.-2019-Mattioli-gr.242222.118.pdf?dl=0)

##### April 1: No class : APRIL-FOOLs <- Present interesting promoters that have many DNA binding protein events. 
Clustering

##### April 3: Findings from clustering & paper figure presentations
- Present a figure and associated analysis/findings from each paper (Mele et al. & Mattioli et al.)
- Present findings from your clustering exercise:
    - What groupings make sense? 
    - Are there different clustering groupings when you compare all promoters vs your subset?
##### April 6: Expression comparisons -- recapitulate Mele et al finding that more TFs higher expression.
- Class excercise: are there promoters with lots of TFs that are not expressed? At what point would we say there are a lot of TFs bound :) ? Hint: histogram of co-occurrence matrix.

##### April 8: 
- Walk through results (ghosts)
- Other questions to analyze? Distribute analyses. 
- Prepare questions for Michael Snyder

##### April 10: Michael Snyder guest lecture/interview

##### April 13: Permutation test class exercise I
[Intuitive Statistics Lecture](https://www.dropbox.com/s/95iq9veg5e7qp1y/Permuation_false_discovery.pdf?dl=0)

Groups will work on `permutation_test_class.Rmd`

##### April 15: Permutation test class exercise II

##### April 17: Design manuscript outline 

##### April 20: Work through making figures -- clean code and figures in .Rmd

##### April 22: Work through making figures -- clean code and figures in .Rmd

##### April 24: Figure from each group due in .Rmd

##### April 27: Finalize Figures and git

#### April 29: Sweep up the workshop !

Can we use data standards and reproducibility to write a paper on our findings? Let's set up the Paper-Pository on Git


+++++++++++++++++++++++