- January 13: Introduction and overview of course
- January 15: Set up tools and practice examples in Chapters 1-3
- SSH Client:
- FTP Client:
- Text Editor:
- Command line exercises
- January 17: ENCODE data reproducibility and Example datasets
- Overview of Encode / ChIP and Transposons as missing regulatory regions
- Encode Data portal
- Go through each category and get familiar — we will specifically be looking at:
- DNA binding / TF-CHIPseq / K562 / paired-ended
- Exploring metadata in unix
- January 22: Start organizing metadata for data retrieval
- Sorting hat / intro to bash scripts
- Taking notes in Markdown
- Organize input files for our ChIP-seq pipeline
- Jan 24: Let’s go get data!
- We will each go retrieve a ENCODE data file from our sample sheet.
- Jan 27: git & GitHub
- Beyond the samplesheet
- Jan 29: IT lecture on Fiji
- Tour of Fiji data center
- Jan 31: Connect to Fiji
- HOME : Data you really want to keep and back up not intermediate analyses
- Scratch: THe wild west no limits (within reason) here is where we will start doing analysis and set up git etc.
- Folder to submit final files/analysis -- more later
- The precooked class
- Feb 3: Set up Fiji to get ready to run nextflow
- Feb 5: NextFlow / nf-core chipseq
- Feb 7: NextFlow / nf-core chipseq
- Set up design and sample files — folder structures — Run.sh
- GOAL -- Set up your project directory run for 3 TFs.
- Familiarize yourself and take notes on file types
- Feb 10: Regroup, debug and / or rerun pipeline
- Feb 12: Lecture : Transcriptional regulation and ChiPseq analysis pipeline.
- Feb 14: Class Presentations / IGV and UCSC Genome Browsing of your data!
- Feb 17: DNA Binding Proteins from Structures to "Meta-analysis"
- repeat masker
- Gencode annotations
- Feb 19: Exploring the larger dataset.
- Assess reproducibility of tracks
- Feb 21: Introduction to R: Part 1
- Feb 24: Class discussion of questions
- Feb 26: R Part II
- Feb 28: R Part III : R for Genomics (GRanges / rtracklayer)
- March 2: Class on own -- peak plotting exercise
- March 4: Review TE / TF metaplots exercise
- March 6:
- Exercise 1: Practicing with Git
- Exercise 2: Creating a gold-standard peak set
- UCSC Resources
- March 16: Remote class orientation (pre cooked .Rmd)
- March 18: Intersect annotation features in GRanges for mRNA, lncRNA and TE
Clone this wiki locally
Group 1 (mRNA): Kristen, Shelby, Ben, Soroya, Arpan Group 2 (lncRNA): Savannah, Michael, Thao, Dan, Graycen Group 3 (TE): Alison, Devin, Tom, Kevin, Guilia
Read Chapters 1-3 for Monday Jan 22
Bioinformatics Data Skills by Vince Buffalo
First we will install the following tools, and after that work through some command line exercises.
- iterm will work for this!
- preferred: putty
- preferred: FileZilla
Overview of Encode / ChIP and Transposons as missing regulatory regions
What is a promoter and transcription factor?
select samples, click columns add control, click on table and then download .tsv
Use ls, head, tail, cat, awk / grep to explore this metadata table.
Taking notes in Markdown
In addition to the sample-level metadata you retrieved last time, download this table of file-level metadata from ENCODE
As a group, choose one transcription factors that you would like to analyze.
We're going to subset and organize our metadata file to include just those files that you would like to download and the columns that will be useful to you using
We'll also make a file which contains the URLs to retrieve the fastq files from Encode.
Read chapters 4-5
SFTP, SSH, SCP
wget -i file.txt
- Create a git repository and commit some changes
- Create one GitHub repository per group and commit your sample sheet script
We're going to create a file that matches the ChIP samples to their control samples. The format of this file is specified by the pipeline that we will be running.
THESE ARE THE REQUIRED COLUMNS FOR THE DESIGN FILE
group,replicate,fastq_1,fastq_2,antibody,control (**** fastq1 and fastq2 URLS ****)
Make a design file by Friday January 31 for your TF
- Hint: this maybe easiest in excel. Look up file accession number for YTF. Then look for "paired with" you will see a new File accession number -- that needs to be in your control column.
- If your "paired with" identifier is not in the sample sheet (Jan 22 lecture notes) -- then go to encode portal and find it :)
- Advanced exercise : Script this in bash (going to need a few greps & joins :)
- Meet at Space Sciences
Please take notes on the key rules and regulations — to do and not to do’s !
- Layout of class directories -- where will you be doing work?
- Get a local git repo -- set up ssh key for fiji-GitHub
- Moving files to and from fiji
Scratch: THe wild west no limits (within reason) here is where we will start doing analysis and set up git etc.
Design File presentations
- SCREEN (screen -list / ctr-d + a/ screen -r)
- Get fastq's for your TF
- SLURM review (interactive & batch jobs)
- md5sum -c
- Go over class design file
exchange design files to have a total of 3 TFs (e.g., collaborate with another group) `cp` design files. What happened? How can we solve this?
Discuss and catch up on what we have learned about unix and commands etc
Nextflow paper: Nextflow enables reproducible computational workflows
Read basic documentation and install nextflow in your path!
- fastq directory w/ fastqs downloaded
- checked by John or Michael
- run pipeline
squeue -u X000
Read next flow documentation and nextflow.out
Homework google the programs used in nextflow.out Fastqc TimaGalore BWAMem SortBAM MergeBAM BigWig MACSCallPeak Peak QC
Lets cover some of the basic statistics being used in the NF-Core Chip-Seq pipeline. Probability Distributions: Poisson, Binomial, negative binomial Scan Statistics
Recomended reading: Biometry Chapter 4
Class exercise: each group presents a statistical principle and how it is used in NF-CORE ChIPseq
Class UCSC Account: MyData > Sign in: BCHM_5631 Pswd : will tell y'all in class
#Genome file /Shares/rinn_class/data/genomes/human/gencode/v32/GRCh38.p13.genome.fa \ #Annotation file /Shares/rinn_class/data/genomes/human/gencode/v32/gencode.v32.annotation.gtf \
- Your tracks
- Encode tracks (go to portal download bigwigs)
- Pre-baked tracks
Install x2go to use IGV on fiji.
Class exercise: download and view: BigWig and BroadPeak Files from this run versus on of yours Are the results similar?
First: a quick tour of UCSC table browser
- How to load a RMSK track into IGV or R
Try loading a peak file into R
- they are just tab seperated tables and can be loaded with read.table(sep = "\t)
Each group presents three questions they would like to address based on the TE-DNA, TE-RBP, E-CLIP study designs. Each person 3 questions.
- Introduce yourself and your research
- Present the question that you would like to pursue with the class dataset
- Discuss how you'd like to use lessons from this class in your own research
- Continuation of R data types
- Introduction to ggplot2 and tidyverse
- Exercise -- plotting gene profiles
- Git from R
Good R tutorial: https://www.youtube.com/watch?v=fDRa82lxzaU
Install the following packages in fiji-viz/RStudio
install.packages("BiocManager") BiocManager::install("GenomicRanges") BiocManager::install("rtracklayer")
- Review your solutions to the for loop/plotting exercise
- Introduce GRanges and findOverlaps
- Read in peak files, repeatMasker files, and find overlaps
Exercise: Make some plots to characterize the overlap of ChIP-seq peaks with TEs.
Can be as simple as plotting the number of overlaps of one particular TF with a class of TEs - OR - since you have data for all the TFs, you can plot each protein's peaks and where they fall in relation to the center of the repeat -- i.e. a metaplot heatmap or profile plot.
If you get stuck, ask your group members for help and if you're still stuck, ask in the general slack channel. We will go over your plots and code on Wednesday.
- 3 Groups of 5
Group 1 (mRNA): Kristen, Shelby, Ben, Soroya, Arpan Group 2 (lncRNA): Savannah, Michael, Tao, Dan, Graycen Group 3 (TE): Alison, Devon, Tom, Kevin, Guilia
- Granges Gencode
- Granges consensus.peak.file
- Intersect Granges
- Go over TE intersection plots and problems
- Fix RMarkdown with Jon
- Introduction to RMarkdown and functions
- Git structure -- how teams will be committing to class repository
- Discussion: Clustering
Each person contributes commmits to the README.md in each group. Submit a pull request to the master branch.
Write a function that will require peaks to be present in all replicates per TF. Then iterate over all TFs to create peak sets (GRanges objects) that consist of peaks present in all replicates. Write these peaks to one bed file per TF. Copy these peak files to your class directory
/Shares/rinn_class/students/<identikey>. We will be reviewing these files on Monday.
Bonus: Write the function such that the number or percent of replicates required is adjustable.
Considerations: Do you want to merge the peak regions? What is the minimum overlap required? How do the results change when this parameter is varied? How many peaks do we lose by doing this approach?
- going remote as of Friday March 13.
- Browsing / spot checking consensus peaks in UCSC (session example "consensus_peaks" in UCSC class session list -- Randomly sampled peaks to check out)
- Peak files for each replicate
- Consensus peak files
- BigWig file link bigWigs
Class Exercise: Look through the profile plots and remake the plots for your favorite TF(s) or all of them. Find two TFs that have different profile plots. Find examples of their consensus peaks with bigWig replicates. Present interesting aspects about these TFs from literature (NCBI Gene). Prepare a presentation per group for Friday. Slack a ppt or keynote to the general channel before class on Friday.
- Welcome to Zoooooom !
- Break out rooms for groups
- Slack and zoom / trello
Do intersects in class for your "biotype"
Class exercise (presentations Friday March 20): Find some interesting examples for your group (5 TFs). Is there a trend with number of peaks and number of overlaps? How could we "shuffle" to understand if this is significant or happens by chance? Which ones bind your biotype more than others? What is the most unique DNA binding protein for your group? ##### March 30: Functions, Features and Fun and git organization for analyses [Paper to read on mRNA and lncRNA promoter properties](https://www.dropbox.com/s/ux3e7xzl9lsflxz/Mele_et_al.pdf?dl=0) [Second paper to read on promoter properties](https://www.dropbox.com/s/m4832lsedpt826f/Genome%20Res.-2019-Mattioli-gr.242222.118.pdf?dl=0) ##### April 1: No class : APRIL-FOOLs <- Present interesting promoters that have many DNA binding protein events. Clustering ##### April 3: Findings from clustering & paper figure presentations - Present a figure and associated analysis/findings from each paper (Mele et al. & Mattioli et al.) - Present findings from your clustering exercise: - What groupings make sense? - Are there different clustering groupings when you compare all promoters vs your subset? ##### April 6: Expression comparisons -- recapitulate Mele et al finding that more TFs higher expression. - Class excercise: are there promoters with lots of TFs that are not expressed? At what point would we say there are a lot of TFs bound :) ? Hint: histogram of co-occurrence matrix. ##### April 8: - Walk through results (ghosts) - Other questions to analyze? Distribute analyses. - Prepare questions for Michael Snyder ##### April 10: Michael Snyder guest lecture/interview ##### April 13: Permutation test class exercise I [Intuitive Statistics Lecture](https://www.dropbox.com/s/95iq9veg5e7qp1y/Permuation_false_discovery.pdf?dl=0) Groups will work on `permutation_test_class.Rmd` ##### April 15: Permutation test class exercise II ##### April 17: Design manuscript outline ##### April 20: Work through making figures -- clean code and figures in .Rmd ##### April 22: Work through making figures -- clean code and figures in .Rmd ##### April 24: Figure from each group due in .Rmd ##### April 27: Finalize Figures and git #### April 29: Sweep up the workshop !
Can we use data standards and reproducibility to write a paper on our findings? Let's set up the Paper-Pository on Git