- Project 05: Cancer Methylome Analysis
Supervisor:
- Matthias Schlesner (m.schlesner@dkfz-heidelberg.de)
- Christian Heyer (c.heyer@dkfz-heidelberg.de)
Tutor:
- Valentina Giunchiglia (Giunchiglia@stud.uni-heidelberg.de)
DNA Methylation is a key mechanism regulating transcriptional processes. Especially in regulatory regions such as promoter regions, DNA Methylation is known to be a signal of transcriptional repression. In developmental processes such as hematopoiesis, DNA methylation is paramount in deciding cell fate. Both array based and sequencing based methods can provide a map of DNA methylation, with a varying degree of coverage. One acronym you will commonly find in publications on comparing DNA methylation between sample groups are differentially methylated regions (DMRs).
Here, you are tasked with identifying DMRs between groups of samples and interpreting the sequence context of these regions. Reducing the dimensionality of the methylation whilst enriching for functionally relevant regions is vital, especially when working with constrained computational resources.
- Load the data into R and inspect it. You will also be provided a .csv with sample information.
- Reorganize the data into a better legible and manageable format.
- Report quality control measures. Decide which features you want to display and how you want to display these.
- Before the samples can be analyzed, they need to be normalized. What could cause problems in the downstream analysis or disturb it? How strict do you want to control for quality without losing too much information?
- Filter features which you want to keep in the analysis. Choose a metric to reduce the number features to a better manageable amount.
- Use dimensionality reduction techniques to extract the highest sources of variation in the data.
- Identify regions with differentially methylated loci between the two sample groups. How should this analysis be done? Do you want to run it on all loci or filter out certain regions first(related to the normalization)?
- Test for differential methylation between the sample groups.
- Use logistic regression to find good predictors between healthy and diseased sample groups
- Annotate your results and interpret their sequence context. Which genes and regulatory features are at your differentially methylated loci. Are the differentially methylated regions in the gene bodies or promoter regions? Research what impact your candidate genes could have.
- Document your results using R Markdown to provide explanations of your code and the reasoning behind it. Add visualizations and their explanations. Remember at the end of this project each member will be evaluated on the basis of their R markdown code and need to be able to explain any aspect of the project.
- Evaluate Multiple approaches for defining differential methylation
- Comparing different dimensionality reduction techniques or testing methodologies.
- Advanced visualization using R-markdown (Interactive plots using Plotly or Shiny libraries)
The Blueprint epigenome project provides resources on haematopoietic epigenomes from both healthy and diseased samples. Each sample provided here has been processed with Whole Genome Bisulfite Sequencing (WGBS). You will be given a matrix of methylation status of sites for each sample.
195 samples are in this cohort. We will provide 5 different comparison on subgroups of these samples for you to perform. Each group will work on a part of the dataset.
- ALL vs. B-cells Download Link
- AML vs. granulocytes (Bone Marrow) Download Link
- AML vs monocytes (Blood) Download Link
- CLL vs. B-cells Download Link
- Mantle cell lymphoma vs. Bcells (Blood) Download Link
Each Download link contains a file archive with a sample annotation .csv and RDS.gz object to load into R with readRDS
.
This object is a list with 4 entries Tiling (5kb window), genes, promoters, cpgislands. Each matrix has the corresponding genomic positions, gc content, Beta value methylation and coverage data. Start off by taking one of these datasets and getting used to working with R.
- Analysing and interpreting DNA methylation data
- Statistical and integrative system-level analysis of DNA methylation data
- The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery
- DNA Methylation Dynamics of Human Hematopoietic Stem Cell Differentiation
Each group is required to create a project proposal
- summary of literature on this dataset
- questions you want to address
- approximate timetable
These elements are to be present in your data.
- descriptive statistics about the datasets
- graphical representations
- dimension reduction analysis (PCA, clustering or k-means)
- statistical tests (t-test, proportion tests etc)
- regression analysis
Clean, organize and preprocess the data for analysis
- Check Coverage at each position and remove low coverage regions (QC)
- Remove regions with zero variability
- Remove regions with high amount of missing values
- Consider transforming methylation beta values in an attempt to normalize the variance.
- Visualize some of the basic statistics of the dataset (Data distributions, mean, sd etc.)
- Try to test for differences between the sample groups.
- Reduce the Dimensionality of the dataset. Try to test various methods.
- Attempt to run clustering analysis on this data and visualize (scatterplot)
- Do groups your groups coincide with the sample labels?
- Think about which predictions you want to test before you start testing regression between the variable groups.
- For testing the prediction of disease vs. healthy consider using logistic regression instead of linear regression.
Here are some random links for R and git you might find useful. You don't need to go through all of these, but check them out if you want to know more about using R and git.
- R for Data Science (Combines R basics and introduction to statistical techniques in R/R Markdown)
- Style Guidelines for R (Try to keep your code in a consistent format. Here is a short style guide you can consider using.)
- Bioconductor (Large R bioinformatics software repository. While you are supposed to do the analysis yourself and not have package to it for you, there are also a bunch of useful packages for working with sequencing data f.e. for annotating the genome.)
- Long but comprehensive (and a little entertaining) step by step guide: git for userRs
- Git learning resources from Github
- Git "flight rules" (Compendium of git problems and how to solve them)
- Quick setup for using git with Rstudio