Project 05: Cancer Methylome Analysis

Project 05: Cancer Methylome Analysis

Supervisor:

Matthias Schlesner (m.schlesner@dkfz-heidelberg.de)
Christian Heyer (c.heyer@dkfz-heidelberg.de)

Tutor:

Valentina Giunchiglia (Giunchiglia@stud.uni-heidelberg.de)

Introduction

DNA Methylation is a key mechanism regulating transcriptional processes. Especially in regulatory regions such as promoter regions, DNA Methylation is known to be a signal of transcriptional repression. In developmental processes such as hematopoiesis, DNA methylation is paramount in deciding cell fate. Both array based and sequencing based methods can provide a map of DNA methylation, with a varying degree of coverage. One acronym you will commonly find in publications on comparing DNA methylation between sample groups are differentially methylated regions (DMRs).

Here, you are tasked with identifying DMRs between groups of samples and interpreting the sequence context of these regions. Reducing the dimensionality of the methylation whilst enriching for functionally relevant regions is vital, especially when working with constrained computational resources.

Objective

Load the data into R and inspect it. You will also be provided a .csv with sample information.
Reorganize the data into a better legible and manageable format.
Report quality control measures. Decide which features you want to display and how you want to display these.
Before the samples can be analyzed, they need to be normalized. What could cause problems in the downstream analysis or disturb it? How strict do you want to control for quality without losing too much information?
Filter features which you want to keep in the analysis. Choose a metric to reduce the number features to a better manageable amount.
Use dimensionality reduction techniques to extract the highest sources of variation in the data.
Identify regions with differentially methylated loci between the two sample groups. How should this analysis be done? Do you want to run it on all loci or filter out certain regions first(related to the normalization)?
Test for differential methylation between the sample groups.
Use logistic regression to find good predictors between healthy and diseased sample groups
Annotate your results and interpret their sequence context. Which genes and regulatory features are at your differentially methylated loci. Are the differentially methylated regions in the gene bodies or promoter regions? Research what impact your candidate genes could have.
Document your results using R Markdown to provide explanations of your code and the reasoning behind it. Add visualizations and their explanations. Remember at the end of this project each member will be evaluated on the basis of their R markdown code and need to be able to explain any aspect of the project.

Additional Objectives

Evaluate Multiple approaches for defining differential methylation
Comparing different dimensionality reduction techniques or testing methodologies.
Advanced visualization using R-markdown (Interactive plots using Plotly or Shiny libraries)

Dataset

The Blueprint epigenome project provides resources on haematopoietic epigenomes from both healthy and diseased samples. Each sample provided here has been processed with Whole Genome Bisulfite Sequencing (WGBS). You will be given a matrix of methylation status of sites for each sample.

195 samples are in this cohort. We will provide 5 different comparison on subgroups of these samples for you to perform. Each group will work on a part of the dataset.

ALL vs. B-cells Download Link
AML vs. granulocytes (Bone Marrow) Download Link
AML vs monocytes (Blood) Download Link
CLL vs. B-cells Download Link
Mantle cell lymphoma vs. Bcells (Blood) Download Link

Each Download link contains a file archive with a sample annotation .csv and RDS.gz object to load into R with readRDS. This object is a list with 4 entries Tiling (5kb window), genes, promoters, cpgislands. Each matrix has the corresponding genomic positions, gc content, Beta value methylation and coverage data. Start off by taking one of these datasets and getting used to working with R.

Literature

Reviews on DNA Methylation data

Reviews on Hematopoiesis

Literature on the Blueprint epigenome consortium

How to structure your project

Project proposal

Each group is required to create a project proposal

summary of literature on this dataset
questions you want to address
approximate timetable

Project

These elements are to be present in your data.

descriptive statistics about the datasets
graphical representations
dimension reduction analysis (PCA, clustering or k-means)
statistical tests (t-test, proportion tests etc)
regression analysis

Data preprocessing

Clean, organize and preprocess the data for analysis

Check Coverage at each position and remove low coverage regions (QC)
Remove regions with zero variability
Remove regions with high amount of missing values

Normalize and Visualize

Consider transforming methylation beta values in an attempt to normalize the variance.
Visualize some of the basic statistics of the dataset (Data distributions, mean, sd etc.)
Try to test for differences between the sample groups.

Data reduction

Reduce the Dimensionality of the dataset. Try to test various methods.
Attempt to run clustering analysis on this data and visualize (scatterplot)
Do groups your groups coincide with the sample labels?

Regression

Think about which predictions you want to test before you start testing regression between the variable groups.
For testing the prediction of disease vs. healthy consider using logistic regression instead of linear regression.

General Resources for Programming

Here are some random links for R and git you might find useful. You don't need to go through all of these, but check them out if you want to know more about using R and git.

General resources for R

R for Data Science (Combines R basics and introduction to statistical techniques in R/R Markdown)
Style Guidelines for R (Try to keep your code in a consistent format. Here is a short style guide you can consider using.)
Bioconductor (Large R bioinformatics software repository. While you are supposed to do the analysis yourself and not have package to it for you, there are also a bunch of useful packages for working with sequencing data f.e. for annotating the genome.)

General resources for Git

Long but comprehensive (and a little entertaining) step by step guide: git for userRs
Git learning resources from Github
Git "flight rules" (Compendium of git problems and how to solve them)
Quick setup for using git with Rstudio

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Project 05: Cancer Methylome Analysis

Introduction

Objective

Additional Objectives

Dataset

Literature

Reviews on DNA Methylation data

Reviews on Hematopoiesis

Literature on the Blueprint epigenome consortium

How to structure your project

Project proposal

Project

Data preprocessing

Normalize and Visualize

Data reduction

Regression

General Resources for Programming

General resources for R

General resources for Git

About

Releases

Packages

Contributors 2

datascience-mobi/05-cancer-genomics

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Project 05: Cancer Methylome Analysis

Introduction

Objective

Additional Objectives

Dataset

Literature

Reviews on DNA Methylation data

Reviews on Hematopoiesis

Literature on the Blueprint epigenome consortium

How to structure your project

Project proposal

Project

Data preprocessing

Normalize and Visualize

Data reduction

Regression

General Resources for Programming

General resources for R

General resources for Git

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages