a deep-learning framework that reveals promoter activity landscapes from DNA methylomes in individual tumors
MethylationToActivity (M2A) is a machine learning framework using convolutional neural networks (CNN) to infer histone modification (HM) enrichment from whole genome bisulfite sequencing (WGBS). To date, both H3K27ac and H3K4me3 enrichment prediction from WGBS is supported, from a tab-delimited text file format of M-values. Optionally, we also support transfer-learning where a user may have matching H3K27ac or H3K4me3 data with appropriate controls in addition to WGBS data.
Process | Description |
---|---|
1_ResponseVariable | Generate histone enrichment for each unique promote region (transfer-learning only) |
2_MethylationFeatures | Process WGBS features for model input |
3_CombineInput | Scale and recombine features, and for transfer learning, calculated HM values |
4_TransferLearning | Train fully-connected layers of a particular model for increased performance in your domain of interest (Optional) |
5_RunModel | Using pre-generated input, get HM predictions for each unique promoter region |
Python 3.6.5 or greater:
- pyBigWig v0.3.13
- numpy v1.17.1
- pandas v0.25.1
- pandarallel v1.4.2
- scikit-learn 0.20.2
- h5py v2.9.0
- keras v2.2.4
- tensorflow v1.10.1
- scipy v1.3.1
- matplotlib v3.3.0
- cwltool v1.0
- psutil v5.6.1
Clone M2A from GitHub:
git clone https://github.com/chenlab-sj/M2A.git
M2A requires five inputs, defined in a YAML file as CWL inputs. E.g., inputs.yml
:
chipBigwig:
class: File
path: sample.bw
inputBigwig:
class: File
path: input.bw
curated:
class: File
path: sites.txt
promoterDefinitions:
class: File
path: promoters.txt
model:
class: File
path: model.h5
Name | Description |
---|---|
Sample HM bigwig file (only if using M2A with Transfer) | HM ChIP-seq experiment bigwig track. |
Sample HM control (Input) bigwig (only if using M2A with Transfer) | ChIP-seq Experiment control (Input) bigwig track. |
WGBS data file | M-values by chromosome and position (non-standard format, see below). |
Promoter region definition file (provided, or user defined) | File describing promoter regions to be predicted. (non-standard format, see below) |
Model weights (provided, or user defined from transfer) | hdf5 model weights for either H3K27ac prediction OR H3K4me3 prediction |
A tab delimited file containing the unique promoter-regions for either:
- hg19-based data: 2_Promoter_Definitions_hg19.txt, or
- GRCh38-based data: 2_Promoter_Definitions_GRCh38.txt
Column | Description |
---|---|
EnsmblID_T | Ensemble transcript ID (unique) |
EnsmblID_G | Ensemble gene ID (not unique) |
Gene | human readable gene name (abbrev, not unique) |
Strand | +, - |
Chr | chr1, chr2, ... chr22, etc. |
Start | Beginning of transcript definition |
End | End of transcript definition |
RStart | TSS - 1000bp |
REnd | TSS + 1000bp |
A bed-like file of genomic positions with corresponding M-values, tab delimited:
Column | Description |
---|---|
chrom | chromosome ID, e.g. 1,2,3 ...22 |
pos | position of 5' cytosine of a CpG on the positive strand |
mval | calculated mvalue of a given CpG, typically M-value=log2(Beta/1-Beta) |
M2A uses CWL to describe its workflow. To run an example workflow, update sample_data/input_data/inputs.yml
with the path to a promoter definitions file.
Then run the following.
$ mkdir results
$ cwltool --outdir results cwl/m2a.cwl sample_data/input_data/inputs.yml
M2A without transfer learning enabled is contained in the CWL workflow cwl/m2a_without_transfer_learning.cwl
. It requires the same inputs as the with transfer learning pipeline, with the exception of the bigwig files.
M2A provides a Dockerfile that builds an image with all the included dependencies. To use this image, install Docker for your platform. This Docker image is used by the CWL workflow and contains the prerequisites.
In the M2A project directory, build the Docker image.
$ docker build --tag stjude/m2a:0.0.1 .
Today, the M2A pipeline does not produce an interactive visualization. If M2A with Transfer was run, the easiest measurement of training prediction accuracy would be calculating the Pearson's R2, or root mean square error (RMSE) between the measured and M2A predicted values. Furthermore, comparisons of sample-sample consistency with the same/similar cancer-type (as determined by Pearson's R2) is a good start for a contextual understanding of the predictions produced by M2A.
To run M2A in St. Jude Cloud, please follow the directions at https://university.stjude.cloud/docs/genomics-platform/workflow-guides/methylation-to-activity/
Copyright 2019 St. Jude Children's Research Hospital
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
For questions and bug reports, please open an issue on the GitHub project page.
All scripts describing the experiments and analyses in the M2A publication, including previous (unsupported) versions of M2A, can be found in the M2A_analyses directory.
(In submission) Justin Williams, Beisi Xu, Daniel Putnam, Andrew Thrasher, and Xiang Chen. MethylationToActivity: a deep-learning framework that reveals promoter activity landscapes from DNA methylomes in individual tumors.