# DeepCpG basics

This tutorial describes how to create the input data for DeepCpG, train models, and use the trained models for imputation.

## Variables

These are variables that point to data that are required for this tutorial.

In [4]:
data_dir="../data"
cpg_dir="${data_dir}/cpg"
dna_dir="$data_dir/dna/mm10"



## Creating DeepCpG data files

We first store the known CpG methylation states of each cell into a tab delimted file with the following columns:
* Chromosome (without chr)
* Position of the CpG site on the chromosome
* Binary methylation state of the CpG sites (0=unmethylation, 1=methylated)

CpG sites with a methylation rate between zero and one should be binarized by rounding. Filenames should correspond to cell names. 

For this tutorial we are using a subset of serum mouse embryonic stem cells from *Smallwood et al. (2014)*:

In [5]:
ls $cpg_dir

BS27_1_SER.tsv BS27_3_SER.tsv BS27_5_SER.tsv BS27_6_SER.tsv BS27_8_SER.tsv


We can have a look at the methylation profile of cell 'BS27_1_SER':

In [6]:
head "$cpg_dir/BS27_1_SER.bed"

head: ../data/cpg/BS27_1_SER.bed: No such file or directory


Since we are dealing with mouse cells, we are using the mm10 (GRCm38) mouse genome build:

In [7]:
ls $dna_dir

Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
Mus_musculus.GRCm38.dna.chromosome.10.fa.gz
Mus_musculus.GRCm38.dna.chromosome.11.fa.gz
Mus_musculus.GRCm38.dna.chromosome.12.fa.gz
Mus_musculus.GRCm38.dna.chromosome.13.fa.gz
Mus_musculus.GRCm38.dna.chromosome.14.fa.gz
Mus_musculus.GRCm38.dna.chromosome.15.fa.gz
Mus_musculus.GRCm38.dna.chromosome.16.fa.gz
Mus_musculus.GRCm38.dna.chromosome.17.fa.gz
Mus_musculus.GRCm38.dna.chromosome.18.fa.gz
Mus_musculus.GRCm38.dna.chromosome.19.fa.gz
Mus_musculus.GRCm38.dna.chromosome.2.fa.gz
Mus_musculus.GRCm38.dna.chromosome.3.fa.gz
Mus_musculus.GRCm38.dna.chromosome.4.fa.gz
Mus_musculus.GRCm38.dna.chromosome.5.fa.gz
Mus_musculus.GRCm38.dna.chromosome.6.fa.gz
Mus_musculus.GRCm38.dna.chromosome.7.fa.gz
Mus_musculus.GRCm38.dna.chromosome.8.fa.gz
Mus_musculus.GRCm38.dna.chromosome.9.fa.gz
Mus_musculus.GRCm38.dna.chromosome.MT.fa.gz
Mus_musculus.GRCm38.dna.chromosome.X.fa.gz
Mus_musculus.GRCm38.dna.chromosome.Y.fa.gz


These files were downloaded by `setup.py`. Other genomes, e.g. human genome hg38, can be downloaded, for example, with the following command:

```bash
wget ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz
```

Now we can run `dcpg_data.py` to create the input data for DeepCpG. For demostration purposes, we only consider chromosome 19:

In [9]:
dcpg_data="./data"
cmd="dcpg_data.py
    --cpg_profiles $cpg_dir/*.tsv
    --dna_files $dna_dir
    --out_dir $dcpg_data
    --cpg_wlen 50
    --dna_wlen 1001
    --chromo 19
    "
eval $cmd

INFO (2017-01-25 14:46:52,096): Reading single-cell profiles ...
INFO (2017-01-25 14:47:02,613): 306801 samples
INFO (2017-01-25 14:47:02,651): --------------------------------------------------------------------------------
INFO (2017-01-25 14:47:02,651): Chromosome 19 ...
INFO (2017-01-25 14:47:03,172): 306801 / 306801 (100.0%) sites matched minimum coverage filter
INFO (2017-01-25 14:47:04,431): Chunk 	1 / 10
INFO (2017-01-25 14:47:04,463): Extracting DNA sequence windows ...
INFO (2017-01-25 14:47:10,437): Extracting CpG neighbors ...
INFO (2017-01-25 14:47:14,091): Chunk 	2 / 10
INFO (2017-01-25 14:47:14,099): Extracting DNA sequence windows ...
INFO (2017-01-25 14:47:20,189): Extracting CpG neighbors ...
INFO (2017-01-25 14:47:23,996): Chunk 	3 / 10
INFO (2017-01-25 14:47:24,004): Extracting DNA sequence windows ...
INFO (2017-01-25 14:47:30,030): Extracting CpG neighbors ...
INFO (2017-01-25 14:47:33,635): Chunk 	4 / 10
INFO (2017-01-25 14:47:33,643): Extracting D

For each CpG site that is observed in at least one cell, this command extracts the 50 neighboring CpG sites (25 to the left and 25 to the right), and the 1000 bp long DNA sequence window centered on the CpG site. The command creates multiple HDF5 files with name `cX_FROM_TO.h5`, where `X` is the chromosome, and `FROM` and `TO` the index of CpG sites stored in the file:

In [10]:
ls $dcpg_data

c19_000000-032768.h5 c19_131072-163840.h5 c19_262144-294912.h5
c19_032768-065536.h5 c19_163840-196608.h5 c19_294912-306801.h5
c19_065536-098304.h5 c19_196608-229376.h5
c19_098304-131072.h5 c19_229376-262144.h5


## Model training 

We can now train models on the created data files. 

First, we train a model that only uses the neighboring methylation states of all cells, denoted as *CpG module* in the publication. For demonstration purposes, we only train for one epoch on 1000 CpG sites. In practice, one would train on all data until learning converges, if possible on one or more GPUs. 

In [11]:
models_dir="./models"
mkdir -p $models_dir



In [12]:
cmd="dcpg_train.py
    $dcpg_data/c*.h5
    --val_files $dcpg_data/c*.h5
    --cpg_model RnnL1
    --out_dir $models_dir/cpg
    --nb_epoch 1
    --nb_train_sample 1000
    --nb_val_sample 1000
    "
eval $cmd

Using TensorFlow backend.
INFO (2017-01-25 15:09:46,915): Building model ...
Replicate names:
BS27_1_SER, BS27_3_SER, BS27_5_SER, BS27_6_SER, BS27_8_SER

INFO (2017-01-25 15:09:46,922): Building CpG model ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
cpg/state/BS27_1_SER--BS27_3_SER (None, 5, 50)         0                                            
____________________________________________________________________________________________________
cpg/dist/BS27_1_SER--BS27_3_SER- (None, 5, 50)         0                                            
____________________________________________________________________________________________________
cpg/merge_1 (Merge)              (None, 5, 100)        0           cpg/state/BS27_1_SER--BS27_3_SER-
                                                                   cpg/

Altough the model only uses neighboring CpG sites, it is already quite accurate in practice. To also make use of the DNA sequence, we train a `DNA module`:

In [13]:
cmd="dcpg_train.py
    $dcpg_data/c*.h5
    --val_files $dcpg_data/c*.h5
    --dna_model CnnL2h128
    --out_dir $models_dir/dna
    --nb_epoch 1
    --nb_train_sample 1000
    --nb_val_sample 1000
    "
eval $cmd

Using TensorFlow backend.
INFO (2017-01-25 15:10:16,585): Building model ...
INFO (2017-01-25 15:10:16,587): Building DNA model ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
dna (InputLayer)                 (None, 1001, 4)       0                                            
____________________________________________________________________________________________________
dna/convolution1d_1 (Convolution (None, 991, 128)      5760        dna[0][0]                        
____________________________________________________________________________________________________
dna/activation_1 (Activation)    (None, 991, 128)      0           dna/convolution1d_1[0][0]        
____________________________________________________________________________________________________
dna/maxpooling1d_1 (MaxPooling1D (None, 247, 128)

Finally, we are combining both models by training a *joint module* without training the *CpG* and *DNA* module:

In [14]:
cmd="dcpg_train.py
    $dcpg_data/c*.h5
    --val_files $dcpg_data/c*.h5
    --dna_model $models_dir/dna
    --cpg_model $models_dir/cpg
    --joint_model JointL2h512
    --train_models joint
    --out_dir $models_dir/joint
    --nb_epoch 1
    --nb_train_sample 1000
    --nb_val_sample 1000
    "
eval $cmd

Using TensorFlow backend.
INFO (2017-01-25 15:10:44,830): Building model ...
INFO (2017-01-25 15:10:44,832): Loading existing DNA model ...
Replicate names:
BS27_1_SER, BS27_3_SER, BS27_5_SER, BS27_6_SER, BS27_8_SER

INFO (2017-01-25 15:10:45,160): Loading existing CpG model ...
INFO (2017-01-25 15:10:45,909): Joining models ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
dna (InputLayer)                 (None, 1001, 4)       0                                            
____________________________________________________________________________________________________
dna/convolution1d_1 (Convolution (None, 991, 128)      5760        dna[0][0]                        
____________________________________________________________________________________________________
dna/activation_1 (Activation)    (None, 991, 128)

## Model evaluation 

We are now using `dcpg_eval.py` to finally impute the missing methylation states and to evaluate prediction performance on the partially observed methylation states. We will use the trained joint module, but could of course also evaluate the CpG or DNA module only.

In [16]:
eval_dir="./eval"
mkdir -p $eval_dir

cmd="dcpg_eval.py
    $dcpg_data/c*.h5
    --model_files $models_dir/joint
    --out_data $eval_dir/data.h5
    --out_report $eval_dir/report.tsv
    --nb_sample 1000
    "
eval $cmd

Using TensorFlow backend.
INFO (2017-01-25 15:14:21,705): Loading model ...
INFO (2017-01-25 15:14:22,717): Loading data ...
INFO (2017-01-25 15:14:22,718): Predicting ...
INFO (2017-01-25 15:14:22,751):  128/1000 (12.8%)
INFO (2017-01-25 15:14:23,228):  256/1000 (25.6%)
INFO (2017-01-25 15:14:23,730):  384/1000 (38.4%)
INFO (2017-01-25 15:14:24,167):  512/1000 (51.2%)
INFO (2017-01-25 15:14:24,626):  640/1000 (64.0%)
INFO (2017-01-25 15:14:25,124):  768/1000 (76.8%)
INFO (2017-01-25 15:14:25,624):  896/1000 (89.6%)
INFO (2017-01-25 15:14:26,128): 1000/1000 (100.0%)
  'precision', 'predicted', average, warn_for)
           output       auc       acc       tpr  tnr        f1       mcc      n
4  cpg/BS27_8_SER  0.704014  0.161765  0.002915  1.0  0.005814  0.021578  408.0
2  cpg/BS27_5_SER  0.546834  0.304147  0.000000  1.0  0.000000  0.000000  434.0
1  cpg/BS27_3_SER  0.502976  0.082759  0.000000  1.0  0.000000  0.000000  290.0
0  cpg/BS27_1_SER  0.465349  0.142857  0.00

The imputed methylation profiles of all cells are stored in `data.h5`, and performance metrics in `report.tsv`

In [17]:
cat eval/report.tsv

metric	output	value
acc	cpg/BS27_3_SER	0.08275862068965517
acc	cpg/BS27_1_SER	0.14285714285714285
acc	cpg/BS27_8_SER	0.16176470588235295
acc	cpg/BS27_5_SER	0.30414746543778803
acc	cpg/BS27_6_SER	0.3172690763052209
auc	cpg/BS27_6_SER	0.42576321667907674
auc	cpg/BS27_1_SER	0.4653486394557823
auc	cpg/BS27_3_SER	0.5029761904761905
auc	cpg/BS27_5_SER	0.5468342364037728
auc	cpg/BS27_8_SER	0.7040143529939448
f1	cpg/BS27_3_SER	0.0
f1	cpg/BS27_6_SER	0.0
f1	cpg/BS27_5_SER	0.0
f1	cpg/BS27_1_SER	0.0
f1	cpg/BS27_8_SER	0.005813953488372092
mcc	cpg/BS27_3_SER	0.0
mcc	cpg/BS27_6_SER	0.0
mcc	cpg/BS27_5_SER	0.0
mcc	cpg/BS27_1_SER	0.0
mcc	cpg/BS27_8_SER	0.02157806086076016
n	cpg/BS27_6_SER	249.0
n	cpg/BS27_3_SER	290.0
n	cpg/BS27_1_SER	392.0
n	cpg/BS27_8_SER	408.0
n	cpg/BS27_5_SER	434.0
tnr	cpg/BS27_3_SER	1.0
tnr	cpg/BS27_6_SER	1.0
tnr	cpg/BS27_5_SER	1.0
tnr	cpg/BS27_1_SER	1.0
tnr	cpg/BS27_8_SER	1.0
tpr	cpg/BS27_3_SER	0.0
tpr	cpg/BS27_6_SER	0.0
tpr	cpg/BS27_5_SER	0.0
tpr	

In [18]:
h5ls -r eval/data.h5

/                        Group
/chromo                  Dataset {1000}
/outputs                 Group
/outputs/cpg             Group
/outputs/cpg/BS27_1_SER  Dataset {1000}
/outputs/cpg/BS27_3_SER  Dataset {1000}
/outputs/cpg/BS27_5_SER  Dataset {1000}
/outputs/cpg/BS27_6_SER  Dataset {1000}
/outputs/cpg/BS27_8_SER  Dataset {1000}
/pos                     Dataset {1000}
/preds                   Group
/preds/cpg               Group
/preds/cpg/BS27_1_SER    Dataset {1000}
/preds/cpg/BS27_3_SER    Dataset {1000}
/preds/cpg/BS27_5_SER    Dataset {1000}
/preds/cpg/BS27_6_SER    Dataset {1000}
/preds/cpg/BS27_8_SER    Dataset {1000}
