# DeepCpG basics

This tutorial describes the basics of creating input data, training models, and evaluation models. More detailed information can be found in the [DeepCpG documentation](http://deepcpg.readthedocs.io/en/latest/).

## Initialization
We first initialize some variables that will be used throughout the tutorial. `test_mode=1` should be used for testing purposes, which speeds up computations by only using a subset of the data. For real applications, `test_mode=0` should be used.

In [1]:
function run {
  local cmd=$@
  echo
  echo "#################################"
  echo $cmd
  echo "#################################"
  eval $cmd
}

test_mode=1 # Set to 1 for testing and 0 otherwise.
example_dir="../../data" # Directory with example data.
cpg_dir="$example_dir/cpg" # Directory with CpG profiles.
dna_dir="$example_dir/dna/mm10" # Directory with DNA sequences.



## Creating DeepCpG data files
We first store the known CpG methylation states of each cell into a tab delimted file with the following columns:
* Chromosome (without chr)
* Position of the CpG site on the chromosome
* Binary methylation state of the CpG sites (0=unmethylation, 1=methylated)

CpG sites with a methylation rate between zero and one should be binarized by rounding. Filenames should correspond to cell names. 

Each position must point the cytosine residue of a CpG site (positions enumerated from 1). Otherwise `dcpg_data.py` will report warnings, e.g. if a wrong genome is used or CpG sites were not correctly aligned.

For this tutorial we are using a subset of serum mouse embryonic stem cells from *Smallwood et al. (2014)*:

In [2]:
ls $cpg_dir

BS27_1_SER.tsv BS27_3_SER.tsv BS27_5_SER.tsv BS27_6_SER.tsv BS27_8_SER.tsv


We can have a look at the methylation profile of cell 'BS27_1_SER':

In [3]:
head "$cpg_dir/BS27_1_SER.tsv"

1	3000827	1.0
1	3001007	1.0
1	3001018	1.0
1	3001277	1.0
1	3001629	1.0
1	3003226	1.0
1	3003339	1.0
1	3003379	1.0
1	3006416	1.0
1	3007580	1.0


Since we are dealing with mouse cells, we are using the mm10 (GRCm38) mouse genome build:

In [4]:
ls $dna_dir

Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
Mus_musculus.GRCm38.dna.chromosome.10.fa.gz
Mus_musculus.GRCm38.dna.chromosome.11.fa.gz
Mus_musculus.GRCm38.dna.chromosome.12.fa.gz
Mus_musculus.GRCm38.dna.chromosome.13.fa.gz
Mus_musculus.GRCm38.dna.chromosome.14.fa.gz
Mus_musculus.GRCm38.dna.chromosome.15.fa.gz
Mus_musculus.GRCm38.dna.chromosome.16.fa.gz
Mus_musculus.GRCm38.dna.chromosome.17.fa.gz
Mus_musculus.GRCm38.dna.chromosome.18.fa.gz
Mus_musculus.GRCm38.dna.chromosome.19.fa.gz
Mus_musculus.GRCm38.dna.chromosome.2.fa.gz
Mus_musculus.GRCm38.dna.chromosome.3.fa.gz
Mus_musculus.GRCm38.dna.chromosome.4.fa.gz
Mus_musculus.GRCm38.dna.chromosome.5.fa.gz
Mus_musculus.GRCm38.dna.chromosome.6.fa.gz
Mus_musculus.GRCm38.dna.chromosome.7.fa.gz
Mus_musculus.GRCm38.dna.chromosome.8.fa.gz
Mus_musculus.GRCm38.dna.chromosome.9.fa.gz
Mus_musculus.GRCm38.dna.chromosome.MT.fa.gz
Mus_musculus.GRCm38.dna.chromosome.X.fa.gz
Mus_musculus.GRCm38.dna.chromosome.Y.fa.gz


These files were downloaded by `setup.py`. Other genomes, e.g. human genome hg38, can be downloaded, for example, with the following command:

```bash
wget ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz
```

Now we can run `dcpg_data.py` to create the input data for DeepCpG. For testing purposes, we only consider a few CpG sites on chromosome 19:

In [5]:
data_dir="./data"
cmd="dcpg_data.py
    --cpg_profiles $cpg_dir/*.tsv
    --dna_files $dna_dir
    --out_dir $data_dir
    --cpg_wlen 50
    --dna_wlen 1001
"
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --chromo 1 13
        --nb_sample_chromo 1000
        "
fi
run $cmd


#################################
dcpg_data.py --cpg_profiles ../../data/cpg/BS27_1_SER.tsv ../../data/cpg/BS27_3_SER.tsv ../../data/cpg/BS27_5_SER.tsv ../../data/cpg/BS27_6_SER.tsv ../../data/cpg/BS27_8_SER.tsv --dna_files ../../data/dna/mm10 --out_dir ./data --cpg_wlen 50 --dna_wlen 1001 --chromo 1 13 --nb_sample_chromo 1000
#################################
INFO (2017-03-05 18:37:02,011): Reading single-cell profiles ...
INFO (2017-03-05 18:37:44,082): 2000 samples
INFO (2017-03-05 18:37:44,082): --------------------------------------------------------------------------------
INFO (2017-03-05 18:37:44,082): Chromosome 1 ...
INFO (2017-03-05 18:37:44,100): 1000 / 1000 (100.0%) sites matched minimum coverage filter
INFO (2017-03-05 18:37:49,338): Chunk 	1 / 1
INFO (2017-03-05 18:37:49,395): Extracting DNA sequence windows ...
INFO (2017-03-05 18:37:49,941): Extracting CpG neighbors ...
INFO (2017-03-05 18:37:50,094): -------------------------------------------------------

For each CpG site that is observed in at least one cell, this command extracts the 50 neighboring CpG sites (25 to the left and 25 to the right), and the 1001 bp long DNA sequence window centered on the CpG site. In test mode, only 1000 CpG sites will be randomly sampled from chromosome 1 and 13. The command creates multiple HDF5 files with name `cX_FROM_TO.h5`, where `X` is the chromosome, and `FROM` and `TO` the index of CpG sites stored in the file:

In [6]:
ls $data_dir

c13_000000-001000.h5 c1_000000-001000.h5


## Model training 

We can now train models on the created data. First, we need to split the data into a training, validation set, and test set. The training set should contain at least 3 million CpG sites. We will use chromosome 1, 3, 5, 7, and 19 as training set, and chromosome 13, 14, 15, 16, and 17 as validation set:

In [8]:
train_files=$(ls $data_dir/c{1,3,5,7,9}_*.h5 2> /dev/null)
val_files=$(ls $data_dir/c{13,14,15,16,17}_*.h5 2> /dev/null)



We can count the number of CpG sites in the training set using `dcpg_data_stats.py`:

In [9]:
cmd="dcpg_data_stats.py $train_files"
run $cmd


#################################
dcpg_data_stats.py ./data/c1_000000-001000.h5
#################################
           output  nb_tot  nb_obs  frac_obs      mean       var
0  cpg/BS27_1_SER    1000     187     0.187  0.775401  0.174154
1  cpg/BS27_3_SER    1000     208     0.208  0.711538  0.205251
2  cpg/BS27_5_SER    1000     200     0.200  0.690000  0.213900
3  cpg/BS27_6_SER    1000     195     0.195  0.666667  0.222222
4  cpg/BS27_8_SER    1000     210     0.210  0.776190  0.173719


For each output cell, `nb_tot` is the total number of CpG sites, `nb_obs` the number of CpG sites with known methylation state, `frac_obs` the ratio between `nb_obs` and `nb_tot`, `mean` the mean methylation rate, and `var` the variance of the methylation rate.

We can now train our model. DeepCpG consists of a *DNA*, *CpG*, and *Joint model*, which can be trained either jointly or separately. We will train them jointly, starting with the CpG model:

In [11]:
models_dir="./models"
mkdir -p $models_dir



In [12]:
cmd="dcpg_train.py
    $train_files
    --val_files $val_files
    --cpg_model RnnL1
    --out_dir $models_dir/cpg
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 1000
        --nb_val_sample 1000
    "
else
    cmd="$cmd
        --nb_epoch 30
        "
fi
run $cmd


#################################
dcpg_train.py ./data/c1_000000-001000.h5 --val_files ./data/c13_000000-001000.h5 --cpg_model RnnL1 --out_dir ./models/cpg --nb_epoch 1 --nb_train_sample 1000 --nb_val_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-03-05 18:39:19,813): Building model ...
Replicate names:
BS27_1_SER, BS27_3_SER, BS27_5_SER, BS27_6_SER, BS27_8_SER

INFO (2017-03-05 18:39:19,817): Building CpG model ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
cpg/state (InputLayer)           (None, 5, 50)         0                                            
____________________________________________________________________________________________________
cpg/dist (InputLayer)            (None, 5, 50)         0                                            
_______________________

`--RnnL1` specifies the architecture of the CpG model and `--nb_epoch` the number of training epochs. These are hyper-parameters, which can be adapted depending on the size of the training set and model complexitiy. For testing purposes, we decrease the number of samples using `--nb_train_sample` and `--nb_val_sample`.

The CpG model is often times already quite accurate on its own. However, we can further boost the performance by also training a DNA model, which leverage the DNA sequence:

In [13]:
cmd="dcpg_train.py
    $train_files
    --val_files $val_files
    --dna_model CnnL2h128
    --out_dir $models_dir/dna
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 1000
        --nb_val_sample 1000
    "
else
    cmd="$cmd
        --nb_epoch 30
        "
fi
run $cmd


#################################
dcpg_train.py ./data/c1_000000-001000.h5 --val_files ./data/c13_000000-001000.h5 --dna_model CnnL2h128 --out_dir ./models/dna --nb_epoch 1 --nb_train_sample 1000 --nb_val_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-03-05 18:39:36,503): Building model ...
INFO (2017-03-05 18:39:36,506): Building DNA model ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
dna (InputLayer)                 (None, 1001, 4)       0                                            
____________________________________________________________________________________________________
dna/convolution1d_1 (Convolution (None, 991, 128)      5760        dna[0][0]                        
___________________________________________________________________________________________________

Finally, we combine both models by training a Joint model:

In [14]:
cmd="dcpg_train.py
    $train_files
    --val_files $val_files
    --dna_model $models_dir/dna
    --cpg_model $models_dir/cpg
    --joint_model JointL2h512
    --train_models joint
    --out_dir $models_dir/joint
"
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 1000
        --nb_val_sample 1000
    "
else
    cmd="$cmd
        --nb_epoch 10
        "
fi
run $cmd


#################################
dcpg_train.py ./data/c1_000000-001000.h5 --val_files ./data/c13_000000-001000.h5 --dna_model ./models/dna --cpg_model ./models/cpg --joint_model JointL2h512 --train_models joint --out_dir ./models/joint --nb_epoch 1 --nb_train_sample 1000 --nb_val_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-03-05 18:40:04,081): Building model ...
INFO (2017-03-05 18:40:04,084): Loading existing DNA model ...
INFO (2017-03-05 18:40:04,084): Using model files ./models/dna/model.json ./models/dna/model_weights_val.h5
Replicate names:
BS27_1_SER, BS27_3_SER, BS27_5_SER, BS27_6_SER, BS27_8_SER

INFO (2017-03-05 18:40:04,403): Loading existing CpG model ...
INFO (2017-03-05 18:40:04,403): Using model files ./models/cpg/model.json ./models/cpg/model_weights_val.h5
INFO (2017-03-05 18:40:05,163): Joining models ...
____________________________________________________________________________________________________
Layer (t

You can find more information about [training](http://deepcpg.readthedocs.io/en/latest/train.html) and [model architectures](http://deepcpg.readthedocs.io/en/latest/models.html) in the DeepCpG documentation.

## Model evaluation 

Finally, we use `dcpg_eval.py` to impute missing methylation states and to evaluate prediction performance on observed  states. We will use the trained Joint model, but could of course also evaluate the CpG or DNA model.

In [15]:
eval_dir="./eval"
mkdir -p $eval_dir

cmd="dcpg_eval.py
    $data_dir/c*.h5
    --model_files $models_dir/joint
    --out_data $eval_dir/data.h5
    --out_report $eval_dir/report.csv
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_sample 1000
        "
fi
run $cmd


#################################
dcpg_eval.py ./data/c13_000000-001000.h5 ./data/c1_000000-001000.h5 --model_files ./models/joint --out_data ./eval/data.h5 --out_report ./eval/report.csv --nb_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-03-05 18:40:30,156): Loading model ...
INFO (2017-03-05 18:40:31,124): Loading data ...
INFO (2017-03-05 18:40:31,131): Predicting ...
INFO (2017-03-05 18:40:31,148):  128/1000 (12.8%)
INFO (2017-03-05 18:40:31,635):  256/1000 (25.6%)
INFO (2017-03-05 18:40:32,078):  384/1000 (38.4%)
INFO (2017-03-05 18:40:32,554):  512/1000 (51.2%)
INFO (2017-03-05 18:40:33,020):  640/1000 (64.0%)
INFO (2017-03-05 18:40:33,498):  768/1000 (76.8%)
INFO (2017-03-05 18:40:33,973):  896/1000 (89.6%)
INFO (2017-03-05 18:40:34,447): 1000/1000 (100.0%)
           output       auc       acc       tpr       tnr        f1       mcc      n
4  cpg/BS27_8_SER  0.531582  0.297980  0.116883  0.931818  0.205714  0.065755  198.0

The imputed methylation profiles of all cells are stored in `data.h5`, and performance metrics in `report.csv`.

In [16]:
cat $eval_dir/report.csv 

metric	output	value
acc	cpg/BS27_1_SER	0.7549019607843137
acc	cpg/BS27_3_SER	0.76
acc	cpg/BS27_5_SER	0.6735751295336787
acc	cpg/BS27_6_SER	0.5797101449275363
acc	cpg/BS27_8_SER	0.29797979797979796
auc	cpg/BS27_1_SER	0.5003048780487804
auc	cpg/BS27_3_SER	0.5191981931112366
auc	cpg/BS27_5_SER	0.4860865165696939
auc	cpg/BS27_6_SER	0.4439193446754883
auc	cpg/BS27_8_SER	0.531582054309327
f1	cpg/BS27_1_SER	0.857142857142857
f1	cpg/BS27_3_SER	0.8636363636363635
f1	cpg/BS27_5_SER	0.8012618296529967
f1	cpg/BS27_6_SER	0.7202572347266881
f1	cpg/BS27_8_SER	0.2057142857142857
mcc	cpg/BS27_1_SER	0.02048455758362385
mcc	cpg/BS27_3_SER	-0.05492890710151309
mcc	cpg/BS27_5_SER	-0.002891936384328604
mcc	cpg/BS27_6_SER	-0.09219821056073607
mcc	cpg/BS27_8_SER	0.06575532984988024
n	cpg/BS27_1_SER	204.0
n	cpg/BS27_3_SER	200.0
n	cpg/BS27_5_SER	193.0
n	cpg/BS27_6_SER	207.0
n	cpg/BS27_8_SER	198.0
tnr	cpg/BS27_1_SER	0.1
tnr	cpg/BS27_3_SER	0.0
tnr	cpg/BS27_5_SER	0.05084745762711865
tn

In [17]:
h5ls -r $eval_dir/data.h5

/                        Group
/chromo                  Dataset {1000}
/outputs                 Group
/outputs/cpg             Group
/outputs/cpg/BS27_1_SER  Dataset {1000}
/outputs/cpg/BS27_3_SER  Dataset {1000}
/outputs/cpg/BS27_5_SER  Dataset {1000}
/outputs/cpg/BS27_6_SER  Dataset {1000}
/outputs/cpg/BS27_8_SER  Dataset {1000}
/pos                     Dataset {1000}
/preds                   Group
/preds/cpg               Group
/preds/cpg/BS27_1_SER    Dataset {1000}
/preds/cpg/BS27_3_SER    Dataset {1000}
/preds/cpg/BS27_5_SER    Dataset {1000}
/preds/cpg/BS27_6_SER    Dataset {1000}
/preds/cpg/BS27_8_SER    Dataset {1000}


## Exporting methylation profiles

Finally, we export imputed methylation profiles to HDF5 files:

In [18]:
cmd="dcpg_eval_export.py
    $eval_dir/data.h5
    -o $eval_dir/hdf
    -f hdf
"
eval $cmd

INFO (2017-03-05 18:40:37,044): cpg/BS27_1_SER
INFO (2017-03-05 18:40:37,048): cpg/BS27_3_SER
INFO (2017-03-05 18:40:37,051): cpg/BS27_5_SER
INFO (2017-03-05 18:40:37,055): cpg/BS27_6_SER
INFO (2017-03-05 18:40:37,059): cpg/BS27_8_SER
INFO (2017-03-05 18:40:37,062): Done!


In [19]:
ls $eval_dir/hdf

BS27_1_SER.h5 BS27_3_SER.h5 BS27_5_SER.h5 BS27_6_SER.h5 BS27_8_SER.h5


You can use `-f bedGraph` to export profiles to gzip-compressed bedGraph files, which, however, takes longer.