# DeepCpG basics

This tutorial describes how to create the input data for DeepCpG, train models, and use the trained models for imputation.

## Variables

We first initialize some variables that will be used throughout the tutorial. `test_mode=1` should be used for testing purposes, which speeds up computations by only using a subset of the data. For real applications, `test_mode=0` should be used.

In [None]:
function run {
  local cmd=$@
  echo
  echo "#################################"
  echo $cmd
  echo "#################################"
  eval $cmd
}

test_mode=1 # set this variable to 0 for production
data_dir="../../data"
cpg_dir="$data_dir/cpg"
dna_dir="$data_dir/dna/mm10"

## Creating DeepCpG data files

We first store the known CpG methylation states of each cell into a tab delimted file with the following columns:
* Chromosome (without chr)
* Position of the CpG site on the chromosome
* Binary methylation state of the CpG sites (0=unmethylation, 1=methylated)

CpG sites with a methylation rate between zero and one should be binarized by rounding. Filenames should correspond to cell names. 

Each position must point the cytosine residue of a CpG site (positions enumerated from 1). Otherwise `dcpg_data.py` will report warnings, e.g. if a wrong genome is used or CpG sites were not correctly aligned.

For this tutorial we are using a subset of serum mouse embryonic stem cells from *Smallwood et al. (2014)*:

In [None]:
ls $cpg_dir

We can have a look at the methylation profile of cell 'BS27_1_SER':

In [None]:
head "$cpg_dir/BS27_1_SER.tsv"

Since we are dealing with mouse cells, we are using the mm10 (GRCm38) mouse genome build:

In [None]:
ls $dna_dir

These files were downloaded by `setup.py`. Other genomes, e.g. human genome hg38, can be downloaded, for example, with the following command:

```bash
wget ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz
```

Now we can run `dcpg_data.py` to create the input data for DeepCpG. For testing purposes, we only consider a few CpG sites on chromosome 19:

In [None]:
dcpg_data="./data"
cmd="dcpg_data.py
    --cpg_profiles $cpg_dir/*.tsv
    --dna_files $dna_dir
    --out_dir $dcpg_data
    --cpg_wlen 50
    --dna_wlen 1001
"
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --chromo 19
        --nb_sample 10000
        "
fi
run $cmd

For each CpG site that is observed in at least one cell, this command extracts the 50 neighboring CpG sites (25 to the left and 25 to the right), and the 1000 bp long DNA sequence window centered on the CpG site. The command creates multiple HDF5 files with name `cX_FROM_TO.h5`, where `X` is the chromosome, and `FROM` and `TO` the index of CpG sites stored in the file:

In [None]:
ls $dcpg_data

## Model training 

We can now train models on the created data files. 

First, we train a model that only uses the neighboring methylation states of all cells, denoted as *CpG module* in the publication. 

For testing purposes, we use `--nb_train_sample` and `--nb_val_sample` to train only on 1000 CpG sites, and use `--nb_epoch 1` to only train for one epoch. In practice, one would train on more data and also for more epochs, e.g. 30, or use `--early_stopping 5` to stop training if the validation does not increase over five epochs. These parameters depend on the size of the training set and the complexity of the chosen model.

In [None]:
models_dir="./models"
mkdir -p $models_dir

In [None]:
cmd="dcpg_train.py
    $dcpg_data/c*.h5
    --val_files $dcpg_data/c*.h5
    --cpg_model RnnL1
    --out_dir $models_dir/cpg
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 1000
        --nb_val_sample 1000
    "
else
    cmd="$cmd
        --nb_epoch 30
        --early_stopping 5
        "
fi
run $cmd

Altough the model only uses neighboring CpG sites, it is already quite accurate in practice. To also make use of the DNA sequence, we train a `DNA module`:

In [None]:
cmd="dcpg_train.py
    $dcpg_data/c*.h5
    --val_files $dcpg_data/c*.h5
    --dna_model CnnL2h128
    --out_dir $models_dir/dna
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 1000
        --nb_val_sample 1000
    "
else
    cmd="$cmd
        --nb_epoch 30
        --early_stopping 5
        "
fi
run $cmd

Finally, we are combining both models by training a *joint module* without training the *CpG* and *DNA* module:

In [None]:
cmd="dcpg_train.py
    $dcpg_data/c*.h5
    --val_files $dcpg_data/c*.h5
    --dna_model $models_dir/dna
    --cpg_model $models_dir/cpg
    --joint_model JointL2h512
    --train_models joint
    --out_dir $models_dir/joint
"
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 1000
        --nb_val_sample 1000
    "
else
    cmd="$cmd
        --nb_epoch 10
        --early_stopping 5
        "
fi
run $cmd

## Model evaluation 

We are now using `dcpg_eval.py` to finally impute the missing methylation states and to evaluate prediction performance on the partially observed methylation states. We will use the trained joint module, but could of course also evaluate the CpG or DNA module only.

In [None]:
eval_dir="./eval"
mkdir -p $eval_dir

cmd="dcpg_eval.py
    $dcpg_data/c*.h5
    --model_files $models_dir/joint
    --out_data $eval_dir/data.h5
    --out_report $eval_dir/report.tsv
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_sample 1000
        "
fi
run $cmd

The imputed methylation profiles of all cells are stored in `data.h5`, and performance metrics in `report.tsv`

In [None]:
cat $eval_dir/report.tsv

In [None]:
h5ls -r $eval_dir/data.h5

## Exporting methylation profiles

Finally, we export imputed methylation profiles to gzip-compressed bedGraph files:

In [None]:
cmd="dcpg_eval_export.py
    $eval_dir/data.h5
    -o ./eval
    -f bedGraph
"
eval $cmd

In [None]:
ls ./eval