# Predicting inter-cell statistics

This tutorial describes how to predict inter-cell statistics such as the mean methylation rate or variance across cells.

## Initialization
We first initialize some variables that will be used throughout the tutorial. `test_mode=1` should be used for testing purposes, which speeds up computations by only using a subset of the data. For real applications, `test_mode=0` should be used.

In [1]:
function run {
  local cmd=$@
  echo
  echo "#################################"
  echo $cmd
  echo "#################################"
  eval $cmd
}

test_mode=1 # Set to 1 for testing and 0 otherwise.
example_dir="../../data" # Directory with example data.
cpg_dir="$example_dir/cpg" # Directory with CpG profiles.
dna_dir="$example_dir/dna/mm10" # Directory with DNA sequences.

## Creating DeepCpG data files
`dcpg_data.py` provides the arguments `--cpg_stats` and `--win_stats` to compute statistics across cells for single CpG sites or in windows of lengths `--win_stats_wlen` centred on CpG sites, respectively. Supported statistics are described in the [documentation](http://deepcpg.readthedocs.io/en/latest/data.html#predicting-statistics) and include the mean methylation rate (`mean`), variance (`var`), and if a CpG site is differentially methylated (`diff`). With `--cpg_stats_cov`, per-CpG statistics will be computed only for CpG sites that are covered by at least the specified number of cells. If this number of too low, estimated statistics might be unreliable in lowly covered regions. We will compute the mean methylation rate and variance across cells in windows of different lengths, and if CpG sites with at least three observations are differentially methylated.

In [2]:
data_dir="./data"
cmd="dcpg_data.py
    --cpg_profiles $cpg_dir/*.tsv
    --dna_files $dna_dir
    --out_dir $data_dir
    --dna_wlen 1001
    --cpg_wlen 50
    --cpg_stats diff
    --cpg_stats_cov 3
    --win_stats mean var
    --win_stats_wlen 1001 2001 3001 4001 5001
"
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --chromo 1 13
        --nb_sample_chromo 10000
        "
fi
run $cmd


#################################
dcpg_data.py --cpg_profiles ../../data/cpg/BS27_1_SER.tsv ../../data/cpg/BS27_3_SER.tsv ../../data/cpg/BS27_5_SER.tsv ../../data/cpg/BS27_6_SER.tsv ../../data/cpg/BS27_8_SER.tsv --dna_files ../../data/dna/mm10 --out_dir ./data --dna_wlen 1001 --cpg_wlen 50 --cpg_stats diff --cpg_stats_cov 3 --win_stats mean var --win_stats_wlen 1001 2001 3001 4001 5001 --chromo 1 13 --nb_sample_chromo 10000
#################################
INFO (2017-05-01 09:00:48,895): Reading CpG profiles ...
INFO (2017-05-01 09:00:48,895): ../../data/cpg/BS27_1_SER.tsv
INFO (2017-05-01 09:00:55,405): ../../data/cpg/BS27_3_SER.tsv
INFO (2017-05-01 09:01:00,122): ../../data/cpg/BS27_5_SER.tsv
INFO (2017-05-01 09:01:07,260): ../../data/cpg/BS27_6_SER.tsv
INFO (2017-05-01 09:01:12,711): ../../data/cpg/BS27_8_SER.tsv
INFO (2017-05-01 09:01:18,234): 20000 samples
INFO (2017-05-01 09:01:18,235): --------------------------------------------------------------------------------
INFO (2017-

## Model training 
We will train a DNA model to predict mean methylation rates, cell-to-cell variance, and differentially methylated CpG sites from the DNA sequence alone. However, you could train a CpG model or Joint model to also use neighboring CpG sites for making predictions. To predict all per-CpG and window-based statistics computed by `dcpg_data.py` instead of methylation states, we are running `dcpg_train.py` with `--output_names 'cpg_stats/.*' 'win_stats/.*'`. You could use `--output_names '.*'` to predict both methylation states and statistics. 

In [3]:
train_files=$(ls $data_dir/c{1,3,5,7,9}_*.h5 2> /dev/null)
echo "Training files:"
echo $train_files
echo

val_files=$(ls $data_dir/c{13,14,15,16,17}_*.h5 2> /dev/null)
echo "Validation files:"
echo $val_files

Training files:
./data/c1_000000-010000.h5

Validation files:
./data/c13_000000-010000.h5


In [4]:
model_dir="./model"

cmd="dcpg_train.py
    $train_files
    --val_files $val_files
    --dna_model CnnL2h128
    --out_dir $model_dir
    --output_names 'cpg_stats/.*' 'win_stats/.*'
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 10000
        --nb_val_sample 10000
    "
else
    cmd="$cmd
        --nb_epoch 30
        "
fi
run $cmd


#################################
dcpg_train.py ./data/c1_000000-010000.h5 --val_files ./data/c13_000000-010000.h5 --dna_model CnnL2h128 --out_dir ./model --output_names 'cpg_stats/.*' 'win_stats/.*' --nb_epoch 1 --nb_train_sample 10000 --nb_val_sample 10000
#################################
Using TensorFlow backend.
INFO (2017-05-01 09:01:37,896): Building model ...
INFO (2017-05-01 09:01:37,901): Building DNA model ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
dna (InputLayer)                 (None, 1001, 4)       0                                            
____________________________________________________________________________________________________
dna/conv1d_1 (Conv1D)            (None, 991, 128)      5760        dna[0][0]                        
_____________________________________________________________________

## Model evaluation 

Finally, we use `dcpg_eval.py` for predicting statistics and evaluating predictions.

In [5]:
eval_dir="./eval"
mkdir -p $eval_dir

cmd="dcpg_eval.py
    $data_dir/c*.h5
    --model_files $model_dir
    --out_data $eval_dir/data.h5
    --out_report $eval_dir/report.csv
    "
run $cmd


#################################
dcpg_eval.py ./data/c13_000000-010000.h5 ./data/c1_000000-010000.h5 --model_files ./model --out_data ./eval/data.h5 --out_report ./eval/report.csv
#################################
Using TensorFlow backend.
INFO (2017-05-01 09:03:56,192): Loading model ...
INFO (2017-05-01 09:03:56,834): Loading data ...
INFO (2017-05-01 09:03:56,838): Predicting ...
INFO (2017-05-01 09:03:56,868):   128/20000 (0.6%)
INFO (2017-05-01 09:04:02,852):  2176/20000 (10.9%)
INFO (2017-05-01 09:04:08,858):  4224/20000 (21.1%)
INFO (2017-05-01 09:04:14,793):  6272/20000 (31.4%)
INFO (2017-05-01 09:04:20,718):  8320/20000 (41.6%)
INFO (2017-05-01 09:04:26,740): 10384/20000 (51.9%)
INFO (2017-05-01 09:04:32,829): 12432/20000 (62.2%)
INFO (2017-05-01 09:04:39,001): 14480/20000 (72.4%)
INFO (2017-05-01 09:04:45,036): 16528/20000 (82.6%)
INFO (2017-05-01 09:04:51,084): 18576/20000 (92.9%)
INFO (2017-05-01 09:04:55,628): 20000/20000 (100.0%)
  'precision', 'predicted', average, war

In [6]:
cat $eval_dir/report.csv 

metric	output	value
acc	cpg_stats/diff	0.8461538461538461
acc	win_stats/1001/mean	0.7111
acc	win_stats/2001/mean	0.71115
acc	win_stats/3001/mean	0.7136
acc	win_stats/4001/mean	0.7171
acc	win_stats/5001/mean	0.72255
auc	cpg_stats/diff	0.31818181818181823
auc	win_stats/1001/mean	0.7591639136300757
auc	win_stats/2001/mean	0.7476034296359877
auc	win_stats/3001/mean	0.7392403201486835
auc	win_stats/4001/mean	0.7333872428809352
auc	win_stats/5001/mean	0.7299328050362869
cor	win_stats/1001/mean	0.5665225871692309
cor	win_stats/1001/var	0.04710989485589752
cor	win_stats/2001/mean	0.5740795240820958
cor	win_stats/2001/var	0.04085099681709634
cor	win_stats/3001/mean	0.5758646227680748
cor	win_stats/3001/var	0.02788840114716501
cor	win_stats/4001/mean	0.5776600608207685
cor	win_stats/4001/var	0.03488458615307337
cor	win_stats/5001/mean	0.5763485749134751
cor	win_stats/5001/var	0.03616016798069942
f1	cpg_stats/diff	0.0
f1	win_stats/1001/mean	0.8311612413067616
f1	win_stats/2001/mean	0.831195394909