# Data pre-processing

In [1]:
%cd ./data

/Users/angermue/research/bs/dev/160919_deepcpg/examples/basics/data


We first store the known CpG methylation states of each cell into a tab delimted file with the following columns:
* Chromosome (without chr)
* Position of the CpG site on the chromosome
* Binary methylation state of the CpG sites (0=unmethylation, 1=methylated)

In [2]:
%%bash
ln -s ../../data/cpg cpg
ls cpg

BS27_1_SER.tsv
BS27_3_SER.tsv
BS27_4_SER.tsv
BS27_5_SER.tsv
BS27_6_SER.tsv
BS27_7_SER.tsv
BS27_8_SER.tsv
cpg


Next, we download the genome.

In [None]:
%%bash

mkdir -p ./dna
cd ./dna
wget ftp://ftp.ensembl.org/pub/release-85/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.*.fa.gz

We can now create the inpute data for DeepCpG using `dcpg_data.py`

In [40]:
%%bash
cmd="dcpg_data.py
    --cpg_profiles ./cpg/*.tsv
    --dna_db ./dna
    --out_dir ./deepcpg
    --nb_sample 1000
    --chromo 1
    "
eval $cmd

INFO (2017-01-09 16:41:33,677): Reading single-cell profiles ...
INFO (2017-01-09 16:41:33,887): 4329 samples
INFO (2017-01-09 16:41:33,888): --------------------------------------------------------------------------------
INFO (2017-01-09 16:41:33,888): Chromosome 1 ...
INFO (2017-01-09 16:41:33,906): 1000 / 1000 (100.0%) sites matched minimum coverage filter
INFO (2017-01-09 16:41:37,544): Chunk 	1 / 1
INFO (2017-01-09 16:41:37,565): Extracting DNA sequence windows ...
INFO (2017-01-09 16:41:37,955): Extracting CpG neighbors ...
INFO (2017-01-09 16:41:38,120): Done!


In [41]:
%%bash
ls ./deepcpg

c1_000000-001000.h5


%cd ..

# Model training

In [47]:
%%bash

cmd="dcpg_train.py
    ./data/deepcpg/c*.h5
    --val_files ./data/deepcpg/c*.h5
    --dna_model CnnL2h128
    --out_dir ./dna_model
    --nb_epoch 1
    "
eval $cmd

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
dna (InputLayer)                 (None, 1001, 4)       0                                            
____________________________________________________________________________________________________
dna/convolution1d_1 (Convolution1(None, 991, 128)      5760        dna[0][0]                        
____________________________________________________________________________________________________
dna/activation_1 (Activation)    (None, 991, 128)      0           dna/convolution1d_1[0][0]        
____________________________________________________________________________________________________
dna/maxpooling1d_1 (MaxPooling1D)(None, 247, 128)      0           dna/activation_1[0][0]           
___________________________________________________________________________________________

Using TensorFlow backend.
INFO (2017-01-09 16:43:53,830): Building model ...
INFO (2017-01-09 16:43:53,830): Building DNA model ...
INFO (2017-01-09 16:43:54,175): Computing output statistics ...
INFO (2017-01-09 16:43:54,511): Loading data ...
INFO (2017-01-09 16:43:54,512): Initializing callbacks ...
INFO (2017-01-09 16:43:54,512): Training model ...
INFO (2017-01-09 16:44:36,204): Done!


In [49]:
%%bash
ls ./dna_model

events.out.tfevents.1483980240.lawrence
lc_train.csv
lc_val.csv
model.h5
model.json
model_weights_train.h5
model_weights_val.h5


In [51]:
%%bash

cmd="dcpg_train.py
    ./data/deepcpg/c*.h5
    --val_files ./data/deepcpg/c*.h5
    --cpg_model RnnL1
    --out_dir ./cpg_model
    --nb_epoch 1
    "
eval $cmd

Replicate names:
BS27_1_SER, BS27_3_SER, BS27_4_SER, BS27_5_SER, BS27_6_SER, BS27_7_SER, BS27_8_SER

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
cpg/state/BS27_1_SER--BS27_3_SER-(None, 7, 50)         0                                            
____________________________________________________________________________________________________
cpg/dist/BS27_1_SER--BS27_3_SER--(None, 7, 50)         0                                            
____________________________________________________________________________________________________
cpg/merge_1 (Merge)              (None, 7, 100)        0           cpg/state/BS27_1_SER--BS27_3_SER-
                                                                   cpg/dist/BS27_1_SER--BS27_3_SER--
___________________________________________________________________________________________

Using TensorFlow backend.
INFO (2017-01-09 16:46:08,211): Building model ...
INFO (2017-01-09 16:46:08,211): Building CpG model ...
INFO (2017-01-09 16:46:09,321): Computing output statistics ...
INFO (2017-01-09 16:46:09,720): Loading data ...
INFO (2017-01-09 16:46:09,721): Initializing callbacks ...
INFO (2017-01-09 16:46:09,721): Training model ...
INFO (2017-01-09 16:46:28,530): Done!


In [53]:
%%bash

cmd="dcpg_train.py
    ./data/deepcpg/c*.h5
    --val_files ./data/deepcpg/c*.h5
    --dna_model ./dna_model/model.json ./dna_model/model_weights_val.h5
    --cpg_model ./cpg_model/model.json ./cpg_model/model_weights_val.h5
    --joint_model JointL2h512
    --train_models joint
    --nb_epoch 1
    --out_dir ./joint_model
    "
eval $cmd

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
dna (InputLayer)                 (None, 1001, 4)       0                                            
____________________________________________________________________________________________________
dna/convolution1d_1 (Convolution1(None, 991, 128)      5760        dna[0][0]                        
____________________________________________________________________________________________________
dna/activation_1 (Activation)    (None, 991, 128)      0           dna/convolution1d_1[0][0]        
____________________________________________________________________________________________________
dna/maxpooling1d_1 (MaxPooling1D)(None, 247, 128)      0           dna/activation_1[0][0]           
___________________________________________________________________________________________

Using TensorFlow backend.
INFO (2017-01-09 16:52:45,036): Building model ...
INFO (2017-01-09 16:52:45,036): Loading existing DNA model ...
INFO (2017-01-09 16:52:45,454): Loading existing CpG model ...
INFO (2017-01-09 16:52:46,735): Joining models ...
INFO (2017-01-09 16:52:47,343): Computing output statistics ...
INFO (2017-01-09 16:52:47,788): Loading data ...
INFO (2017-01-09 16:52:47,789): Initializing callbacks ...
INFO (2017-01-09 16:52:47,790): Training model ...
INFO (2017-01-09 16:53:19,514): Done!


# Evaluation

In [55]:
%%bash

eval_dir="./eval"
mkdir -p $eval_dir

cmd="dcpg_eval.py
    ./data/deepcpg/*.h5
    --model_files ./joint_model/model.json ./joint_model/model_weights_val.h5
    --out_data $eval_dir/data.h5
    --out_report $eval_dir/report.tsv
    "
eval $cmd

           output       auc       acc       tpr       tnr        f1       mcc      n
1  cpg/BS27_3_SER  0.659654  0.474138  0.397849  0.782609  0.548148  0.149707  116.0
2  cpg/BS27_4_SER  0.640138  0.496933  0.151351  0.950355  0.254545  0.162765  326.0
0  cpg/BS27_1_SER  0.599513  0.847973  0.932075  0.129032  0.916512  0.071337  296.0
5  cpg/BS27_7_SER  0.563526  0.606061  0.450704  0.663212  0.380952  0.104773  264.0
3  cpg/BS27_5_SER  0.558863  0.764706  0.861538  0.217391  0.861538  0.078930  153.0
6  cpg/BS27_8_SER  0.551657  0.919149  1.000000  0.000000  0.957871  0.000000  235.0
4  cpg/BS27_6_SER  0.535728  0.594470  0.696552  0.388889  0.696552  0.085441  217.0


Using TensorFlow backend.
INFO (2017-01-09 16:55:36,120): Loading model ...
INFO (2017-01-09 16:55:37,611): Loading data ...
INFO (2017-01-09 16:55:37,612): Predicting ...
INFO (2017-01-09 16:55:37,629):  128/1000 (12.8%)
INFO (2017-01-09 16:55:39,066):  256/1000 (25.6%)
INFO (2017-01-09 16:55:40,457):  384/1000 (38.4%)
INFO (2017-01-09 16:55:41,878):  512/1000 (51.2%)
INFO (2017-01-09 16:55:43,280):  640/1000 (64.0%)
INFO (2017-01-09 16:55:44,687):  768/1000 (76.8%)
INFO (2017-01-09 16:55:46,084):  896/1000 (89.6%)
INFO (2017-01-09 16:55:47,487): 1000/1000 (100.0%)
INFO (2017-01-09 16:55:48,746): Done!
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x11a9407f0>>
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 522, in __del__
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/py

%%bash
h5ls -r ./eval/data.h5