# Fine-tuning a pre-trained model

This tutorial describes how to fine-tune a pre-trained model from the [DeepCpG model zoo](https://github.com/cangermueller/deepcpg/blob/master/docs/models.md). Fine-tuning a model which has been pre-trained on a cells that are similar to the cells of interest allows to considerably decrease training time. 

## Variable initialization

We first initialize some variables that will be used throughout the tutorial. `test_mode=1` should be used for testing purposes, which speeds up computations by only using a subset of the data. For real applications, `test_mode=0` should be used.

In [1]:
function run {
  local cmd=$@
  echo
  echo "#################################"
  echo $cmd
  echo "#################################"
  eval $cmd
}

test_mode=1 # set this variable to 0 for production
data_dir="../../data"



## Creating DeepCpG data files

First, we create DeepCpG data files using `dcpg_data.py`. Since we will fine-tune a CpG model, we do not extract sequence windows. Otherwise, `--dna_files` and `--dna_wlen` must to be specified.

In [2]:
dcpg_data="./data"
cmd="dcpg_data.py
    --cpg_profiles $data_dir/cpg/*.tsv
    --out_dir $dcpg_data
    --cpg_wlen 50
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_sample 10000
        "
fi
eval $cmd

INFO (2017-02-05 21:03:37,367): Reading single-cell profiles ...
INFO (2017-02-05 21:03:37,844): 10000 samples
INFO (2017-02-05 21:03:37,845): --------------------------------------------------------------------------------
INFO (2017-02-05 21:03:37,845): Chromosome 1 ...
INFO (2017-02-05 21:03:37,882): 10000 / 10000 (100.0%) sites matched minimum coverage filter
INFO (2017-02-05 21:03:37,882): Chunk 	1 / 1
INFO (2017-02-05 21:03:37,939): Extracting CpG neighbors ...
INFO (2017-02-05 21:03:39,508): Done!


## Downloading a pre-trained model

`dcpg_download.py` downloads a pre-trained model from the DeepCpG model zoo. Available models and their corresponding description can be found on the [model zoo website](https://github.com/cangermueller/deepcpg/blob/master/docs/models.md), or retrieved with `dcpg_download.py --show`:

In [3]:
dcpg_download.py --show

Available models: https://github.com/cangermueller/deepcpg/blob/master/docs/models.md
Hou2016_HCC_cpg
Hou2016_HCC_dna
Hou2016_HCC_joint
Hou2016_HepG2_cpg
Hou2016_HepG2_dna
Hou2016_HepG2_joint
Hou2016_mESC_cpg
Hou2016_mESC_dna
Hou2016_mESC_joint
Smallwood2014_2i_cpg
Smallwood2014_2i_dna
Smallwood2014_2i_joint
Smallwood2014_serum_cpg
Smallwood2014_serum_dna
Smallwood2014_serum_joint


A model name consist of three parts, which are separated by '_'. The first part corresponds to the publication, the second to the cell type, and the third to the model type (CpG, DNA, or joint model). Cells from 'Hou2016' were profiled using scRRBS-seq, cells from 'Smallwood2014' using scBS-seq. 'HCC' and 'HepG2' are human cancer cells, all others mouse cells. The cell-type that is most similar to the cell-type of interest should be used. More information  about the available models can be found [here](https://github.com/cangermueller/deepcpg/blob/master/docs/models.md). 

Since we are dealing with 2i cells and want to train a CpG model, we will use 'Smallwood2014_2i_cpg':

In [4]:
pretrained_model="./models/Smallwood2014_2i_cpg"
cmd="dcpg_download.py
  $(basename $pretrained_model)
  -o $pretrained_model
  "
run $cmd


#################################
dcpg_download.py Smallwood2014_2i_cpg -o ./models/Smallwood2014_2i_cpg
#################################
INFO (2017-02-05 21:03:41,130): Downloading model ...
INFO (2017-02-05 21:03:41,131): Model URL: http://www.ebi.ac.uk/~angermue/deepcpg/alias/f89b2e8344012d73e95504da06bcf378
--2017-02-05 21:03:41--  http://www.ebi.ac.uk/~angermue/deepcpg/alias/f89b2e8344012d73e95504da06bcf378
Resolving www.ebi.ac.uk (www.ebi.ac.uk)... 193.62.193.80
Connecting to www.ebi.ac.uk (www.ebi.ac.uk)|193.62.193.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31068468 (30M) [text/plain]
Saving to: ‘./models/Smallwood2014_2i_cpg/model.zip’


2017-02-05 21:03:48 (3.84 MB/s) - ‘./models/Smallwood2014_2i_cpg/model.zip’ saved [31068468/31068468]

Archive:  ./models/Smallwood2014_2i_cpg/model.zip
  inflating: ./models/Smallwood2014_2i_cpg/model.h5  
  inflating: ./models/Smallwood2014_2i_cpg/model.json  
  inflating: ./models/Smallwo

The command downloads and stores model files in the output directory, including the weights and JSON file with the model description:

In [5]:
ls $pretrained_model

model.h5               model_weights.h5       model_weights_val.h5
model.json             model_weights_train.h5


`model.json` is the model description, and `model_weights_train.h5` and `model_weights_val.h5` the weights that yielded the highest performance on the training and validation set, respectively. `model.h5` combines `model.json` and `model_weights_val.h5`.

## Fine-tuning the model

To fine-tune the downloaded model, we use `--cpg_model` followed by the model directory, and `--fine_tune` to only train the output layers.

`--cpg_model $pretrained_model` is equivalent to `--cpg_model $pretrained_model/model.json $pretrained_model/model_weights_val.h5`. To fine-tune the weights with the highest performance on the training set, use `model_weights_train.h5` as input.

Without `--fine_tune`, `dcpg_train.py` will train all weights, not only the output layers. This is recommended if the cells that were used for the pre-trained model are only distantly related to the cells of interests, e.g. if cell-types do not match. Training all weights will lead to a higher prediction performance, but also increase training time.

In [6]:
cmd="dcpg_train.py
   ./data/*.h5
  --val_files ./data/*.h5
  --cpg_model $pretrained_model
  --out_dir ./models/cpg
  --fine_tune
  "
if [[ $test_mode -eq 1 ]]; then
  cmd="$cmd
    --nb_epoch 2
    --nb_train_sample 1000
    --nb_val_sample 1000
    "
else
  cmd="$cmd
    --nb_epoch 25
    --early_stopping 5
    "
fi
run $cmd



#################################
dcpg_train.py ./data/c1_000000-010000.h5 --val_files ./data/c1_000000-010000.h5 --cpg_model ./models/Smallwood2014_2i_cpg --out_dir ./models/cpg --fine_tune --nb_epoch 2 --nb_train_sample 1000 --nb_val_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-02-05 21:04:02,579): Building model ...
Replicate names:
BS27_1_SER, BS27_3_SER, BS27_5_SER, BS27_6_SER, BS27_8_SER

INFO (2017-02-05 21:04:02,620): Loading existing CpG model ...
INFO (2017-02-05 21:04:02,620): Using model files ./models/Smallwood2014_2i_cpg/model.json ./models/Smallwood2014_2i_cpg/model_weights.h5
INFO (2017-02-05 21:04:03,921): Replicate names differ: Copying weights to new model ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
cpg/state (InputLayer)           (None, 5, 50)         0 

## Model evaluation 

Finally, we evaluate our fine-tuned model and impute methylation profiles using `dcpg_eval.py`:

In [7]:
eval_dir="./eval"
mkdir -p $eval_dir

cmd="dcpg_eval.py
    $dcpg_data/*.h5
    --model_files ./models/cpg
    --out_data $eval_dir/data.h5
    --out_report $eval_dir/report.tsv
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_sample 1000
        "
fi
run $cmd


#################################
dcpg_eval.py ./data/c1_000000-010000.h5 --model_files ./models/cpg --out_data ./eval/data.h5 --out_report ./eval/report.tsv --nb_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-02-05 21:04:35,849): Loading model ...
INFO (2017-02-05 21:04:36,671): Loading data ...
INFO (2017-02-05 21:04:36,682): Predicting ...
INFO (2017-02-05 21:04:36,697):  128/1000 (12.8%)
INFO (2017-02-05 21:04:36,823):  256/1000 (25.6%)
INFO (2017-02-05 21:04:36,943):  384/1000 (38.4%)
INFO (2017-02-05 21:04:37,063):  512/1000 (51.2%)
INFO (2017-02-05 21:04:37,194):  640/1000 (64.0%)
INFO (2017-02-05 21:04:37,322):  768/1000 (76.8%)
INFO (2017-02-05 21:04:37,432):  896/1000 (89.6%)
INFO (2017-02-05 21:04:37,530): 1000/1000 (100.0%)
  'precision', 'predicted', average, warn_for)
           output       auc       acc       tpr       tnr        f1       mcc      n
2  cpg/BS27_5_SER  0.614029  0.850000  0.989362  0.031250  0.918519