### Install dependencies

In [1]:
%env HDF5_USE_FILE_LOCKING=FALSE

env: HDF5_USE_FILE_LOCKING=FALSE


In [19]:
import bpnet
from bpnet.cli.contrib import ContribFile
from bpnet.plot.tracks import plot_tracks, to_neg

import os
import uuid
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import clear_output, HTML
from pathlib import Path
import pandas as pd
import numpy as np
clear_output()

#### Optional: Setup wandb

In [3]:
import wandb

wandb.init(project='bpnet-training', entity='an1lam')

2020-08-31 20:32:44,693 [INFO] file/dir created: /home/stephenmalina/project/src/wandb/run-20200831_203242-lmpui8st/wandb-metadata.json
2020-08-31 20:32:44,750 [INFO] system metrics and metadata threads started
2020-08-31 20:32:44,751 [INFO] checking resume status, waiting at most 10 seconds
2020-08-31 20:32:44,865 [INFO] resuming run from id: UnVuOnYxOmxtcHVpOHN0OmJwbmV0LXRyYWluaW5nOmFuMWxhbQ==
2020-08-31 20:32:44,871 [INFO] upserting run before process can begin, waiting at most 10 seconds
2020-08-31 20:32:44,963 [INFO] saving pip packages
2020-08-31 20:32:44,965 [INFO] initializing streaming files api
2020-08-31 20:32:44,969 [INFO] unblocking file change observer, beginning sync with W&B servers


W&B Run: https://app.wandb.ai/an1lam/bpnet-training/runs/lmpui8st

2020-08-31 20:32:44,979 [INFO] shutting down system stats and metadata service
2020-08-31 20:32:45,692 [INFO] file/dir modified: /home/stephenmalina/project/src/wandb/run-20200831_203242-lmpui8st/config.yaml
2020-08-31 20:32:45,752 [INFO] stopping streaming files and file change observer
2020-08-31 20:32:45,801 [INFO] file/dir created: /home/stephenmalina/project/src/wandb/run-20200831_203242-lmpui8st/wandb-summary.json
2020-08-31 20:32:45,802 [INFO] file/dir created: /home/stephenmalina/project/src/wandb/run-20200831_203242-lmpui8st/requirements.txt
2020-08-31 20:32:45,808 [INFO] file/dir created: /home/stephenmalina/project/src/wandb/run-20200831_203242-lmpui8st/wandb-history.jsonl
2020-08-31 20:32:45,822 [INFO] file/dir created: /home/stephenmalina/project/src/wandb/run-20200831_203242-lmpui8st/wandb-events.jsonl
2020-08-31 20:32:45,833 [INFO] file/dir modified: /home/stephenmalina/project/src/wandb/run-20200831_203242-lmpui8st/wandb-metadata.json


In [4]:
# config variables
n_reps = 1

# file paths
config_dir = Path('./bpnet/') 

model_config_fname = 'ChIP-nexus-default.gin'
data_config_fname = 'ChIP-nexus.dataspec.yml'

timestamp = datetime.now().strftime('%Y-%m-%H-%M-%S')
output_dir = f'../dat/res-bpnet-training-{timestamp}'
output_dir

'../dat/res-bpnet-training-2020-08-20-32-48'

In [11]:
!cat {config_dir}/{data_config_fname}

task_specs:
  Oct4:
    pos_counts: /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Oct4/counts.pos.bw
    neg_counts: /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Oct4/counts.neg.bw
    peaks: /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Oct4/idr-optimal-set.summit.bed.gz
  Sox2:
    pos_counts: /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Sox2/counts.pos.bw
    neg_counts: /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Sox2/counts.neg.bw
    peaks: /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Sox2/idr-optimal-set.summit.bed.gz
  Nanog:
    pos_counts: /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Nanog/counts.pos.bw
    neg_counts: /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Nanog/counts.neg.bw
    peaks: /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Nanog/

### Data stats

In [13]:
# chromsomome names of differnet peaks
!zcat /home/stephenmalina/project/dat/bpnet-manuscript-data/data/chip-nexus/Sox2/idr-optimal-set.summit.bed.gz \
    | cut -f 1 | sort -u

chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrX
chrY


Each task (or TF) can specify a set of peaks associated with it. Here are the number of peaks per TF we will use in this tutorial:

In [16]:
tasks = ['Oct4', 'Sox2', 'Nanog', 'Klf4']

# number of peaks per task
for task in tasks:
    print(task)
    data_dir = '/home/stephenmalina/project/dat/bpnet-manuscript-data'
    !zcat {data_dir}/data/chip-nexus/{task}/idr-optimal-set.summit.bed.gz | wc -l

Oct4
25849
Sox2
10999
Nanog
56459
Klf4
57601


## 2. Train the model

Having specified `dataspec.yml`, we are now ready to train the model with 

```
bpnet train <dataspec.yml> <output dir> [optional flags]`
```


We will use a pre-made model [bpnet9](../bpnet/premade/bpnet9.gin) as a starting point and modify a few parameters specified in the config.gin file. Specifically, we will 
- train the model only on chromosomes 16-19
- evaluate the model on chromosome 2
- use only 3 layers of dilated convolutions 
- use an input sequence length of 200 bp and accordingly lower the augmentation shift to 100 bp

In [17]:
!cat {config_dir}/{model_config_fname} 
# NOTE: test_chr will be also excluded similar to 'exclude_chr'

b_loss_weight = 0
c_loss_weight = 10
p_loss_weight = 1
filters = 64
tconv_kernel_size = 25
lr = 0.004
n_dil_layers = 9
train.batch_size = 128
merge_profile_reg = False
dataspec = 'ChIP-nexus.dataspec.yml'

batchnorm = False

padding = 'same'
seq_width = 1000

tasks = ['Oct4', 'Sox2', 'Nanog', 'Klf4']


Have a look at the original gin file of bpnet9 here: https://github.com/kundajelab/bpnet/blob/master/bpnet/premade/bpnet9-ginspec.gin. For more information on using gin files see <https://github.com/google/gin-config>. 

To track model training and evaluation, we will use [wandb](http://wandb.com/) by adding `--wandb=avsec/bpnet-demo` to `bpnet train`. You can navigate to https://app.wandb.ai/avsec/bpnet-demo to see the training progress.

Let's train!

In [20]:
# setup all the file paths
example_model_dir = os.path.join(output_dir, 'output_ensemble', '0')

In [None]:
# Train for at most 10 epochs
for i in range(n_reps):
    # setup a new run_id (could be done automatically, but then the output directory would change)
    run_id = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + "_" + str(uuid.uuid4())
    !cd {config_dir} && bpnet train {data_config_fname} --premade=bpnet9 \
        --config={model_config_fname} {output_dir} \
        --run-id '{run_id}' --wandb=an1lam/bpnet-training --in-memory \
        --override='train.epochs=1; train.seed={i}'
    # softlink the new output directory
    !rm -f {output_dir}/output_ensemble/{i} && ln -srf {output_dir}/{run_id} {output_dir}/output_ensemble/{i}

Using TensorFlow backend.
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0
OMP: Info #156: KMP_AFFINITY: 1 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 1 packages x 1 cores/pkg x 1 threads/core (1 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 
OMP: Info #250: KMP_AFFINITY: pid 3111 tid 3111 thread 0 bound to OS proc set 0


2020-08-31 20:47:57,033 [INFO] NumExpr defaulting to 2 threads.
INFO [08-31 20:47:59] Using wandb. Running wandb.init()
wandb: Tracking run with wandb version 0.9.6
wandb: Run data is saved locally in ../dat/res-bpnet-training-2020-08-20-32-48/run-20200831_204759-2020-08-31_20-47-52_37d02a08-a509-4445-998d-7f0bd29b07d1
wandb: Syncing run 2020-08-31_20-47-52_37d02a08-a509-4445-998d-7f0bd29

100%|█████████████████████████████████████████| 711/711 [09:37<00:00,  1.23it/s]
100%|█████████████████████████████████████████| 229/229 [02:33<00:00,  1.50it/s]
100%|█████████████████████████████████████████| 711/711 [07:55<00:00,  1.50it/s]
Epoch 1/1
OMP: Info #250: KMP_AFFINITY: pid 3111 tid 3199 thread 1 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 3111 tid 3200 thread 2 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 12046 tid 12046 thread 3 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 12180 tid 12180 thread 4 bound to OS proc set 0
 47%|███████████████████▍                     | 108/228 [06:04<06:42,  3.36s/it]

In [None]:
!ls -latr {example_model_dir}/