Required inputs for Akita are:
* binned Hi-C or Micro-C data stored in cooler format (https://github.com/mirnylab/cooler)
* Genome FASTA file

First, make sure you have a FASTA file available consistent with genome used for the coolers. Either add a symlink for a the data directory or download the machine learning friendly simplified version in the next cell.

In [1]:
import os, subprocess

if not os.path.isfile('./data/hg38.ml.fa'):
    print('downloading hg38.ml.fa')
    subprocess.call('curl -o ./data/hg38.ml.fa.gz https://storage.googleapis.com/basenji_barnyard/hg38.ml.fa.gz', shell=True)
    subprocess.call('gunzip ./data/hg38.ml.fa.gz', shell=True)


Download a few Micro-C datasets, processed using distiller (https://github.com/mirnylab/distiller-nf), binned to 2048bp, and iteratively corrected. 

In [151]:
ls ./data/coolers/

[0m[01;36mH1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool[0m@
[01;36mHFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool[0m@


In [152]:
### tbd where to store ##
# if not os.path.isfile('data/CNhs11760.bw'):
#     subprocess.call('curl -o data/CNhs11760.bw https://storage.googleapis.com/akita_tutorial_data/CNhs11760.bw', shell=True)
#     subprocess.call('curl -o data/CNhs12843.bw https://storage.googleapis.com/basenji_tutorial_data/CNhs12843.bw', shell=True)

Write out these cooler files and labels to a samples table.

In [18]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'HFF', './data/coolers/HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool', '2', 'sum', 'HFF'])
lines.append(['1', 'H1hESC', './data/coolers/H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool', '2', 'sum', 'H1hESC'])

samples_out = open('data/microc_cools.txt', 'w')
for line in lines:
    print('\t'.join(line), file=samples_out)
samples_out.close()

Next, we want to choose genomic sequences to form batches for stochastic gradient descent, divide them into training/validation/test sets, and construct TFRecords to provide to downstream programs.

The script [akita_data.py](https://github.com/calico/basenji/blob/master/bin/akita_data.py) implements this procedure.

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| --sample | 0.1 | Down-sample the genome to 10% to speed things up here. |
| -g | data/hg38_gaps_binsize2048_numconseq10.bed | Dodge large-scale unmappable regions determined from filtered cooler bins. |
| -l | 1048576 | Sequence length. |
| --crop | 65536 | Crop edges of matrix so loss is only computed over the central region. |
| --local | True | Run locally, as opposed to on a SLURM scheduler. |
| -o | data/1m | Output directory |
| -p | 8 | Uses multiple concourrent processes to read/write. |
| -t | .1 | Hold out 10% sequences for testing. |
| -v | .1 | Hold out 10% sequences for validation. |
| -w | 2048 | Pool the nucleotide-resolution values to 2048 bp bins. |
| fasta_file| data/hg38.ml.fa | FASTA file to extract sequences from. |
| targets_file | data/microc_cools.txt | Target table with cooler paths. |

In [33]:
! akita_data.py --sample 0.05 -g ./data/hg38_gaps_binsize2048_numconseq10.bed -l 1048576 --crop 65536 --local -o ./data/1m --as_obsexp -p 8 -t .1 -v .1 -w 2048 --snap 2048 --stride_train 262144 --stride_test 32768 ./data/hg38.ml.fa ./data/microc_cools.txt


Contigs divided into
 Train:   413 contigs, 2078450861 nt (0.8036)
 Valid:    47 contigs,  254228224 nt (0.0983)
 Test:     48 contigs,  253678336 nt (0.0981)
writing sequences to BED
akita_data_read.py --crop 65536 -k 0 -w 2048 --clip 2.000000 --as_obsexp ./data/coolers/HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool ./data/1m/sequences.bed ./data/1m/seqs_cov/0.h5
akita_data_read.py --crop 65536 -k 0 -w 2048 --clip 2.000000 --as_obsexp ./data/coolers/H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool ./data/1m/sequences.bed ./data/1m/seqs_cov/1.h5
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
[0m[0mbasenji_data_write.py -s 0 -e 256 ./data/hg38.ml.fa ./data/1m/sequences.bed ./data/1m/seqs_cov ./data/1m/tfrecords/train-0.tfr
basenji_data_write.py -s 256 -e 350 ./data/hg38.ml.fa ./data/1m/sequen

The data for training is now saved in data/1m as tfrecords (for training, validation, and testing), where *contigs.bed* contains the original large contiguous regions from which training sequences were taken, and *sequences.bed* contains the train/valid/test sequences.

In [37]:
! cut -f4 data/1m/sequences.bed | sort | uniq -c

    314 test
    350 train
    316 valid


In [38]:
! head -n3 data/1m/sequences.bed

chr2	183353344	184401920	train
chr3	120852480	121901056	train
chr16	12914688	13963264	train


Now train a model!

In [43]:
# specify model parameters json to have only two targets
params_file   = './params.json'
with open(params_file) as params_file:
    params_tutorial = json.load(params_file)   
params_tutorial['model']['head_hic'][-1]['units'] =2
with open('./data/1m/params_tutorial.json','w') as params_tutorial_file:
    json.dump(params_tutorial,params_tutorial_file)#'./data/1m/params_tutorial.json')
    
    
### note that training with default parameters requires GPU with >12Gb RAM ###

In [148]:
! akita_train.py -o ./data/1m/train_out/  ./data/1m/params_tutorial.json ./data/1m/


2020-03-21 16:57:25.921319: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-03-21 16:57:25.924618: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-03-21 16:57:25.924648: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DFLUCD07279
2020-03-21 16:57:25.924655: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DFLUCD07279
2020-03-21 16:57:25.924706: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.67.0
2020-03-21 16:57:25.924730: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.67.0
2020-03-21 16:57:25.924736: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.67.0
2020-03-21 16:57:25.924946: I tensorflow/core/platform/cpu_f

Total params: 751,506
Trainable params: 746,002
Non-trainable params: 5,504
__________________________________________________________________________________________________
None
model_strides [2048]
target_lengths [99681]
target_crops [-49585]
Epoch 1/1000
W0321 16:58:01.144318 140360698246912 deprecation.py:323] From /home/gfudenberg/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/clip_ops.py:157: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2020-03-21 16:58:15.146353: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 805306368 exceeds 10% of system memory.
2020-03-21 16:58:15.732145: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 805306368 exceeds 10% of system memory.
2020-03-21 16:58:16.085202: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allo

See explore_model.ipynb for tips on investigating the output of a trained model