Required inputs for Akita are:
* binned Hi-C or Micro-C data stored in cooler format (https://github.com/mirnylab/cooler)
* Genome FASTA file

First, make sure you have a FASTA file available consistent with genome used for the coolers. Either add a symlink for a the data directory or download the machine learning friendly simplified version in the next cell.

In [None]:
import os, subprocess, json

if not os.path.isfile('./data/hg38.ml.fa'):
    print('downloading hg38.ml.fa')
    subprocess.call('curl -o ./data/hg38.ml.fa.gz https://storage.googleapis.com/basenji_barnyard/hg38.ml.fa.gz', shell=True)
    subprocess.call('gunzip ./data/hg38.ml.fa.gz', shell=True)


downloading hg38.ml.fa


Download a few Micro-C datasets, processed using distiller (https://github.com/mirnylab/distiller-nf), binned to 2048bp, and iteratively corrected. 

In [None]:
if not os.path.exists('./data/coolers'):
    os.mkdir('./data/coolers)
if not os.path.isfile('./data/coolers/HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool'):
    subprocess.call('curl -o ./data/coolers/HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool'+
            ' https://storage.googleapis.com/basenji_hic/tutorials/coolers/HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool', shell=True)
    subprocess.call('curl -o ./data/coolers/H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool'+
            ' https://storage.googleapis.com/basenji_hic/tutorials/coolers/H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool', shell=True)

In [None]:
ls ./data/coolers/

Write out these cooler files and labels to a samples table.

In [3]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'HFF', './data/coolers/HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool', '2', 'sum', 'HFF'])
lines.append(['1', 'H1hESC', './data/coolers/H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool', '2', 'sum', 'H1hESC'])

samples_out = open('data/microc_cools.txt', 'w')
for line in lines:
    print('\t'.join(line), file=samples_out)
samples_out.close()

Next, we want to choose genomic sequences to form batches for stochastic gradient descent, divide them into training/validation/test sets, and construct TFRecords to provide to downstream programs.

The script [akita_data.py](https://github.com/calico/basenji/blob/master/bin/akita_data.py) implements this procedure.

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| --sample | 0.1 | Down-sample the genome to 10% to speed things up here. |
| -g | data/hg38_gaps_binsize2048_numconseq10.bed | Dodge large-scale unmappable regions determined from filtered cooler bins. |
| -l | 1048576 | Sequence length. |
| --crop | 65536 | Crop edges of matrix so loss is only computed over the central region. |
| --local | True | Run locally, as opposed to on a SLURM scheduler. |
| -o | data/1m | Output directory |
| -p | 8 | Uses multiple concourrent processes to read/write. |
| -t | .1 | Hold out 10% sequences for testing. |
| -v | .1 | Hold out 10% sequences for validation. |
| -w | 2048 | Pool the nucleotide-resolution values to 2048 bp bins. |
| fasta_file| data/hg38.ml.fa | FASTA file to extract sequences from. |
| targets_file | data/microc_cools.txt | Target table with cooler paths. |

Note: make sure to export BASENJIDIR as outlined in the basenji installation tips 
(https://github.com/calico/basenji/tree/master/#installation). 

In [4]:
! akita_data.py --sample 0.05 -g ./data/hg38_gaps_binsize2048_numconseq10.bed -l 1048576 --crop 65536 --local -o ./data/1m --as_obsexp -p 8 -t .1 -v .1 -w 2048 --snap 2048 --stride_train 262144 --stride_test 32768 ./data/hg38.ml.fa ./data/microc_cools.txt


Contigs divided into
 Train:   413 contigs, 2078450861 nt (0.8036)
 Valid:    47 contigs,  254228224 nt (0.0983)
 Test:     48 contigs,  253678336 nt (0.0981)
writing sequences to BED
akita_data_read.py --crop 65536 -k 0 -w 2048 --clip 2.000000 --as_obsexp ./data/coolers/HFF_hg38_4DNFIP5EUOFX.mapq_30.2048.cool ./data/1m/sequences.bed ./data/1m/seqs_cov/0.h5
akita_data_read.py --crop 65536 -k 0 -w 2048 --clip 2.000000 --as_obsexp ./data/coolers/H1hESC_hg38_4DNFI1O6IL1Q.mapq_30.2048.cool ./data/1m/sequences.bed ./data/1m/seqs_cov/1.h5
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
  val_cur = ar_cur / armask_cur
[0m[0mbasenji_data_write.py -s 0 -e 256 ./data/hg38.ml.fa ./data/1m/sequences.bed ./data/1m/seqs_cov ./data/1m/tfrecords/train-0.tfr
basenji_data_write.py -s 256 -e 350 ./data/hg38.ml.fa ./data/1m/sequen

The data for training is now saved in data/1m as tfrecords (for training, validation, and testing), where *contigs.bed* contains the original large contiguous regions from which training sequences were taken, and *sequences.bed* contains the train/valid/test sequences.

In [5]:
! cut -f4 data/1m/sequences.bed | sort | uniq -c

    314 test
    350 train
    316 valid


In [6]:
! head -n3 data/1m/sequences.bed

chr2	183353344	184401920	train
chr3	120852480	121901056	train
chr16	12914688	13963264	train


Now train a model!

(Note: for training production-level models, please remove the --sample option when generating tfrecords)

In [13]:
# specify model parameters json to have only two targets
params_file   = './params.json'
with open(params_file) as params_file:
    params_tutorial = json.load(params_file)   
params_tutorial['model']['head_hic'][-1]['units'] =2
with open('./data/1m/params_tutorial.json','w') as params_tutorial_file:
    json.dump(params_tutorial,params_tutorial_file) 
    
### note that training with default parameters requires GPU with >12Gb RAM ###

In [2]:
!akita_train.py -o ./data/1m/train_out/  ./data/1m/params_tutorial.json ./data/1m/


2020-03-30 13:48:48.316119: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-03-30 13:48:48.322761: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3492170000 Hz
2020-03-30 13:48:48.323194: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f5a90bf860 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-30 13:48:48.323209: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-11
OMP: Info #213: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #276: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #156: KMP_AFFINIT

Train for 175 steps, validate for 158 steps
Epoch 1/1000
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12265 thread 2 bound to OS proc set 2
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12270 thread 3 bound to OS proc set 3
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 11703 thread 4 bound to OS proc set 4
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12287 thread 5 bound to OS proc set 5
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12288 thread 6 bound to OS proc set 6
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12289 thread 7 bound to OS proc set 7
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12290 thread 8 bound to OS proc set 8
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12291 thread 9 bound to OS proc set 9
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12293 thread 11 bound to OS proc set 11
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12292 thread 10 bound to OS proc set 10
OMP: Info #251: KMP_AFFINITY: pid 11656 tid 12294 thread 12 bound to OS proc set 0
OMP: Info #251: KMP_AFFINITY: pid 11

See explore_model.ipynb for tips on investigating the output of a trained model. 