Required inputs for Akita are:
* binned Hi-C or Micro-C data stored in cooler format (https://github.com/mirnylab/cooler)
* Genome FASTA file

First, make sure you have a FASTA file available consistent with genome used for the coolers. Either add a symlink for a the data directory or download the machine learning friendly simplified version in the next cell.

In [None]:
import json
import os
import shutil
import subprocess

Write out these cooler files and labels to a samples table.

In [None]:
lines = [['index','identifier','file','clip','sum_stat','description']]
lines.append(['0', 'Dplus', '/home1/yxiao977/sc1/train_akita/data/5000res_0.5thres_hic_filter_both_bin.mcool::resolutions/5000', '2', 'sum', 'Dplus'])

samples_out = open('/home1/yxiao977/sc1/train_akita/data/dinoflagellate_cools.txt', 'w')
for line in lines:
    print('\t'.join(line), file=samples_out)
samples_out.close()

Next, we want to choose genomic sequences to form batches for stochastic gradient descent, divide them into training/validation/test sets, and construct TFRecords to provide to downstream programs.

The script [akita_data.py](https://github.com/calico/basenji/blob/master/bin/akita_data.py) implements this procedure.

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| --sample | 0.1 | Down-sample the genome to 10% to speed things up here. |
| -g | data/hg38_gaps_binsize2048_numconseq10.bed | Dodge large-scale unmappable regions determined from filtered cooler bins. |
| -l | 1048576 | Sequence length. |
| --crop | 65536 | Crop edges of matrix so loss is only computed over the central region. |
| --local | True | Run locally, as opposed to on a SLURM scheduler. |
| -o | data/1m | Output directory |
| -p | 8 | Uses multiple concourrent processes to read/write. |
| -t | .1 | Hold out 10% sequences for testing. |
| -v | .1 | Hold out 10% sequences for validation. |
| -w | 2048 | Pool the nucleotide-resolution values to 2048 bp bins. |
| fasta_file| data/hg38.ml.fa | FASTA file to extract sequences from. |
| targets_file | data/microc_cools.txt | Target table with cooler paths. |

Note: make sure to export BASENJIDIR as outlined in the basenji installation tips 
(https://github.com/calico/basenji/tree/master/#installation). 

In [None]:
import sys
sys.path.insert(0, '/home1/yxiao977/labwork/train_akita/basenji/bin')

In [None]:
import os
import shutil
if os.path.isdir('data/2m'):
    shutil.rmtree('data/2m')

In [None]:
! ./akita_data.py --sample 1 -l 250000 --local -o ~/sc1/train_akita/data/3m --as_obsexp -p 8 -t .1 -v .1 -w 5000 --snap 5000 --stride_train 250000 --stride_test 50000 ~/sc1/train_akita/data/GSE152150_Smic1.1N.fa ~/sc1/train_akita/data/dinoflagellate_cools.txt

The data for training is now saved in data/1m as tfrecords (for training, validation, and testing), where *contigs.bed* contains the original large contiguous regions from which training sequences were taken, and *sequences.bed* contains the train/valid/test sequences.

In [None]:
! cut -f4 /home1/yxiao977/sc1/train_akita/data/3m/sequences.bed | sort | uniq -c

In [None]:
! head -n3 /home1/yxiao977/sc1/train_akita/data/3m/sequences.bed

Now train a model!

(Note: for training production-level models, please remove the --sample option when generating tfrecords)

In [None]:
import json

# specify model parameters json to have only two targets
params_file   = 'params.json'
with open(params_file) as params_file:
    params_dinof = json.load(params_file)   
params_dinof['model']['head_hic'][-1]['units'] = 1
params_dinof['model']['seq_length'] = 250000
params_dinof['model']['target_length'] = 50

params_dinof['model']['trunk'][0]['pool_size'] = 5

params_dinof['model']['trunk'][1]['pool_size'] = 10
params_dinof['model']['trunk'][1]['repeat'] = 3



with open('./data/1m/params_dinof.json','w') as params_dinof_file:
    json.dump(params_dinof, params_dinof_file) 
    
### note that training with default parameters requires GPU with >12Gb RAM ###

In [None]:
! ./akita_train.py -k -o /home1/yxiao977/sc1/train_akita/data/3m/train_out/ /home1/yxiao977/sc1/train_akita/data/3m/params_dinof.json /home1/yxiao977/sc1/train_akita/data/3m/

See explore_model.ipynb for tips on investigating the output of a trained model. 