### Hi-C Data Preparation

1. Convert chromosome anchor.to.anchor files to full genome cooler file
2. Map cooler to uniform (10kb) bins
3. Convert to sparse matrix (.npz) format

In [None]:
! python src/cshark/preprocessing/convert_to_cooler.py \
    --loop_dir /mnt/jinstore/Archive01/LAB/Hi-C/ssz20_12122022_publicArima_AlphaBetaAcinar/processed/beta/Enhance_250M \
    --anchors /mnt/jinstore/JinLab03/xxl1432/Reference/HiC/enzyme/hg19_GATC_GANTC/anchor_bed \
    --col_names a1 a2 ratio \
    --out beta_5kb_deeploop.cool

In [None]:
! python src/cshark/preprocessing/cooler_uniform_bins.py beta_5kb_deeploop.cool beta_10kb_deeploop.cool 10000

In [None]:
! python src/cshark/preprocessing/cool2npy.py beta_10kb_deeploop.mcool ../cshark_data/data/hg19/beta/hic_matrix --no-balance

### Data Organization

Training expects data organized according to:

`<data_root>/<assembly>/<celltype>`

Each of these celltype directories should contain a `genomic_features` folder containing the training and target bigwigs, as well as a `hic_matrix` folder containing the output of the Hi-C preprocessing pipeline.

An example of a celltype directory looks like this:

In [7]:
! tree ../cshark_data/data/hg19/beta

[01;34m../cshark_data/data/hg19/beta[0m
├── [01;34mgenomic_features[0m
│   ├── [00matac.bw[0m
│   └── [00mctcf.bw[0m
└── [01;34mhic_matrix[0m
    ├── [00mchr10.npz[0m
    ├── [00mchr11.npz[0m
    ├── [00mchr12.npz[0m
    ├── [00mchr13.npz[0m
    ├── [00mchr14.npz[0m
    ├── [00mchr15.npz[0m
    ├── [00mchr16.npz[0m
    ├── [00mchr17.npz[0m
    ├── [00mchr18.npz[0m
    ├── [00mchr19.npz[0m
    ├── [00mchr1.npz[0m
    ├── [00mchr20.npz[0m
    ├── [00mchr21.npz[0m
    ├── [00mchr22.npz[0m
    ├── [00mchr2.npz[0m
    ├── [00mchr3.npz[0m
    ├── [00mchr4.npz[0m
    ├── [00mchr5.npz[0m
    ├── [00mchr6.npz[0m
    ├── [00mchr7.npz[0m
    ├── [00mchr8.npz[0m
    ├── [00mchr9.npz[0m
    ├── [00mchrX.npz[0m
    └── [00mchrY.npz[0m

3 directories, 26 files


### Training

Once the data is organized, you can run the `train.py` script with the arguments:

* `--data-root`

* `--assembly`

* `--celltypes`: can provide a single celltype folder name, or a whole list

* `--input-features`: names of bigwig files to use as input

You can optionally provide `--target-features` of the same or different bigwigs and the model will try to reconstruct/predict these.

In [None]:
! python3 ../src/cshark/training/train.py \
    --data-root ../cshark_data/data \
    --assembly hg19 \
    --celltypes beta \
    --input-features ctcf atac \
    --target-features ctcf atac \
    --latent-dim 128 \
    --num-gpu 1 \
    --batch-size 2 \
    --num-workers 1 \
    --use-wandb