# Analyzing T-Cell Hi-C Data with `jointly-hic`

This notebook demonstrates how to use the `jointly-hic` toolkit to jointly embed Hi-C contact matrices generated from HiC data on from the paper *Choppavarapu, Lavanya et al. Cell Reports*.

## Background

Hi-C is a genome-wide chromosome conformation capture method that quantifies 3D chromatin contacts. It provides a matrix of contact frequencies between genomic bins. By embedding these matrices into a low-dimensional space, we can compare 3D chromatin structure across different conditions and cell types.

In this demo, we:
- Map the Hi-C data of 12 breast tissues(2 normal, 5 primary tumors, 5 recurrent tumors) from *Choppavarapu L et al. Cell Rep. 2025* using [`distiller-nf`](https://github.com/open2c/distiller-nf)
- Prepare for downstream embedding with `jointly-hic`

## Setup

To run this notebook, you will need the following Python packages and command-line tools:

### Python Packages
- `jointly-hic`: For joint embedding and analysis of Hi-C data
- `cooler`: For handling `.cool` and `.mcool` Hi-C files
- `cooltools`: Additional utilities for Hi-C file operations and balancing
- `pandas`, `numpy`, etc. (installed as dependencies)

Install them using pip:

```bash
pip install jointly-hic cooler
```

### Command-Line Tools -- distiller-nf
This tool is a modular Hi-C mapping pipeline
To setup a new project, execute the following line in the project folder:
```
$ nextflow clone open2c/distiller-nf ./
```
After installation, follow instruction here: (https://github.com/open2c/distiller-nf)
The folder of this demo also inlucdes the file `project.yml` for:
```
$ nextflow run distiller.nf -params-file project.yml
```


In [4]:
import subprocess
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import pandas as pd

from jointly_hic.notebook_utils.encode_utils import EncodeFile

## Joint Embedding of breast tissues Hi-C Matrices

Now that we have obtained `.mcool` files for all 12 breast tissue samples, we can jointly embed these contact matrices into a low-dimensional vector space using `jointly-hic`.

The `jointly embed` command performs out-of-core matrix decomposition using PCA (default), NMF, or SVD. It stacks contact matrices from all samples at a specified resolution and learns a shared representation that preserves biological variation across samples.

We'll run it at a fairly course-resolution of 320 kb. Often results are desired at higher resolution such as 50 kb or even 25 kb if the data can support it. However, we'll get results faster at 320 kb. 

This embedding can then be used for:
- Dimensionality reduction and visualization
- Clustering and trajectory inference
- Integration with RNA-seq, ATAC-seq, or ChIP-seq via JointDb

In this example, we run `jointly embed` with:
- Input: all `.mcool` files in `./data/`
- Resolution: 320 kb
- Assembly: `hg38`
- Method: PCA
- Components: 32

Because these are clinical samples from female donors, to reduce noise and improve the quality of our outputs, we will exclude chrY and chrM while running `jointly embed`

In [3]:
import subprocess
from pathlib import Path
import pandas as pd



In [None]:
# Set up output directory
output_dir = Path("./data")
output_dir.mkdir(parents=True, exist_ok=True)

# Gather list of all mcool files
mcool_files = sorted(Path("./").glob("*.mcool"))

#run jointly embed excluding chrY and chrM by setting chrom_limit as 23

cmd = [
    "jointly",
    "embed",
    "--mcools",
    *map(str, mcool_files),
    "--resolution",
    "50000",
    "--assembly",
    "hg38",
    "--method",
    "PCA",
    "--components",
    "32",
    "--chrom-limit", 
    "23",
    "--output",
    "data/breast-tissue-demo-output",
]

print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)