# Day 2 practice session: Interacting with Hi-C matrices

In this session, we will be using python to manipulate and extract information from Hi-C datasets.

## Data structures

Most Hi-C storage format will comprise 3 tables: the matrix, the bins coordinates and chromosome sizes.
Since we are using the cool format, each dataset is a single file containing mukltiple tables.

We can access those tables using the cooler API as follows:


In [9]:
# Get information on chromosomes
import cooler
import numpy as np
import pandas as pd
import scipy.sparse as sp

clr_g1 = cooler.Cooler('G1.cool')
chroms = clr_g1.chroms()[:]

chroms.head()

OSError: Unable to open file (unable to open file: name = 'G1.cool', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

In [None]:
bins = clr_g1.bins()[:]
bins.loc[bins.chrom == 'chrom1', :]

Cooler can load tables from the HDF5 architecture using the corresponding methods `bins()`, `chroms()` and `matrix()`. Each method returns a view on the HDF5 table. in order to be explicitely loaded into memory, these views need to be queried. For example, to load the whole table we call `[:]` on the view.

Once loaded, the `bins` and `chroms` tables are exposed as pandas DataFrame. This means we can query and process them using standard pandas methods.

Instead of loading the whole table from the view, we can use the `fetch` method and provide a UCSC format genomic region to load from the view.

In [5]:
mat_g1 = clr_g1.matrix(balance=True).fetch('chrom1:100000-500000')

NameError: name 'clr_g1' is not defined

By default, the matrix is returned in dense format as a numpy array. This means we can use all standard numpy methods on it. The issue with dense matrices is that their memory size scales with the square of the genome size.

Alternatively, we can specify `sparse=True` when calling the `matrix()` method. This will instead return a `scipy.sparse` matrix object. This is often necessary when working with large matrices. The main drawback of working with sparse matrices is that they can not be visualized directly.

## Metadata

All metadata accessible on the command line via `cooler info` can also be retrieved using the cooler API.
This metadata is stored as a dictionary in the info attribute. This includes relevant information such as the total coverage.

In [None]:
clr_g1.info

## Aggregating from multiple regions

In this exercise, we will be comparing the local contacts of a set of regions with random background.

We have a BED file containing the binding sites of CTCF and want to investigate whether these sites are associated with a specific contact pattern. How would you proceed ?    

## Computing the 4C-like profile of a region

Sometimes we are curious about the contacts of a specific region (or a few regions) with the rest of the genome. This is the information that would be generated by a 4C experiment, but we can emulate it with Hi-C data.

Can you think of a way to compute the 4C like profile of a region from the Hi-C matrix using standard numpy methods ?

## Visualization

It is often more intuitive to show Hi-C matrices directly. However, when showing Hi-C matrices in figures, it is important that the data has been correctly preprocessed and that the different matrices are comparable.