# Testing of imgaging.py

In [1]:
import numpy as np
import sys
import os
# we are in alabtools/test/imaging_testing
# we have to import functions in alabtools/
sys.path.append(os.path.abspath('../..'))
from alabtools.utils import Genome, Index
from alabtools.imaging import CtFile

The standardized data by 4DN for Chromatin Tracing is FOF-CT (Fish Omics Format - Chromatin Tracing).

I have downloaded a few datasets from different labs to test the code.

For quicker tests, I also create two files - taken from an original FOF-CT - that have only 1000 lines each.

In [7]:
# Create two cvs test files, taken from an original F csv file but with only 1000 lines each.
# Important: both files include the header.

in_filename = './data/takei_mesc_DNA-tracing_1Mb_rep1.csv'
in_file = open(in_filename, 'r')
lines = in_file.readlines()
in_file.close()

out_filename = './data/csv_test_1.csv'
out_file = open(out_filename, 'w')
for line in lines[:1000]:
    out_file.write(line)
out_file.close()

out_filename = './data/csv_test_2.csv'
out_file = open(out_filename, 'w')
for line in lines[:15]:
    out_file.write(line)
for line in lines[1000:2000]:
    out_file.write(line)
out_file.close()

First, I am going to read the data from one test file.

In [10]:
if os.path.exists('ct_test1.ct'):
    os.system('rm ct_test1.ct')  # remove the file if it exists
# Reading the test csv file (1)
ct = CtFile('ct_test1.ct', 'w')  # initialize the ct file
ct.set_from_fofct('./data/csv_test_1.csv')  # read the FOFCT file

Assembly mm10 found in alabtools/genomes. Using this.

chroms or lengths not given, reading from genomes info file.


A CT file is very similar to an HSS one. It is an HDF5 file, and has a genome and an index objects as data.

There is an important difference though. HDF5 files support only homogeneous arrays (i.e. hyperrectangules) as dataset. For Chromatin Tracing, the data is inherently heterogeneous: for example, the number of spots can change by copy/domain/cell. Here this issue is solved by max-padding: the dimensions are increased to match the maximum values, and the rest is set to NaN. This is applied to both coordinates and number of spots, and consequently there are two attributes (nspot_max and ncopy_max) that are used for max-padding only.

CT attributes:

    - ncell : number of cells
    - ndomain : number of domains (in Index object)
    - nspot_tot : total number of spots measured
    - ntrace_tot : total number of chromatin traces identified
    - nspot_max : maximum number of spots per copy/domain/cell
    - ncopy_max : maximum number of copies per domain/cell

CT datasets:

    - cell_labels : np.array(ncell, dtype='str'), contains cellIDs
    - coordinates : np.array(ncell, ndomain, ncopy_max, nspot_max, 3)
    - nspot: np.array(ncell, ndomain, ncopy_max), number of spots per copy/domain/cell
    - ncopy: np.array(ncell, ndomain), number of copies per domain/cell

In [13]:
print(ct.ncell)
print(ct.ndomain)
print(ct.nspot_tot)
print(ct.ntrace_tot)
print(ct.nspot_max)
print(ct.ncopy_max)
print(ct.cell_labels.shape)
print(ct.coordinates.shape)
print(ct.nspot.shape)
print(ct.ncopy.shape)

print(ct.coordinates[0, 0, 0, 0, :])

197
4
985
197
6
1
(197,)
(197, 4, 1, 6, 3)
(197, 4, 1)
(197, 4)
[174.119  17.179   2.392]


There is a merging feature. This is necessary when - like in the case of Takei's data - we have to merge two replicates.

The code checks if there is overlap between the cellIDs, and if there is it requires tags to make them distinguishable.

In [15]:
if os.path.exists('ct_test2.ct'):
    os.system('rm ct_test2.ct')  # remove the file if it exists
# Reading the test csv file (1)
ct2 = CtFile('ct_test2.ct', 'w')
ct2.set_from_fofct('./data/csv_test_1.csv')

Assembly mm10 found in alabtools/genomes. Using this.

chroms or lengths not given, reading from genomes info file.


In [16]:
# Try merging the two ct files without tags
if os.path.exists('ct_test3.ct'):
    os.system('rm ct_test3.ct')  # remove the file if it exists
ct3 = ct1.merge(ct2, 'ct_test3.ct')  # merge the two ct files

ValueError: There is an overlap in cell labels. Please provide tags to distinguish them.

In [17]:
# Merge with tags
if os.path.exists('ct_test3.ct'):
    os.system('rm ct_test3.ct')  # remove the file if it exists
ct3 = ct1.merge(ct2, 'ct_test3.ct', '_rep1', '_rep2')  # merge the two ct files

In [18]:
# The number of cells should be the sum of the two
print(ct1.ncell, ct2.ncell, ct3.ncell)

197 197 394


Now we can load the actual file and check that everything works.

In [19]:
if os.path.exists('ct_takei_rep1.ct'):
    os.system('rm ct_takei_rep1.ct')  # remove the file if it exists
ct = CtFile('ct_takei_rep1.ct', 'w')
ct.set_from_fofct('./data/takei_mesc_DNA-tracing_1Mb_rep1.csv')

Assembly mm10 found in alabtools/genomes. Using this.

chroms or lengths not given, reading from genomes info file.


In [22]:
if os.path.exists('ct_takei_rep2.ct'):
    os.system('rm ct_takei_rep2.ct')  # remove the file if it exists
ct2 = CtFile('ct_takei_rep2.ct', 'w')
ct2.set_from_fofct('./data/takei_mesc_DNA-tracing_1Mb_rep2.csv')

Assembly mm10 found in alabtools/genomes. Using this.

chroms or lengths not given, reading from genomes info file.


In [23]:
ct3 = ct.merge(ct2, 'ct_takei_comb.ct', '_rep1', '_rep2')  # merge the two ct files

In [25]:
print(ct3.ncell)
print(ct3.ndomain)
print(ct3.nspot_tot)
print(ct3.ntrace_tot)
print(ct3.nspot_max)
print(ct3.ncopy_max)
print(ct3.cell_labels.shape)
print(ct3.coordinates.shape)
print(ct3.nspot.shape)
print(ct3.ncopy.shape)


446
2460
1795228
8919
15
1
(446,)
(446, 2460, 1, 15, 3)
(446, 2460, 1)
(446, 2460)


In [24]:
# Check that the Index object is correctly created from the FOFCT file

i = Index('./domains_takei.bed', ct3.genome)

print(i == ct3.index)

True


In [2]:
# Try other data
if os.path.exists('ct_bingren.ct'):
    os.system('rm ct_bingren.ct')  # remove the file if it exists
ct = CtFile('ct_bingren.ct', 'w')
ct.set_from_fofct('./data/bingren_4DNFIKPGMZJ8.csv')

print(ct.coordinates.shape)


Assembly grcm38 found in alabtools/genomes. Using this.


chroms or lengths not given, reading from genomes info file.


(752, 42, 2, 1, 3)


In [3]:
# Try other data
if os.path.exists('ct_boettiger.ct'):
    os.system('rm ct_boettiger.ct')  # remove the file if it exists
ct = CtFile('ct_boettiger.ct', 'w')
ct.set_from_fofct('./data/boettiger_4DNFI5WLD5KM.csv')

print(ct.coordinates.shape)

ValueError: Assembly not found in alabtools/genomes. Need to include it.

In [None]:
# Try other data
if os.path.exists('ct_wang.ct'):
    os.system('rm ct_wang.ct')  # remove the file if it exists
ct = CtFile('ct_wang.ct', 'w')
ct.set_from_fofct('./data/wang_4DNFI7PBQK6G.csv')

print(ct.coordinates.shape)