The cndb file is an H5 file format, to handle it we use two main libraries:

h5py <br>
numpy

In [1]:
import h5py
import numpy as np

"cndbf" variable is a dictonary, where we have labels -> values. In this example we have Header label and chains labels:

"Header" -> All informations present in Header of NDB files  <br>

chains labels:<br>
If the file contains multiple chromosomes there is a single label for deal of each of then separately <br> 

Below I will show each label inside the cndb file

![CNDB file](cndb.png)

Open the cndb file using:

In [2]:
filename = 'multichain_C1C2.cndb'
mode = 'r'
cndbf = h5py.File(filename, mode)

In this example file we have the Header label and two chains, called here as C1 and C2

In [4]:
#The labels in the root directory
cndbf.keys()

<KeysViewHDF5 ['C1', 'C2', 'Header']>

The Header contains the metadata informations <br>
all infos is saved in separate atributes. To go over all then we can use this commands:

In [6]:
cndbf['Header'].attrs.keys()

<KeysViewHDF5 ['author', 'chains', 'cycle', 'date', 'expdta', 'info', 'title', 'version']>

In [7]:
#loop over all metadata
for inf in cndbf['Header'].attrs.keys():
    print('{} => {}'.format(inf, cndbf['Header'].attrs[inf]))

author => Antonio B Oliveira Junior
chains => C1,C2
cycle => INTERPHASE G2
date => 2022-11-01 14:15:58.071576
expdta => Simulation - MEGABASE - MiChroM
info => First cndb file created
title => The Nucleome Data Bank: Web-based Resources Simulate and Analyze the Three-Dimensional Genome
version => 1.0.0


Lets access now the C1 label and look all keys we can access

In [8]:
#use variable C1 here jst for not rewriting "cndbf['C1']" all the time
C1 = cndbf['C1']

In [9]:
C1.keys() #remember this command is the same as cndbf['C1'].keys()

<KeysViewHDF5 ['genomic_position', 'loops', 'spatial_position', 'time', 'types']>

Inside the C1 label (a chain label) we always will have these keys: <br>

'types' -> the list of sequence of types (**Now in string format**) <br>
'loops' -> list of index position (i,j) of each loop, if exist <br>
'time' -> list of time for each frame present in spatial_position <br>
'genomic_position' -> list os genomic position (star and end positions) in bp <br>
'spatial_position' -> directory with all frames (or traces) contains the xyz position (this is saved sequentially), so '1'; '2'; '3' and so on...<br>

Let take a look how extract each information:


Types

In [11]:
types = list(C1['types']) #get from C1 and transform in a list os string
print(types[:10]) #show the first 10 elements

['NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA']


Loops


In [14]:
loops = list(C1['loops']) #get from C1 and transform in a list of pairs of index
print(loops[:10]) #show the first 10 elements

[array([931., 933.]), array([4089., 4090.]), array([4065., 4067.]), array([624., 626.]), array([478., 480.]), array([244., 246.]), array([4105., 4106.]), array([3128., 3129.]), array([3652., 3656.]), array([1075., 1078.])]


Genomic Position

In [20]:
gen_pos = list(C1['genomic_position']) #get from C1 and transform in a list of star and end position in genomic bp
print(gen_pos[:10]) #show the first 10 elements

[array([    1, 50000]), array([ 50001, 100000]), array([100001, 150000]), array([150001, 200000]), array([200001, 250000]), array([250001, 300000]), array([300001, 350000]), array([350001, 400000]), array([400001, 450000]), array([450001, 500000])]


Time

*obs: the time here is not necessary sequential, this label is important to link the sequential traces (from spatial_position) with the real time (or computional time saved)* <br>

in this case, I saved a frame for each 100 timesteps (in simulation) for example

In [21]:
time = list(C1['time']) #get from C1 and transform in a list os str
print(time[:10]) #show the first 10 elements

['1', '101', '201', '301', '401', '501', '601', '701', '801', '901']


Spatial Position

In [22]:
#these are the 3 ways to handle with directories in HDF5 files, again, I using 'pos'
#just for rewriting all the path all the time.
pos = cndbf['C1/spatial_position'] #or
#pos = cndbf['C1'][spatial_position'] or
#pos = C1.[spatial_position']


In [25]:
#look for the number of traces:
print(len(pos))

1000


To access the trajectory (x,y,z) position save it a numpy array:

In [26]:
xyz_1 = np.array(pos["1"])
print(xyz_1)

[[-11.285787   -16.510454    -9.630638  ]
 [-11.101784   -16.334913    -8.64791   ]
 [-10.572278   -15.668607    -9.003843  ]
 ...
 [-12.695717    -4.2778034    0.10639028]
 [-12.547007    -5.2102995    0.1069468 ]
 [-11.645896    -5.278418    -0.16999087]]


to extract all  frames just loop over all labels:

In [31]:
xyz_all = []
Ntraces = len(pos)
for i in range(1,Ntraces+1): 
    xyz_all.append(pos[str(i)])

print("In this file we have {:} frames of {:} beads each".format(len(xyz_all), len(xyz_all[0])))

In this file we have 1000 frames of 4989 beads each


All the exaple above can be made using the chain 2 just change the directory for C2:

In [32]:
C2 = cndbf['C2']
print(C2.keys())

<KeysViewHDF5 ['genomic_position', 'loops', 'spatial_position', 'time', 'types']>
