# Feature Tutorial

Besides datatracks, another aspect of chromatin/genome/structural biology which is routinely needed to be investigated are features. This can be in an extremely broad sense but to a first approximation a genome feature is just a named region of the genome which doesn't change. This is distinct from datatracks which define some function over genomic regions (see the Datatrack tutorial). 

Ofcourse, we could just encode features as numpy arrays on some chromosome:

In [1]:
import numpy as np
myregion = np.array([10,20])
mychrom = '4'

myfeature = (mychrom, myregion)
print(myfeature)

('4', array([10, 20]))


However, this isn't the most flexible and when we scale to large numbers of features it may not even be so descriptive or useful. Firstly, we often need a way to link features together. For example:
- A given gene may be associated with a given promoter. 
- A topologically associated domain (TAD) may contain multiple promoters within it

Secondly, features may have condition-specific attributes which we might want to assign to them (while still having access to the same condition-independent feature). For example:

- A promoter might have active marks in naive mESCs but bivalent marks in primed mESCs
- A TAD may be within the A compartment in naive mESCs but move to the B compartment in primed mESCs

Finally, we may want to keep track of datatracks within specific classes of features. For example:

- We may want to know average rates of DNA methylation at active vs. bivalent promoters in naive mESCs

In order to do all of these things, we'll make use of the Datatrack classes and the Feature classes. A bit of work will be needed to create our feature classes but once done it will provide some flexible analysis tools.

## Modules

In [2]:
from GenTools.tools import Feature as F
from GenTools.tools import Datatrack as Dt
from GenTools.utils import dtrack_utils as dtu

import pandas as pd
import numpy as np

## The Feature Class
The main class which i'll use to encode general features is the Feature class. At it's base, all a feature needs is a type and a unique ID which allows us to keep track of that feature:

In [3]:
myfeat1 = F.Feature('type1', 'type1_UID1')

print("Thisi is a feature of type:\t{}\nand with ID:\t{}".format(myfeat1.type, myfeat1.id))

Thisi is a feature of type:	type1
and with ID:	type1_UID1


### Feature Attributes
One optional input when we construct our feature is attrs (attributes). Essentially this is extra info beyond the feature ID and the feature type which we might want to associate with the feature. Let's add a 'chromosome' attribute to our feature myfeat1:

In [4]:
myfeat1.attrs['chromosome'] = '1'

print(myfeat1.attrs)

{'chromosome': '1'}


Feature attributes can really be anything:

In [5]:
myfeat1.attrs['region'] = np.array([[10,20]]).astype('int32')
print(myfeat1.attrs)

{'chromosome': '1', 'region': array([[10, 20]], dtype=int32)}


### Feature Children
Features can also have children:

In [6]:
#Lets make a second feature of type 2
myfeat2 = F.Feature('type2',
                    'type2_UID1',
                    attrs = {'chromosome': '1',
                             'region': np.array([[11, 13]]).astype('int32')
                            }
                   )


#Now let's add it to our first feature as a child
myfeat1.add_child(myfeat2, #The child feature we want to add
                  'type2' #The type of child feature we're adding
                 )

The children of myfeat1 are stored in a dictionary in myfeat1.children. In particular, since we have specified our child type as 'type2', myfeat2 will appear in myfeat1.children['type2']:

In [7]:
print("Children of {}:".format(myfeat1.id))
print(myfeat1.children)

#We can also access the attributes of children
print("Attributes of {}".format(myfeat1.children['type2'][0].id))
print(myfeat1.children['type2'][0].attrs)

Children of type1_UID1:
{'type2': [<GenTools.tools.Feature.Feature object at 0x7f5705374690>]}
Attributes of type2_UID1
{'chromosome': '1', 'region': array([[11, 13]], dtype=int32)}


Note that at the moment there is nothing stopping the same feature from being added twice:

In [8]:
#Now let's add it to our first feature as a child
myfeat1.add_child(myfeat2, #The child feature we want to add
                  'type2' #The type of child feature we're adding
                 )

print(myfeat1.children)

{'type2': [<GenTools.tools.Feature.Feature object at 0x7f5705374690>, <GenTools.tools.Feature.Feature object at 0x7f5705374690>]}


However, since we have added the same feature (myfeat2) twice, note that the location of each item in the list myfeat1.children['type2'] is the same i.e. they're the same object. 

Let's see what happens if we add an attribute to our myfeat2:

In [9]:
print("Before editing the region the attributes of {} are:".format(myfeat2.id))
print(myfeat2.attrs)
print("\n")
print("###################################################")
#Changing the region of myfeat2 to be longer
myfeat2.attrs['region'][0,1] = 17

print("After editing the region of the attributes of {}, accessing the attrs via {} we get:".format(myfeat2.id, myfeat2.id))
print(myfeat2.attrs)
print("\n")
print("###################################################")

print("After editing the region of the attributes of {}, accessing the attrs via {} we get:".format(myfeat2.id, myfeat1.id))
print("Child type2 number 0:")
print(myfeat1.children['type2'][0].attrs)
print("Child type2 number 1:")
print(myfeat1.children['type2'][1].attrs)

Before editing the region the attributes of type2_UID1 are:
{'chromosome': '1', 'region': array([[11, 13]], dtype=int32)}


###################################################
After editing the region of the attributes of type2_UID1, accessing the attrs via type2_UID1 we get:
{'chromosome': '1', 'region': array([[11, 17]], dtype=int32)}


###################################################
After editing the region of the attributes of type2_UID1, accessing the attrs via type1_UID1 we get:
Child type2 number 0:
{'chromosome': '1', 'region': array([[11, 17]], dtype=int32)}
Child type2 number 1:
{'chromosome': '1', 'region': array([[11, 17]], dtype=int32)}


Because the two added children (both myfeat2) are actually the same object, when we edit that object, the same edits are applied to the children within myfeat1. 

Finally, the input 'type2' into the add_child method is not entirely necessary since it can be inferred from myfeat2:

In [10]:
myfeat1 = F.Feature('type1', 'type1_UID1')
myfeat2 = F.Feature('type2', 'type2_UID1')
myfeat1.add_child(myfeat2)

print(myfeat1.children)

{'type2': [<GenTools.tools.Feature.Feature object at 0x7f577c136b10>]}


However, we may want to group children of the same type when we add children to our features. Therefore we can specify the child type:

In [11]:
myfeat1 = F.Feature('type1', 'type1_UID1')
myfeat2 = F.Feature('type2', 'type2_UID1')
myfeat3 = F.Feature('type2', 'type2_UID2') #Third feature of the same type as myfeat2 but with a different UID
myfeat1.add_child(myfeat2, 'type2a')
myfeat1.add_child(myfeat3, 'type2b')

print(myfeat1.children)

{'type2a': [<GenTools.tools.Feature.Feature object at 0x7f5705374c90>], 'type2b': [<GenTools.tools.Feature.Feature object at 0x7f577c136ad0>]}


### Feature Parents
Features can also have parents. These work just the same as children:

In [15]:
myfeat1 = F.Feature('type1', 'type1_UID1')
myfeat2 = F.Feature('type2', 'type2_UID1')
myfeat1.add_child(myfeat2)
myfeat2.add_parent(myfeat1)

print("Child information for {}:".format(myfeat1.id))
print(myfeat1.children['type2'][0].id)
print(myfeat1.children['type2'][0].type)
print(myfeat1.children['type2'][0].attrs)
print("\n")
print("Parent information for {}:".format(myfeat2.id))
print(myfeat2.parents['type1'][0].id)
print(myfeat2.parents['type1'][0].type)
print(myfeat2.parents['type1'][0].attrs)

Child information for type1_UID1:
type2_UID1
type2
{}


Parent information for type2_UID1:
type1_UID1
type1
{}


## Single Condition Features
Sometimes we might want to associate a feature with a single condition and for this there is the Feature_single_condition class. The only difference between this and the standard Feature class is that the initialisation of one of these objects requires that a condition be specified. For example we might specify a TAD which seems to be unique to the naive mESC timepoint:

In [18]:
mESC_TAD = F.Feature_single_condition('tad',       #Feature type
                                      'tad_UID1',  #Feature UID
                                      'naive',      #Condition
                                      attrs = {'region': np.array([[10,20]]).astype('int32'),
                                               'chromosome': '2'
                                              }
                                     )
print(mESC_TAD.type)
print(mESC_TAD.id)
print(mESC_TAD.attrs['condition'])
print(mESC_TAD.attrs['chromosome'])
print(mESC_TAD.attrs['region'])

tad
tad_UID1
naive
2
[[10 20]]


## Single Cell Features
As a subclass of single-condition features sometimes we might want to associate a feature with a single cell and for this there is the Feature_single_cell class. The only difference between this and the Feature_single_condition class is that the initialisation of one of these objects requires that a cell identifier be specified. For example we might specify a loop which seems to be unique to a single cell in the naive mESC timepoint:

In [19]:
mESC_loop = F.Feature_single_cell('loop',      #Feature type
                                 'loop_UID1',  #Feature UID
                                 'cell_1',     #Cell ID
                                 'naive',      #Condition
                                 attrs = {'contacts': np.array([[10,20],[50,60]]).astype('int32'),
                                          'chromosome': '2'
                                              }
                                )
print(mESC_loop.type)
print(mESC_loop.id)
print(mESC_loop.attrs['condition'])
print(mESC_loop.attrs['chromosome'])
print(mESC_loop.attrs['contacts'])

loop
loop_UID1
naive
2
[[10 20]
 [50 60]]
