## Counting and math

Apart from introducing non-binary trees, the power of NBNode comes from its included counting and math mechanisms. 
Each ``NBNode`` has a ``math_node_attribute`` which is used to calculate math on. This is usually set to ``counter``. 

In this example, we will use a small test dataset coming with the package. It comes from a flow cytometry experiment with 13 features (columns) of 999 cells (rows). 
Each cell can further be classified into cell types which we defined with prior biological knowledge as tree given in ``nbtree``.


#### Counting

I start by introducing how to count. 

In [1]:
import os
import re
import pandas as pd

print(os.getcwd())
cellmat = pd.read_csv(
    os.path.join(
        os.pardir,
        os.pardir,
        "tests",
        "testdata",
        "flowcytometry",
        "gated_cells",
        "cellmat.csv",
    )
)
# FS TOF (against FS INT which is "FS")
cellmat.rename(columns={"FS_TOF": "FS.0"}, inplace=True)
cellmat.columns = [re.sub("_.*", "", x) for x in cellmat.columns]
print(cellmat)


/home/gugl/clonedgit/ccc_verse/nbnode/docs/notebooks
         FS  FS.0      SS  CD45RA  CCR7  CD28   PD1  CD27   CD4   CD8   CD3   
0    197657    94  186372    3.90  6.34  4.97 -1.98  7.51  5.87  3.55  5.83  \
1    180716    92  135447    6.48  6.63  5.17  3.07  7.38  5.49  2.64  5.83   
2    134129    90  168268    5.92  6.53  5.39  2.60  7.57  5.70  2.54  5.74   
3    239241    94   79262    5.47  6.57  4.68  3.30  7.36  5.75  2.76  6.06   
4    246527    89   97635    6.12  6.26  5.22  3.05  7.40  5.70  2.66  6.29   
..      ...   ...     ...     ...   ...   ...   ...   ...   ...   ...   ...   
994  176236    90  149982    6.48 -1.11  2.85 -1.55  2.28  0.59  1.70  0.39   
995  191863    99  115406    6.30  5.19  3.01  2.07 -1.58  0.62  1.02  0.73   
996  217752    93  124675    6.35  4.75  0.42  1.89  2.02  0.52  1.48  0.53   
997  334174    97  210458    1.90  1.36  1.22  2.52 -0.72  0.59  1.03  0.75   
998  308089   103  219747    6.48 -0.42  1.23  2.64  7.07  0.57  1.82  1.72   

In [2]:
import nbnode.nbnode_trees as nbtree
cell_tree = nbtree.tree_complete_aligned_trunk()
cell_tree.pretty_print("__long__")

AllCells (counter:0, decision_name:None, decision_value:None)
├── DN (counter:0, decision_name:['CD4', 'CD8'], decision_value:[-1, -1])
├── DP (counter:0, decision_name:['CD4', 'CD8'], decision_value:[1, 1])
├── CD4-/CD8+ (counter:0, decision_name:['CD4', 'CD8'], decision_value:[-1, 1])
│   ├── naive (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[1, 1])
│   ├── Tcm (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[1, -1])
│   ├── Temra (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[-1, 1])
│   └── Tem (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[-1, -1])
└── CD4+/CD8- (counter:0, decision_name:['CD4', 'CD8'], decision_value:[1, -1])
    ├── naive (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[1, 1])
    ├── Tcm (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[1, -1])
    ├── Temra (counter:0, decision_name:['CCR7', 'CD45RA'], decision_value:[-1, 1])
    └── Tem (counter:0, decision_name:['CCR7', 'CD4

Let's predict the cell type of all cells which returns a list of 999 predicted nodes! 

In [3]:
cell_preds = cell_tree.predict(cellmat)
print(cell_preds)

0      (((NBNode('/AllCells/DP', counter=0, decision_...
1      (((NBNode('/AllCells/DP', counter=0, decision_...
2      (((NBNode('/AllCells/DP', counter=0, decision_...
3      (((NBNode('/AllCells/DP', counter=0, decision_...
4      (((NBNode('/AllCells/DP', counter=0, decision_...
                             ...                        
994    (((NBNode('/AllCells/DP', counter=0, decision_...
995    (((NBNode('/AllCells/DP', counter=0, decision_...
996    (((NBNode('/AllCells/DP', counter=0, decision_...
997    (((NBNode('/AllCells/DP', counter=0, decision_...
998    (((NBNode('/AllCells/DP', counter=0, decision_...
Length: 999, dtype: object


This by itself did not change anything in the tree. 

I will introduce another NBNode attribute: ``NBNode.ids``. This is a list of numerical indices indicating which _predicted_ nodes are "contained" in a specific node. 
Naturally, ``root.ids`` should contain _ALL_ ids, and every other node only the list of ids which are (or passed) the node until reaching a endnode. 

Even after predicting, no ids are set, so this is still an empty list. 

In [4]:
print(cell_tree.ids)
print(cell_tree["/AllCells/DP"].ids)

[]
[]


To set the ids, you have to actively use the predicted nodes and identify their ids. ``celltree.id_preds`` takes a list of nodes and sorts them within the tree. The numerical index refers to the order in which the predicted nodes occurred!



In [5]:
cell_tree.id_preds(cell_preds)
print(cell_tree.ids[0:10])
print(len(cell_tree.ids))

# With this here we see that nodes [69, 74, 443, 972, 973] are all in /AllCells/CD4-/CD8+
# or a node below!
print(cell_tree["/AllCells/CD4-/CD8+"].ids[0:10])
print(cell_tree["/AllCells/CD4-/CD8+"].ids[0:10])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
999
[69, 74, 443, 972, 973]
[69, 74, 443, 972, 973]




However, it would be interesting to know how many cells are in each node. For this, we can use ``cell_preds.count(cell_preds)``

In [6]:
cell_tree.count(cell_preds)

If we already set ``NBNode.ids``, we could also not recount but directly use the ``len(every_node.ids)`` which saves us a lot of computation. 

In [7]:
cell_tree.count(cell_preds, use_ids=True)

Internally, this iterates over every predicted node and iterates the tree until reaching the node. 
Any passed node's ``node.ids`` gets appended by the (numerical) index of the predicted node. 

In [8]:
cell_tree.pretty_print()

AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)


We see that now the printed ``counter`` became filled, and the majority of cells are ``/AllCells/DP`` (which are double positive T-cells, but that does not matter for our examples). 


Finally, we can export the counts per node, but we should set ``.data`` for it, see another jupyter notebook for further explanation. 

In [9]:
cell_tree.data = cellmat
print("\nCounts for every sample, only leaf (=end) nodes:")
print(cell_tree.export_counts(only_leafnodes=True).transpose())
print("\n\nCounts for every sample, leaf AND intermediate nodes:")
print(cell_tree.export_counts(only_leafnodes=False).transpose())


Counts for every sample, only leaf (=end) nodes:
Sample                       0
/AllCells/DN                 0
/AllCells/DP               973
/AllCells/CD4-/CD8+/naive    5
/AllCells/CD4-/CD8+/Tcm      0
/AllCells/CD4-/CD8+/Temra    0
/AllCells/CD4-/CD8+/Tem      0
/AllCells/CD4+/CD8-/naive   20
/AllCells/CD4+/CD8-/Tcm      0
/AllCells/CD4+/CD8-/Temra    1
/AllCells/CD4+/CD8-/Tem      0


Counts for every sample, leaf AND intermediate nodes:
Sample                       0
/AllCells/DN                 0
/AllCells/DP               973
/AllCells/CD4-/CD8+/naive    5
/AllCells/CD4-/CD8+/Tcm      0
/AllCells/CD4-/CD8+/Temra    0
/AllCells/CD4-/CD8+/Tem      0
/AllCells/CD4-/CD8+          5
/AllCells/CD4+/CD8-/naive   20
/AllCells/CD4+/CD8-/Tcm      0
/AllCells/CD4+/CD8-/Temra    1
/AllCells/CD4+/CD8-/Tem      0
/AllCells/CD4+/CD8-         21
/AllCells                  999




#### Math on one NBNode

After we now have a usefull number assigned to each node, we can do quite a bit of math. 
Each ``NBNode`` has a ``math_node_attribute`` which is used to calculate math on. This is usually set to ``counter``. 

We can then use usual math to add, subtract, multiply, etc. nodes with numerics.  

Note that this is then not backed up by ``NBNode.ids`` anymore!

In [10]:
# Math operations do not happen inplace
added_tree = cell_tree + 100
print(added_tree.pretty_print())

print(cell_tree.pretty_print())

AllCells (counter:1099)
├── DN (counter:100)
├── DP (counter:1073)
├── CD4-/CD8+ (counter:105)
│   ├── naive (counter:105)
│   ├── Tcm (counter:100)
│   ├── Temra (counter:100)
│   └── Tem (counter:100)
└── CD4+/CD8- (counter:121)
    ├── naive (counter:120)
    ├── Tcm (counter:100)
    ├── Temra (counter:101)
    └── Tem (counter:100)
None
AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)
None


Re-counting by using the ids RESETS all math operations and overwrites the counter with the length of the ids!

In [11]:
added_tree.count(use_ids=True)
added_tree.pretty_print()

AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)


To focus on the important math, we will only print the root node from now on, but could use pretty_print() everytime to show that the operations happen on every node. 

In [12]:
new_tree = cell_tree + 100
print(new_tree)

new_tree = new_tree - 10
print(new_tree)

new_tree = new_tree *2
print(new_tree)


NBNode('/AllCells', counter=1099, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=1089, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=2178, decision_name=None, decision_value=None)


Sometimes it is important which type is used. There are two options to do that:

 1. Change the math operation such that it is appropriate
 2. Modify the type of the tree

In [13]:
try: 
    new_tree = new_tree /3
except TypeError as e:
    print("TypeError: descriptor '__truediv__' requires a 'float' object but received a 'int'")

new_tree = new_tree /3.0
# Note that the error did NOT happen in the rootnode, so it might be that some math 
# operations have already been done!
print(new_tree)

# With astype_math_node_attribute we can change the type of the math node attribute
# from all nodes in the tree
print(new_tree.astype_math_node_attribute(float))
print(new_tree.astype_math_node_attribute(int))

NBNode('/AllCells', counter=242.0, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=242.0, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=242, decision_name=None, decision_value=None)


In [14]:
print(new_tree / 15)
print(new_tree // 15)
print(new_tree % 15)
print(new_tree << 2)
print(new_tree >> 2)


NBNode('/AllCells', counter=16.133333333333333, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=16, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=2, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=968, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=60, decision_name=None, decision_value=None)


##### Equalities

When introducing counters, suddenly the same "trees" are not identical anymore: 



In [15]:
print(cell_tree == cell_tree)
print(cell_tree == cell_tree + 100)

True
False


Therefore we introduce the difference between "structural" and "complete" identity. 
Two trees are structurally equal if their node.name, node.decision_name and node.decision_value are equal, everything else can be different. 

In [16]:
from nbnode.nbnode import NBNode
import nbnode.nbnode_trees as nbtree

original_tree = nbtree.tree_simple()
new_tree = nbtree.tree_simple()
print(new_tree == new_tree  + 100)
print(new_tree.eq_structure(new_tree + 100))

NBNode("ADDITIONAL_NODE", parent=new_tree)

new_tree.pretty_print()
original_tree.pretty_print()

print(original_tree == original_tree + 100)
print(original_tree == new_tree)

# You can generate a new tree by only copying the structure, then counts and data are not copied: 
new_tree = original_tree.copy_structure()

False
True
a (counter:0)
├── a0 (counter:0)
├── a1 (counter:0)
│   └── a1a (counter:0)
├── a2 (counter:0)
└── ADDITIONAL_NODE (counter:0)
a (counter:0)
├── a0 (counter:0)
├── a1 (counter:0)
│   └── a1a (counter:0)
└── a2 (counter:0)
False
False




#### Math with multiple NBNodes

We can then use usual math to add, subtract, multiply, etc. nodes with each other. Explicitely, this traverses all nodes in both trees simultaneously and does the mathematical operation using both ``math_node_attribute``. The result is then saved in the ``math_node_attribute``, but no tree is changed inplace. 

Note that this is then not backed up by ``NBNode.ids`` anymore!

In [17]:
import copy
import nbnode.nbnode_trees as nbtree
cell_tree = nbtree.tree_complete_aligned_trunk()
cell_tree.id_preds(cell_tree.predict(cellmat))
cell_tree.count(use_ids=True)
cell_tree.pretty_print()

cell_tree_2 = copy.deepcopy(cell_tree)
# Reset the counts of the nodes
cell_tree_2.reset_counts()
cell_tree_2 = cell_tree_2 + 1

# You can set the counter values manually. 
# Keep in mind that setting an intermediate node (like this one)
#  might not make any sense biologically as every cell must reach a leaf node
cell_tree_2["/AllCells/CD4-/CD8+"].counter = 1000
cell_tree_2.pretty_print()


AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)
AllCells (counter:1)
├── DN (counter:1)
├── DP (counter:1)
├── CD4-/CD8+ (counter:1000)
│   ├── naive (counter:1)
│   ├── Tcm (counter:1)
│   ├── Temra (counter:1)
│   └── Tem (counter:1)
└── CD4+/CD8- (counter:1)
    ├── naive (counter:1)
    ├── Tcm (counter:1)
    ├── Temra (counter:1)
    └── Tem (counter:1)


In [18]:
# Add the two trees
(cell_tree + cell_tree_2).pretty_print()
print(cell_tree)
print(cell_tree_2)

AllCells (counter:1000)
├── DN (counter:1)
├── DP (counter:974)
├── CD4-/CD8+ (counter:1005)
│   ├── naive (counter:6)
│   ├── Tcm (counter:1)
│   ├── Temra (counter:1)
│   └── Tem (counter:1)
└── CD4+/CD8- (counter:22)
    ├── naive (counter:21)
    ├── Tcm (counter:1)
    ├── Temra (counter:2)
    └── Tem (counter:1)
NBNode('/AllCells', counter=999, decision_name=None, decision_value=None)
NBNode('/AllCells', counter=1, decision_name=None, decision_value=None)


In [19]:
(cell_tree - cell_tree_2).pretty_print()

AllCells (counter:998)
├── DN (counter:-1)
├── DP (counter:972)
├── CD4-/CD8+ (counter:-995)
│   ├── naive (counter:4)
│   ├── Tcm (counter:-1)
│   ├── Temra (counter:-1)
│   └── Tem (counter:-1)
└── CD4+/CD8- (counter:20)
    ├── naive (counter:19)
    ├── Tcm (counter:-1)
    ├── Temra (counter:0)
    └── Tem (counter:-1)


In [20]:
(cell_tree * cell_tree_2).pretty_print()

AllCells (counter:999)
├── DN (counter:0)
├── DP (counter:973)
├── CD4-/CD8+ (counter:5000)
│   ├── naive (counter:5)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:21)
    ├── naive (counter:20)
    ├── Tcm (counter:0)
    ├── Temra (counter:1)
    └── Tem (counter:0)


In [21]:

(cell_tree / cell_tree_2).pretty_print()

AllCells (counter:999.0)
├── DN (counter:0.0)
├── DP (counter:973.0)
├── CD4-/CD8+ (counter:0.005)
│   ├── naive (counter:5.0)
│   ├── Tcm (counter:0.0)
│   ├── Temra (counter:0.0)
│   └── Tem (counter:0.0)
└── CD4+/CD8- (counter:21.0)
    ├── naive (counter:20.0)
    ├── Tcm (counter:0.0)
    ├── Temra (counter:1.0)
    └── Tem (counter:0.0)


In [22]:

(cell_tree % cell_tree_2).pretty_print()

AllCells (counter:0)
├── DN (counter:0)
├── DP (counter:0)
├── CD4-/CD8+ (counter:5)
│   ├── naive (counter:0)
│   ├── Tcm (counter:0)
│   ├── Temra (counter:0)
│   └── Tem (counter:0)
└── CD4+/CD8- (counter:0)
    ├── naive (counter:0)
    ├── Tcm (counter:0)
    ├── Temra (counter:0)
    └── Tem (counter:0)
