# Details of the BioDendro pipeline

**This is technical manual of each step in BioDendo and the information you can extract. Please use the quick start for example, execution and data interrogation of results.**

## Running the pipeline manually

BioDendro essentially does the following steps:

1. Parse the MGF and components files
   - Optionally scaling and filtering Ions from MGF with a minimum intensity threshold.
2. Find Ions matching components
3. Bins mz values by proximity
4. Uses binned mz values to cluster components into a tree

In [1]:
# Load required modules

import os
import copy
import plotly
from BioDendro import preprocess
from BioDendro import cluster

Loading the MGF file is relatively straight forward using python.

The pipeline changes the MGF record titles in a way that won't always be portable.
If you encounter issues with record titles, you may wish to exclude the `split_msms_title` step.

In [2]:
# Load the MSMS records

with open("Fireflies_MSMS.mgf") as handle:
    mgf = preprocess.MGF.parse(handle)

mgf

MGF(records=[
MGFRecord(title='Ppyr_hemolymph_extract.173.173.1 File:"Ppyr_hemolymph_extract.raw", NativeID:"controllerType=0 controllerNumber=1 scan=173"', retention=58.5319032, pepmass=Ion(mz=118.08655413452, intensity=7337849.5), charge='1+', ions=[Ion(mz=55.05483998, intensity=1038985.0), Ion(mz=58.06570893, intensity=599806.5), Ion(mz=59.07383792, intensity=617962.375), Ion(mz=72.08117517, intensity=28507728.0), Ion(mz=118.0867653, intensity=13105900.0)]),
MGFRecord(title='Ppyr_hemolymph_extract.337.337.1 File:"Ppyr_hemolymph_extract.raw", NativeID:"controllerType=0 controllerNumber=1 scan=337"', retention=112.213572, pepmass=Ion(mz=120.081038697928, intensity=11558862.0), charge='1+', ions=[Ion(mz=56.05011642, intensity=144165.71875), Ion(mz=61.04036552, intensity=31103.697265625), Ion(mz=74.06097629, intensity=415685.0), Ion(mz=91.05498543, intensity=128501.1953125), Ion(mz=93.07017855, intensity=519245.96875), Ion(mz=102.0552512, intensity=153893.03125), Ion(mz=103.0548364, int

At this point we might decide to scale and/or filter the Ions for each MGFRecord.

In [3]:
with open("Fireflies_MSMS.mgf") as handle:
    mgf_scaled = preprocess.MGF.parse(handle, scaling=True, filtering=True, eps=0.2)
mgf_scaled

MGF(records=[
MGFRecord(title='Ppyr_hemolymph_extract.173.173.1 File:"Ppyr_hemolymph_extract.raw", NativeID:"controllerType=0 controllerNumber=1 scan=173"', retention=58.5319032, pepmass=Ion(mz=118.08655413452, intensity=7337849.5), charge='1+', ions=[Ion(mz=72.08117517, intensity=1.0), Ion(mz=118.0867653, intensity=0.4597314805304723)]),
MGFRecord(title='Ppyr_hemolymph_extract.337.337.1 File:"Ppyr_hemolymph_extract.raw", NativeID:"controllerType=0 controllerNumber=1 scan=337"', retention=112.213572, pepmass=Ion(mz=120.081038697928, intensity=11558862.0), charge='1+', ions=[Ion(mz=103.0548364, intensity=0.20429361231041457), Ion(mz=120.0804451, intensity=1.0)]),
MGFRecord(title='Ppyr_hemolymph_extract.673.673. File:"Ppyr_hemolymph_extract.raw", NativeID:"controllerType=0 controllerNumber=1 scan=673"', retention=223.69026, pepmass=Ion(mz=130.050003051758, intensity=4432127.0), charge='None', ions=[Ion(mz=84.04458615, intensity=1.0), Ion(mz=87.00419849, intensity=0.8858643370218536), Ion

You'll notice in the ions field of each MGF record that the "intensity" values have been scaled to numbers between 0 and 1, and that any ions with scaled intensity below 0.2 will be filtered out.

Loading components files is a similar process.

In [4]:
# Load the list of components to compare to the MSMS file

with open("./Fireflies_component_list.txt") as handle:
    components = preprocess.SampleRecord.parse(handle)

components[:5]

[SampleRecord(mz=171.076431274414, retention=47.8025634, original='Ppyr_hemolymph_extract_171.076431274414_0.79670939'),
 SampleRecord(mz=143.081619262695, retention=73.88694600000001, original='Ppyr_hemolymph_extract_143.081619262695_1.2314491'),
 SampleRecord(mz=234.098587036132, retention=786.9699, original='Ppyr_hemolymph_extract_234.098587036132_13.116165'),
 SampleRecord(mz=234.098495483398, retention=538.098924, original='Ppyr_hemolymph_extract_234.098495483398_8.9683154'),
 SampleRecord(mz=313.150604248046, retention=49.100565, original='Ppyr_hemolymph_extract_313.150604248046_0.81834275')]

Next we find the ions for each components.

In [5]:
# Remove redundant records with mass and retention time tolerance 

df = preprocess.remove_redundancy(components, mgf, mz_tol=0.002, retention_tol=5, neutral=False)
df.head()

Unnamed: 0,component,sample,mz
0,Ppyr_hemolymph_extract_501.270843505859_3.6794211,"Ppyr_hemolymph_extract.653.653.1 File:""Ppyr_he...",50.002351
1,Ppyr_hemolymph_extract_185.023376464843_20.739846,"Ppyr_hemolymph_extract.3689.3689. File:""Ppyr_h...",50.006396
2,Ppyr_hemolymph_extract_491.227676391601_10.204906,"Ppyr_hemolymph_extract.1829.1829.1 File:""Ppyr_...",50.015094
3,Ppyr_hemolymph_extract_431.178741455078_25.372757,"Ppyr_hemolymph_extract.4529.4529.1 File:""Ppyr_...",50.24259
4,Ppyr_hemolymph_extract_332.279632568359_12.443386,"Ppyr_hemolymph_extract.2229.2229.1 File:""Ppyr_...",50.334601


And finally, we can cluster the data using a `Tree` object.

The interface to `Tree` is similar to `scikit-learn` objects, where first you initialise a model with parameters, then you fit it with some data.

In [6]:
# Using the non-redundant dataframe, bin analytes on threshold=0.004 and return a data matrix
# Cluster the data matrix using clustering_method="jaccard"
# Set threshold to color dendrogram and output clusters at cutoff=0.4

tree = cluster.Tree(threshold=0.004, clustering_method="braycurtis", cutoff=0.6)
tree.fit(df)

The threshold is the parameter that affects binning.
Increasing the threshold will reduce the number of bins, decreasing the threshold will make bins more granular.

The tree object now contains all of our intermediate data.

We can look at which samples have which bins in a [Pandas](https://pandas.pydata.org/) dataframe.
This is the matrix used to compute distances based on presence absence.

In [7]:
# One hot encoded matrix output from mz binning.
tree.onehot_df.head()

bins,100.0237_100.0224_100.0244,100.0760_100.0751_100.0766,100.1127_100.1127_100.1127,100.8624_100.8624_100.8624,100.9321_100.9317_100.9326,100.9383_100.9383_100.9383,100.9522_100.9521_100.9523,101.0235_101.0235_101.0235,101.0597_101.0590_101.0605,101.0963_101.0963_101.0963,...,99.0805_99.0799_99.0808,99.0919_99.0918_99.0919,99.1171_99.1170_99.1171,99.9242_99.9242_99.9242,99.9485_99.9485_99.9485,99.9691_99.9691_99.9691,99.9874_99.9874_99.9874,990.1105_990.1105_990.1105,992.7504_992.7504_992.7504,992.7560_992.7560_992.7560
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ppyr_hemolymph_extract_1061.80700683593_24.011575,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Ppyr_hemolymph_extract_1065.4677734375_15.366083,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Ppyr_hemolymph_extract_1078.8335571289_24.011575,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Ppyr_hemolymph_extract_1082.49493408203_21.307386,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Ppyr_hemolymph_extract_1082.49499511718_15.123002,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


NB. "one-hot" encoding is a common technique in ML/statistice where 1 (or True) denotes presence and 0 (or False) denotes absence.
It's related to the concept of Dummy coding.

We can also look at the clusters derived from the computed tree.

In [8]:
# Clusters derived from the tree using cutoff=0.6
tree.clusters

array([126,   8, 126,   4,   7,   8,   3,   4,   3,   3, 132,   3,   7,
        11,   5,   5,  56,  79, 147, 125, 146, 149, 148, 211, 129, 261,
       261, 236, 181, 164, 164, 215, 216, 268, 269, 173, 178,  80, 183,
       228, 251, 196, 249, 182,  94, 173, 267, 267, 255,  65, 264, 220,
       260, 258, 258, 258, 233, 183, 266, 266, 253, 197, 262, 185, 150,
        70, 127, 127, 128,  97,  69, 229, 179, 248,  92, 263, 263, 263,
       263, 263, 265, 260, 259, 259, 259,  78,  69, 257, 257, 256, 225,
       188, 188, 188, 188, 188, 138,  96, 142, 244, 270, 206, 176,  78,
        74, 100, 161,  87, 212, 218,  82, 138, 242, 226,  95, 171, 231,
       140, 188,  22, 103, 240, 241, 254,  74,  88, 212, 246, 210, 169,
       169, 252,  68, 137,  19, 232,  49, 104,  35, 250, 160, 204, 271,
       238, 106, 146, 209, 218, 217, 247,  89, 194, 210, 174, 245, 272,
        32, 175, 243,  85, 146, 237, 237, 237, 165, 204, 201, 239,  87,
       221, 192, 212, 148, 210, 193,  52, 235, 235,  34, 188,  3

This is a [numpy](https://www.numpy.org/) array.
The order of these is the same as the row index in the one-hot dataframe.

So we can obtain clusters for each sample like so...

In [9]:
dict(zip(tree.onehot_df.index, tree.clusters))['Ppyr_hemolymph_extract_533.238464355468_15.101331']

7

In fact we already store this dictionary in the tree object

In [10]:
# return the cluster number for which the queried analyte belongs
tree.cluster_map["Ppyr_hemolymph_extract_533.238464355468_15.101331"]

7

As with the python pipeline in the quick-start notebook, we can easily choose a new cutoff for clustering.
In fact, the pipeline function returns a tree.

Note that cut_tree mutates the tree object, so if you want to retain the original clusters, take a copy of the tree first.

In [11]:
# Show the number of clusters before adjustment
print("BEFORE: Cutoff:", tree.cutoff, "n clusters:", len(set(tree.clusters)))

# Re-set a new cutoff for clusters
new_tree=copy.deepcopy(tree)
new_tree.cut_tree(cutoff=0.8)

# Show number of clusters after adjustment
print("AFTER: Cutoff:", new_tree.cutoff, "n clusters:", len(set(new_tree.clusters)))

BEFORE: Cutoff: 0.6 n clusters: 274
AFTER: Cutoff: 0.8 n clusters: 166


The `tree` and `new_tree` objects can now be compared side by side as per the quick start example. For now we will continue with the `new_tree`.

To get the cluster numbers along-side the presence absence data we can simply add a new colum to the dataframe, 
since it's already in the correct order.

In [12]:
# Make sure you take a copy to avoid editing the data.
oh = tree.onehot_df.copy()
oh["cluster"] = tree.clusters

# reorder columns
oh = oh[["cluster"] + [col for col in oh.columns if col != "cluster"]]

# Sort values by cluster
oh.sort_values(by="cluster", inplace=True)

oh.head()

bins,cluster,100.0237_100.0224_100.0244,100.0760_100.0751_100.0766,100.1127_100.1127_100.1127,100.8624_100.8624_100.8624,100.9321_100.9317_100.9326,100.9383_100.9383_100.9383,100.9522_100.9521_100.9523,101.0235_101.0235_101.0235,101.0597_101.0590_101.0605,...,99.0805_99.0799_99.0808,99.0919_99.0918_99.0919,99.1171_99.1170_99.1171,99.9242_99.9242_99.9242,99.9485_99.9485_99.9485,99.9691_99.9691_99.9691,99.9874_99.9874_99.9874,990.1105_990.1105_990.1105,992.7504_992.7504_992.7504,992.7560_992.7560_992.7560
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ppyr_hemolymph_extract_436.269622802734_13.538295,1,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Ppyr_hemolymph_extract_453.296081542968_13.538295,1,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Ppyr_hemolymph_extract_452.264404296875_8.0908171,2,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Ppyr_hemolymph_extract_539.332946777343_15.255389,2,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Ppyr_hemolymph_extract_469.291122436523_8.0908171,2,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


We can quickly extract the subtable for clusters or samples using the `cluster_table` method.

In [13]:
# for visualising the ion table of your cluster of interest

# NB .head just shows the first few rows of the table, used here to keep things tidy.
# Remove .head() if you want to see the whole table.
new_tree.cluster_table(cluster=3).head()

bins,105.0701_105.0698_105.0707,107.0492_107.0489_107.0498,107.0860_107.0857_107.0862,1076.8733_1076.8733_1076.8733,117.0703_117.0701_117.0706,119.0859_119.0851_119.0864,121.0649_121.0644_121.0653,123.0440_123.0437_123.0448,129.0701_129.0701_129.0701,131.0858_131.0847_131.0865,...,81.0705_81.0699_81.0708,83.0496_83.0490_83.0498,89.0601_89.0593_89.0603,91.0556_91.0541_91.0584,92.7556_92.7526_92.7585,93.0703_93.0701_93.0708,95.0493_95.0490_95.0501,95.0861_95.0852_95.0864,960.8495_960.8495_960.8495,97.0652_97.0648_97.0659
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ppyr_hemolymph_extract_1065.4677734375_15.366083,True,True,False,True,False,True,True,True,False,True,...,True,False,False,True,True,False,True,False,True,False
Ppyr_hemolymph_extract_1082.49499511718_15.123002,True,True,False,False,False,True,True,False,False,True,...,False,False,False,True,True,False,False,False,False,False
Ppyr_hemolymph_extract_1082.49499511718_15.366083,True,False,False,False,False,True,True,True,False,True,...,True,False,False,True,False,True,True,False,False,False
Ppyr_hemolymph_extract_1110.52667236328_17.431269,True,True,False,False,False,True,True,False,False,True,...,False,False,False,False,False,False,True,False,False,False
Ppyr_hemolymph_extract_1138.55786132812_19.784291,True,False,False,False,False,False,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False


In [14]:
new_tree.cluster_table(sample="Ppyr_hemolymph_extract_533.238464355468_15.101331").head()

bins,105.0701_105.0698_105.0707,107.0492_107.0489_107.0498,107.0860_107.0857_107.0862,1076.8733_1076.8733_1076.8733,117.0703_117.0701_117.0706,119.0859_119.0851_119.0864,121.0649_121.0644_121.0653,123.0440_123.0437_123.0448,129.0701_129.0701_129.0701,131.0858_131.0847_131.0865,...,81.0705_81.0699_81.0708,83.0496_83.0490_83.0498,89.0601_89.0593_89.0603,91.0556_91.0541_91.0584,92.7556_92.7526_92.7585,93.0703_93.0701_93.0708,95.0493_95.0490_95.0501,95.0861_95.0852_95.0864,960.8495_960.8495_960.8495,97.0652_97.0648_97.0659
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ppyr_hemolymph_extract_1065.4677734375_15.366083,True,True,False,True,False,True,True,True,False,True,...,True,False,False,True,True,False,True,False,True,False
Ppyr_hemolymph_extract_1082.49499511718_15.123002,True,True,False,False,False,True,True,False,False,True,...,False,False,False,True,True,False,False,False,False,False
Ppyr_hemolymph_extract_1082.49499511718_15.366083,True,False,False,False,False,True,True,True,False,True,...,True,False,False,True,False,True,True,False,False,False
Ppyr_hemolymph_extract_1110.52667236328_17.431269,True,True,False,False,False,True,True,False,False,True,...,False,False,False,False,False,False,True,False,False,False
Ppyr_hemolymph_extract_1138.55786132812_19.784291,True,False,False,False,False,False,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False


We can write out the results of the clustering as we saw in the quick-start notebook.
Note that the `path` will be a directory where multiple files are written, and it must already exist.

Also, any results already in the directory will be overwritten.
The default path will use a combination of the date and time to avoid overwriting data.

In [15]:
# Create the directory to store results.
# Note that this will fail if the directory already exists.
# You can change exist_ok to True, but the files from old results will be overwritten or remain unchanged.
os.makedirs("results-cutoff-08", exist_ok=False)

# Generate the plots of clusters.
new_tree.write_summaries(path="results-cutoff-08")

To remove a directory and its files you can type.

```python
import shutil
shutil.rmtree("results-cutoff-08")
```

To list the created files...

In [16]:
os.listdir("results-cutoff-08")[:10]

['clusters.xlsx',
 'cluster_100_3.png',
 'cluster_100_3.xlsx',
 'cluster_101_2.png',
 'cluster_101_2.xlsx',
 'cluster_102_2.png',
 'cluster_102_2.xlsx',
 'cluster_103_2.png',
 'cluster_103_2.xlsx',
 'cluster_104_1.png']

We can inspect the hierarchical clustering linkage map output from scipy.
Unfortunately, this lacks a bit of intuition.
The first two columns refer to two nodes in the tree.
The third column gives the distance between the two nodes.
And the fourth column gives the number of leaves that are in or beneath these nodes.

In [17]:
# Linkage tree from scipy.
new_tree.tree[:6]

array([[5.80000000e+01, 5.90000000e+01, 0.00000000e+00, 2.00000000e+00],
       [2.60000000e+02, 2.61000000e+02, 4.00000000e-02, 2.00000000e+00],
       [2.84000000e+02, 2.85000000e+02, 5.26315789e-02, 2.00000000e+00],
       [4.71000000e+02, 4.72000000e+02, 7.69230769e-02, 2.00000000e+00],
       [4.46000000e+02, 4.57000000e+02, 7.69230769e-02, 2.00000000e+00],
       [7.70000000e+01, 7.80000000e+01, 8.10810811e-02, 2.00000000e+00]])

The tree is probably more useful to you.
You can write an interactive tree to a file using the `plot` method on the `Tree` object.

In [18]:
# To write out the tree.
# NB the results folder already exists from the previous step where we wrote summaries.
new_tree.plot(filename="results-cutoff-08/simple_dendrogram.html", width=1000, height=1000);

Note that the width and height values are in pixels.
If your sample names are fairly long like ours are, you might like to fiddle with the height.
You can navigate to the .html file and open it in your favourite browser.

Since you're already using a Jupyter notebook you might like to plot the tree inline with your other analysis.
We can do that using some plotly functionality.

The `plot` method returns a Plotly `Figure` object which you can use with normal plotly functions.
So we can use the `notebook_mode` to show it inline.

In [19]:
# Make it a bit smaller this time
iplot = new_tree.plot(width=1000, height=1000)

# for visualising plot inline
# You only need to set connected to True once per notebook,
# but doing it multiple times won't hurt.
plotly.offline.init_notebook_mode(connected=True)

plotly.offline.iplot(iplot)

Rolling over the leaf nodes with your mouse you'll be able to see which cluster each component belongs to, so you can then track it down in your results folder.

The `plot` method is just a wrapper around the `dendrogram` function in `BioDendro.plot.dendrogram`.
If you want to customise your plot e.g. with custom titles, you can do that there.

The `dendrogram` function is a modified version of plotlys [`figure_factory.create_dendrogram`](https://plot.ly/python/dendrogram/) function.
It is modified to reduce complexity, to handle text-rollover and to allow us to provide precomputed trees rather than having to recompute the tree each time (possibly yielding different trees and different clusters).

In [21]:
from BioDendro.plot import dendrogram

iplot = dendrogram(tree, width=800, height=1000, title="TreesRCool", xlabel="Samples", ylabel="Distance")
plotly.offline.iplot(iplot)

There are some features that we're working on.

In dendrogram, the `hovertext` parameter allows you to specify arbitrary text for when you hover over the data.
However, the current tree drawing algorithm reorders the internal nodes, and we can't find the correct order for internal node data from the output of that function.
For now know that it exists, and if you figure it out, please let us know!