# BioDendro Quick Start Example

The BioDendro pipeline automates the process of binning and hierarchically clustering
msms components and spectra.

This notebook shows basic use of the pipeline for the:
- [Python API](#Python-API)
- [Command line API](#Command-line-API)

\* API = Application programming interface

## Python API

The python interface allows you to make small changes easily and view intermediate results.
A more detailed explanation of the Python API/pipeline is available in the `longer_example.ipynb` notebook at the BioDendro repository.

In [1]:
# Load modules

import os
import plotly
import BioDendro

The main `pipeline` function runs the full pipeline (i.e. reading files, clustering, and plotting).
At a minimum the function needs an MGF file and a component list.
Note that by default, the results will be saved to a folder in your current working directory using the name `results_<datetime>` where datetime will be the date and current time of day in `hhmmss` format.
This is to avoid overwriting data in multiple runs.

To get a list of possible parameters and defaults you can run `help(BioDendro.pipeline)`.

In [2]:
help(BioDendro.pipeline)

Help on function pipeline in module BioDendro:

pipeline(mgf_path, components_path, neutral=False, cutoff=0.6, bin_threshold=0.0008, clustering_method='jaccard', processed='processed.xlsx', results_dir=None, out_html='simple_dendrogram.html', width=900, height=800, quiet=False, scaling=False, filtering=False, eps=0.6, mz_tol=0.002, retention_tol=5, **kwargs)
    Runs the default BioDendro pipeline.



In [3]:
# Run the complete BioDendro pipeline

tree = BioDendro.pipeline("MSMS.mgf", "component_list.txt", clustering_method="braycurtis")

Running BioDendro v0.0.1

- input mgf file = MSMS.mgf
- input components file = component_list.txt
- neutral = False
- cutoff = 0.6
- bin_threshold = 0.0008
- clustering_method = braycurtis
- output processed file = results_20190417165427/processed.xlsx
- output results directory = results_20190417165427
- output html dendrogram = results_20190417165427/simple_dendrogram.html
- dendrogram figure width = 900
- dendrogram figure height = 800
- scaling = False
- filtering = False
- eps = 0.6
- mz_tolerance = 0.002
- retention_tolerance = 5


Processing inputs
Binning and clustering
This may take some time...
Writing per-cluster summaries
Writing output html dendrogram
Finished


The pipeline function prints out running information for you, including the directory where you can find your results.
Note that the dendrograms and processed files are also saved into this directory, even if you don't specify that in the filename.

The pipeline also returns a `Tree` object, which stores most of the results.
We can change the number clusters by specifying a new distance threshold.

In [4]:
# Show the number of clusters before adjustment
print("BEFORE: Cutoff:", tree.cutoff, "n clusters:", len(set(tree.clusters)))

# Re-set a new cutoff for clusters
tree.cut_tree(cutoff=0.8)

# Show number of clusters after adjustment
print("AFTER: Cutoff:", tree.cutoff, "n clusters:", len(set(tree.clusters)))

BEFORE: Cutoff: 0.6 n clusters: 79
AFTER: Cutoff: 0.8 n clusters: 46


So our cluster results are updated to use a higher threshold (i.e. fewer clusters, with more members).
Note that these updated results are not automatically saved to the results folder.

To save these updated results, we need to create a new folder and write the summaries ourselves.
Previously the pipeline function handled this all for us.

NOTE: this will overwrite any contents in the specified directory so be careful!

In [5]:
# Generate the out plots and tables of the new clusters.
# exist_ok tells python not to raise an error if the directory already exists.
os.makedirs("results", exist_ok=True)
tree.write_summaries(path="results")

We can view the interactive tree directly in an IPython notebook using the plotly `notebook_mode`.

In [6]:
# View the new dendrogram cutoff inline

plotly.offline.init_notebook_mode(connected=True) # for visualising plot inline
iplot = tree.plot(width=900, height=800)
plotly.offline.iplot(iplot)

## Command line API

The pipeline can also be run from a bash or bash-like terminal.
This is useful if you're not planning on tweaking the parameters much and just want to run the darn thing.

For these examples, we're using the ipython magic command `%%bash` to run the commands in bash.
You can omit the %%bash bit if you're running straight in the terminal.

To get a list of all options available use the `--help` (or `-h`) flag.

In [7]:
%%bash
BioDendro --help

usage: BioDendro [-h] [-n] [-c CUTOFF] [-b BIN_THRESHOLD]
                 [-d {jaccard,braycurtis}] [-p PROCESSED] [-o OUT_HTML]
                 [-r RESULTS_DIR] [-x WIDTH_PX] [-y HEIGHT_PX] [-q] [-s] [-f]
                 [-e EPS] [-mz_tol MZ_TOL] [-retention_tol RETENTION_TOL]
                 mgf components

Run the BioDendro pipeline.

positional arguments:
  mgf                   MGF input file.
  components            Listed components file.

optional arguments:
  -h, --help            show this help message and exit
  -n, --neutral         Apply neutral loss.
  -c CUTOFF, --cutoff CUTOFF
                        Distance threshold for selecting clusters from tree.
  -b BIN_THRESHOLD, --bin-threshold BIN_THRESHOLD
                        Threshold for binning m/z values prior to clustering.
  -d {jaccard,braycurtis}, --cluster-method {jaccard,braycurtis}
                        The distance metric used during tree construction.
  -p PROCESSED, --processed PROCESSED
             

The minimum options to run the pipeline are the MGF file and a components list.

Using the example data in the BioDendro repo we could run...

In [8]:
%%bash

BioDendro MSMS.mgf component_list.txt

Running BioDendro v0.0.1

- input mgf file = MSMS.mgf
- input components file = component_list.txt
- neutral = False
- cutoff = 0.6
- bin_threshold = 0.0008
- clustering_method = jaccard
- output processed file = results_20190417165621/processed.xlsx
- output results directory = results_20190417165621
- output html dendrogram = results_20190417165621/simple_dendrogram.html
- dendrogram figure width = 900
- dendrogram figure height = 800
- scaling = False
- filtering = False
- eps = 0.6
- mz_tolerance = 0.002
- retention_tolerance = 5


Processing inputs
Binning and clustering
This may take some time...
Writing per-cluster summaries
Writing output html dendrogram
Finished


As before, the results will be stored in a directory with the current date and the current time added to the end of it.

You can change the parameters to use by supplying additional flags, however, this will run the whole pipeline again, so it you just need to adjust the cutoff or decide to use braycurtis instead of jaccard distances, you might be better off using the python API.

In [9]:
%%bash

BioDendro --scaling --cluster-method braycurtis --cutoff 0.5 MSMS.mgf component_list.txt

Running BioDendro v0.0.1

- input mgf file = MSMS.mgf
- input components file = component_list.txt
- neutral = False
- cutoff = 0.5
- bin_threshold = 0.0008
- clustering_method = braycurtis
- output processed file = results_20190417165736/processed.xlsx
- output results directory = results_20190417165736
- output html dendrogram = results_20190417165736/simple_dendrogram.html
- dendrogram figure width = 900
- dendrogram figure height = 800
- scaling = True
- filtering = False
- eps = 0.6
- mz_tolerance = 0.002
- retention_tolerance = 5


Processing inputs
Binning and clustering
This may take some time...
Writing per-cluster summaries
Writing output html dendrogram
Finished


would be equivalent to running the following in python

In [10]:
tree = BioDendro.pipeline("MSMS.mgf", "component_list.txt", clustering_method="braycurtis", scaling=True, cutoff=0.5)

Running BioDendro v0.0.1

- input mgf file = MSMS.mgf
- input components file = component_list.txt
- neutral = False
- cutoff = 0.5
- bin_threshold = 0.0008
- clustering_method = braycurtis
- output processed file = results_20190417165856/processed.xlsx
- output results directory = results_20190417165856
- output html dendrogram = results_20190417165856/simple_dendrogram.html
- dendrogram figure width = 900
- dendrogram figure height = 800
- scaling = True
- filtering = False
- eps = 0.6
- mz_tolerance = 0.002
- retention_tolerance = 5


Processing inputs
Binning and clustering
This may take some time...
Writing per-cluster summaries
Writing output html dendrogram
Finished
