Skip to content

Commit

Permalink
Merge 0d5fbb3 into f608ea6
Browse files Browse the repository at this point in the history
  • Loading branch information
anupriyatripathi committed Apr 16, 2019
2 parents f608ea6 + 0d5fbb3 commit d0ce5eb
Show file tree
Hide file tree
Showing 69 changed files with 88,155 additions and 126 deletions.
36 changes: 27 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ A tool to build a tree of MS1 features to compare chemical composition of sample

## Installation

Once QIIME 2 is [installed](https://docs.qiime2.org/2018.11/install/), activate your QIIME 2 environment and install q2-chemistree following the steps below:
Once QIIME 2 is [installed](https://docs.qiime2.org/2019.1/install/), activate your QIIME 2 environment and install q2-chemistree following the steps below:

```bash
git clone https://github.com/biocore/q2-chemistree.git
Expand Down Expand Up @@ -102,15 +102,33 @@ Now, we use these predicted molecular substructures to generate a hierarchy of m

```bash
qiime chemistree make-hierarchy \
--i-csi-result fingerprints.qza \
--i-feature-table feature-table.qza \
--o-tree demo-chemisTree.qza \
--o-matched-feature-table filtered-feature-table.qza
--i-csi-results fingerprints.qza \
--i-feature-tables feature-table.qza \
--o-tree demo-chemistree.qza \
--o-merged-feature-table filtered-feature-table.qza
--o-merged-feature-data feature-data.qza
```

This method performs two tasks:
1. Generates a tree relating the MS1 features in these data based on molecular substructures predicted for MS1 features. This is of type `Phylogeny[Rooted]`. By default, we only use PubChem fingerprints (total 489 molecular properties). Adding `--p-no-qc-properties` retains all (2936) the molecular properties in the contingency table.
To support meta-analyses, this method is capable of handling one or more datasets i.e pairs of CSI results and feature tables. Below is an example for two datasets:

```bash
qiime chemistree make-hierarchy \
--i-csi-results fingerprints.qza \
--i-csi-results fingerprints2.qza \
--i-feature-tables feature-table.qza \
--i-feature-tables feature-table2.qza
--o-tree merged-chemistree.qza \
--o-merged-feature-table merged-feature-table.qza \
--o-merged-feature-data merged-feature-data.qza
```

**Note:** The input CSI results and feature tables should have a one-to-one correspondance i.e csi results and feature tables from all datasets should be provided in the same order.

This method generates the following:
1. A combined feature table by merging all the input feature tables; MS1 features without fingerprints are filtered out of this feature table. This is done because SIRIUS predicts molecular substructures for a subset of features (typically for 70-90% of all MS1 features) in an experiment (based on factors such as sample type, the quality MS2 spectra, and user-defined tolerances such as `--p-ppm-max`, `--p-zodiac-threshold`). This output is of type `FeatureTable[Frequency]`.
2. A tree relating the MS1 features in these data based on molecular substructures predicted for MS1 features. This is of type `Phylogeny[Rooted]`. By default, we only use PubChem fingerprints (total 489 molecular properties). Adding `--p-no-qc-properties` retains all (2936) the molecular properties in the contingency table.
**Note**: The latest release of [SIRIUS](https://www.nature.com/articles/s41592-019-0344-8) uses PubChem version downloaded on 13 August 2017.
2. Filters the MS1 features without fingerprints from the feature table such that the feature IDs in the feature table and the tree match. This is done because SIRIUS predicts molecular substructures for a subset of features (typically for 70-90% of all MS1 features) in an experiment (based on factors such as sample type, the quality MS2 spectra, and user-defined tolerances such as `--p-ppm-max`, `--p-zodiac-threshold`). The resulting feature table is also of type `FeatureTable[Frequency]`.
3. A combined feature data file that contains unique identifiers of each feature, their corresponding original feature identifier, and feature tables that each feature was detected in. This is of type `FeatureData[Molecules]`. (The renaming of features needs to be done to avoid overlapping, non-unique feature identifiers in the original feature table)


Thus, using these steps, we can generate a tree (`demo-chemisTree.qza`) relating MS1 features in a mass-spectrometry dataset along with a matched feature table (`filtered-feature-table.qza`). These can be used as inputs to perform chemical phylogeny-based [alpha-diversity](https://docs.qiime2.org/2018.11/plugins/available/diversity/alpha-phylogenetic/) and [beta-diversity](https://docs.qiime2.org/2018.11/plugins/available/diversity/beta-phylogenetic/) analyses.
Thus, using these steps, we can generate a tree relating MS1 features in a mass-spectrometry dataset along with a matched feature table. These can be used as inputs to perform chemical phylogeny-based [alpha-diversity](https://docs.qiime2.org/2019.1/plugins/available/diversity/alpha-phylogenetic/) and [beta-diversity](https://docs.qiime2.org/2019.1/plugins/available/diversity/beta-phylogenetic/) analyses.
69 changes: 54 additions & 15 deletions q2_chemistree/_hierarchy.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage
from skbio import TreeNode
from q2_feature_table import merge

from ._collate_fingerprint import collate_fingerprint
from ._match import match_label
Expand All @@ -32,9 +33,33 @@ def build_tree(relabeled_fingerprints: pd.DataFrame) -> TreeNode:
return tree


def make_hierarchy(csi_result: CSIDirFmt,
feature_table: biom.Table,
qc_properties: bool = True) -> (TreeNode, biom.Table):
def merge_feature_data(fdata: pd.DataFrame):
'''
This function merges feature data from multiple feature tables. The
resulting table is indexed by MD5 hash mapped to unique feature
identifiers in the original feature tables.
'''
for idx, data in enumerate(fdata):
data['table_number'] = str(idx+1)
merged_fdata = pd.concat(fdata)
dupl_bool = merged_fdata.index.duplicated(keep='first')
duplicates = merged_fdata[dupl_bool].index.unique()
if len(duplicates) == 0:
return merged_fdata
else:
for idx in duplicates:
merged_fdata.loc[idx, '#featureID'] = (',').join(
list(merged_fdata.loc[idx, '#featureID']))
merged_fdata.loc[idx, 'table_number'] = (',').join(
list(merged_fdata.loc[idx, 'table_number']))
merged_fdata = merged_fdata[~dupl_bool]
return merged_fdata


def make_hierarchy(csi_results: CSIDirFmt,
feature_tables: biom.Table,
qc_properties: bool = True) -> (TreeNode, biom.Table,
pd.DataFrame):
'''
This function generates a hierarchy of mass-spec features based on
predicted chemical fingerprints. It filters the feature table to
Expand All @@ -43,10 +68,10 @@ def make_hierarchy(csi_result: CSIDirFmt,
Parameters
----------
csi_result : CSIDirFmt
CSI:FingerID output folder
csi_results : CSIDirFmt
one or more CSI:FingerID output folder
feature_table : biom.Table
feature table with mass-spec feature intensity per sample
one or more feature tables with mass-spec feature intensity per sample
qc_properties : bool, default True
flag to filter molecular properties to keep only PUBCHEM fingerprints
Expand All @@ -64,15 +89,29 @@ def make_hierarchy(csi_result: CSIDirFmt,
skbio.TreeNode
a tree of relatedness of molecules
biom.Table
filtered feature table that contains only the features present in
the tree
merged feature table that is filtered to contain only the
features present in the tree
pd.DataFrame
merged feature data
'''

if feature_table.is_empty():
raise ValueError("Cannot have empty feature table")
fingerprints = collate_fingerprint(csi_result, qc_properties)
relabeled_fingerprints, matched_feature_table = match_label(fingerprints,
feature_table)
tree = build_tree(relabeled_fingerprints)
fps, fts, fdata = [], [], []
if len(feature_tables) != len(csi_results):
raise ValueError("The feature tables and CSI results should have a "
"one-to-one correspondance.")
for feature_table, csi_result in zip(feature_tables, csi_results):
if feature_table.is_empty():
raise ValueError("Cannot have empty feature table")
fingerprints = collate_fingerprint(csi_result, qc_properties)
relabeled_fp, matched_ft, feature_data = match_label(fingerprints,
feature_table)
fps.append(relabeled_fp)
fts.append(matched_ft)
fdata.append(feature_data)
merged_fdata = merge_feature_data(fdata)
merged_fps = pd.concat(fps)
merged_fps = merged_fps[~merged_fps.index.duplicated(keep='first')]
merged_fts = merge(fts, overlap_method='error_on_overlapping_sample')
tree = build_tree(merged_fps)

return tree, matched_feature_table
return tree, merged_fts, merged_fdata
49 changes: 42 additions & 7 deletions q2_chemistree/_match.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,24 +17,59 @@ def match_label(collated_fingerprints: pd.DataFrame,
This function filters the feature table to retain only features with
fingerprints. It also relabels features with MD5 hash of its
binary fingerprint vector.
Parameters
----------
collated_fingerprints : pd.DataFrame
table containing mass-spec molecular substructures (columns) for each
mass-spec feature (index)
feature_table : biom.Table
feature tables with mass-spec feature intensity per sample.
Raises
------
ValueError
If features in collated fingerprint table are not a subset of
features in ``feature_table``
Returns
-------
pd.DataFrame
fingerprint table with features relabeled with MD5 hash of
its binary fingerprint vector
biom.Table
feature table that is filtered to contain only the
features with predicted fingerprints. Features are labeled by MD5 hash
of its binary fingerprint vector
pd.DataFrame
table that maps MD5 hash of a feature to the original feature ID in
the input feature table
'''

fps = collated_fingerprints.copy()
allfps = set(fps.index)
allfps = list(fps.index)
if fps.empty:
raise ValueError("Cannot have empty fingerprint table")
table = feature_table.to_dataframe(dense=True)
allfeatrs = set(table.index)
if not allfps.issubset(allfeatrs):
extra_tips = allfps-allfps.intersection(allfeatrs)
if not set(allfps).issubset(allfeatrs):
extra_tips = set(allfps) - set(allfps).intersection(allfeatrs)
raise ValueError('The following tips were not '
'found in the feature table:\n' +
', '.join([str(i) for i in extra_tips]))
filtered_table = table.reindex(allfps)
fps = fps.apply(pd.to_numeric)
fps = (fps > 0.5).astype(int)
for fid in fps.index:
list_md5 = []
for fid in allfps:
md5 = str(hashlib.sha256(fps.loc[fid].values.tobytes()).hexdigest())
fps.loc[fid, 'label'] = md5
filtered_table.loc[fid, 'label'] = md5
list_md5.append(md5)
fps['label'] = list_md5
filtered_table['label'] = list_md5
feature_data = pd.DataFrame(columns=['label', '#featureID'])
feature_data['label'] = list_md5
feature_data['#featureID'] = allfps
feature_data.set_index('label', inplace=True)
relabel_fps = fps.groupby('label').first()
matched_table = filtered_table.groupby('label').sum()
# biom requires that ids be strings
Expand All @@ -43,4 +78,4 @@ def match_label(collated_fingerprints: pd.DataFrame,
data=npfeatures, observation_ids=matched_table.index.astype(str),
sample_ids=matched_table.columns.astype(str))

return relabel_fps, matched_table
return relabel_fps, matched_table, feature_data
14 changes: 14 additions & 0 deletions q2_chemistree/_semantics.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import qiime2.plugin.model as model
from qiime2.plugin import SemanticType
from q2_types.feature_data import FeatureData
import os


Expand All @@ -21,6 +22,19 @@ def sniff(self):
MassSpectrometryFeatures = SemanticType('MassSpectrometryFeatures')


# TODO: define this class
# TSVMoleculesFormat (mimic MGFFile)
class TSVMolecules(model.TextFileFormat):
def sniff(self):
return True


TSVMoleculesFormat = model.SingleFileDirectoryFormat('TSVMoleculesFormat',
'feature_data.tsv',
TSVMolecules)
Molecules = SemanticType('Molecules', variant_of=FeatureData.field['type'])


class OutputDirs(model.DirectoryFormat):

def get_folder_name(self):
Expand Down
27 changes: 27 additions & 0 deletions q2_chemistree/_transformer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
from .plugin_setup import plugin
from ._semantics import TSVMolecules
import pandas as pd


def _read_dataframe(fh):
# Using `dtype=object` and `set_index` to avoid type casting/inference
# of any columns or the index.
df = pd.read_csv(fh, sep='\t', header=0, dtype='str')
df.set_index(df.columns[0], drop=True, append=False, inplace=True)
df.index.name = 'id'
return df

# define a transformer from pd.DataFrame to -> TSVMolecules
@plugin.register_transformer
def _1(data: pd.DataFrame) -> TSVMolecules:
ff = TSVMolecules()
with ff.open() as fh:
data.to_csv(fh, sep='\t', header=True)
return ff

# define a transformer from TSVMolecules -> pd.DataFrame
@plugin.register_transformer
def _2(ff: TSVMolecules) -> pd.DataFrame:
with ff.open() as fh:
df = _read_dataframe(fh)
return df
41 changes: 28 additions & 13 deletions q2_chemistree/plugin_setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,18 @@
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------
import importlib
from ._fingerprint import (compute_fragmentation_trees,
rerank_molecular_formulas,
predict_fingerprints)
from ._hierarchy import make_hierarchy
from ._semantics import (MassSpectrometryFeatures, MGFDirFmt,
SiriusFolder, SiriusDirFmt,
ZodiacFolder, ZodiacDirFmt,
CSIFolder, CSIDirFmt)
CSIFolder, CSIDirFmt,
FeatureData, TSVMoleculesFormat, Molecules)

from qiime2.plugin import Plugin, Str, Range, Choices, Float, Int, Bool
from qiime2.plugin import Plugin, Str, Range, Choices, Float, Int, Bool, List
from q2_types.feature_table import FeatureTable, Frequency
from q2_types.tree import Phylogeny, Rooted

Expand Down Expand Up @@ -48,6 +50,11 @@
plugin.register_semantic_type_to_format(CSIFolder,
artifact_format=CSIDirFmt)

plugin.register_views(TSVMoleculesFormat)
plugin.register_semantic_types(Molecules)
plugin.register_semantic_type_to_format(FeatureData[Molecules],
artifact_format=TSVMoleculesFormat)

PARAMS = {
'ionization_mode': Str % Choices(['positive', 'negative', 'auto']),
'database': Str % Choices(['all', 'pubchem']),
Expand Down Expand Up @@ -141,22 +148,30 @@
function=make_hierarchy,
name='Create a molecular tree',
description='Build a phylogeny based on molecular substructures',
inputs={'csi_result': CSIFolder,
'feature_table': FeatureTable[Frequency]},
inputs={'csi_results': List[CSIFolder],
'feature_tables': List[FeatureTable[Frequency]]},
parameters={'qc_properties': Bool},
input_descriptions={'csi_result': 'CSI:FingerID output folder',
'feature_table': 'Feature table that will be filtered '
'based on the features of the '
'phylogenetic tree'},
input_descriptions={'csi_results': 'one or more CSI:FingerID '
'output folders',
'feature_tables': 'one or more feature tables with '
'mass-spec feature intensity '
'per sample'},
parameter_descriptions={'qc_properties': 'filters molecular properties to '
'retain PUBCHEM fingerprints'},
outputs=[('tree', Phylogeny[Rooted]),
('matched_feature_table', FeatureTable[Frequency])],
('merged_feature_table', FeatureTable[Frequency]),
('merged_feature_data', FeatureData[Molecules])],
output_descriptions={'tree': 'Tree of relatedness between mass '
'spectrometry features based on the chemical '
'substructures within those features',
'matched_feature_table': 'filtered feature table '
'that contains only the '
'features present in '
'the tree'}
'merged_feature_table': 'filtered feature table '
'that contains only the '
'features present in '
'the tree',
'merged_feature_data': 'mapping of unique feature '
'identifiers in input '
'feature tables to MD5 hash '
'of feature fingerprints'}
)

importlib.import_module('q2_chemistree._transformer')
Binary file added q2_chemistree/tests/data/csiFolder2.qza
Binary file not shown.

0 comments on commit d0ce5eb

Please sign in to comment.