Merge 0d5fbb3 into f608ea6

biocore · Apr 16, 2019 · d0ce5eb · d0ce5eb
2 parents f608ea6 + 0d5fbb3
commit d0ce5eb
Show file tree

Hide file tree

Showing 69 changed files with 88,155 additions and 126 deletions.
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ A tool to build a tree of MS1 features to compare chemical composition of sample
 
 ## Installation
 
-Once QIIME 2 is [installed](https://docs.qiime2.org/2018.11/install/), activate your QIIME 2 environment and install q2-chemistree following the steps below:
+Once QIIME 2 is [installed](https://docs.qiime2.org/2019.1/install/), activate your QIIME 2 environment and install q2-chemistree following the steps below:
 
 ```bash
 git clone https://github.com/biocore/q2-chemistree.git
@@ -102,15 +102,33 @@ Now, we use these predicted molecular substructures to generate a hierarchy of m
 
 ```bash
 qiime chemistree make-hierarchy \
-  --i-csi-result fingerprints.qza \
-  --i-feature-table feature-table.qza \
-  --o-tree demo-chemisTree.qza \
-  --o-matched-feature-table filtered-feature-table.qza
+  --i-csi-results fingerprints.qza \
+  --i-feature-tables feature-table.qza \
+  --o-tree demo-chemistree.qza \
+  --o-merged-feature-table filtered-feature-table.qza
+  --o-merged-feature-data feature-data.qza
 ```
 
-This method performs two tasks:
-1. Generates a tree relating the MS1 features in these data based on molecular substructures predicted for MS1 features. This is of type `Phylogeny[Rooted]`. By default, we only use PubChem fingerprints (total 489 molecular properties). Adding `--p-no-qc-properties` retains all (2936) the molecular properties in the contingency table.
+To support meta-analyses, this method is capable of handling one or more datasets i.e pairs of CSI results and feature tables. Below is an example for two datasets:
+
+```bash
+qiime chemistree make-hierarchy \
+  --i-csi-results fingerprints.qza \
+  --i-csi-results fingerprints2.qza \
+  --i-feature-tables feature-table.qza \
+  --i-feature-tables feature-table2.qza
+  --o-tree merged-chemistree.qza \
+  --o-merged-feature-table merged-feature-table.qza \
+  --o-merged-feature-data merged-feature-data.qza
+```
+
+**Note:** The input CSI results and feature tables should have a one-to-one correspondance i.e csi results and feature tables from all datasets should be provided in the same order.
+
+This method generates the following:
+1. A combined feature table by merging all the input feature tables; MS1 features without fingerprints are filtered out of this feature table. This is done because SIRIUS predicts molecular substructures for a subset of features (typically for 70-90% of all MS1 features) in an experiment (based on factors such as sample type, the quality MS2 spectra, and user-defined tolerances such as `--p-ppm-max`, `--p-zodiac-threshold`). This output is of type `FeatureTable[Frequency]`.
+2. A tree relating the MS1 features in these data based on molecular substructures predicted for MS1 features. This is of type `Phylogeny[Rooted]`. By default, we only use PubChem fingerprints (total 489 molecular properties). Adding `--p-no-qc-properties` retains all (2936) the molecular properties in the contingency table.
 **Note**: The latest release of [SIRIUS](https://www.nature.com/articles/s41592-019-0344-8) uses PubChem version downloaded on 13 August 2017.
-2. Filters the MS1 features without fingerprints from the feature table such that the feature IDs in the feature table and the tree match. This is done because SIRIUS predicts molecular substructures for a subset of features (typically for 70-90% of all MS1 features) in an experiment (based on factors such as sample type, the quality MS2 spectra, and user-defined tolerances such as `--p-ppm-max`, `--p-zodiac-threshold`). The resulting feature table is also of type `FeatureTable[Frequency]`.
+3. A combined feature data file that contains unique identifiers of each feature, their corresponding original feature identifier, and feature tables that each feature was detected in. This is of type `FeatureData[Molecules]`. (The renaming of features needs to be done to avoid overlapping, non-unique feature identifiers in the original feature table)
+
 
-Thus, using these steps, we can generate a tree (`demo-chemisTree.qza`) relating MS1 features in a mass-spectrometry dataset along with a matched feature table (`filtered-feature-table.qza`). These can be used as inputs to perform chemical phylogeny-based [alpha-diversity](https://docs.qiime2.org/2018.11/plugins/available/diversity/alpha-phylogenetic/) and [beta-diversity](https://docs.qiime2.org/2018.11/plugins/available/diversity/beta-phylogenetic/) analyses.
+Thus, using these steps, we can generate a tree relating MS1 features in a mass-spectrometry dataset along with a matched feature table. These can be used as inputs to perform chemical phylogeny-based [alpha-diversity](https://docs.qiime2.org/2019.1/plugins/available/diversity/alpha-phylogenetic/) and [beta-diversity](https://docs.qiime2.org/2019.1/plugins/available/diversity/beta-phylogenetic/) analyses.
diff --git a/q2_chemistree/_hierarchy.py b/q2_chemistree/_hierarchy.py
@@ -12,6 +12,7 @@
 from scipy.spatial.distance import squareform
 from scipy.cluster.hierarchy import linkage
 from skbio import TreeNode
+from q2_feature_table import merge
 
 from ._collate_fingerprint import collate_fingerprint
 from ._match import match_label
@@ -32,9 +33,33 @@ def build_tree(relabeled_fingerprints: pd.DataFrame) -> TreeNode:
     return tree
 
 
-def make_hierarchy(csi_result: CSIDirFmt,
-                   feature_table: biom.Table,
-                   qc_properties: bool = True) -> (TreeNode, biom.Table):
+def merge_feature_data(fdata: pd.DataFrame):
+    '''
+    This function merges feature data from multiple feature tables. The
+    resulting table is indexed by MD5 hash mapped to unique feature
+    identifiers in the original feature tables.
+    '''
+    for idx, data in enumerate(fdata):
+        data['table_number'] = str(idx+1)
+    merged_fdata = pd.concat(fdata)
+    dupl_bool = merged_fdata.index.duplicated(keep='first')
+    duplicates = merged_fdata[dupl_bool].index.unique()
+    if len(duplicates) == 0:
+        return merged_fdata
+    else:
+        for idx in duplicates:
+            merged_fdata.loc[idx, '#featureID'] = (',').join(
+                list(merged_fdata.loc[idx, '#featureID']))
+            merged_fdata.loc[idx, 'table_number'] = (',').join(
+                list(merged_fdata.loc[idx, 'table_number']))
+        merged_fdata = merged_fdata[~dupl_bool]
+        return merged_fdata
+
+
+def make_hierarchy(csi_results: CSIDirFmt,
+                   feature_tables: biom.Table,
+                   qc_properties: bool = True) -> (TreeNode, biom.Table,
+                                                   pd.DataFrame):
     '''
     This function generates a hierarchy of mass-spec features based on
     predicted chemical fingerprints. It filters the feature table to
@@ -43,10 +68,10 @@ def make_hierarchy(csi_result: CSIDirFmt,
 
     Parameters
     ----------
-    csi_result : CSIDirFmt
-        CSI:FingerID output folder
+    csi_results : CSIDirFmt
+        one or more CSI:FingerID output folder
     feature_table : biom.Table
-        feature table with mass-spec feature intensity per sample
+        one or more feature tables with mass-spec feature intensity per sample
     qc_properties : bool, default True
         flag to filter molecular properties to keep only PUBCHEM fingerprints
 
@@ -64,15 +89,29 @@ def make_hierarchy(csi_result: CSIDirFmt,
     skbio.TreeNode
         a tree of relatedness of molecules
     biom.Table
-        filtered feature table that contains only the features present in
-        the tree
+        merged feature table that is filtered to contain only the
+        features present in the tree
+    pd.DataFrame
+        merged feature data
     '''
 
-    if feature_table.is_empty():
-        raise ValueError("Cannot have empty feature table")
-    fingerprints = collate_fingerprint(csi_result, qc_properties)
-    relabeled_fingerprints, matched_feature_table = match_label(fingerprints,
-                                                                feature_table)
-    tree = build_tree(relabeled_fingerprints)
+    fps, fts, fdata = [], [], []
+    if len(feature_tables) != len(csi_results):
+        raise ValueError("The feature tables and CSI results should have a "
+                         "one-to-one correspondance.")
+    for feature_table, csi_result in zip(feature_tables, csi_results):
+        if feature_table.is_empty():
+            raise ValueError("Cannot have empty feature table")
+        fingerprints = collate_fingerprint(csi_result, qc_properties)
+        relabeled_fp, matched_ft, feature_data = match_label(fingerprints,
+                                                             feature_table)
+        fps.append(relabeled_fp)
+        fts.append(matched_ft)
+        fdata.append(feature_data)
+    merged_fdata = merge_feature_data(fdata)
+    merged_fps = pd.concat(fps)
+    merged_fps = merged_fps[~merged_fps.index.duplicated(keep='first')]
+    merged_fts = merge(fts, overlap_method='error_on_overlapping_sample')
+    tree = build_tree(merged_fps)
 
-    return tree, matched_feature_table
+    return tree, merged_fts, merged_fdata
diff --git a/q2_chemistree/_match.py b/q2_chemistree/_match.py
@@ -17,24 +17,59 @@ def match_label(collated_fingerprints: pd.DataFrame,
     This function filters the feature table to retain only features with
     fingerprints. It also relabels features with MD5 hash of its
     binary fingerprint vector.
+
+    Parameters
+    ----------
+    collated_fingerprints : pd.DataFrame
+        table containing mass-spec molecular substructures (columns) for each
+        mass-spec feature (index)
+    feature_table : biom.Table
+        feature tables with mass-spec feature intensity per sample.
+
+    Raises
+    ------
+    ValueError
+        If features in collated fingerprint table are not a subset of
+        features in ``feature_table``
+
+    Returns
+    -------
+    pd.DataFrame
+        fingerprint table with features relabeled with MD5 hash of
+        its binary fingerprint vector
+    biom.Table
+        feature table that is filtered to contain only the
+        features with predicted fingerprints. Features are labeled by MD5 hash
+        of its binary fingerprint vector
+    pd.DataFrame
+        table that maps MD5 hash of a feature to the original feature ID in
+        the input feature table
     '''
+
     fps = collated_fingerprints.copy()
-    allfps = set(fps.index)
+    allfps = list(fps.index)
     if fps.empty:
         raise ValueError("Cannot have empty fingerprint table")
     table = feature_table.to_dataframe(dense=True)
     allfeatrs = set(table.index)
-    if not allfps.issubset(allfeatrs):
-        extra_tips = allfps-allfps.intersection(allfeatrs)
+    if not set(allfps).issubset(allfeatrs):
+        extra_tips = set(allfps) - set(allfps).intersection(allfeatrs)
         raise ValueError('The following tips were not '
                          'found in the feature table:\n' +
                          ', '.join([str(i) for i in extra_tips]))
     filtered_table = table.reindex(allfps)
+    fps = fps.apply(pd.to_numeric)
     fps = (fps > 0.5).astype(int)
-    for fid in fps.index:
+    list_md5 = []
+    for fid in allfps:
         md5 = str(hashlib.sha256(fps.loc[fid].values.tobytes()).hexdigest())
-        fps.loc[fid, 'label'] = md5
-        filtered_table.loc[fid, 'label'] = md5
+        list_md5.append(md5)
+    fps['label'] = list_md5
+    filtered_table['label'] = list_md5
+    feature_data = pd.DataFrame(columns=['label', '#featureID'])
+    feature_data['label'] = list_md5
+    feature_data['#featureID'] = allfps
+    feature_data.set_index('label', inplace=True)
     relabel_fps = fps.groupby('label').first()
     matched_table = filtered_table.groupby('label').sum()
     # biom requires that ids be strings
@@ -43,4 +78,4 @@ def match_label(collated_fingerprints: pd.DataFrame,
         data=npfeatures, observation_ids=matched_table.index.astype(str),
         sample_ids=matched_table.columns.astype(str))
 
-    return relabel_fps, matched_table
+    return relabel_fps, matched_table, feature_data
diff --git a/q2_chemistree/_semantics.py b/q2_chemistree/_semantics.py
@@ -8,6 +8,7 @@
 
 import qiime2.plugin.model as model
 from qiime2.plugin import SemanticType
+from q2_types.feature_data import FeatureData
 import os
 
 
@@ -21,6 +22,19 @@ def sniff(self):
 MassSpectrometryFeatures = SemanticType('MassSpectrometryFeatures')
 
 
+# TODO: define this class
+# TSVMoleculesFormat (mimic MGFFile)
+class TSVMolecules(model.TextFileFormat):
+    def sniff(self):
+        return True
+
+
+TSVMoleculesFormat = model.SingleFileDirectoryFormat('TSVMoleculesFormat',
+                                                     'feature_data.tsv',
+                                                     TSVMolecules)
+Molecules = SemanticType('Molecules', variant_of=FeatureData.field['type'])
+
+
 class OutputDirs(model.DirectoryFormat):
 
     def get_folder_name(self):

diff --git a/q2_chemistree/_transformer.py b/q2_chemistree/_transformer.py
@@ -0,0 +1,27 @@
+from .plugin_setup import plugin
+from ._semantics import TSVMolecules
+import pandas as pd
+
+
+def _read_dataframe(fh):
+    # Using `dtype=object` and `set_index` to avoid type casting/inference
+    # of any columns or the index.
+    df = pd.read_csv(fh, sep='\t', header=0, dtype='str')
+    df.set_index(df.columns[0], drop=True, append=False, inplace=True)
+    df.index.name = 'id'
+    return df
+
+# define a transformer from pd.DataFrame to -> TSVMolecules
+@plugin.register_transformer
+def _1(data: pd.DataFrame) -> TSVMolecules:
+    ff = TSVMolecules()
+    with ff.open() as fh:
+        data.to_csv(fh, sep='\t', header=True)
+    return ff
+
+# define a transformer from TSVMolecules -> pd.DataFrame
+@plugin.register_transformer
+def _2(ff: TSVMolecules) -> pd.DataFrame:
+    with ff.open() as fh:
+        df = _read_dataframe(fh)
+        return df
diff --git a/q2_chemistree/plugin_setup.py b/q2_chemistree/plugin_setup.py
@@ -5,16 +5,18 @@
 #
 # The full license is in the file LICENSE, distributed with this software.
 # ----------------------------------------------------------------------------
+import importlib
 from ._fingerprint import (compute_fragmentation_trees,
                            rerank_molecular_formulas,
                            predict_fingerprints)
 from ._hierarchy import make_hierarchy
 from ._semantics import (MassSpectrometryFeatures, MGFDirFmt,
                          SiriusFolder, SiriusDirFmt,
                          ZodiacFolder, ZodiacDirFmt,
-                         CSIFolder, CSIDirFmt)
+                         CSIFolder, CSIDirFmt,
+                         FeatureData, TSVMoleculesFormat, Molecules)
 
-from qiime2.plugin import Plugin, Str, Range, Choices, Float, Int, Bool
+from qiime2.plugin import Plugin, Str, Range, Choices, Float, Int, Bool, List
 from q2_types.feature_table import FeatureTable, Frequency
 from q2_types.tree import Phylogeny, Rooted
 
@@ -48,6 +50,11 @@
 plugin.register_semantic_type_to_format(CSIFolder,
                                         artifact_format=CSIDirFmt)
 
+plugin.register_views(TSVMoleculesFormat)
+plugin.register_semantic_types(Molecules)
+plugin.register_semantic_type_to_format(FeatureData[Molecules],
+                                        artifact_format=TSVMoleculesFormat)
+
 PARAMS = {
     'ionization_mode': Str % Choices(['positive', 'negative', 'auto']),
     'database': Str % Choices(['all', 'pubchem']),
@@ -141,22 +148,30 @@
     function=make_hierarchy,
     name='Create a molecular tree',
     description='Build a phylogeny based on molecular substructures',
-    inputs={'csi_result': CSIFolder,
-            'feature_table': FeatureTable[Frequency]},
+    inputs={'csi_results': List[CSIFolder],
+            'feature_tables': List[FeatureTable[Frequency]]},
     parameters={'qc_properties': Bool},
-    input_descriptions={'csi_result': 'CSI:FingerID output folder',
-                        'feature_table': 'Feature table that will be filtered '
-                                         'based on the features of the '
-                                         'phylogenetic tree'},
+    input_descriptions={'csi_results': 'one or more CSI:FingerID '
+                                       'output folders',
+                        'feature_tables': 'one or more feature tables with '
+                                          'mass-spec feature intensity '
+                                          'per sample'},
     parameter_descriptions={'qc_properties': 'filters molecular properties to '
                                              'retain PUBCHEM fingerprints'},
     outputs=[('tree', Phylogeny[Rooted]),
-             ('matched_feature_table', FeatureTable[Frequency])],
+             ('merged_feature_table', FeatureTable[Frequency]),
+             ('merged_feature_data', FeatureData[Molecules])],
     output_descriptions={'tree': 'Tree of relatedness between mass '
                                  'spectrometry features based on the chemical '
                                  'substructures within those features',
-                         'matched_feature_table': 'filtered feature table '
-                                                  'that contains only the '
-                                                  'features present in '
-                                                  'the tree'}
+                         'merged_feature_table': 'filtered feature table '
+                                                 'that contains only the '
+                                                 'features present in '
+                                                 'the tree',
+                         'merged_feature_data': 'mapping of unique feature '
+                                                'identifiers in input '
+                                                'feature tables to MD5 hash '
+                                                'of feature fingerprints'}
 )
+
+importlib.import_module('q2_chemistree._transformer')
diff --git a/q2_chemistree/tests/data/csiFolder2.qza b/q2_chemistree/tests/data/csiFolder2.qza