Ms1 experiment #213

amnona · 2020-07-12T20:52:30Z

Add some utility functions to ms1experiment:

filter_mz() - for approximate m/z filtering
get_bad_features() - for detection of feature-selection artifacts
improve the heatmap() for nice ms1 defaults (show mz/rt on y-axis)

RNAer · 2020-07-13T12:10:20Z

calour/ms1_experiment.py

+
+    def filter_mz(self, mz, tolerance=0.001, inplace=False, negate=False):
+        '''Filter metabolites based on m/z
+


can describe how the filtering is performed? users have to look at code to understand what it is doing.

fixed. is that clear enough now?

RNAer · 2020-07-13T12:12:18Z

calour/ms1_experiment.py

+        return self.reorder(sorted(list(keep)), axis='f', inplace=inplace)
+
+    def get_bad_features(self, mz_tolerance=0.001, rt_tolerance=2, corr_thresh=0.8, inplace=False, negate=False):
+        '''Get metabolites that have similar m/z and rt, and are correlated/anti-correlated.


add a blank line.

RNAer · 2020-07-13T12:29:26Z

calour/ms1_experiment.py

+        Returns
+        -------
+        calour.MS1Experiment
+            features filtered and ordered basen on m/z and rt similarity and correlation


what do you do here? remove (semi-) duplicate features? or combine them?

you need a better function name!

any ideas? this is the best name i could come up with :)

so these are spurious duplicate features? if so, how about get_spurious_duplicates?

what's the workfflow here? what do u do after you get these features?

feature selection in ms is complicated... and lots of parameters
the workflow is that after you set your parameters and do feature selection, you load the resulting table and look at it also using get_spurious_duplicates(). If your feature selection is too sensitive, you get lots of metabolites with similar mz/rt that show anti-correlation or correlation (depending on the exact problem).
Then, if you get too many, you go back and redo feature selection with different params...
(but you will always get some bad features - it's a sensitivity/specificity payoff... it depends on how many and how severe... and how it affects your downstream analysis)

amnona · 2020-07-14T08:50:22Z

ready for merge?

RNAer · 2020-07-14T10:03:22Z

calour/ms1_experiment.py

+            raise ValueError('The Experiment does not contain the column "MZ". cannot filter by mz')
+        mz = _to_list(mz)
+        keep = set()
+        for cmz in mz:


better to use a boolean mask? like select in filtering.py

changed to filter_mz_rt() to enable filtering based on mz / rt/ or both

RNAer · 2020-07-14T10:05:49Z

calour/ms1_experiment.py

+        Returns
+        -------
+        calour.MS1Experiment
+            features filtered and ordered basen on m/z and rt similarity and correlation


so these are spurious duplicate features? if so, how about get_spurious_duplicates?

what's the workfflow here? what do u do after you get these features?

amnona · 2020-07-14T21:06:18Z

fixed. also added 2 new functions: merge_similar_features(), sort_mz_rt()
and changed filter_mz() to a more general filter_mz_rt()
no more new function - i promise.
ready for merge :)

RNAer · 2020-07-15T02:12:00Z

calour/ms1_experiment.py

+    def merge_similar_features(self, mz_tolerance=0.001, rt_tolerance=0.5):
+        '''Merge metabolites with similar mz/rt to a single metabolite
+
+        metabolites are initially sorted by frequency and a greefy clustering algorithm (starting from the highest freq.) is used to join together


pls write proper english sentences (eg use capital letter).

greefy -> greedy?

does this function really do clustering? maybe I missed it but I don't see it.

the description is still not clear. You sum up the features with similar mz and rt? pls be more specific than join together.

1,2,4 fixed
3. It is greedy clustering, similar to open-reference OTU picking (the highest freq. feature is the center, add all metabolites close enough to the center...)

RNAer · 2020-07-15T02:12:25Z

calour/ms1_experiment.py

+
+        Returns
+        -------
+        calour.MS1Experiment with  close metabolites joined to a single metabolite.


pls follow the docstring format.

RNAer · 2020-07-15T02:12:48Z

calour/ms1_experiment.py

+        The m/z and rt of the new metabolite are the m/z and rt of the highest freq. metabolite.
+        new feature_metadata fields: _calour_merge_number, _calour_merge_ids are added listing the number and ids of the metabolites joined for each new metabolite
+        '''
+        exp = self.sort_abundance(reverse=False)


why sorting?

We want the center of each cluster to be the highest frequency metabolite and to join to it all the close metabolites.

RNAer · 2020-07-15T02:16:19Z

calour/ms1_experiment.py

+        '''
+        exp = self.sort_abundance(reverse=False)
+        features = exp.feature_metadata
+        features['_metabolite_group'] = np.zeros(len(features)) - 1


instead of of messing and clutter feature metadata data frame, could you modify aggregate_by_metadata to accept a list-like (pd.series, tuple, etc.) besides a column name for group-then-apply manipulation? it is a very light change and avoids adding intermediary column to metadata data frame.

do you mean to make the field var in agg_be metadata can be a list/pd.series etc?
I think this will be confusing for the API of the function (field can be a tuple...)
adding it as a different param instead of field will also be confusing for the api...
The new column is deleted from the final result, so the user does not feel anything.
i think making the API clean is more important than making the code clean...

RNAer · 2020-07-15T02:18:14Z

calour/ms1_experiment.py

+            mzdist = np.abs(features['MZ'] - cfeature['MZ'])
+            rtdist = np.abs(features['RT'] - cfeature['RT'])
+            ok = np.logical_and(mzdist <= mz_tolerance, rtdist <= rt_tolerance)
+            ok = np.logical_and(ok, features['_metabolite_group'] == -1)


you can use a & b & c for multiple logic and

cool. fixed

RNAer · 2020-07-15T02:24:58Z

calour/ms1_experiment.py

+        Returns
+        -------
+        calour.MS1Experiment
+        Sorted according to m/z and retention time


pls follow the format

RNAer · 2020-07-15T02:26:46Z

calour/ms1_experiment.py

+        calour.MS1Experiment
+        Sorted according to m/z and retention time
+        '''
+        return self.sort_by_metadata('mz_rt', axis='f', inplace=inplace)


i don't understand why we need this. this function add more code maintainance burden but doesn't provide any benefits beyond sort_by_metadata.

I think it helps make the analysis notebook more easy to understand. Important for non-expert calour users.
similar to sort_samples()

RNAer · 2020-07-15T02:33:09Z

calour/ms1_experiment.py

+        if len(mz) != len(rt):
+            raise ValueError('mz and rt must have same length')
+
+        for cmz, crt in zip(mz, rt):


mz and rt should be matched if they are list-like? pls document this in the docstring.

RNAer · 2020-07-15T02:35:01Z

calour/ms1_experiment.py

+            bothok = np.logical_and(keepmz, keeprt)
+            if bothok.sum() == 0:
+                notfound += 1
+            keep = keep.union(set(np.where(bothok)[0]))


as on this for loop, could you use boolean mask array (as I mentioned in your original code of this PR)? you can take a look at select variable in filtering.py. It will make the code cleaner and a little more efficient?

RNAer · 2020-07-15T02:40:22Z

calour/ms1_experiment.py

        super().heatmap(*args, **kwargs)

    def __repr__(self):
        '''Return a string representation of this object.'''
        return 'MS1Experiment %s with %d samples, %d features' % (
            self.description, self.data.shape[0], self.data.shape[1])
+
+    def get_spurious_duplicates(self, mz_tolerance=0.001, rt_tolerance=2, corr_thresh=0.8, inplace=False, negate=False):


this function should use your new filter_mz_rt function? you have redundant code in the 2 functions.

amnona · 2020-07-15T16:48:25Z

ok
addressed comments :)
go merge go

RNAer · 2020-07-16T04:47:28Z

calour/transforming.py

@@ -370,3 +370,17 @@ def subsample_count(exp: Experiment, total, replace=False, inplace=False, random
    exp.reorder([i not in drops for i in range(exp.data.shape[0])], inplace=True)
    exp.normalized = total
    return exp
+
+
+def _subsample(data, depth):


do you want this in this PR or it is just an accidental commit? if so, could you add a test?

oops wrong branch... deleted

RNAer · 2020-07-16T04:48:43Z

calour/transforming.py

+        new_reads = np.random.permutation(reads)[: depth]
+        # res = np.unique(new_reads, return_counts=True)
+        res = np.bincount(new_reads)
+    return res


res is 1-dimensional? isn't this function supposed to accept a 2-d and also return a 2-d?

amnona · 2020-07-16T13:49:19Z

@RNAer ready for merge :)

amnona added 5 commits July 11, 2020 23:17

add filter_mz

a3fd298

get_bad_features() and better heatmap

a923f53

move mz_rt to read_ms()

9d94cef

add ms1 experiment unit testing

a437253

pep8 and fix std0 warnning

9736fb8

RNAer requested changes Jul 13, 2020

View reviewed changes

pr fixes

87a664e

RNAer requested changes Jul 14, 2020

View reviewed changes

amnona added 2 commits July 14, 2020 23:47

change filter_mz to filter_mz_rt

2c0f99a

pr fixes

df37c5d

RNAer requested changes Jul 15, 2020

View reviewed changes

amnona added 2 commits July 15, 2020 15:35

pr fixes

97d6530

chenge filter_mz_rt to select mask

28dea54

RNAer reviewed Jul 16, 2020

View reviewed changes

pr fixes

eac3b60

RNAer self-requested a review July 17, 2020 02:25

RNAer self-assigned this Jul 17, 2020

RNAer merged commit e10bf68 into biocore:master Jul 17, 2020


		def filter_mz(self, mz, tolerance=0.001, inplace=False, negate=False):
		'''Filter metabolites based on m/z

Ms1 experiment #213

Ms1 experiment #213

Conversation

amnona commented Jul 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amnona commented Jul 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amnona commented Jul 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amnona commented Jul 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amnona commented Jul 16, 2020