Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ms1 experiment #213

Merged
merged 11 commits into from
Jul 17, 2020
Merged

Ms1 experiment #213

merged 11 commits into from
Jul 17, 2020

Conversation

amnona
Copy link
Collaborator

@amnona amnona commented Jul 12, 2020

Add some utility functions to ms1experiment:

  • filter_mz() - for approximate m/z filtering
  • get_bad_features() - for detection of feature-selection artifacts
  • improve the heatmap() for nice ms1 defaults (show mz/rt on y-axis)


def filter_mz(self, mz, tolerance=0.001, inplace=False, negate=False):
'''Filter metabolites based on m/z

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can describe how the filtering is performed? users have to look at code to understand what it is doing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. is that clear enough now?

return self.reorder(sorted(list(keep)), axis='f', inplace=inplace)

def get_bad_features(self, mz_tolerance=0.001, rt_tolerance=2, corr_thresh=0.8, inplace=False, negate=False):
'''Get metabolites that have similar m/z and rt, and are correlated/anti-correlated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line.

Returns
-------
calour.MS1Experiment
features filtered and ordered basen on m/z and rt similarity and correlation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you do here? remove (semi-) duplicate features? or combine them?

you need a better function name!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any ideas? this is the best name i could come up with :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these are spurious duplicate features? if so, how about get_spurious_duplicates?

what's the workfflow here? what do u do after you get these features?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature selection in ms is complicated... and lots of parameters
the workflow is that after you set your parameters and do feature selection, you load the resulting table and look at it also using get_spurious_duplicates(). If your feature selection is too sensitive, you get lots of metabolites with similar mz/rt that show anti-correlation or correlation (depending on the exact problem).
Then, if you get too many, you go back and redo feature selection with different params...
(but you will always get some bad features - it's a sensitivity/specificity payoff... it depends on how many and how severe... and how it affects your downstream analysis)

@amnona
Copy link
Collaborator Author

amnona commented Jul 14, 2020

ready for merge?

raise ValueError('The Experiment does not contain the column "MZ". cannot filter by mz')
mz = _to_list(mz)
keep = set()
for cmz in mz:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use a boolean mask? like select in filtering.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to filter_mz_rt() to enable filtering based on mz / rt/ or both

Returns
-------
calour.MS1Experiment
features filtered and ordered basen on m/z and rt similarity and correlation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these are spurious duplicate features? if so, how about get_spurious_duplicates?

what's the workfflow here? what do u do after you get these features?

@amnona
Copy link
Collaborator Author

amnona commented Jul 14, 2020

fixed. also added 2 new functions: merge_similar_features(), sort_mz_rt()
and changed filter_mz() to a more general filter_mz_rt()
no more new function - i promise.
ready for merge :)

def merge_similar_features(self, mz_tolerance=0.001, rt_tolerance=0.5):
'''Merge metabolites with similar mz/rt to a single metabolite

metabolites are initially sorted by frequency and a greefy clustering algorithm (starting from the highest freq.) is used to join together
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. pls write proper english sentences (eg use capital letter).

  2. greefy -> greedy?

  3. does this function really do clustering? maybe I missed it but I don't see it.

  4. the description is still not clear. You sum up the features with similar mz and rt? pls be more specific than join together.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1,2,4 fixed
3. It is greedy clustering, similar to open-reference OTU picking (the highest freq. feature is the center, add all metabolites close enough to the center...)


Returns
-------
calour.MS1Experiment with close metabolites joined to a single metabolite.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls follow the docstring format.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

The m/z and rt of the new metabolite are the m/z and rt of the highest freq. metabolite.
new feature_metadata fields: _calour_merge_number, _calour_merge_ids are added listing the number and ids of the metabolites joined for each new metabolite
'''
exp = self.sort_abundance(reverse=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sorting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the center of each cluster to be the highest frequency metabolite and to join to it all the close metabolites.

'''
exp = self.sort_abundance(reverse=False)
features = exp.feature_metadata
features['_metabolite_group'] = np.zeros(len(features)) - 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of of messing and clutter feature metadata data frame, could you modify aggregate_by_metadata to accept a list-like (pd.series, tuple, etc.) besides a column name for group-then-apply manipulation? it is a very light change and avoids adding intermediary column to metadata data frame.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean to make the field var in agg_be metadata can be a list/pd.series etc?
I think this will be confusing for the API of the function (field can be a tuple...)
adding it as a different param instead of field will also be confusing for the api...
The new column is deleted from the final result, so the user does not feel anything.
i think making the API clean is more important than making the code clean...

mzdist = np.abs(features['MZ'] - cfeature['MZ'])
rtdist = np.abs(features['RT'] - cfeature['RT'])
ok = np.logical_and(mzdist <= mz_tolerance, rtdist <= rt_tolerance)
ok = np.logical_and(ok, features['_metabolite_group'] == -1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use a & b & c for multiple logic and

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. fixed

Returns
-------
calour.MS1Experiment
Sorted according to m/z and retention time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls follow the format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

calour.MS1Experiment
Sorted according to m/z and retention time
'''
return self.sort_by_metadata('mz_rt', axis='f', inplace=inplace)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand why we need this. this function add more code maintainance burden but doesn't provide any benefits beyond sort_by_metadata.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it helps make the analysis notebook more easy to understand. Important for non-expert calour users.
similar to sort_samples()

if len(mz) != len(rt):
raise ValueError('mz and rt must have same length')

for cmz, crt in zip(mz, rt):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mz and rt should be matched if they are list-like? pls document this in the docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

bothok = np.logical_and(keepmz, keeprt)
if bothok.sum() == 0:
notfound += 1
keep = keep.union(set(np.where(bothok)[0]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as on this for loop, could you use boolean mask array (as I mentioned in your original code of this PR)? you can take a look at select variable in filtering.py. It will make the code cleaner and a little more efficient?

super().heatmap(*args, **kwargs)

def __repr__(self):
'''Return a string representation of this object.'''
return 'MS1Experiment %s with %d samples, %d features' % (
self.description, self.data.shape[0], self.data.shape[1])

def get_spurious_duplicates(self, mz_tolerance=0.001, rt_tolerance=2, corr_thresh=0.8, inplace=False, negate=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function should use your new filter_mz_rt function? you have redundant code in the 2 functions.

@amnona
Copy link
Collaborator Author

amnona commented Jul 15, 2020

ok
addressed comments :)
go merge go

@@ -370,3 +370,17 @@ def subsample_count(exp: Experiment, total, replace=False, inplace=False, random
exp.reorder([i not in drops for i in range(exp.data.shape[0])], inplace=True)
exp.normalized = total
return exp


def _subsample(data, depth):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want this in this PR or it is just an accidental commit? if so, could you add a test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops wrong branch... deleted

new_reads = np.random.permutation(reads)[: depth]
# res = np.unique(new_reads, return_counts=True)
res = np.bincount(new_reads)
return res
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

res is 1-dimensional? isn't this function supposed to accept a 2-d and also return a 2-d?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

@amnona
Copy link
Collaborator Author

amnona commented Jul 16, 2020

@RNAer ready for merge :)

@RNAer RNAer self-requested a review July 17, 2020 02:25
@RNAer RNAer self-assigned this Jul 17, 2020
@RNAer RNAer merged commit e10bf68 into biocore:master Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants