-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ms1 experiment #213
Ms1 experiment #213
Conversation
|
||
def filter_mz(self, mz, tolerance=0.001, inplace=False, negate=False): | ||
'''Filter metabolites based on m/z | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can describe how the filtering is performed? users have to look at code to understand what it is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. is that clear enough now?
calour/ms1_experiment.py
Outdated
return self.reorder(sorted(list(keep)), axis='f', inplace=inplace) | ||
|
||
def get_bad_features(self, mz_tolerance=0.001, rt_tolerance=2, corr_thresh=0.8, inplace=False, negate=False): | ||
'''Get metabolites that have similar m/z and rt, and are correlated/anti-correlated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a blank line.
Returns | ||
------- | ||
calour.MS1Experiment | ||
features filtered and ordered basen on m/z and rt similarity and correlation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you do here? remove (semi-) duplicate features? or combine them?
you need a better function name!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any ideas? this is the best name i could come up with :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these are spurious duplicate features? if so, how about get_spurious_duplicates?
what's the workfflow here? what do u do after you get these features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature selection in ms is complicated... and lots of parameters
the workflow is that after you set your parameters and do feature selection, you load the resulting table and look at it also using get_spurious_duplicates(). If your feature selection is too sensitive, you get lots of metabolites with similar mz/rt that show anti-correlation or correlation (depending on the exact problem).
Then, if you get too many, you go back and redo feature selection with different params...
(but you will always get some bad features - it's a sensitivity/specificity payoff... it depends on how many and how severe... and how it affects your downstream analysis)
ready for merge? |
calour/ms1_experiment.py
Outdated
raise ValueError('The Experiment does not contain the column "MZ". cannot filter by mz') | ||
mz = _to_list(mz) | ||
keep = set() | ||
for cmz in mz: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use a boolean mask? like select
in filtering.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to filter_mz_rt() to enable filtering based on mz / rt/ or both
Returns | ||
------- | ||
calour.MS1Experiment | ||
features filtered and ordered basen on m/z and rt similarity and correlation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these are spurious duplicate features? if so, how about get_spurious_duplicates?
what's the workfflow here? what do u do after you get these features?
fixed. also added 2 new functions: merge_similar_features(), sort_mz_rt() |
calour/ms1_experiment.py
Outdated
def merge_similar_features(self, mz_tolerance=0.001, rt_tolerance=0.5): | ||
'''Merge metabolites with similar mz/rt to a single metabolite | ||
|
||
metabolites are initially sorted by frequency and a greefy clustering algorithm (starting from the highest freq.) is used to join together |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
pls write proper english sentences (eg use capital letter).
-
greefy -> greedy?
-
does this function really do clustering? maybe I missed it but I don't see it.
-
the description is still not clear. You sum up the features with similar mz and rt? pls be more specific than
join together
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1,2,4 fixed
3. It is greedy clustering, similar to open-reference OTU picking (the highest freq. feature is the center, add all metabolites close enough to the center...)
calour/ms1_experiment.py
Outdated
|
||
Returns | ||
------- | ||
calour.MS1Experiment with close metabolites joined to a single metabolite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls follow the docstring format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
The m/z and rt of the new metabolite are the m/z and rt of the highest freq. metabolite. | ||
new feature_metadata fields: _calour_merge_number, _calour_merge_ids are added listing the number and ids of the metabolites joined for each new metabolite | ||
''' | ||
exp = self.sort_abundance(reverse=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why sorting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the center of each cluster to be the highest frequency metabolite and to join to it all the close metabolites.
''' | ||
exp = self.sort_abundance(reverse=False) | ||
features = exp.feature_metadata | ||
features['_metabolite_group'] = np.zeros(len(features)) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of of messing and clutter feature metadata data frame, could you modify aggregate_by_metadata
to accept a list-like (pd.series, tuple, etc.) besides a column name for group-then-apply manipulation? it is a very light change and avoids adding intermediary column to metadata data frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean to make the field var in agg_be metadata can be a list/pd.series etc?
I think this will be confusing for the API of the function (field can be a tuple...)
adding it as a different param instead of field will also be confusing for the api...
The new column is deleted from the final result, so the user does not feel anything.
i think making the API clean is more important than making the code clean...
calour/ms1_experiment.py
Outdated
mzdist = np.abs(features['MZ'] - cfeature['MZ']) | ||
rtdist = np.abs(features['RT'] - cfeature['RT']) | ||
ok = np.logical_and(mzdist <= mz_tolerance, rtdist <= rt_tolerance) | ||
ok = np.logical_and(ok, features['_metabolite_group'] == -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use a & b & c
for multiple logic and
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. fixed
calour/ms1_experiment.py
Outdated
Returns | ||
------- | ||
calour.MS1Experiment | ||
Sorted according to m/z and retention time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls follow the format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
calour.MS1Experiment | ||
Sorted according to m/z and retention time | ||
''' | ||
return self.sort_by_metadata('mz_rt', axis='f', inplace=inplace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't understand why we need this. this function add more code maintainance burden but doesn't provide any benefits beyond sort_by_metadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it helps make the analysis notebook more easy to understand. Important for non-expert calour users.
similar to sort_samples()
if len(mz) != len(rt): | ||
raise ValueError('mz and rt must have same length') | ||
|
||
for cmz, crt in zip(mz, rt): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mz and rt should be matched if they are list-like? pls document this in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
calour/ms1_experiment.py
Outdated
bothok = np.logical_and(keepmz, keeprt) | ||
if bothok.sum() == 0: | ||
notfound += 1 | ||
keep = keep.union(set(np.where(bothok)[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as on this for loop, could you use boolean mask array (as I mentioned in your original code of this PR)? you can take a look at select
variable in filtering.py. It will make the code cleaner and a little more efficient?
super().heatmap(*args, **kwargs) | ||
|
||
def __repr__(self): | ||
'''Return a string representation of this object.''' | ||
return 'MS1Experiment %s with %d samples, %d features' % ( | ||
self.description, self.data.shape[0], self.data.shape[1]) | ||
|
||
def get_spurious_duplicates(self, mz_tolerance=0.001, rt_tolerance=2, corr_thresh=0.8, inplace=False, negate=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function should use your new filter_mz_rt
function? you have redundant code in the 2 functions.
ok |
calour/transforming.py
Outdated
@@ -370,3 +370,17 @@ def subsample_count(exp: Experiment, total, replace=False, inplace=False, random | |||
exp.reorder([i not in drops for i in range(exp.data.shape[0])], inplace=True) | |||
exp.normalized = total | |||
return exp | |||
|
|||
|
|||
def _subsample(data, depth): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want this in this PR or it is just an accidental commit? if so, could you add a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops wrong branch... deleted
calour/transforming.py
Outdated
new_reads = np.random.permutation(reads)[: depth] | ||
# res = np.unique(new_reads, return_counts=True) | ||
res = np.bincount(new_reads) | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
res
is 1-dimensional? isn't this function supposed to accept a 2-d and also return a 2-d?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
@RNAer ready for merge :) |
Add some utility functions to ms1experiment: