-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ms1 experiment #213
Ms1 experiment #213
Changes from 2 commits
a3fd298
a923f53
9d94cef
a437253
9736fb8
87a664e
2c0f99a
df37c5d
97d6530
28dea54
eac3b60
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -106,42 +106,7 @@ def __repr__(self): | |
return 'MS1Experiment %s with %d samples, %d features' % ( | ||
self.description, self.data.shape[0], self.data.shape[1]) | ||
|
||
def filter_mz(self, mz, tolerance=0.001, inplace=False, negate=False): | ||
'''Filter metabolites based on m/z | ||
|
||
Keep (or remove if negate=True) metabolites that have an m/z close (up to tolerance) | ||
to the requested mz (or list of mz). | ||
|
||
Parameters | ||
---------- | ||
mz: float or list of float | ||
the M/Z to filter | ||
tolerance: float, optional | ||
the M/Z tolerance. filter metabolites with abs(metabolite_mz - mz) <= tolerance | ||
inplace: bool, optional | ||
True to replace current experiment, False to create new experiment with results | ||
negate: bool, optional | ||
If False, keep only metabolites matching mz | ||
If True, remove metabolites matching mz | ||
|
||
Returns | ||
------- | ||
calour.MS1Experiment | ||
features filtered based on mz | ||
''' | ||
if 'MZ' not in self.feature_metadata.columns: | ||
raise ValueError('The Experiment does not contain the column "MZ". cannot filter by mz') | ||
mz = _to_list(mz) | ||
keep = set() | ||
for cmz in mz: | ||
mzdiff = np.abs(self.feature_metadata['MZ'] - cmz) | ||
keep = keep.union(set(np.where(mzdiff <= tolerance)[0])) | ||
if negate: | ||
keep = set(np.arange(len(self.feature_metadata))).difference(keep) | ||
logger.info('after filtering %d features remaining' % len(keep)) | ||
return self.reorder(sorted(list(keep)), axis='f', inplace=inplace) | ||
|
||
def get_bad_features(self, mz_tolerance=0.001, rt_tolerance=2, corr_thresh=0.8, inplace=False, negate=False): | ||
def get_spurious_duplicates(self, mz_tolerance=0.001, rt_tolerance=2, corr_thresh=0.8, inplace=False, negate=False): | ||
'''Get subgroups of metabolites that are suspected ms1 alignment artifacts. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can describe how the filtering is performed? users have to look at code to understand what it is doing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. is that clear enough now? |
||
The function returns a calour.MS1Experiment with groups of metabolites that (within each group) have similar m/z and rt, and are highly | ||
|
@@ -196,3 +161,129 @@ def get_bad_features(self, mz_tolerance=0.001, rt_tolerance=2, corr_thresh=0.8, | |
keep_features.extend(ckeep) | ||
features.drop(index=ckeep, inplace=True) | ||
return self.filter_ids(keep_features, negate=negate, inplace=inplace) | ||
|
||
def merge_similar_features(self, mz_tolerance=0.001, rt_tolerance=0.5): | ||
'''Merge metabolites with similar mz/rt to a single metabolite | ||
|
||
metabolites are initially sorted by frequency and a greefy clustering algorithm (starting from the highest freq.) is used to join together | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1,2,4 fixed |
||
metabolites that are close in m/z and r/t. | ||
|
||
Parameters | ||
---------- | ||
mz_tolerance: float, optional | ||
metabolites with abs(metabolite_mz - mz) <= mz_tolerance are joined | ||
rt_tolerance: float, optional | ||
metabolites with abs(metabolite_rt - rt) <= rt_tolerance are joined | ||
|
||
Returns | ||
------- | ||
calour.MS1Experiment with close metabolites joined to a single metabolite. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls follow the docstring format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
The m/z and rt of the new metabolite are the m/z and rt of the highest freq. metabolite. | ||
new feature_metadata fields: _calour_merge_number, _calour_merge_ids are added listing the number and ids of the metabolites joined for each new metabolite | ||
''' | ||
exp = self.sort_abundance(reverse=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why sorting? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We want the center of each cluster to be the highest frequency metabolite and to join to it all the close metabolites. |
||
features = exp.feature_metadata | ||
features['_metabolite_group'] = np.zeros(len(features)) - 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. instead of of messing and clutter feature metadata data frame, could you modify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you mean to make the field var in agg_be metadata can be a list/pd.series etc? |
||
gpos = list(features.columns).index('_metabolite_group') | ||
cgroup = 0 | ||
for cgroup, cfeature in features.iterrows(): | ||
mzdist = np.abs(features['MZ'] - cfeature['MZ']) | ||
rtdist = np.abs(features['RT'] - cfeature['RT']) | ||
ok = np.logical_and(mzdist <= mz_tolerance, rtdist <= rt_tolerance) | ||
ok = np.logical_and(ok, features['_metabolite_group'] == -1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cool. fixed |
||
okpos = np.where(ok)[0] | ||
for cpos in okpos: | ||
features.iat[cpos, gpos] = cgroup | ||
exp = exp.aggregate_by_metadata('_metabolite_group', agg='sum', axis='f') | ||
exp.feature_metadata.drop('_metabolite_group', axis='columns', inplace=True) | ||
logger.info('%d metabolites remaining after merge' % len(exp.feature_metadata)) | ||
return exp | ||
|
||
def filter_mz_rt(self, mz=None, rt=None, mz_tolerance=0.05, rt_tolerance=0.2, inplace=False, negate=False): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. by the way, we should consistent with naming - is this more appropriate to name it "filter_by_mz_rt"? |
||
'''Filter metabolites based on m/z and/or retention time | ||
|
||
Keep (or remove if negate=True) metabolites that have an m/z and/or retention time close (up to tolerance) | ||
to the requested mz and/or rt (or list of mz and/or rt). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. requested -> input |
||
|
||
Parameters | ||
---------- | ||
mz: float or list of float or None, optional | ||
the M/Z to filter | ||
if None, do not filter based on M/Z | ||
rt: float or list of float or None, optional | ||
the retention time to filter | ||
if None, do not filter based on rt | ||
mz_tolerance: float, optional | ||
the M/Z tolerance. filter metabolites with abs(metabolite_mz - mz) <= mz_tolerance | ||
rt_tolerance: float, optional | ||
the rt tolerance. filter metabolites with abs(metabolite_rt - rt) <= rt_tolerance | ||
inplace: bool, optional | ||
True to replace current experiment, False to create new experiment with results | ||
negate: bool, optional | ||
If False, keep only metabolites matching mz | ||
If True, remove metabolites matching mz | ||
|
||
Returns | ||
------- | ||
calour.MS1Experiment | ||
features filtered based on mz | ||
''' | ||
if mz is None and rt is None: | ||
raise ValueError('at least one of "mz" and "rt" must not be None') | ||
if mz is not None: | ||
if 'MZ' not in self.feature_metadata.columns: | ||
raise ValueError('The Experiment does not contain the column "MZ". cannot filter by mz') | ||
else: | ||
mz = _to_list(mz) | ||
if rt is not None: | ||
if 'RT' not in self.feature_metadata.columns: | ||
raise ValueError('The Experiment does not contain the column "RT". cannot filter by rt') | ||
else: | ||
rt = _to_list(rt) | ||
|
||
keep = set() | ||
notfound = 0 | ||
if mz is None: | ||
mz = [None] * len(rt) | ||
if rt is None: | ||
rt = [None] * len(mz) | ||
if len(mz) != len(rt): | ||
raise ValueError('mz and rt must have same length') | ||
|
||
for cmz, crt in zip(mz, rt): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mz and rt should be matched if they are list-like? pls document this in the docstring. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
if cmz is not None: | ||
mzdiff = np.abs(self.feature_metadata['MZ'] - cmz) | ||
keepmz = mzdiff <= mz_tolerance | ||
else: | ||
keepmz = np.full([len(self.feature_metadata)], True) | ||
if crt is not None: | ||
rtdiff = np.abs(self.feature_metadata['RT'] - crt) | ||
keeprt = rtdiff <= rt_tolerance | ||
else: | ||
keeprt = np.full([len(self.feature_metadata)], True) | ||
bothok = np.logical_and(keepmz, keeprt) | ||
if bothok.sum() == 0: | ||
notfound += 1 | ||
keep = keep.union(set(np.where(bothok)[0])) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as on this for loop, could you use boolean mask array (as I mentioned in your original code of this PR)? you can take a look at |
||
|
||
logger.info('total from mz/rt list not found: %d' % notfound) | ||
if negate: | ||
keep = set(np.arange(len(self.feature_metadata))).difference(keep) | ||
return self.reorder(sorted(list(keep)), axis='f', inplace=inplace) | ||
|
||
def sort_mz_rt(self, inplace=False): | ||
'''Sort features according to m/z and retention time. | ||
|
||
This is a convenience function wrapping calour.sort_by_metadata() | ||
|
||
Parameters | ||
---------- | ||
inplace: bool, optional | ||
True to replace current experiment, False to create new experiment with results | ||
|
||
Returns | ||
------- | ||
calour.MS1Experiment | ||
Sorted according to m/z and retention time | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls follow the format There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
''' | ||
return self.sort_by_metadata('mz_rt', axis='f', inplace=inplace) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i don't understand why we need this. this function add more code maintainance burden but doesn't provide any benefits beyond There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it helps make the analysis notebook more easy to understand. Important for non-expert calour users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function should use your new
filter_mz_rt
function? you have redundant code in the 2 functions.