-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ms1 experiment #213
Ms1 experiment #213
Changes from 2 commits
a3fd298
a923f53
9d94cef
a437253
9736fb8
87a664e
2c0f99a
df37c5d
97d6530
28dea54
eac3b60
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -131,7 +131,7 @@ def get_spurious_duplicates(self, mz_tolerance=0.001, rt_tolerance=2, corr_thres | |
|
||
Returns | ||
------- | ||
calour.MS1Experiment | ||
MS1Experiment | ||
features filtered and ordered basen on m/z and rt similarity and correlation | ||
''' | ||
features = self.feature_metadata.copy() | ||
|
@@ -165,8 +165,8 @@ def get_spurious_duplicates(self, mz_tolerance=0.001, rt_tolerance=2, corr_thres | |
def merge_similar_features(self, mz_tolerance=0.001, rt_tolerance=0.5): | ||
'''Merge metabolites with similar mz/rt to a single metabolite | ||
|
||
metabolites are initially sorted by frequency and a greefy clustering algorithm (starting from the highest freq.) is used to join together | ||
metabolites that are close in m/z and r/t. | ||
Metabolites are initially sorted by frequency and a greedy clustering algorithm (starting from the highest freq.) is used to join together | ||
metabolites that are close in m/z and r/t, combining them to a signle metabolite with freq=sum(freq) of all metabolites in the cluster. | ||
|
||
Parameters | ||
---------- | ||
|
@@ -177,9 +177,11 @@ def merge_similar_features(self, mz_tolerance=0.001, rt_tolerance=0.5): | |
|
||
Returns | ||
------- | ||
calour.MS1Experiment with close metabolites joined to a single metabolite. | ||
The m/z and rt of the new metabolite are the m/z and rt of the highest freq. metabolite. | ||
new feature_metadata fields: _calour_merge_number, _calour_merge_ids are added listing the number and ids of the metabolites joined for each new metabolite | ||
MS1Experiment | ||
With close metabolites joined to a single metabolite. | ||
The m/z and rt of the new metabolite are the m/z and rt of the highest freq. metabolite. Frequency of the new metabolite is the sum of frequencies | ||
of all joined metabolites. | ||
New feature_metadata fields: _calour_merge_number, _calour_merge_ids are added listing the number and ids of the metabolites joined for each new metabolite | ||
''' | ||
exp = self.sort_abundance(reverse=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why sorting? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We want the center of each cluster to be the highest frequency metabolite and to join to it all the close metabolites. |
||
features = exp.feature_metadata | ||
|
@@ -189,8 +191,7 @@ def merge_similar_features(self, mz_tolerance=0.001, rt_tolerance=0.5): | |
for cgroup, cfeature in features.iterrows(): | ||
mzdist = np.abs(features['MZ'] - cfeature['MZ']) | ||
rtdist = np.abs(features['RT'] - cfeature['RT']) | ||
ok = np.logical_and(mzdist <= mz_tolerance, rtdist <= rt_tolerance) | ||
ok = np.logical_and(ok, features['_metabolite_group'] == -1) | ||
ok = (mzdist <= mz_tolerance) & (rtdist <= rt_tolerance) & (features['_metabolite_group'] == -1) | ||
okpos = np.where(ok)[0] | ||
for cpos in okpos: | ||
features.iat[cpos, gpos] = cgroup | ||
|
@@ -204,6 +205,7 @@ def filter_mz_rt(self, mz=None, rt=None, mz_tolerance=0.05, rt_tolerance=0.2, in | |
|
||
Keep (or remove if negate=True) metabolites that have an m/z and/or retention time close (up to tolerance) | ||
to the requested mz and/or rt (or list of mz and/or rt). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. requested -> input |
||
If both mz and rt are provided, they should be matched (i.e. filtering is performed using each mz and rt pair with same index) | ||
|
||
Parameters | ||
---------- | ||
|
@@ -225,7 +227,7 @@ def filter_mz_rt(self, mz=None, rt=None, mz_tolerance=0.05, rt_tolerance=0.2, in | |
|
||
Returns | ||
------- | ||
calour.MS1Experiment | ||
MS1Experiment | ||
features filtered based on mz | ||
''' | ||
if mz is None and rt is None: | ||
|
@@ -241,7 +243,7 @@ def filter_mz_rt(self, mz=None, rt=None, mz_tolerance=0.05, rt_tolerance=0.2, in | |
else: | ||
rt = _to_list(rt) | ||
|
||
keep = set() | ||
select = np.zeros(len(self.feature_metadata), dtype='?') | ||
notfound = 0 | ||
if mz is None: | ||
mz = [None] * len(rt) | ||
|
@@ -264,12 +266,12 @@ def filter_mz_rt(self, mz=None, rt=None, mz_tolerance=0.05, rt_tolerance=0.2, in | |
bothok = np.logical_and(keepmz, keeprt) | ||
if bothok.sum() == 0: | ||
notfound += 1 | ||
keep = keep.union(set(np.where(bothok)[0])) | ||
select = np.logical_or(select, bothok) | ||
|
||
logger.info('total from mz/rt list not found: %d' % notfound) | ||
logger.info('Total from mz/rt list with no match: %d' % notfound) | ||
if negate: | ||
keep = set(np.arange(len(self.feature_metadata))).difference(keep) | ||
return self.reorder(sorted(list(keep)), axis='f', inplace=inplace) | ||
select = np.logical_not(select) | ||
return self.reorder(select, axis='f', inplace=inplace) | ||
|
||
def sort_mz_rt(self, inplace=False): | ||
'''Sort features according to m/z and retention time. | ||
|
@@ -283,7 +285,7 @@ def sort_mz_rt(self, inplace=False): | |
|
||
Returns | ||
------- | ||
calour.MS1Experiment | ||
Sorted according to m/z and retention time | ||
MS1Experiment | ||
Sorted according to m/z and retention time | ||
''' | ||
return self.sort_by_metadata('mz_rt', axis='f', inplace=inplace) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i don't understand why we need this. this function add more code maintainance burden but doesn't provide any benefits beyond There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it helps make the analysis notebook more easy to understand. Important for non-expert calour users. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -370,3 +370,17 @@ def subsample_count(exp: Experiment, total, replace=False, inplace=False, random | |
exp.reorder([i not in drops for i in range(exp.data.shape[0])], inplace=True) | ||
exp.normalized = total | ||
return exp | ||
|
||
|
||
def _subsample(data, depth): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you want this in this PR or it is just an accidental commit? if so, could you add a test? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oops wrong branch... deleted |
||
for csamp in range(data.shape[0]): | ||
totreads = data[csamp, :].sum() | ||
reads = np.zeros([totreads], dtype=int) | ||
cpos = 0 | ||
for idx in range(data.shape[1]): | ||
reads[cpos:cpos + data[csamp, idx]] = idx | ||
cpos += data[csamp, idx] | ||
new_reads = np.random.permutation(reads)[: depth] | ||
# res = np.unique(new_reads, return_counts=True) | ||
res = np.bincount(new_reads) | ||
return res | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you do here? remove (semi-) duplicate features? or combine them?
you need a better function name!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any ideas? this is the best name i could come up with :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these are spurious duplicate features? if so, how about get_spurious_duplicates?
what's the workfflow here? what do u do after you get these features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature selection in ms is complicated... and lots of parameters
the workflow is that after you set your parameters and do feature selection, you load the resulting table and look at it also using get_spurious_duplicates(). If your feature selection is too sensitive, you get lots of metabolites with similar mz/rt that show anti-correlation or correlation (depending on the exact problem).
Then, if you get too many, you go back and redo feature selection with different params...
(but you will always get some bad features - it's a sensitivity/specificity payoff... it depends on how many and how severe... and how it affects your downstream analysis)