-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve xml parsing for comet and msgfplus #11
Conversation
spec_records.append(psm_level_dict) | ||
return pd.DataFrame(spec_records) | ||
|
||
def _peptide_lookup(entry, entry_tag, sequence, modifications, peptide_lookup): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally verbs in function name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually - must start with as defined in our coding standards - now valentin - you have never seen those so sorry :) :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i surely will have to take another look at my docstrings...
} | ||
if len(attribs["modifications"]) != 0: | ||
for mod in attribs["modifications"]: | ||
monoisotopicMassDelta = mod["monoisotopicMassDelta"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8
f"{modification_mass_map[child.attrib['monoisotopicMassDelta']]}:{child.attrib['location']}" | ||
) | ||
lookup[id]["modifications"] = ";".join(lookup[id]["modifications"]) | ||
for pep_sequence, attribs in peptide_lookup.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth looking into getting mass_to_mod maps before reading mods from xml.
would allow us to assemble the mods correctly during the xml iteration and could be done as string instead of list.
unsure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will look into doing a separate xml iteration for that purpose.
won't work in one iteration, since get_peptide_lookup takes place before get_modifciation_mass_map.
mod_name = "" | ||
fixed_mods = {} | ||
|
||
for i in results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entry should be element and it should be for element in (cv_param, search_modification)
imho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all comments in comet can probably be applied to msgf plus aswell
elif entry_tag == (f"{element_tag_prefix}AnalysisSoftware"): | ||
version = "comet_" + "_".join( | ||
re.findall(r"([/d]*\d+)", entry.attrib["version"]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do not expect to find multiple versions then re.search might be more appropriate than final
if i in mapping_dict: | ||
spec_results.update({mapping_dict[i]: entry.attrib[i]}) | ||
spec_ident_items.append(spec_results) | ||
spec_results = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you update in the loop above just to reset it here? hmmm ... not sure I get it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I need to look at the full file without the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yes the thing is, there can be multiple SpectrumIdentificationItem - so i need to reset spec_results for the next one.
Also there are some general cvParams under SpectrumIdentificationResult, which belong to all SpectrumIdentificationItems (see for loop line 184). I reused the variable spec_results for these cvParam aswell.
@@ -110,22 +171,9 @@ def check_parser_compatibility(cls, file): | |||
contains_engine = "Comet" in head | |||
return is_mzid and contains_engine | |||
|
|||
def _map_mods_and_sequences(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot comment on line 105 but no need for file.as_posix()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check on whether or not that file is a comet file should
a) be done on the headers only (no need to iter 10 lines)
b) if iter over the file use for line in f - no need to call next
c) no need to reset the head variable in python as it exists only in scope and is recreated every time
I would suggest to use
with open(file) as i:
header = i.readline()
if "Comet:EValue" in header
to be perfectly explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think the header is in the first line (xml yada yada), and the try except is in place because you would wanna avoid crashes in empty files (e.g. a 0 PSM MS Amanda file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as posix is still in there :)
with open(file.as_posix()) as f:
will crash on windows - why add as_posix to path lib object? Also Doc string say file is str so str.as_posix won't work at all :)
f"{modification_mass_map[monoisotopicMassDelta]}:{location}" | ||
) | ||
lookup[pep_sequence]["modifications"] = ";".join( | ||
lookup[pep_sequence]["modifications"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need sorting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes and no, would be nice but is resorted during clean up anyways
|
||
# TODO: check mod left strip | ||
seq_mods = pd.DataFrame(self.df["sequence"].map(lookup).to_list()) | ||
self.df.loc[:, "modifications"] = ( | ||
seq_mods["modifications"].str.cat(fixed_mod_strings, sep=";").str.strip(";") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that feels awkward ... isn't it just seq_mods["modifications"].str + fixed_mod_strings
also sorting missing, I guess - maybe comes later in a general function, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didn't really find a better way to do this.
seq_mods["modifications"].str + fixed_mod_strings doesn't do it.
and yes, i always tried to avoid sorting, when it gets sorted later anyways.
spec_records, | ||
modification_mass_map, | ||
fixed_mods, | ||
) = _iterator_xml(self.input_file, self.mapping_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better name required :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true! changed it to get_xml_data()
return pd.DataFrame(spec_records) | ||
peptide_lookup[entry.attrib["id"]].update(sequence) | ||
cv_param_modifications = "" | ||
sequence = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to reset - I feel this looks partial very similar to comet parser (dooo - both xml, right ;)) but maybe worth refactoring to avoid code duplication. a mantra worth singing every morning - "Never copy code" :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) good one for the first one ! :)
from tqdm import tqdm | ||
|
||
import xml.etree.ElementTree as etree | ||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use loguru warnings
def get_modification_mass_map( | ||
entry, entry_tag, modification_mass_map, mod_name, fixed_mods | ||
): | ||
"""Take one entry at a time to return Modification name with massDelta. Also check if Modification is fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why modification is capitalized sometimes
Returns: | ||
modification_mass_map (dict): contains one more modification | ||
mod_name (str): contains the name of the Modification | ||
fixed_mods (dict): contains mods where fixedMod = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contains fixed modifications
|
||
Returns: | ||
modification_mass_map (dict): contains one more modification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
First of all, thanks for the valuable feedback from all of you! I try to implement all suggestions and improvements as best as i can :) I will have another look at my docstrings before the next commit, but the code itself should be nearly finished... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering how the format of the mzIdentML files is defined? Is it bound to the respective output formats of our engines? Just saw that the comet parser is made for version 1.2, while the msgfplus is made for 1.1. That being said - could an existing msgfplus engine report a .mzid in version 1.2 format? Or the other way around - could a newer version of msgfplus use the 1.2 format, where we then could re-use the comet parser for?
@@ -110,22 +171,9 @@ def check_parser_compatibility(cls, file): | |||
contains_engine = "Comet" in head | |||
return is_mzid and contains_engine | |||
|
|||
def _map_mods_and_sequences(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as posix is still in there :)
with open(file.as_posix()) as f:
will crash on windows - why add as_posix to path lib object? Also Doc string say file is str so str.as_posix won't work at all :)
|
||
Operations are performed inplace. | ||
Returns: | ||
version (str): file version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function does not return version as str but None if elif entry_tag.endswith("AnalysisSoftware"):
is not True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also say dont put return in conditionals but set a value in the conditional and return it in the end.
You can always set None as default in the beginning if this is desired behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now set it up exactly as you described.
Returning None is actually never desired - should always return the version, nothing else...
mass_name = self.mod_mass_map[mass] | ||
modifications += mass_name + ":" + location + ";" | ||
elif entry_tag.endswith("Peptide"): | ||
modifications = modifications.rstrip(";").lstrip(";") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if modifications is a list and one uses ";".join(modifications) at the end (i.e. l.137), the rstrip, lstrip, + ";" elements can be spared (readability 💯 )
"value" | ||
] | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why create a dict to update a dict? Why not set it directly ?
if attribute in self.mapping_dict.keys(): | ||
spec_results.update( | ||
{self.mapping_dict[attribute]: entry.attrib[attribute]} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
something like that maybe:
_key = self.mapping_dict[attribute]
spec_results[_key] = entry.attrib[attribute]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed all update calls where this was possible to exactly this solution :)
np.cumsum(list(map(len, l[:-1]))) + range(1, len(l)) | ||
).astype(str) | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neary copy-paste from omssa_2_1_9:translate_mods - refactor?
peptide_lookup[entry.attrib["id"]] = sequence | ||
peptide_lookup[entry.attrib["id"]].update( | ||
{"modifications": cv_param_modifications.rstrip(";")} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above - why create a dict with one value :)
if entry.attrib["fixedMod"] == "true": | ||
fixed_mods.update({entry.attrib["residues"]: mod_name}) | ||
elif entry_tag.endswith("ModificationParams"): | ||
return fixed_mods, mod_mass_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd break here and then return fixed_mods and mod_mass_map at the end
eliminated all unnecessary update calls on dicts changed the file.as_posix in check_parser_compatibility
@@ -117,43 +37,126 @@ def check_parser_compatibility(cls, file): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is file actually a str or a Path object?
If both is possible, adapt doc string and cast to Path if its a str
cleanup on xtandem
Hi there,
I changed the way msgfplus and comet parsers parse the xml_file.
Now they iterate through using iterparse.
The msgfplus parser takes 20% less time while using significantly less memory: About 10Gb on a 1Gb testfile
I would appreciate some feedback.
Best,
Valo