Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve xml parsing for comet and msgfplus #11

Merged
merged 14 commits into from
Dec 19, 2022

Conversation

valentin-petzold
Copy link
Contributor

Hi there,
I changed the way msgfplus and comet parsers parse the xml_file.
Now they iterate through using iterparse.
The msgfplus parser takes 20% less time while using significantly less memory: About 10Gb on a 1Gb testfile
I would appreciate some feedback.
Best,
Valo

spec_records.append(psm_level_dict)
return pd.DataFrame(spec_records)

def _peptide_lookup(entry, entry_tag, sequence, modifications, peptide_lookup):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally verbs in function name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually - must start with as defined in our coding standards - now valentin - you have never seen those so sorry :) :-P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i surely will have to take another look at my docstrings...

}
if len(attribs["modifications"]) != 0:
for mod in attribs["modifications"]:
monoisotopicMassDelta = mod["monoisotopicMassDelta"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8

f"{modification_mass_map[child.attrib['monoisotopicMassDelta']]}:{child.attrib['location']}"
)
lookup[id]["modifications"] = ";".join(lookup[id]["modifications"])
for pep_sequence, attribs in peptide_lookup.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth looking into getting mass_to_mod maps before reading mods from xml.
would allow us to assemble the mods correctly during the xml iteration and could be done as string instead of list.
unsure though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will look into doing a separate xml iteration for that purpose.
won't work in one iteration, since get_peptide_lookup takes place before get_modifciation_mass_map.

mod_name = ""
fixed_mods = {}

for i in results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entry should be element and it should be for element in (cv_param, search_modification) imho

Copy link
Contributor

@tristan-ranff tristan-ranff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all comments in comet can probably be applied to msgf plus aswell

elif entry_tag == (f"{element_tag_prefix}AnalysisSoftware"):
version = "comet_" + "_".join(
re.findall(r"([/d]*\d+)", entry.attrib["version"])
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do not expect to find multiple versions then re.search might be more appropriate than final

if i in mapping_dict:
spec_results.update({mapping_dict[i]: entry.attrib[i]})
spec_ident_items.append(spec_results)
spec_results = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you update in the loop above just to reset it here? hmmm ... not sure I get it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I need to look at the full file without the comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes the thing is, there can be multiple SpectrumIdentificationItem - so i need to reset spec_results for the next one.
Also there are some general cvParams under SpectrumIdentificationResult, which belong to all SpectrumIdentificationItems (see for loop line 184). I reused the variable spec_results for these cvParam aswell.

@@ -110,22 +171,9 @@ def check_parser_compatibility(cls, file):
contains_engine = "Comet" in head
return is_mzid and contains_engine

def _map_mods_and_sequences(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot comment on line 105 but no need for file.as_posix()

Copy link
Collaborator

@fu fu Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check on whether or not that file is a comet file should
a) be done on the headers only (no need to iter 10 lines)
b) if iter over the file use for line in f - no need to call next
c) no need to reset the head variable in python as it exists only in scope and is recreated every time

I would suggest to use

 with open(file) as i:
      header = i.readline()
if "Comet:EValue" in header

to be perfectly explicit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think the header is in the first line (xml yada yada), and the try except is in place because you would wanna avoid crashes in empty files (e.g. a 0 PSM MS Amanda file)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as posix is still in there :)
with open(file.as_posix()) as f: will crash on windows - why add as_posix to path lib object? Also Doc string say file is str so str.as_posix won't work at all :)

f"{modification_mass_map[monoisotopicMassDelta]}:{location}"
)
lookup[pep_sequence]["modifications"] = ";".join(
lookup[pep_sequence]["modifications"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need sorting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and no, would be nice but is resorted during clean up anyways


# TODO: check mod left strip
seq_mods = pd.DataFrame(self.df["sequence"].map(lookup).to_list())
self.df.loc[:, "modifications"] = (
seq_mods["modifications"].str.cat(fixed_mod_strings, sep=";").str.strip(";")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that feels awkward ... isn't it just seq_mods["modifications"].str + fixed_mod_strings

also sorting missing, I guess - maybe comes later in a general function, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't really find a better way to do this.
seq_mods["modifications"].str + fixed_mod_strings doesn't do it.

and yes, i always tried to avoid sorting, when it gets sorted later anyways.

spec_records,
modification_mass_map,
fixed_mods,
) = _iterator_xml(self.input_file, self.mapping_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better name required :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true! changed it to get_xml_data()

return pd.DataFrame(spec_records)
peptide_lookup[entry.attrib["id"]].update(sequence)
cv_param_modifications = ""
sequence = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to reset - I feel this looks partial very similar to comet parser (dooo - both xml, right ;)) but maybe worth refactoring to avoid code duplication. a mantra worth singing every morning - "Never copy code" :).

Copy link
Collaborator

@fu fu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) good one for the first one ! :)

from tqdm import tqdm

import xml.etree.ElementTree as etree
import warnings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use loguru warnings

def get_modification_mass_map(
entry, entry_tag, modification_mass_map, mod_name, fixed_mods
):
"""Take one entry at a time to return Modification name with massDelta. Also check if Modification is fixed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why modification is capitalized sometimes

Returns:
modification_mass_map (dict): contains one more modification
mod_name (str): contains the name of the Modification
fixed_mods (dict): contains mods where fixedMod = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contains fixed modifications


Returns:
modification_mass_map (dict): contains one more modification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@valentin-petzold
Copy link
Contributor Author

First of all, thanks for the valuable feedback from all of you! I try to implement all suggestions and improvements as best as i can :)

I will have another look at my docstrings before the next commit, but the code itself should be nearly finished...
Again, if you have any suggestions, i am happy to receive your feedback :)

Copy link
Contributor

@ArtiVlasov ArtiVlasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering how the format of the mzIdentML files is defined? Is it bound to the respective output formats of our engines? Just saw that the comet parser is made for version 1.2, while the msgfplus is made for 1.1. That being said - could an existing msgfplus engine report a .mzid in version 1.2 format? Or the other way around - could a newer version of msgfplus use the 1.2 format, where we then could re-use the comet parser for?

@@ -110,22 +171,9 @@ def check_parser_compatibility(cls, file):
contains_engine = "Comet" in head
return is_mzid and contains_engine

def _map_mods_and_sequences(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as posix is still in there :)
with open(file.as_posix()) as f: will crash on windows - why add as_posix to path lib object? Also Doc string say file is str so str.as_posix won't work at all :)


Operations are performed inplace.
Returns:
version (str): file version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function does not return version as str but None if elif entry_tag.endswith("AnalysisSoftware"): is not True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also say dont put return in conditionals but set a value in the conditional and return it in the end.
You can always set None as default in the beginning if this is desired behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now set it up exactly as you described.
Returning None is actually never desired - should always return the version, nothing else...

mass_name = self.mod_mass_map[mass]
modifications += mass_name + ":" + location + ";"
elif entry_tag.endswith("Peptide"):
modifications = modifications.rstrip(";").lstrip(";")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if modifications is a list and one uses ";".join(modifications) at the end (i.e. l.137), the rstrip, lstrip, + ";" elements can be spared (readability 💯 )

"value"
]
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why create a dict to update a dict? Why not set it directly ?

if attribute in self.mapping_dict.keys():
spec_results.update(
{self.mapping_dict[attribute]: entry.attrib[attribute]}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above
something like that maybe:

_key = self.mapping_dict[attribute]
spec_results[_key] = entry.attrib[attribute]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed all update calls where this was possible to exactly this solution :)

np.cumsum(list(map(len, l[:-1]))) + range(1, len(l))
).astype(str)
]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neary copy-paste from omssa_2_1_9:translate_mods - refactor?

peptide_lookup[entry.attrib["id"]] = sequence
peptide_lookup[entry.attrib["id"]].update(
{"modifications": cv_param_modifications.rstrip(";")}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above - why create a dict with one value :)

if entry.attrib["fixedMod"] == "true":
fixed_mods.update({entry.attrib["residues"]: mod_name})
elif entry_tag.endswith("ModificationParams"):
return fixed_mods, mod_mass_map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd break here and then return fixed_mods and mod_mass_map at the end

eliminated all unnecessary update calls on dicts
changed the file.as_posix in check_parser_compatibility
@@ -117,43 +37,126 @@ def check_parser_compatibility(cls, file):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is file actually a str or a Path object?
If both is possible, adapt doc string and cast to Path if its a str

@fu fu merged commit 3917a25 into computational-ms:dev Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants