improve xml parsing for comet and msgfplus #11

valentin-petzold · 2022-11-03T21:32:18Z

Hi there,
I changed the way msgfplus and comet parsers parse the xml_file.
Now they iterate through using iterparse.
The msgfplus parser takes 20% less time while using significantly less memory: About 10Gb on a 1Gb testfile
I would appreciate some feedback.
Best,
Valo

pyprotista/parsers/ident/comet_2020_01_4_parser.py

tristan-ranff · 2022-11-10T11:47:44Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

-        spec_records.append(psm_level_dict)
-    return pd.DataFrame(spec_records)
+
+def _peptide_lookup(entry, entry_tag, sequence, modifications, peptide_lookup):


ideally verbs in function name

actually - must start with as defined in our coding standards - now valentin - you have never seen those so sorry :) :-P

i surely will have to take another look at my docstrings...

pyprotista/parsers/ident/comet_2020_01_4_parser.py

tristan-ranff · 2022-11-10T11:52:37Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+            }
+            if len(attribs["modifications"]) != 0:
+                for mod in attribs["modifications"]:
+                    monoisotopicMassDelta = mod["monoisotopicMassDelta"]


tristan-ranff · 2022-11-10T11:56:32Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

-                    f"{modification_mass_map[child.attrib['monoisotopicMassDelta']]}:{child.attrib['location']}"
-                )
-            lookup[id]["modifications"] = ";".join(lookup[id]["modifications"])
+        for pep_sequence, attribs in peptide_lookup.items():


maybe worth looking into getting mass_to_mod maps before reading mods from xml.
would allow us to assemble the mods correctly during the xml iteration and could be done as string instead of list.
unsure though.

i will look into doing a separate xml iteration for that purpose.
won't work in one iteration, since get_peptide_lookup takes place before get_modifciation_mass_map.

tests/ident/test_engine_parser_comet_2020_01_04.py

tristan-ranff · 2022-11-10T11:59:41Z

tests/ident/test_engine_parser_comet_2020_01_04.py

+    mod_name = ""
+    fixed_mods = {}
+
+    for i in results:


entry should be element and it should be for element in (cv_param, search_modification) imho

tristan-ranff

all comments in comet can probably be applied to msgf plus aswell

pyprotista/parsers/ident/comet_2020_01_4_parser.py

fu · 2022-11-16T07:34:54Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+        elif entry_tag == (f"{element_tag_prefix}AnalysisSoftware"):
+            version = "comet_" + "_".join(
+                re.findall(r"([/d]*\d+)", entry.attrib["version"])
+            )


if you do not expect to find multiple versions then re.search might be more appropriate than final

pyprotista/parsers/ident/comet_2020_01_4_parser.py

fu · 2022-11-16T07:40:16Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+            if i in mapping_dict:
+                spec_results.update({mapping_dict[i]: entry.attrib[i]})
+        spec_ident_items.append(spec_results)
+        spec_results = {}


you update in the loop above just to reset it here? hmmm ... not sure I get it

maybe I need to look at the full file without the comments

Well yes the thing is, there can be multiple SpectrumIdentificationItem - so i need to reset spec_results for the next one.
Also there are some general cvParams under SpectrumIdentificationResult, which belong to all SpectrumIdentificationItems (see for loop line 184). I reused the variable spec_results for these cvParam aswell.

fu · 2022-11-16T07:48:49Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

@@ -110,22 +171,9 @@ def check_parser_compatibility(cls, file):
        contains_engine = "Comet" in head
        return is_mzid and contains_engine

-    def _map_mods_and_sequences(self):


cannot comment on line 105 but no need for file.as_posix()

The check on whether or not that file is a comet file should
a) be done on the headers only (no need to iter 10 lines)
b) if iter over the file use for line in f - no need to call next
c) no need to reset the head variable in python as it exists only in scope and is recreated every time

I would suggest to use

with open(file) as i: header = i.readline() if "Comet:EValue" in header

to be perfectly explicit

i dont think the header is in the first line (xml yada yada), and the try except is in place because you would wanna avoid crashes in empty files (e.g. a 0 PSM MS Amanda file)

as posix is still in there :)
with open(file.as_posix()) as f: will crash on windows - why add as_posix to path lib object? Also Doc string say file is str so str.as_posix won't work at all :)

pyprotista/parsers/ident/comet_2020_01_4_parser.py

fu · 2022-11-16T07:56:29Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+                        f"{modification_mass_map[monoisotopicMassDelta]}:{location}"
+                    )
+            lookup[pep_sequence]["modifications"] = ";".join(
+                lookup[pep_sequence]["modifications"]


might need sorting?

yes and no, would be nice but is resorted during clean up anyways

fu · 2022-11-16T07:58:40Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py


-        # TODO: check mod left strip
        seq_mods = pd.DataFrame(self.df["sequence"].map(lookup).to_list())
        self.df.loc[:, "modifications"] = (
            seq_mods["modifications"].str.cat(fixed_mod_strings, sep=";").str.strip(";")


that feels awkward ... isn't it just seq_mods["modifications"].str + fixed_mod_strings

also sorting missing, I guess - maybe comes later in a general function, right?

i didn't really find a better way to do this.
seq_mods["modifications"].str + fixed_mod_strings doesn't do it.

and yes, i always tried to avoid sorting, when it gets sorted later anyways.

fu · 2022-11-16T07:59:23Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+            spec_records,
+            modification_mass_map,
+            fixed_mods,
+        ) = _iterator_xml(self.input_file, self.mapping_dict)


better name required :)

true! changed it to get_xml_data()

fu · 2022-11-16T08:01:47Z

pyprotista/parsers/ident/msgfplus_2021_03_22_parser.py

-    return pd.DataFrame(spec_records)
+        peptide_lookup[entry.attrib["id"]].update(sequence)
+        cv_param_modifications = ""
+        sequence = ""


no need to reset - I feel this looks partial very similar to comet parser (dooo - both xml, right ;)) but maybe worth refactoring to avoid code duplication. a mantra worth singing every morning - "Never copy code" :).

tests/ident/test_engine_parser_comet_2020_01_04.py

fu

:) good one for the first one ! :)

…into dev

tristan-ranff · 2022-11-21T12:35:51Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

-from tqdm import tqdm
-
+import xml.etree.ElementTree as etree
+import warnings


use loguru warnings

pyprotista/parsers/ident/comet_2020_01_4_parser.py

tristan-ranff · 2022-11-21T12:37:25Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+def get_modification_mass_map(
+    entry, entry_tag, modification_mass_map, mod_name, fixed_mods
+):
+    """Take one entry at a time to return Modification name with massDelta. Also check if Modification is fixed.


not sure why modification is capitalized sometimes

tristan-ranff · 2022-11-21T12:37:51Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+    Returns:
+        modification_mass_map (dict): contains one more modification
+        mod_name (str): contains the name of the Modification
+        fixed_mods (dict): contains mods where fixedMod = true


contains fixed modifications

tristan-ranff · 2022-11-21T12:38:00Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py


+    Returns:
+        modification_mass_map (dict): contains one more modification


valentin-petzold · 2022-11-21T14:08:30Z

First of all, thanks for the valuable feedback from all of you! I try to implement all suggestions and improvements as best as i can :)

I will have another look at my docstrings before the next commit, but the code itself should be nearly finished...
Again, if you have any suggestions, i am happy to receive your feedback :)

lxml no longer required new docstrings

ArtiVlasov

I was wondering how the format of the mzIdentML files is defined? Is it bound to the respective output formats of our engines? Just saw that the comet parser is made for version 1.2, while the msgfplus is made for 1.1. That being said - could an existing msgfplus engine report a .mzid in version 1.2 format? Or the other way around - could a newer version of msgfplus use the 1.2 format, where we then could re-use the comet parser for?

fu · 2022-12-19T10:17:10Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

@@ -110,22 +171,9 @@ def check_parser_compatibility(cls, file):
        contains_engine = "Comet" in head
        return is_mzid and contains_engine

-    def _map_mods_and_sequences(self):


as posix is still in there :)
with open(file.as_posix()) as f: will crash on windows - why add as_posix to path lib object? Also Doc string say file is str so str.as_posix won't work at all :)

fu · 2022-12-19T10:19:59Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py


-        Operations are performed inplace.
+        Returns:
+            version (str): file version


function does not return version as str but None if elif entry_tag.endswith("AnalysisSoftware"): is not True

I'd also say dont put return in conditionals but set a value in the conditional and return it in the end.
You can always set None as default in the beginning if this is desired behavior

I've now set it up exactly as you described.
Returning None is actually never desired - should always return the version, nothing else...

fu · 2022-12-19T10:23:44Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+                    mass_name = self.mod_mass_map[mass]
+                    modifications += mass_name + ":" + location + ";"
+                elif entry_tag.endswith("Peptide"):
+                    modifications = modifications.rstrip(";").lstrip(";")


if modifications is a list and one uses ";".join(modifications) at the end (i.e. l.137), the rstrip, lstrip, + ";" elements can be spared (readability 💯 )

fu · 2022-12-19T10:26:03Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+                                    "value"
+                                ]
+                            }
+                        )


why create a dict to update a dict? Why not set it directly ?

fu · 2022-12-19T10:26:54Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+                        if attribute in self.mapping_dict.keys():
+                            spec_results.update(
+                                {self.mapping_dict[attribute]: entry.attrib[attribute]}
+                            )


see above
something like that maybe:

_key = self.mapping_dict[attribute] spec_results[_key] = entry.attrib[attribute]

I changed all update calls where this was possible to exactly this solution :)

fu · 2022-12-19T10:34:14Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+                                np.cumsum(list(map(len, l[:-1]))) + range(1, len(l))
+                            ).astype(str)
+                        ]
+                    )


neary copy-paste from omssa_2_1_9:translate_mods - refactor?

fu · 2022-12-19T10:40:46Z

pyprotista/parsers/ident/msgfplus_2021_03_22_parser.py

+                    peptide_lookup[entry.attrib["id"]] = sequence
+                    peptide_lookup[entry.attrib["id"]].update(
+                        {"modifications": cv_param_modifications.rstrip(";")}
+                    )


see above - why create a dict with one value :)

MKoesters · 2022-12-19T12:36:53Z

pyprotista/parsers/ident/comet_2020_01_4_parser.py

+                    if entry.attrib["fixedMod"] == "true":
+                        fixed_mods.update({entry.attrib["residues"]: mod_name})
+                elif entry_tag.endswith("ModificationParams"):
+                    return fixed_mods, mod_mass_map


I'd break here and then return fixed_mods and mod_mass_map at the end

eliminated all unnecessary update calls on dicts changed the file.as_posix in check_parser_compatibility

MKoesters · 2022-12-19T13:41:48Z

pyprotista/parsers/ident/msgfplus_2021_03_22_parser.py

@@ -117,43 +37,126 @@ def check_parser_compatibility(cls, file):



Is file actually a str or a Path object?
If both is possible, adapt doc string and cast to Path if its a str

cleanup on xtandem

improve xml parsing for comet and msgfplus

2af4603

MKoesters reviewed Nov 8, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Outdated Show resolved Hide resolved

fix unknown modification mapping for comet

206e47e

tristan-ranff reviewed Nov 10, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Outdated Show resolved Hide resolved

tristan-ranff reviewed Nov 10, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Outdated Show resolved Hide resolved

tristan-ranff reviewed Nov 10, 2022

View reviewed changes

tests/ident/test_engine_parser_comet_2020_01_04.py Outdated Show resolved Hide resolved

tristan-ranff reviewed Nov 10, 2022

View reviewed changes

tristan-ranff requested changes Nov 10, 2022

View reviewed changes

tristan-ranff reviewed Nov 10, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Outdated Show resolved Hide resolved

tristan-ranff requested a review from fu November 15, 2022 13:53

fu reviewed Nov 16, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Outdated Show resolved Hide resolved

fu reviewed Nov 16, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Outdated Show resolved Hide resolved

fu reviewed Nov 16, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Outdated Show resolved Hide resolved

fu reviewed Nov 16, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Outdated Show resolved Hide resolved

fu reviewed Nov 16, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Outdated Show resolved Hide resolved

fu reviewed Nov 16, 2022

View reviewed changes

tests/ident/test_engine_parser_comet_2020_01_04.py Outdated Show resolved Hide resolved

fu requested changes Nov 16, 2022

View reviewed changes

valentin-petzold added 3 commits November 20, 2022 22:20

Merge branch 'dev' of https://github.com/computational-ms/pyProtista …

f62ff2e

…into dev

improve xml parsing for comet

84cb6c5

improve xml parsing for msgfplus

81a9765

valentin-petzold added 2 commits November 21, 2022 11:32

improve xml parsing for xtandem

cfe49e6

pytest approx removed for strings

dacc872

tristan-ranff reviewed Nov 21, 2022

View reviewed changes

pyprotista/parsers/ident/comet_2020_01_4_parser.py Show resolved Hide resolved

tristan-ranff reviewed Nov 21, 2022

View reviewed changes

valentin-petzold and others added 5 commits November 24, 2022 14:44

mod_mapping now during iteration in comet

e5f7142

lxml no longer required new docstrings

Merge branch 'dev' into dev

c7b2347

most functions now take file as input

dafef56

Merge remote-tracking branch 'origin/dev' into dev

882233e

functions are now methods

6dc1dce

tristan-ranff requested review from MKoesters, fu and tristan-ranff December 5, 2022 10:20

ArtiVlasov approved these changes Dec 5, 2022

View reviewed changes

fu approved these changes Dec 19, 2022

View reviewed changes

MKoesters reviewed Dec 19, 2022

View reviewed changes

for loops now break, to return in the end

016cce7

eliminated all unnecessary update calls on dicts changed the file.as_posix in check_parser_compatibility

MKoesters reviewed Dec 19, 2022

View reviewed changes

MKoesters approved these changes Dec 19, 2022

View reviewed changes

changed docstring file (path object)

5d2211b

cleanup on xtandem

fu merged commit 3917a25 into computational-ms:dev Dec 19, 2022


		Returns:
		modification_mass_map (dict): contains one more modification

		@@ -117,43 +37,126 @@ def check_parser_compatibility(cls, file):

improve xml parsing for comet and msgfplus #11

improve xml parsing for comet and msgfplus #11

Conversation

valentin-petzold commented Nov 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tristan-ranff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fu Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valentin-petzold commented Nov 21, 2022

ArtiVlasov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fu Nov 16, 2022 •

edited

Loading