-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[py2py3] fix unittests from WMCore/Datastructs in py3 #10562
Conversation
This comment has been minimized.
This comment has been minimized.
Making proposed changesThis is the implementation of Run sorting ( [EDIT] : reflects def __lt__(self, rhs):
"""
Compare on run # first, then by lumis as a list is compared
"""
if self.run != rhs.run:
return self.run < rhs.run
if sorted(self.eventsPerLumi.keys()) != sorted(rhs.eventsPerLumi.keys()):
return sorted(self.eventsPerLumi.keys()) < sorted(rhs.eventsPerLumi.keys())
- return self.eventsPerLumi < rhs.eventsPerLumi # this line breaks in py3
+ for self_key, rhs_key in zip(sorted(self.eventsPerLumi), sorted(rhs.eventsPerLumi)):
+ if self.eventsPerLumi[self_key] == rhs.eventsPerLumi[rhs_key]:
+ continue
+ else:
+ return self.eventsPerLumi[self_key] < rhs.eventsPerLumi[rhs_key]
+ return False py2 dictionaries comparisonsI paste here my findings for future reference import sys
print sys.version_info # sys.version_info(major=2, minor=7, micro=16, releaselevel='final', serial=0)
d0 = {1: 11, 2: 22, 3: 33}
# equal
d1 = {1: 11, 2: 22, 3: 33}
# same keys, diff content
d2 = {1: 11, 2: 22, 3: 32}
d3 = {1: 11, 2: 22, 3: 34}
d4 = {1: 11, 2: 21, 3: 33}
d5 = {1: 11, 2: 21, 3: 34}
d6 = {1: 11, 2: 23, 3: 32}
d7 = {1: 11, 2: 23, 3: 34}
# one added key OR one missing key
d8 = {1: 11, 2: 22, 3: 33, 4: 44}
d9 = {1: 11, 2: 21, 3: 33, 4: 44}
d10 = {1: 11, 2: 23, 3: 33, 4: 44}
d11 = {1: 11, 2: 22}
d12 = {1: 11, 2: 21}
d13 = {1: 11, 2: 23}
# two missing keys AND one added key
d14 = {1: 11, 4: 22}
d15 = {1: 11, 4: 21}
d16 = {1: 11, 4: 23}
# one missing key AND one added key
d17 = {1: 11, 2: 22, 4: 32}
d18 = {1: 11, 2: 22, 4: 33}
d19 = {1: 11, 2: 22, 4: 44}
d20 = {1: 11, 2: 21, 4: 44}
d21 = {1: 11, 2: 23, 4: 44}
def cmp_py2(d1):
cmppy2 = [d0 < d1, d0 <= d1, d0 == d1, d0 != d1, d0 > d1, d0 >= d1]
return cmppy2 == cmp_key(d1), cmppy2, cmp_key(d1)
def cmp_key(d1):
k0 = sorted(d0.keys())
k1 = sorted(d1.keys())
return [k0 < k1, k0 <= k1, k0 == k1, k0 != k1, k0 > k1, k0 >= k1]
print
print cmp_py2(d1) # (True, [False, True, True, False, False, True], [False, True, True, False, False, True])
print
print cmp_py2(d2) # (False, [False, False, False, True, True, True], [False, True, True, False, False, True])
print cmp_py2(d3) # (False, [True, True, False, True, False, False], [False, True, True, False, False, True])
print cmp_py2(d4) # (False, [False, False, False, True, True, True], [False, True, True, False, False, True])
print cmp_py2(d5) # (False, [False, False, False, True, True, True], [False, True, True, False, False, True])
print cmp_py2(d6) # (False, [True, True, False, True, False, False], [False, True, True, False, False, True])
print cmp_py2(d7) # (False, [True, True, False, True, False, False], [False, True, True, False, False, True])
print
print cmp_py2(d8) # (True, [True, True, False, True, False, False], [True, True, False, True, False, False])
print cmp_py2(d9) # (True, [True, True, False, True, False, False], [True, True, False, True, False, False])
print cmp_py2(d10) # (True, [True, True, False, True, False, False], [True, True, False, True, False, False])
print cmp_py2(d11) # (True, [False, False, False, True, True, True], [False, False, False, True, True, True])
print cmp_py2(d12) # (True, [False, False, False, True, True, True], [False, False, False, True, True, True])
print cmp_py2(d13) # (True, [False, False, False, True, True, True], [False, False, False, True, True, True])
print
print cmp_py2(d14) # (False, [False, False, False, True, True, True], [True, True, False, True, False, False])
print cmp_py2(d15) # (False, [False, False, False, True, True, True], [True, True, False, True, False, False])
print cmp_py2(d16) # (False, [False, False, False, True, True, True], [True, True, False, True, False, False])
print
print cmp_py2(d17) # (True, [True, True, False, True, False, False], [True, True, False, True, False, False])
print cmp_py2(d18) # (True, [True, True, False, True, False, False], [True, True, False, True, False, False])
print cmp_py2(d19) # (True, [True, True, False, True, False, False], [True, True, False, True, False, False])
print cmp_py2(d20) # (False, [False, False, False, True, True, True], [True, True, False, True, False, False])
print cmp_py2(d21) # (True, [True, True, False, True, False, False], [True, True, False, True, False, False]) from d2 to d7froom In particular, we should notice that This is tested in from d8 to d21I explored what happens to dictionary sorting in py2 when the dictionaries have different keys. Some weird stuff happen, for example with However, if we consider only what happens when comparing the keys of the dictionaries, then nothing out of the ordinary comes up. This is tested in |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@mapellidario two issues here:
|
@vkuznet I did not make any benchmark for a couple of reasons. 1: is this really a problem?These are the changes to be applied here. def __lt__(self, rhs):
if self.run != rhs.run:
return self.run < rhs.run
if sorted(self.eventsPerLumi.keys()) != sorted(rhs.eventsPerLumi.keys()):
return sorted(self.eventsPerLumi.keys()) < sorted(rhs.eventsPerLumi.keys())
- return self.eventsPerLumi < rhs.eventsPerLumi
+ for (_, self_events), (_, rhs_events) in zip(sorted(viewitems(self.eventsPerLumi)), sorted(viewitems(rhs.eventsPerLumi))):
+ if self_events == rhs_events:
+ continue
+ else:
+ return self_events < rhs_events
+ return False What I am adding, is
I do not consider this to be terrible, since a couple of lines above we already sort the keys of the dictionaries. Twice per each dictionary, without even caching the result. If it were paramount to keep the cpu time low, i would expect that we at least cache the list of sorted keys. In any case, if we hit the part that I added, we already sorted each dictionary twice, so in the worst case scenario it would be a +100% [edited] in cpu time, assuming that py2 implementation of However, how often does it happen that we compare two 2: dmwm/WMCore prioritiesOur priority is to make dmwm/WMCore work in py3, even if this comes with performance drawbacks. If you are worried about the performances, we can open an issue and have a second look at this very changes at a later stage. |
This comment has been minimized.
This comment has been minimized.
Valentin, you planted a seed in the back of my mind and by the end of the workday it grew into a full plant. I had to make a synthetic benchmark to measure this. The benchmark is in my branch Comparing ten times two
You were right, my approach is significantly slower 😬 . I will go back to the drawing board and think about something new. Thanks for pointing this out! |
This comment has been minimized.
This comment has been minimized.
I doubt that you'll achieve decent performance with python itself, what you'll realize at the end is necessity to write C wrapper to battle such problems. I dealt with many dict issues many years ago when DAS was written in python. I even wrote few C-wrappers which you may find here and associated C-wrapper. At the end, I gave up with python and non-structured (no schema) CMS data and switched to GoLang for everything. Since WMCore deals with many dicts and there are a lot of non-structured data (where quite often there is no schema in dicts since keys are not defined but dynamically assigned) you'll always struggle with this and other issues because of quite large size of dicts. And, the only solution for you would be to start learning and writing C-wrappers for Python functions (which by itself far from trivial). My suggestion still stays, try to create a hash of the dict and compare the hashes. To speed things up you may look at DAS code which provides genkey and das_hash C-wrapper. You may generate a hash of the dict using this method and compare the two. |
With the new approach [1] (let's call it
This is in line with the +100% for the worst case scenario I had in mind earlier. ( The problem was that However, the performance drop appears in py2 only! The new approach in py3 is faster than the old approach in py2, so I would thank the python core developers and call this a win. [1] def __lt__(self, rhs):
....
for self_key, rhs_key in zip(sorted(self.eventsPerLumi), sorted(rhs.eventsPerLumi)):
if self.eventsPerLumi[self_key] == rhs.eventsPerLumi[rhs_key]:
continue
else:
return self.eventsPerLumi[self_key] < rhs.eventsPerLumi[rhs_key]
... |
Thank you Valentin for the suggestion! I do agree with everything you just said, however, with the time scale that we are dealing with at the moment, I think we just have to leave with this at the moment. I would leave to Kevin or Alan the decision of whether we should invest time in improving the performances of |
@mapellidario, I would agree with both your focus on completing an implementation before optimizing, and also on your conclusion that since the Py3 version is comparable (even slightly better) than the existing Py2 version, it's good enough to move forward and FINISH the Py3 migration. Once we complete the Py3 migration, we can add to the brainstorming list an analysis of code performance to identify in which components optimization the benefits of optimization would outweigh the costs. |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very detailed description and (performance) checks, Dario. This code looks good to me and it's ready to go. But I left a couple of comments along the code in case you want to have one last look into it.
I'm happy to merge it and move forward as well, just let me know.
I updated the code with @amaltaro's suggestions. I already squashed the commits so that this is ready to be merged (I promise I haven't added anything that wasn't' suggested/approved 😏 ) With Alan's suggestion (
Thanks @vkuznet for raising the concern in the first place and thanks Alan for providing the final implementation! |
Jenkins results:
|
Nicely done, Dario! Thanks |
Fixes #10531
Status
In development
Description
Changes
WMCore.DataStructs.Run.Run
s are compared (i know that the motivations behind these changes may not be self-evident, I made a comment below with some investigations to justify these changes)pickle.dump()
andpickle.load()
require the file object to be opened inbytes
mode also in py3. (JobPackage.py
)assertItemsEqual
->assertCountEqual
))Is it backward compatible (if not, which system it affects?)
yes
Related PRs
This is a follow-up of #10012
External dependencies / deployment changes
nope