-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with partial block in the parentage fix - change to API #11757
Conversation
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
I should have requested the PR review yesterday evening, but I forgot it. So here we go. Unit test failure is unstable and unrelated. |
src/python/Utils/IteratorTools.py
Outdated
@@ -50,3 +51,15 @@ def convertFromUnicodeToBytes(data): | |||
return type(data)(list(map(convertFromUnicodeToBytes, data))) | |||
else: | |||
return data | |||
|
|||
|
|||
def noDupListOfLists(listObj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that logic of this function cover all use cases, e.g. what if you have overlap slices like this:
[[1,10], [4,12]]
In fact, here is what python says about it:
>>> from itertools import islice, chain, groupby
>>> obj=[[1,10], [4,12]]
>>> obj.sort()
>>> list(k for k, _ in groupby(obj))
[[1, 10], [4, 12]]
Therefore, you either should change the docstring and describe the use-case correctly or change the logic of the function if we need to cover such use-cases.
I also suggest to change the function name from being to something like removeDuplicateListRanges
which is what it is doing right now, and change its docstring accordingly to reflect its logic. It removes duplicate list ranges but it allows duplicate list entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, perhaps I could rename it to something like renameDuplicateListElements
?
I don't understand your first comment though. The use case is clear and it's described in the docstring (and covered by unit tests). I don't understand what "overlap" you mean with [[1,10], [4,12]]
, please clarify.
The use case is simple, you have a list where each element is another list (list of lists). The goal of this function is to make the list of elements unique (so maybe saying remove duplicates isn't great either). E.g., if you have:
[[1,10], [4,12], [1,10]]
the output of this function should be (regardless of the order of the elements):
[[1,10], [4,12]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe named it as makeListElementsUnique
would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few things here:
- if you call them tuples then why do you use lists, the list of tuples would be
[(1,10), (4,12)]
- if your lists represent range, e.g.
[1,10]
is short cut for[1,2,3,4,5,6,7,8,9,10]
, then you have overlap in ranges
In either way I suggest to make proper clarification either in docstring or logic.
Said that, makeListElementsUnique
seems more appropriate but if you treat it as pairs, i.e. tuples, but in this case it is better to call them pairs or tuples and use tuple
data types.
@@ -916,29 +924,11 @@ def fixMissingParentageDatasets(self, childDataset, insertFlag=True): | |||
self.logger.info("Found %d blocks without parentage information", len(blocks)) | |||
for blockName in blocks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is sequential logic which will grow significantly with number of processed blocks. My advise to make it concurrent and process blocks in parallel, but it will require changing logic to use async
or rely on queue and multiprocessing modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is beyond the scope of these changes, but it can definitely be changed at some point. Feel free to open a GH issue for that.
self.assertListEqual(expRes, noDupListOfLists(inputTest)) | ||
|
||
# trying a different data type | ||
expRes = [[2, 20], [4, 40], [4, 41], [5, 40]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why you have overlapping lumi ranges. the [4,40]
, [4,41]
and [5.40]
, is it valid use case? If so, then noDupListOfLists
should not be called no duplicate list since they list are overlapping and contain duplicates upon unfolding them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not lumi ranges. This is a tuple of 2 integers. I guess by overlap you mean tuple with the same values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is tuples then why not to use tuple
data-type, i.e. change from [[1,10], ...]
to [(1,10), ...]
and then call the items as pairs.
parentFileId = parentRunLumi[runLumi] | ||
listChildParent.append([childFileId, parentFileId]) | ||
|
||
# then add all run lumi pairs which are missing at the parent Dataset by appending None as parentage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is misleading as logic below does not add None
, instead it adds -1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks for catching it!
Jenkins results:
|
Identify missing files and use the new missing_files dbs parameter Do not append file pair when it is missing parent; make list unique Provide function to remove duplicates in a list of lists Add tuple with -1 parent file id Convert AMR debugging records to logger.debug level apply Valentins change rename noDupListOfLists to makeListElementsUnique
Update DBS mocked data
update unit test IteratorTools unit test fix unit tests use new function name in unit tests
Given that I updated the code we were discussing, I can no longer comment in thread. So here goes my reply to your points, @vkuznet:
Nonetheless, I updated the docstring saying that it expects a list of lists OR a list of tuples. It has examples in the function docstring as well, in addition to extra tests in the unit tests. I hope you don't mind, but I already squashed all those changes in their respective commits. Please have another look still today, if possible :-D |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alan, thanks for addressing my questions and clarifications. I have no furthe rcomments.
Awesome! Much appreciated Valentin. |
Fixes #11715
Status
ready
Description
The following is provided with this PR:
-1
insertFileParents
APImakeListElementsUnique
)In addition, a couple of extra data is mocked in DBS and the json dump has been updated as well.
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
Changes are being made to the DBS Server: dmwm/dbs2go#101