New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script to create a HipPy input dataset file #15886
Conversation
- filter the file list when possible to avoid unnecessary opening and closing - add function for HipPy file list
(the root file has never been called that as far back as I can tell, and it's copied later anyway https://github.com/cms-sw/cmssw/blob/813e5a/Alignment/OfflineValidation/python/TkAlAllInOneTool/configTemplates.py#L104)
A new Pull Request was created by @hroskes (Heshy Roskes) for CMSSW_8_1_X. It involves the following packages: Alignment/HIPAlignmentAlgorithm @ghellwig, @cerminar, @cmsbuild, @franzoni, @mmusich, @davidlange6 can you please review it and eventually sign? Thanks. cms-bot commands are list here #13028 |
please test |
The tests are being triggered in jenkins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hroskes looks fine to me, except for the minor comments I made
In general, I think, it would be good to migrate the Dataset
class to Alignment/CommonAlignment/python/tools
. But this could be done in the next round of updates. On the MillePede side there is a script that does a similar job, but with some code duplication. The goal of this script (mps_create_file_lists.py
) is to create statistically independent dataset for alignment and validation. I think, it would be good if we could unify this somehow, to get common validation datasets and create file lists for the remaining data in each of the formats required by the two alignment algorithms.
What do you think?
def getrunnumberfromfilename(filename): | ||
parts = filename.split("/") | ||
result = error = None | ||
if parts[0] != "" or parts[1] != "store": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hroskes shouldn't it be and
instead of or
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so... this way catches "something/store" and also "/something", and
would not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right!
error = "does not start with /store" | ||
elif parts[2] in ["mc", "relval"]: | ||
result = 1 | ||
elif parts[-2] != "00000" or not parts[-1].endswith(".root"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hroskes is this nomenclature for file names defined or required somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question... I didn't know about it until @usarica told me. It's true for many datasets but not for this one. Actually Ulascan mentioned that in that case there are multiple runs in the same data file.
Basically at this point I was trying to be as strict as possible, and not remove the filename if it's not in the exact pattern that seems to be satisfied for most datasets.
If you have a better way of figuring out the run number from the filename that would be great. This way is more efficient than using a lumi filter which requires opening and closing all the files. It's particularly important for HipPy because we loop through the files multiple times, but since I was implementing it anyway I figured we might as well use it for validation too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your way works in case of prompt data, but not for rereco (as the dataset that you linked). Since we usually run alignments on prompt reco, it is fine this way.
I was just curious, if you know about a place where this convention is defined.
In MillePede, we have a similar guessing mechanism, but I would not call it "a better way".
@@ -252,9 +252,6 @@ def createScript(self, path): | |||
resultingFile = os.path.expandvars( resultingFile ) | |||
resultingFile = os.path.abspath( resultingFile ) | |||
resultingFile = "root://eoscms//eos/cms" + resultingFile #needs to be AFTER abspath so that it doesn't eat the // | |||
repMap["runComparisonScripts"] += \ | |||
("xrdcp -f OUTPUT_comparison.root %s\n" | |||
%resultingFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hroskes can you elaborate what the exact effect of this fix is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The effect is basically to remove a bash error because OUTPUT_comparison.root does not exist.
I think this line was supposed to copy the output of makeArrowPlots("comparison.root", "...")
But actually the first argument to makeArrowPlots is something else, so this file is never created and it just gives a warning.
The actual output file contains .oO[name]Oo. so it gets copied in this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks for the explanation!
@ghellwig Unifying sounds great. Where is |
@hroskes I just realized, that the PR containing this script is not yet merged, but I linked the code in one of my comments. |
fileList.remove(filename) | ||
except AllInOneError as e: | ||
if forcerunselection: raise | ||
print e.message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hroskes just one question: in the All-in-One tool forcerunselection
is set to False
, right?
But in case, one runs a validation using rereco data (for whatever reason), one get's quite a lot of stdout by this line, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I pushed and fixed this now.
(I guess it's still a lot of output in the case where most but not all of the files fall into this category, but I would be surprised if that ever happens, and in that case you would want to know what's going on.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hroskes looks good now!
please test |
The tests are being triggered in jenkins. |
+1 |
This pull request is fully signed and it will be integrated in one of the next CMSSW_8_1_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @smuzaffar |
+1 |
Can also add a similar function for MP if that's needed, but I'm not familiar with the syntax.
Also make the AIO tool dataset class a bit more general by splitting up the biggest function. No point in rewriting another module to do the same thing.
Also, remove some inefficiency in validation by skipping files that have the run number in their filename if they are not in the selected run range.
And a little fix in the geometry comparison.