-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ResourceInformationService #37831
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37831/29750
|
A new Pull Request was created by @wddgit (W. David Dagenhart) for master. It involves the following packages:
@cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
@wddgit can we discuss the kind of information that should be gathered ? For example, some global informations that would be useful to track are:
While some additional per-GPU information that could be useful are:
|
-1 Failed Tests: RelVals-INPUT RelVals-INPUT
Expand to see more relval errors ...
Comparison SummarySummary:
|
Yes, this is definitely open for discussion. One reason I submitted this PR, which only partially resolves this issue, is that I wanted more discussion to make sure I was headed in the right direction before I put in more time. Matti is the expert on this directing my work. My experience with GPU issues is very small. I am just starting up that learning curve. FYI. I have a week of vacation scheduled next week. Feel free to continue discussions in my absence and I'll continue working on this when I return. |
please test The test errors look unrelated to this PR. Try running the tests again. |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/24499/summary.html Comparison SummarySummary:
|
Thanks David. I think it would be better to distinguish the CPU and the GPU models in this Service.. I think all the considered consumers of this information (JobReport, CondorStatusService, "provenance") would want to report those separately (i.e. CPU model is this, and GPU model is that). One feature in the consumers that I didn't consider before is that they all seem to need different level of information. E.g. for CPU
One way would be to evolve the ResourceInformationService towards a key-value store, into which the CPU Service, CUDAService, etc, can push information, and the consumers would use what they consider relevant. That would require some level of standardization of the keys between the producers and consumers, at least for the consumers that want only a (small) subset of the information (e.g. JobReport could just dump everything). Or maybe a 2-level hierarchical key-value store? E.g. expressed as a JSON something along [
{
"Type" : "CPU",
"Model" : "Intel ...",
...
},
{
"Type" : "GPU",
"Model" : "NVIDIA ...",
...
},
{
"Type" : "GPU",
"Model" : "NVIDIA something else...",
...
}
] ? @fwyzard We can certainly add more information to be delivered around, But we should also have some understanding where that information would be consumed. E.g. for the consumers above, I'd think as an overall guideline
Do you have any other consumer in mind for this kind of information? |
Hi Matti, So, this could be similar to the impact of the glibc version (do we store and costume that anywhere?). Information about available memory and exclusive use should not affect the physics output, but could be useful for debugging problems based on the reports. Other details like core counts, total memory, clock speed, etc. could be useful mostly for monitoring, and maybe for scaling the reconstruction time. |
Thanks @fwyzard.
I don't think we store glibc version explicitly anywhere, but to large degree that is governed by the production
Including the driver version in the "file provenance" makes sense (it can be expected to vary a lot more than e.g. the actual glibc binary). Maybe the driver version could be generic-enough between vendors that we could call the field just along "GPU driver version" without explicitly specifying CUDA/ROCm/etc, since the vendor should be clear from the model name record.
Would the GPU model name be sufficient towards available memory, at least for the "file provenance"? How would we know if a process has an exclusive use to a GPU? (without constantly/periodically monitoring possible other processes accessing the GPU, in which case we would know it too late for the "file provenance") |
Pull request #37831 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please check and sign again. |
please test I just rebased and pushed responses to recent review comments |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/29493/summary.html Comparison SummarySummary:
|
|
||
// ---------- member functions --------------------------- | ||
///CPU information - the models present and average speed. | ||
virtual bool cpuInfo(std::string &models, double &avgSpeed) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note that in the future we could probably remove this (practically) abstract base class in a future PR.
ResourceInformation const& operator=(ResourceInformation const&) = delete; | ||
virtual ~ResourceInformation(); | ||
|
||
enum class AcceleratorType { GPU }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#39402 made me think if knowing the GPU vendor (or "software stack"?) would be useful as well. In principle the information is available via gpuModels()
(needs string parsing) or e.g. nvidiaDriverVersion()
(check if the string is empty), so it could be checked already with the present interface. So maybe the present interface is sufficient, and we adjust later if it seems useful?
@cmsbuild, please test Just to check that nothing broke. After that I'll sign (with my last comments meant for something to be improved in the future, if needed). |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/29664/summary.html Comparison SummarySummary:
|
Comparison differences are in 11634.7 and thus spurious (#39803). |
+core |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
Initial implementation of ResourceInformationService. The function acceleratorsTypes() will return a container holding enumeration values. Currently it will always be empty or contain a value for "GPU" if any item in "@selected_accelerators" starts with the substring "gpu-". We expect addition possible enumeration values may be added in the future.
The following additional functions are added:
The service contains data members to hold the data returned by those functions. Other services (CPU and CUDAService) must be configured for these to be filled. We expect other accelerator devices may be added to this in the future.
This does not include changes for anything to get information out of ResourceInformationService. Those changes will come in a future PR.
This also does not include changes to store this information persistently. That will also come in a future PR.
If the service has its verbose parameter set true, then it will print out some information at begin job.
PR validation:
There is a unit test that checks the information printed out when the service is set to be verbose. Nothing uses this service yet so I do not expect this will have any immediate effect on RelVals or production executables.