Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ResourceInformationService #37831

Merged
merged 2 commits into from
Dec 19, 2022

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented May 5, 2022

PR description:

Initial implementation of ResourceInformationService. The function acceleratorsTypes() will return a container holding enumeration values. Currently it will always be empty or contain a value for "GPU" if any item in "@selected_accelerators" starts with the substring "gpu-". We expect addition possible enumeration values may be added in the future.

The following additional functions are added:

  • cpuModels()
  • gpuModels()
  • nvidiaDriverVersion()
  • cudaDriverVersion()
  • cudaRuntimeVersion()
  • cpuModelsFormatted()
  • cpuAverageSpeed()

The service contains data members to hold the data returned by those functions. Other services (CPU and CUDAService) must be configured for these to be filled. We expect other accelerator devices may be added to this in the future.

This does not include changes for anything to get information out of ResourceInformationService. Those changes will come in a future PR.

This also does not include changes to store this information persistently. That will also come in a future PR.

If the service has its verbose parameter set true, then it will print out some information at begin job.

PR validation:

There is a unit test that checks the information printed out when the service is set to be verbose. Nothing uses this service yet so I do not expect this will have any immediate effect on RelVals or production executables.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2022

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37831/29750

  • This PR adds an extra 36KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2022

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

  • FWCore/Framework (core)
  • FWCore/Services (core)
  • FWCore/Utilities (core)
  • HeterogeneousCore/CUDAServices (heterogeneous)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @felicepantaleo, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented May 5, 2022

please test

@fwyzard
Copy link
Contributor

fwyzard commented May 5, 2022

@wddgit can we discuss the kind of information that should be gathered ?
I think it would be useful to have more details that just the number of accelerators and their models.

For example, some global informations that would be useful to track are:

  • the NVIDIA driver version being used (which depends on the local installation);
  • the CUDA driver version being used (which usually matches the NVIDIA driver version, but it could also be the compatibility library we ship with CMSSW);
  • the CUDA runtime version being used (which should be the version we ship with CMSSW, but better check).

While some additional per-GPU information that could be useful are:

  • if the GPU usage is exclusive or shared with other jobs
  • the amount of total and free GPU memory when the job starts
  • etc.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 6, 2022

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/24488/summary.html
COMMIT: b4e04a9
CMSSW: CMSSW_12_4_X_2022-05-05-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37831/24488/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

  • 136.803136.803_RunNoBPTX2017C+RunNoBPTX2017C+HLTDR2_2017+RECODR2_2017reHLTAlCaTkCosmics_Prompt+HARVEST2017/step2_RunNoBPTX2017C+RunNoBPTX2017C+HLTDR2_2017+RECODR2_2017reHLTAlCaTkCosmics_Prompt+HARVEST2017.log
  • 136.8136.8_RunSinglePh2017C+RunSinglePh2017C+HLTDR2_2017+RECODR2_2017reHLT_skimSinglePh_Prompt+HARVEST2017_skimSinglePh/step2_RunSinglePh2017C+RunSinglePh2017C+HLTDR2_2017+RECODR2_2017reHLT_skimSinglePh_Prompt+HARVEST2017_skimSinglePh.log
  • 136.802136.802_RunMuOnia2017C+RunMuOnia2017C+HLTDR2_2017+RECODR2_2017reHLT_skimMuOnia_Prompt+HARVEST2017_skimMuOnia/step2_RunMuOnia2017C+RunMuOnia2017C+HLTDR2_2017+RECODR2_2017reHLT_skimMuOnia_Prompt+HARVEST2017_skimMuOnia.log
Expand to see more relval errors ...
  • 136.801
  • 136.7952
  • 136.804
  • 136.805
  • 136.806
  • 136.807
  • 136.808
  • 136.809
  • 136.81
  • 136.811
  • 136.812
  • 136.813
  • 136.814
  • 136.815
  • 136.816
  • 136.817
  • 136.818
  • 136.819
  • 136.82
  • 136.821
  • 136.822
  • 136.823
  • 136.824
  • 136.825
  • 136.826
  • 136.827
  • 136.828
  • 136.829
  • 136.83
  • 136.831
  • 136.8311
  • 136.83111
  • 136.832
  • 136.833
  • 136.834
  • 136.835
  • 136.836
  • 136.837
  • 136.838
  • 136.839
  • 136.8391
  • 136.84
  • 136.841
  • 136.842
  • 136.843
  • 136.844
  • 136.845
  • 136.846
  • 136.847
  • 136.848
  • 136.849
  • 136.85
  • 136.8501
  • 136.851
  • 136.852
  • 136.8521
  • 136.8522
  • 136.8523
  • 136.853
  • 136.854
  • 136.855
  • 136.856
  • 136.8561
  • 136.8562
  • 136.857
  • 136.858
  • 136.859
  • 136.86
  • 136.861
  • 136.862
  • 136.863
  • 136.864
  • 136.8642
  • 136.865
  • 136.866
  • 136.867
  • 136.868
  • 136.869
  • 136.87
  • 136.871
  • 136.872
  • 136.873
  • 136.874
  • 136.875
  • 136.876
  • 136.877
  • 136.878
  • 136.879
  • 136.88
  • 136.881
  • 136.882
  • 136.883
  • 136.884
  • 136.885
  • 136.8855
  • 136.885501
  • 136.885511
  • 136.885521
  • 136.886
  • 136.8861
  • 136.8862
  • 136.887
  • 136.888
  • 136.88811
  • 136.8885
  • 136.888501
  • 136.888511
  • 136.888521
  • 136.889
  • 136.89
  • 136.891
  • 136.892
  • 136.893
  • 136.894
  • 136.895
  • 136.896
  • 136.897
  • 136.898
  • 136.899
  • 137.8
  • 138.1
  • 138.2
  • 138.3
  • 138.4
  • 138.5
  • 139.001
  • 139.002
  • 139.003
  • 139.004
  • 139.005
  • 140.51
  • 140.52
  • 140.53
  • 140.54
  • 140.55
  • 140.56
  • 140.5611
  • 140.57
  • 158.01
  • 158.1
  • 158.2
  • 158.3
  • 1306.0
  • 1307.0
  • 1308.0
  • 1309.0
  • 1310.0
  • 1311.0
  • 1312.0
  • 1313.0
  • 1314.0
  • 1315.0
  • 1316.0
  • 1317.0
  • 1318.0
  • 1319.0
  • 1320.0
  • 1321.0
  • 1322.0
  • 1323.0
  • 1324.0
  • 1325.0
  • 1325.1
  • 1325.2
  • 1325.3
  • 1325.4
  • 1325.5
  • 1325.51
  • 1325.516
  • 1325.5161
  • 1325.517
  • 1325.518
  • 1325.6
  • 1325.61
  • 1325.7
  • 1325.8
  • 1325.81
  • 1325.9
  • 1325.91
  • 1326.0
  • 1327.0
  • 1328.0
  • 1329.0
  • 1329.1
  • 1330.0
  • 1331.0
  • 1332.0
  • 1333.0
  • 1334.0
  • 1335.0
  • 1336.0
  • 1337.0
  • 1338.0
  • 1339.0
  • 1340.0
  • 1341.0
  • 1343.0
  • 1344.0
  • 1345.0
  • 1347.0
  • 1348.0
  • 1349.0
  • 1350.0
  • 1351.0
  • 1352.0
  • 1353.0
  • 1354.0
  • 1355.0
  • 1364.0
  • 1365.0
  • 1366.0
  • 134.0
  • 134.99601
  • 134.99602
  • 134.99603
  • 134.99901
  • 144.6
  • 139901.0
  • 139902.0
  • 13992501.0
  • 13992502.0
  • 200.0
  • 202.0
  • 203.0
  • 205.0
  • 11024.2
  • 25200.0
  • 25202.0
  • 25202.1
  • 25202.2
  • 25203.0
  • 25204.0
  • 25205.0
  • 25206.0
  • 25207.0
  • 25208.0
  • 25209.0
  • 25214.0
  • 50200.0
  • 50202.0
  • 50203.0
  • 50204.0
  • 50205.0
  • 50206.0
  • 50207.0
  • 50208.0
  • 1000.0
  • 1001.0
  • 1001.2
  • 1002.0
  • 1003.0
  • 1004.0
  • 1010.0
  • 1020.0
  • 1030.0
  • 1040.0
  • 1040.1
  • 1041.0
  • 1042.0
  • 1102.0
  • 4000.0
  • 4001.0
  • 4002.0
  • 4003.0
  • 10001.0
  • 10002.0
  • 10003.0
  • 10004.0
  • 10005.0
  • 10006.0
  • 10007.0
  • 10008.0
  • 10009.0
  • 10023.0
  • 10024.0
  • 10024.1
  • 10024.2
  • 10024.3
  • 10024.4
  • 10024.5
  • 10025.0
  • 10026.0
  • 10042.0
  • 10059.0
  • 10071.0
  • 10224.0
  • 10225.0
  • 10424.0
  • 10801.0
  • 10802.0
  • 10803.0
  • 10804.0
  • 10804.31
  • 10805.0
  • 10805.31
  • 10806.0
  • 10807.0
  • 10808.0
  • 10809.0
  • 10823.0
  • 10824.0
  • 10824.1
  • 10824.5
  • 10824.501
  • 10824.505
  • 10824.511
  • 10824.521
  • 10824.6
  • 10824.8
  • 10825.0
  • 10826.0
  • 10842.0
  • 10842.501
  • 10842.505
  • 10859.0
  • 10871.0
  • 11024.0
  • 11024.6
  • 11025.0
  • 11224.0
  • 11224.6
  • 11601.0
  • 11602.0
  • 11603.0
  • 11604.0
  • 11605.0
  • 11606.0
  • 11607.0
  • 11608.0
  • 11609.0
  • 11630.0
  • 11634.0
  • 11634.1
  • 11634.24
  • 11634.5
  • 11634.501
  • 11634.505
  • 11634.511
  • 11634.521
  • 11634.601
  • 11634.7
  • 11634.91
  • 11640.0
  • 11643.0
  • 11646.0
  • 11650.0
  • 11650.501
  • 11650.505
  • 11723.17
  • 11725.0
  • 11834.0
  • 11834.13
  • 11834.21
  • 11834.24
  • 11834.99
  • 11846.0
  • 11925.0
  • 12034.0
  • 12434.0
  • 12634.0
  • 12634.99
  • 12834.0
  • 13034.0
  • 13034.99
  • 23234.0
  • 23234.21
  • 23434.21
  • 23434.99
  • 23434.9921
  • 23434.999
  • 34634.0
  • 35034.0
  • 39434.0
  • 39434.103
  • 39434.21
  • 39434.5
  • 39434.501
  • 39434.502
  • 39434.75
  • 39434.9
  • 39496.0
  • 39500.0
  • 39634.114
  • 39634.21
  • 39634.99
  • 39634.9921
  • 39634.999
  • 250200.0
  • 250200.17
  • 250200.18
  • 250202.0
  • 250202.1
  • 250202.17
  • 250202.171
  • 250202.172
  • 250202.18
  • 250202.181
  • 250202.2
  • 250202.3
  • 250202.4
  • 250202.5
  • 250203.0
  • 250203.17
  • 250203.18
  • 250204.0
  • 250204.17
  • 250204.18
  • 250205.0
  • 250205.17
  • 250205.18
  • 250206.0
  • 250206.17
  • 250206.18
  • 250206.181
  • 250207.0
  • 250207.17
  • 250207.18
  • 250208.17
  • 250208.18
  • 500200.0
  • 500202.0
  • 500203.0
  • 500204.0
  • 500205.0
  • 500206.0
  • 500207.0

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3700704
  • DQMHistoTests: Total failures: 19
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3700662
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 206 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@wddgit
Copy link
Contributor Author

wddgit commented May 6, 2022

Yes, this is definitely open for discussion. One reason I submitted this PR, which only partially resolves this issue, is that I wanted more discussion to make sure I was headed in the right direction before I put in more time. Matti is the expert on this directing my work. My experience with GPU issues is very small. I am just starting up that learning curve.

FYI. I have a week of vacation scheduled next week. Feel free to continue discussions in my absence and I'll continue working on this when I return.

@wddgit
Copy link
Contributor Author

wddgit commented May 6, 2022

please test

The test errors look unrelated to this PR. Try running the tests again.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 6, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/24499/summary.html
COMMIT: b4e04a9
CMSSW: CMSSW_12_4_X_2022-05-06-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37831/24499/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3700704
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3700680
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 206 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

makortel commented May 7, 2022

Thanks David.

I think it would be better to distinguish the CPU and the GPU models in this Service.. I think all the considered consumers of this information (JobReport, CondorStatusService, "provenance") would want to report those separately (i.e. CPU model is this, and GPU model is that).

One feature in the consumers that I didn't consider before is that they all seem to need different level of information. E.g. for CPU

  • for "provenance" in the ROOT file I'd imagine only the CPU model to be relevant
  • CondorStatusService reports "average speed" in addition to the model
  • JobReport adds even more information

One way would be to evolve the ResourceInformationService towards a key-value store, into which the CPU Service, CUDAService, etc, can push information, and the consumers would use what they consider relevant. That would require some level of standardization of the keys between the producers and consumers, at least for the consumers that want only a (small) subset of the information (e.g. JobReport could just dump everything).

Or maybe a 2-level hierarchical key-value store? E.g. expressed as a JSON something along

[
  {
    "Type" : "CPU",
    "Model" :  "Intel ...",
    ...
  },
  {
    "Type" : "GPU",
    "Model" : "NVIDIA ...",
    ...
  },
  {
    "Type" : "GPU",
    "Model" : "NVIDIA something else...",
    ...
    }
]

?

@fwyzard We can certainly add more information to be delivered around, But we should also have some understanding where that information would be consumed. E.g. for the consumers above, I'd think as an overall guideline

  • file "provenance" would be limited to information that can affect physics results
  • CondorStatusService would be limited to information useful for monitoring currently/recently running grid jobs
  • Framework job report could contain almost anything (as it is quite large already)

Do you have any other consumer in mind for this kind of information?

@fwyzard
Copy link
Contributor

fwyzard commented May 7, 2022

Hi Matti,
the various driver versions can affect the physics results, due to bug fixes within them, and the possible use of runtime version checks to enable or disable optimisations.
(e.g.: the 510.xx driver series fixes a bug in cooperative groups)

So, this could be similar to the impact of the glibc version (do we store and costume that anywhere?).

Information about available memory and exclusive use should not affect the physics output, but could be useful for debugging problems based on the reports.

Other details like core counts, total memory, clock speed, etc. could be useful mostly for monitoring, and maybe for scaling the reconstruction time.

@makortel
Copy link
Contributor

Thanks @fwyzard.

So, this could be similar to the impact of the glibc version (do we store and costume that anywhere?).

I don't think we store glibc version explicitly anywhere, but to large degree that is governed by the production SCRAM_ARCH of a given release (and if one uses non-production arch, one is expected to know what one is doing).

the various driver versions can affect the physics results, due to bug fixes within them, and the possible use of runtime version checks to enable or disable optimisations.

Including the driver version in the "file provenance" makes sense (it can be expected to vary a lot more than e.g. the actual glibc binary). Maybe the driver version could be generic-enough between vendors that we could call the field just along "GPU driver version" without explicitly specifying CUDA/ROCm/etc, since the vendor should be clear from the model name record.

Information about available memory and exclusive use should not affect the physics output, but could be useful for debugging problems based on the reports.

Would the GPU model name be sufficient towards available memory, at least for the "file provenance"?

How would we know if a process has an exclusive use to a GPU? (without constantly/periodically monitoring possible other processes accessing the GPU, in which case we would know it too late for the "file provenance")

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2022

Pull request #37831 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please check and sign again.

@wddgit
Copy link
Contributor Author

wddgit commented Dec 6, 2022

please test

I just rebased and pushed responses to recent review comments

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 7, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/29493/summary.html
COMMIT: 033c52c
CMSSW: CMSSW_13_0_X_2022-12-06-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37831/29493/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 27 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3421337
  • DQMHistoTests: Total failures: 1173
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3420142
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 206 log files, 158 edm output root files, 48 DQM output files
  • TriggerResults: no differences found


// ---------- member functions ---------------------------
///CPU information - the models present and average speed.
virtual bool cpuInfo(std::string &models, double &avgSpeed) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note that in the future we could probably remove this (practically) abstract base class in a future PR.

ResourceInformation const& operator=(ResourceInformation const&) = delete;
virtual ~ResourceInformation();

enum class AcceleratorType { GPU };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#39402 made me think if knowing the GPU vendor (or "software stack"?) would be useful as well. In principle the information is available via gpuModels() (needs string parsing) or e.g. nvidiaDriverVersion() (check if the string is empty), so it could be checked already with the present interface. So maybe the present interface is sufficient, and we adjust later if it seems useful?

@makortel
Copy link
Contributor

@cmsbuild, please test

Just to check that nothing broke. After that I'll sign (with my last comments meant for something to be improved in the future, if needed).

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/29664/summary.html
COMMIT: 033c52c
CMSSW: CMSSW_13_0_X_2022-12-16-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37831/29664/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 31 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3557521
  • DQMHistoTests: Total failures: 1191
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3556308
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 211 log files, 162 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

Comparison differences are in 11634.7 and thus spurious (#39803).

@makortel
Copy link
Contributor

+core

@makortel
Copy link
Contributor

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@rappoccio
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants