Implement ResourceInformationService #37831

wddgit · 2022-05-05T21:54:28Z

PR description:

Initial implementation of ResourceInformationService. The function acceleratorsTypes() will return a container holding enumeration values. Currently it will always be empty or contain a value for "GPU" if any item in "@selected_accelerators" starts with the substring "gpu-". We expect addition possible enumeration values may be added in the future.

The following additional functions are added:

cpuModels()
gpuModels()
nvidiaDriverVersion()
cudaDriverVersion()
cudaRuntimeVersion()
cpuModelsFormatted()
cpuAverageSpeed()

The service contains data members to hold the data returned by those functions. Other services (CPU and CUDAService) must be configured for these to be filled. We expect other accelerator devices may be added to this in the future.

This does not include changes for anything to get information out of ResourceInformationService. Those changes will come in a future PR.

This also does not include changes to store this information persistently. That will also come in a future PR.

If the service has its verbose parameter set true, then it will print out some information at begin job.

PR validation:

There is a unit test that checks the information printed out when the service is set to be verbose. Nothing uses this service yet so I do not expect this will have any immediate effect on RelVals or production executables.

cmsbuild · 2022-05-05T22:00:18Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37831/29750

This PR adds an extra 36KB to repository

cmsbuild · 2022-05-05T22:00:36Z

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

FWCore/Framework (core)
FWCore/Services (core)
FWCore/Utilities (core)
HeterogeneousCore/CUDAServices (heterogeneous)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @felicepantaleo, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

wddgit · 2022-05-05T22:01:49Z

please test

fwyzard · 2022-05-05T22:19:58Z

@wddgit can we discuss the kind of information that should be gathered ?
I think it would be useful to have more details that just the number of accelerators and their models.

For example, some global informations that would be useful to track are:

the NVIDIA driver version being used (which depends on the local installation);
the CUDA driver version being used (which usually matches the NVIDIA driver version, but it could also be the compatibility library we ship with CMSSW);
the CUDA runtime version being used (which should be the version we ship with CMSSW, but better check).

While some additional per-GPU information that could be useful are:

if the GPU usage is exclusive or shared with other jobs
the amount of total and free GPU memory when the job starts
etc.

cmsbuild · 2022-05-06T02:45:34Z

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/24488/summary.html
COMMIT: b4e04a9
CMSSW: CMSSW_12_4_X_2022-05-05-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37831/24488/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

136.803136.803_RunNoBPTX2017C+RunNoBPTX2017C+HLTDR2_2017+RECODR2_2017reHLTAlCaTkCosmics_Prompt+HARVEST2017/step2_RunNoBPTX2017C+RunNoBPTX2017C+HLTDR2_2017+RECODR2_2017reHLTAlCaTkCosmics_Prompt+HARVEST2017.log
136.8136.8_RunSinglePh2017C+RunSinglePh2017C+HLTDR2_2017+RECODR2_2017reHLT_skimSinglePh_Prompt+HARVEST2017_skimSinglePh/step2_RunSinglePh2017C+RunSinglePh2017C+HLTDR2_2017+RECODR2_2017reHLT_skimSinglePh_Prompt+HARVEST2017_skimSinglePh.log
136.802136.802_RunMuOnia2017C+RunMuOnia2017C+HLTDR2_2017+RECODR2_2017reHLT_skimMuOnia_Prompt+HARVEST2017_skimMuOnia/step2_RunMuOnia2017C+RunMuOnia2017C+HLTDR2_2017+RECODR2_2017reHLT_skimMuOnia_Prompt+HARVEST2017_skimMuOnia.log

Expand to see more relval errors ...

136.801
136.7952
136.804
136.805
136.806
136.807
136.808
136.809
136.81
136.811
136.812
136.813
136.814
136.815
136.816
136.817
136.818
136.819
136.82
136.821
136.822
136.823
136.824
136.825
136.826
136.827
136.828
136.829
136.83
136.831
136.8311
136.83111
136.832
136.833
136.834
136.835
136.836
136.837
136.838
136.839
136.8391
136.84
136.841
136.842
136.843
136.844
136.845
136.846
136.847
136.848
136.849
136.85
136.8501
136.851
136.852
136.8521
136.8522
136.8523
136.853
136.854
136.855
136.856
136.8561
136.8562
136.857
136.858
136.859
136.86
136.861
136.862
136.863
136.864
136.8642
136.865
136.866
136.867
136.868
136.869
136.87
136.871
136.872
136.873
136.874
136.875
136.876
136.877
136.878
136.879
136.88
136.881
136.882
136.883
136.884
136.885
136.8855
136.885501
136.885511
136.885521
136.886
136.8861
136.8862
136.887
136.888
136.88811
136.8885
136.888501
136.888511
136.888521
136.889
136.89
136.891
136.892
136.893
136.894
136.895
136.896
136.897
136.898
136.899
137.8
138.1
138.2
138.3
138.4
138.5
139.001
139.002
139.003
139.004
139.005
140.51
140.52
140.53
140.54
140.55
140.56
140.5611
140.57
158.01
158.1
158.2
158.3
1306.0
1307.0
1308.0
1309.0
1310.0
1311.0
1312.0
1313.0
1314.0
1315.0
1316.0
1317.0
1318.0
1319.0
1320.0
1321.0
1322.0
1323.0
1324.0
1325.0
1325.1
1325.2
1325.3
1325.4
1325.5
1325.51
1325.516
1325.5161
1325.517
1325.518
1325.6
1325.61
1325.7
1325.8
1325.81
1325.9
1325.91
1326.0
1327.0
1328.0
1329.0
1329.1
1330.0
1331.0
1332.0
1333.0
1334.0
1335.0
1336.0
1337.0
1338.0
1339.0
1340.0
1341.0
1343.0
1344.0
1345.0
1347.0
1348.0
1349.0
1350.0
1351.0
1352.0
1353.0
1354.0
1355.0
1364.0
1365.0
1366.0
134.0
134.99601
134.99602
134.99603
134.99901
144.6
139901.0
139902.0
13992501.0
13992502.0
200.0
202.0
203.0
205.0
11024.2
25200.0
25202.0
25202.1
25202.2
25203.0
25204.0
25205.0
25206.0
25207.0
25208.0
25209.0
25214.0
50200.0
50202.0
50203.0
50204.0
50205.0
50206.0
50207.0
50208.0
1000.0
1001.0
1001.2
1002.0
1003.0
1004.0
1010.0
1020.0
1030.0
1040.0
1040.1
1041.0
1042.0
1102.0
4000.0
4001.0
4002.0
4003.0
10001.0
10002.0
10003.0
10004.0
10005.0
10006.0
10007.0
10008.0
10009.0
10023.0
10024.0
10024.1
10024.2
10024.3
10024.4
10024.5
10025.0
10026.0
10042.0
10059.0
10071.0
10224.0
10225.0
10424.0
10801.0
10802.0
10803.0
10804.0
10804.31
10805.0
10805.31
10806.0
10807.0
10808.0
10809.0
10823.0
10824.0
10824.1
10824.5
10824.501
10824.505
10824.511
10824.521
10824.6
10824.8
10825.0
10826.0
10842.0
10842.501
10842.505
10859.0
10871.0
11024.0
11024.6
11025.0
11224.0
11224.6
11601.0
11602.0
11603.0
11604.0
11605.0
11606.0
11607.0
11608.0
11609.0
11630.0
11634.0
11634.1
11634.24
11634.5
11634.501
11634.505
11634.511
11634.521
11634.601
11634.7
11634.91
11640.0
11643.0
11646.0
11650.0
11650.501
11650.505
11723.17
11725.0
11834.0
11834.13
11834.21
11834.24
11834.99
11846.0
11925.0
12034.0
12434.0
12634.0
12634.99
12834.0
13034.0
13034.99
23234.0
23234.21
23434.21
23434.99
23434.9921
23434.999
34634.0
35034.0
39434.0
39434.103
39434.21
39434.5
39434.501
39434.502
39434.75
39434.9
39496.0
39500.0
39634.114
39634.21
39634.99
39634.9921
39634.999
250200.0
250200.17
250200.18
250202.0
250202.1
250202.17
250202.171
250202.172
250202.18
250202.181
250202.2
250202.3
250202.4
250202.5
250203.0
250203.17
250203.18
250204.0
250204.17
250204.18
250205.0
250205.17
250205.18
250206.0
250206.17
250206.18
250206.181
250207.0
250207.17
250207.18
250208.17
250208.18
500200.0
500202.0
500203.0
500204.0
500205.0
500206.0
500207.0

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 8 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3700704
DQMHistoTests: Total failures: 19
DQMHistoTests: Total nulls: 1
DQMHistoTests: Total successes: 3700662
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.004 KiB( 48 files compared)
DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
Checked 206 log files, 45 edm output root files, 49 DQM output files
TriggerResults: no differences found

wddgit · 2022-05-06T14:28:23Z

Yes, this is definitely open for discussion. One reason I submitted this PR, which only partially resolves this issue, is that I wanted more discussion to make sure I was headed in the right direction before I put in more time. Matti is the expert on this directing my work. My experience with GPU issues is very small. I am just starting up that learning curve.

FYI. I have a week of vacation scheduled next week. Feel free to continue discussions in my absence and I'll continue working on this when I return.

wddgit · 2022-05-06T15:21:00Z

please test

The test errors look unrelated to this PR. Try running the tests again.

cmsbuild · 2022-05-06T19:31:37Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/24499/summary.html
COMMIT: b4e04a9
CMSSW: CMSSW_12_4_X_2022-05-06-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37831/24499/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3700704
DQMHistoTests: Total failures: 2
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3700680
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
Checked 206 log files, 45 edm output root files, 49 DQM output files
TriggerResults: no differences found

makortel · 2022-05-07T01:30:59Z

Thanks David.

I think it would be better to distinguish the CPU and the GPU models in this Service.. I think all the considered consumers of this information (JobReport, CondorStatusService, "provenance") would want to report those separately (i.e. CPU model is this, and GPU model is that).

One feature in the consumers that I didn't consider before is that they all seem to need different level of information. E.g. for CPU

for "provenance" in the ROOT file I'd imagine only the CPU model to be relevant
CondorStatusService reports "average speed" in addition to the model
JobReport adds even more information

One way would be to evolve the ResourceInformationService towards a key-value store, into which the CPU Service, CUDAService, etc, can push information, and the consumers would use what they consider relevant. That would require some level of standardization of the keys between the producers and consumers, at least for the consumers that want only a (small) subset of the information (e.g. JobReport could just dump everything).

Or maybe a 2-level hierarchical key-value store? E.g. expressed as a JSON something along

[
  {
    "Type" : "CPU",
    "Model" :  "Intel ...",
    ...
  },
  {
    "Type" : "GPU",
    "Model" : "NVIDIA ...",
    ...
  },
  {
    "Type" : "GPU",
    "Model" : "NVIDIA something else...",
    ...
    }
]

?

@fwyzard We can certainly add more information to be delivered around, But we should also have some understanding where that information would be consumed. E.g. for the consumers above, I'd think as an overall guideline

file "provenance" would be limited to information that can affect physics results
CondorStatusService would be limited to information useful for monitoring currently/recently running grid jobs
Framework job report could contain almost anything (as it is quite large already)

Do you have any other consumer in mind for this kind of information?

fwyzard · 2022-05-07T07:15:22Z

Hi Matti,
the various driver versions can affect the physics results, due to bug fixes within them, and the possible use of runtime version checks to enable or disable optimisations.
(e.g.: the 510.xx driver series fixes a bug in cooperative groups)

So, this could be similar to the impact of the glibc version (do we store and costume that anywhere?).

Information about available memory and exclusive use should not affect the physics output, but could be useful for debugging problems based on the reports.

Other details like core counts, total memory, clock speed, etc. could be useful mostly for monitoring, and maybe for scaling the reconstruction time.

makortel · 2022-05-13T19:03:52Z

Thanks @fwyzard.

So, this could be similar to the impact of the glibc version (do we store and costume that anywhere?).

I don't think we store glibc version explicitly anywhere, but to large degree that is governed by the production SCRAM_ARCH of a given release (and if one uses non-production arch, one is expected to know what one is doing).

the various driver versions can affect the physics results, due to bug fixes within them, and the possible use of runtime version checks to enable or disable optimisations.

Including the driver version in the "file provenance" makes sense (it can be expected to vary a lot more than e.g. the actual glibc binary). Maybe the driver version could be generic-enough between vendors that we could call the field just along "GPU driver version" without explicitly specifying CUDA/ROCm/etc, since the vendor should be clear from the model name record.

Information about available memory and exclusive use should not affect the physics output, but could be useful for debugging problems based on the reports.

Would the GPU model name be sufficient towards available memory, at least for the "file provenance"?

How would we know if a process has an exclusive use to a GPU? (without constantly/periodically monitoring possible other processes accessing the GPU, in which case we would know it too late for the "file provenance")

cmsbuild · 2022-12-06T23:26:18Z

Pull request #37831 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @makortel, @fwyzard can you please check and sign again.

wddgit · 2022-12-06T23:30:31Z

please test

I just rebased and pushed responses to recent review comments

cmsbuild · 2022-12-07T04:04:39Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/29493/summary.html
COMMIT: 033c52c
CMSSW: CMSSW_13_0_X_2022-12-06-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37831/29493/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 27 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3421337
DQMHistoTests: Total failures: 1173
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3420142
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 206 log files, 158 edm output root files, 48 DQM output files
TriggerResults: no differences found

makortel · 2022-12-07T18:55:43Z

FWCore/Utilities/interface/CPUServiceBase.h

-
-    // ---------- member functions ---------------------------
-    ///CPU information - the models present and average speed.
-    virtual bool cpuInfo(std::string &models, double &avgSpeed) = 0;


Just to note that in the future we could probably remove this (practically) abstract base class in a future PR.

makortel · 2022-12-07T19:36:14Z

FWCore/Utilities/interface/ResourceInformation.h

+    ResourceInformation const& operator=(ResourceInformation const&) = delete;
+    virtual ~ResourceInformation();
+
+    enum class AcceleratorType { GPU };


#39402 made me think if knowing the GPU vendor (or "software stack"?) would be useful as well. In principle the information is available via gpuModels() (needs string parsing) or e.g. nvidiaDriverVersion() (check if the string is empty), so it could be checked already with the present interface. So maybe the present interface is sufficient, and we adjust later if it seems useful?

makortel · 2022-12-16T20:22:57Z

@cmsbuild, please test

Just to check that nothing broke. After that I'll sign (with my last comments meant for something to be improved in the future, if needed).

cmsbuild · 2022-12-17T00:36:02Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-02f619/29664/summary.html
COMMIT: 033c52c
CMSSW: CMSSW_13_0_X_2022-12-16-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37831/29664/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 31 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3557521
DQMHistoTests: Total failures: 1191
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3556308
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
Checked 211 log files, 162 edm output root files, 49 DQM output files
TriggerResults: no differences found

makortel · 2022-12-17T01:33:59Z

Comparison differences are in 11634.7 and thus spurious (#39803).

makortel · 2022-12-17T01:34:23Z

+core

makortel · 2022-12-17T01:34:28Z

+heterogeneous

cmsbuild · 2022-12-17T01:34:51Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

rappoccio · 2022-12-19T14:35:01Z

+1

cmsbuild added this to the CMSSW_12_4_X milestone May 5, 2022

cmsbuild added code-checks-pending core-pending heterogeneous-pending orp-pending pending-signatures tests-pending labels May 5, 2022

cmsbuild added code-checks-approved and removed code-checks-pending labels May 5, 2022

cmsbuild added tests-started and removed tests-pending labels May 5, 2022

wddgit mentioned this pull request May 5, 2022

Store "architecture" in event provenance #30044

Open

cmsbuild added tests-rejected and removed tests-started labels May 6, 2022

cmsbuild added tests-started and removed tests-rejected labels May 6, 2022

makortel mentioned this pull request May 6, 2022

Create a new Service to collect the architecture information from CPU Service and CUDAService cms-sw/framework-team#361

Closed

cmsbuild added tests-approved and removed tests-started labels May 6, 2022

cmsbuild added tests-started and removed tests-pending labels Dec 6, 2022

cmsbuild added tests-approved and removed tests-started labels Dec 7, 2022

makortel reviewed Dec 7, 2022

View reviewed changes

cmsbuild added tests-started and removed tests-approved labels Dec 16, 2022

cmsbuild added tests-approved and removed tests-started labels Dec 17, 2022

cmsbuild added core-approved and removed core-pending labels Dec 17, 2022

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels Dec 17, 2022

cmsbuild added orp-approved and removed orp-pending labels Dec 19, 2022

cmsbuild merged commit 9607861 into cms-sw:master Dec 19, 2022

dan131riley mentioned this pull request Jan 6, 2023

Update CPU service calls to ResourceInformationService for non-Intel CPUs #40441

Merged

makortel mentioned this pull request Jan 26, 2023

[PowerPC] Unit test testCUDAService fails #40616

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ResourceInformationService #37831

Implement ResourceInformationService #37831

wddgit commented May 5, 2022 •

edited

Loading

cmsbuild commented May 5, 2022

cmsbuild commented May 5, 2022

wddgit commented May 5, 2022

fwyzard commented May 5, 2022

cmsbuild commented May 6, 2022

wddgit commented May 6, 2022

wddgit commented May 6, 2022

cmsbuild commented May 6, 2022

makortel commented May 7, 2022

fwyzard commented May 7, 2022

makortel commented May 13, 2022

cmsbuild commented Dec 6, 2022

wddgit commented Dec 6, 2022

cmsbuild commented Dec 7, 2022

makortel Dec 7, 2022

makortel Dec 7, 2022

makortel commented Dec 16, 2022

cmsbuild commented Dec 17, 2022

makortel commented Dec 17, 2022

makortel commented Dec 17, 2022

makortel commented Dec 17, 2022

cmsbuild commented Dec 17, 2022

rappoccio commented Dec 19, 2022

Implement ResourceInformationService #37831

Implement ResourceInformationService #37831

Conversation

wddgit commented May 5, 2022 • edited Loading

PR description:

PR validation:

cmsbuild commented May 5, 2022

cmsbuild commented May 5, 2022

wddgit commented May 5, 2022

fwyzard commented May 5, 2022

cmsbuild commented May 6, 2022

RelVals-INPUT

Comparison Summary

wddgit commented May 6, 2022

wddgit commented May 6, 2022

cmsbuild commented May 6, 2022

Comparison Summary

makortel commented May 7, 2022

fwyzard commented May 7, 2022

makortel commented May 13, 2022

cmsbuild commented Dec 6, 2022

wddgit commented Dec 6, 2022

cmsbuild commented Dec 7, 2022

Comparison Summary

makortel Dec 7, 2022

Choose a reason for hiding this comment

makortel Dec 7, 2022

Choose a reason for hiding this comment

makortel commented Dec 16, 2022

cmsbuild commented Dec 17, 2022

Comparison Summary

makortel commented Dec 17, 2022

makortel commented Dec 17, 2022

makortel commented Dec 17, 2022

cmsbuild commented Dec 17, 2022

rappoccio commented Dec 19, 2022

wddgit commented May 5, 2022 •

edited

Loading