Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report CPU info to HTCondor monitoring #15767

Merged
merged 5 commits into from Oct 9, 2016

Conversation

bbockelm
Copy link
Contributor

@bbockelm bbockelm commented Sep 7, 2016

This PR fixes #15763 and #15764 by adding CPU information to the HTCondor monitoring. In particular, it adds:

  • CPU time usage.
  • CPU model information.
  • CPU time usage from child processes.

This adds CPU usage information to the HTCondor service reporting;
it reports the entire CPU (user + system) usage for this process.
As the "job CPU" information reported by the framework excludes the
framework startup costs, the "total CPU" is slightly more than what
the Timing service prints on stdout.
This commit has the framework forward the CPU model information to the
HTCondor monitoring.
With this commit, getCPU() additionally returns all usage information
from child processes.
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 7, 2016

A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_8_1_X.

It involves the following packages:

FWCore/Services
FWCore/Utilities

@cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @makortel, @wddgit, @wmtan this is something you requested to watch as well.
@slava77, @smuzaffar you are the release manager for this.

cms-bot commands are list here #13028

@davidlange6
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 14, 2016

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/15144/console

private:
int totalNumberCPUs_;
double averageCoreSpeed_;
bool reportCPUProperties_;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the member data in the cpuInfo call is not thread safe. As far as I can tell, there is no need for totalNumberCPUs_ or averageCoreSpeed_ to be member data. It also looks like reportCPUProperties_ could be const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return cpuInfoImpl(models, avgSpeed, nullptr);
}

bool CPU::cpuInfoImpl(std::string &result_models, double &result_avg_speed, Service<JobReport>* reportSvc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to do a lot more than just the work need to return result_models and result_avg_speed. I think the function should be better factored so that it doesn't know about the JobReport.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Now, one function will parse the cpuinfo file - each different
statistic we want will utilize this information in a separate
method.
@cmsbuild
Copy link
Contributor

Pull request #15767 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please check and sign again.

…_cpu_info

Conflicts:
	FWCore/Services/plugins/CPU.cc
@cmsbuild
Copy link
Contributor

Pull request #15767 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please check and sign again.

@Dr15Jones
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 6, 2016

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/15579/console

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 6, 2016

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 6, 2016

@Dr15Jones
Copy link
Contributor

@bbockelm how do we know if this is working?

@bbockelm
Copy link
Contributor Author

bbockelm commented Oct 9, 2016

Hi Chris,

Add the following two lines to your pset:

process.CPU = cms.Service("CPU", reportCPUProperties=cms.untracked.bool(True))
process.CondorStatusService = cms.Service("CondorStatusService", tag=cms.untracked.string("DIGI"), debug=cms.untracked.bool(True))

You should see the condor invocations echoed to stdout and the CPU information in the FJR.

Brian

@Dr15Jones
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 9, 2016

This pull request is fully signed and it will be integrated in one of the next CMSSW_8_1_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @smuzaffar

@davidlange6
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 5e56097 into cms-sw:CMSSW_8_1_X Oct 9, 2016
@dan131riley
Copy link

Hi @bbockelm

If I add

process.Timing = cms.Service("Timing")
process.CPU = cms.Service("CPU", reportCPUProperties=cms.untracked.bool(True))
process.CondorStatusService = cms.Service("CondorStatusService", tag=cms.untracked.string("DIGI"), debug=cms.untracked.bool(True))

to a job in CMSSW_8_1_0_pre13 (that has this PR) I get a failure

----- Begin Fatal Exception 24-Oct-2016 12:55:33 EDT-----------------------
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing service of type CondorStatusService
Exception Message:
MissingParameter: The required parameter 'summaryOnly' was not specified.
----- End Fatal Exception -------------------------------------------------

which looks like the TimingService is getting constructed before fillDescriptions has been called...this does not happen if the TimingService isn't included in the config. Found by a CRAB user, see this HN and the upstream thread:

https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/2205/2/1.html

Since this happens while constructing CondorStatusService, I strongly suspect this PR is responsible.

@dan131riley
Copy link

Never mind, I had missed that this was fixed by #16311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants