Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource summary file on BU #59

Closed
smorovic opened this issue Jan 21, 2015 · 1 comment
Closed

Resource summary file on BU #59

smorovic opened this issue Jan 21, 2015 · 1 comment
Assignees

Comments

@smorovic
Copy link
Contributor

Presently the information of state of CPU resource usage is available through box info files updated by each FU in ramdisk. It was proposed that BU hltd should instead summarize this into a number of available resources and provide to consumers (BU application).

In the updated version, a file /fff/ramdisk/appliance/resource_summary (JSON file) is written, containing also other summmarized information (taking care that it is taken from box files updated within last 10s). For example:
{
"ramdisk_occupancy": 0.32000000000000001,
"active_resources": 1,
"activeFURun": 127042,
"activeRunNumQueuedLS": 0,
"broken": 0,
"idle": 0,
"used": 1,
"cloud": 0
}

  • ramdisk_occupancy is ration between used and total size of ramdisk partition
  • active_resources - sum of idle and used resources in FUs
  • activeFURun: most recent run found in all active_runs boxinfo files
  • activeRunNumQueuedLS - worst-case number of lumisection data sitting in anelastic.py queue on FUs.
    This indicates number of EoLS files found in queue in anelastic.py, which is used to store inotify file events before they are handled by the script. High value can indicate problems in disk IO or NFS file copying to BU. Value is -1 if there is no FU active run or the script is not initialized yet. Value is only taken from FUs with the same last active run as indicated in the summary.
  • broken/idle/used/cloud summarize core resources in more detail
@smorovic smorovic self-assigned this Jan 23, 2015
@smorovic
Copy link
Contributor Author

We have already switched to using resource_summary file.

Current version contains
"activeRunCMSSWMaxLS" - integer, initial value: -1
(this will show max LS seen by any CMSSW once initialized, needs version >=7_3_2_patch5)

hltd 1.7 will add:
"stale_resources - integer, default value: 0

  • this will be 0 unless we detetect lag or problems with updating files on ramdisk via data network (assuming box files are updated through control network).
    In case of problems, resources will be counted here instead as active_resources and those that are acounted in it: idle,used,broken). "cloud" resources are never counted as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant