Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Add hwloc whitelist support #467
This PR adds support for hwloc whitelist, which allows the resource-query and -match to load in hwloc resources that are in the whitelist.
This alleviate two problems: 1) the graph date store created with hwloc is relatively large with no clear benefit to scheduling; 2) because hwloc topology oftentimes exhibit variations between different platforms, one jobspec that work on one platform doesn't necessarily work for another platform.
This also has a fix for a couple bugs: Issue #466.
@@ Coverage Diff @@ ## master #467 +/- ## ========================================== + Coverage 76.23% 76.31% +0.08% ========================================== Files 45 45 Lines 5478 5535 +57 ========================================== + Hits 4176 4224 +48 - Misses 1302 1311 +9
@SteVwonder: OK. Travis is green. This is ready for your review. This is kind of important to support minimal portable jobspec. One caveat is this may still not support the complete portability of a jobspec. Even though you will whitelist and downselect resources to populate the resource graph data store with this, the logic preserves the relative hierarchy of those resources. So if the relative hierarchy of a resource as reported by hwloc is different between two platforms, one jobspec may not work across both platforms. I suspect, GPUs could be in that category. I can imagine in one platform, hwloc may report this as:
And in another platform:
Although I haven't seen this yet.
From what I'm seeing, it looks like the overall coverage stayed basically the same, but of the code that changed in this PR, a slightly smaller percentage was covered than the percentage covered in the repo as a whole. Peeking at the coverage diff breakdown, it looks like the only changes that aren't covered by testing are error paths. So I'm fine with this coverage as-is.
Problem: 1) the graph date store created with hwloc is relatively large with no clear benefit to scheduling; 2) because hwloc topology oftentimes exhibit variations between different platforms, one jobspec that work on one platform doesn't necessarily work for another platform. Introduce hwloc whitelist support to mitigate these problems. For query, specifying --hwloc-whitelist=node,socket,core will only add these resource types (if detected from hwloc) into the graph data store. For resource-match service, hwloc-whiltelist=... as a module load option will have the same effect.
Problem: We had a partial match logic to allow a jobspec to omit the prefix of hierarchical resource requests. It turned out this partial match logic accidentally and thus incorrectly attempts partial matching for non-prefix, lower-level components as well. For example, - with resource graph: cluster->node->socket->numanode->core, - and jobspec: node->socket->core This logic doesn't appear to fail matching when it couldn't find a qualifed resource at the socket level. Notice at this level, resource graph has numanode while the jobspec looks for core. This resulted in successful match with only partially matched R ("node->socket" but no "core"!). Fix this by differentiating no-match case for prefix omission support vs. other types of no-match. Enumerate the former condition as PRESTINE_NONE_MATCH and the latter NONE_MATCH: Traverser returns unsuccessful match when it detects NONE_MATCH.