- 
                Notifications
    
You must be signed in to change notification settings  - Fork 315
 
Add Support for On-Demand Capacity Reservations (ODCRs) in Cluster Configuration File #4295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
06633b1    to
    bbd702b      
    Compare
  
    
          Codecov Report
 @@             Coverage Diff             @@
##           develop    #4295      +/-   ##
===========================================
+ Coverage    88.29%   88.37%   +0.07%     
===========================================
  Files          158      159       +1     
  Lines        13349    13472     +123     
===========================================
+ Hits         11787    11906     +119     
- Misses        1562     1566       +4     
 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.  | 
    
| instance_type=compute_resource.instance_type, | ||
| instance_type_data=instance_types_data[compute_resource.instance_type], | ||
| ) | ||
| # The validation below has to be in cluster config class instead of queue class | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can line 2382 and these two validators to be added under CommonSchedulerClusterConfig? The reason is that from the schema, it seems that it also support for scheduler plugin schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which validators? Sorry, I changed the line number with some rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CapacityReservationValidator and CapacityReservationResourceGroupValidator.
I think the field capacity_reservation_target is both for slurm and scheduler plugin in the schema.
Why are the validators added for only class SlurmClusterConfig instaed of its parent class?
class SlurmClusterConfig(CommonSchedulerClusterConfig)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion with chenwany, ODCR should be supported by both Slurm and Scheduler Plugin. I made relevant changes. Thank you!
bbd702b    to
    bb6f133      
    Compare
  
    | resources = self._client.list_group_resources(Group=group)["Resources"] | ||
| for resource in resources: | ||
| if resource["Identifier"]["ResourceType"] == "AWS::EC2::CapacityReservation": | ||
| capacity_reservation_ids.append(re.match("(.*)(cr-.*)", resource["Identifier"]["ResourceArn"]).group(2)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to match exactly so there is only one result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be good to clarify what the matches are using the name syntax ?P<match_name>
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
'Reynolds'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to match exactly so there is only one result?
I made the regex stricter, but not sure if there is a exact match
The second comment is addressed
24897bd    to
    dd4a866      
    Compare
  
    e8a8225    to
    f148263      
    Compare
  
    f70c841    to
    ba6e1a7      
    Compare
  
    | conditional_template_properties.update({"instance_type": compute_resource.instance_type}) | ||
| 
               | 
          ||
| capacity_reservation_specification = None | ||
| if isinstance(queue, (SlurmQueue, SchedulerPluginQueue)): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check f isinstance(queue, (SlurmQueue, SchedulerPluginQueue)) necessary?
def _add_compute_resource_launch_template function is under class ComputeFleetConstruct, ComputeFleetConstruct is not for awsbatch scheduler.
See code:
        if not self._condition_is_batch():
            self.compute_fleet_resources = ComputeFleetConstruct(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| ] | ||
| ) | ||
| 
               | 
          ||
| if self._config.scheduling.scheduler == "slurm": | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self._config.scheduling.scheduler != "awsbatch"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
1. The section can appear both on queue level and compute resource level. The section on compute resource level takes precedence over the queue level. 2. The capacity reservation specification is passed into the launch templates used by compute nodes. This approach is forward compatible if ParallelCluster uses EC2 fleet Signed-off-by: Hanwen <hanwenli@amazon.com>
ba6e1a7    to
    96ad66b      
    Compare
  
    This commit also adds a validation to prevent the two parameters coexist in ODCR section Signed-off-by: Hanwen <hanwenli@amazon.com>
…availability zone Signed-off-by: Hanwen <hanwenli@amazon.com>
Signed-off-by: Hanwen <hanwenli@amazon.com>
96ad66b    to
    249ecbb      
    Compare
  
    This policy is required to run validators for `CapacityReservationResourceGroupArn`. The policy of describing capacity reservation is added in this commit. Because it is covered by existing policy: `ec2:Describe*` The same policy is added to integration tests Signed-off-by: Hanwen <hanwenli@amazon.com>
249ecbb    to
    2b1d3fd      
    Compare
  
    test_scheduler_plugin checks cluster configuration file to be the same as expected. We added ODCR section in aws#4295 and Placement Group section in Networking aws#4330 in cluster configuration file. Therefore, the expected cluster configuration file should be changed accordingly Signed-off-by: Hanwen <hanwenli@amazon.com>
test_scheduler_plugin checks cluster configuration file to be the same as expected. We added ODCR section in aws#4295 and Placement Group section in Compute Resource aws#4330 in cluster configuration file. Therefore, the expected cluster configuration file should be changed accordingly Signed-off-by: Hanwen <hanwenli@amazon.com>
test_scheduler_plugin checks cluster configuration file to be the same as expected. We added ODCR section in #4295 and Placement Group section in Compute Resource #4330 in cluster configuration file. Therefore, the expected cluster configuration file should be changed accordingly Signed-off-by: Hanwen <hanwenli@amazon.com>
…fied on compute resource or queue This was a bug from aws#4295. The validation was only executed if capacity reservation target is specified on compute resource Signed-off-by: Hanwen <hanwenli@amazon.com>
…fied on compute resource or queue This was a bug from #4295. The validation was only executed if capacity reservation target is specified on compute resource Signed-off-by: Hanwen <hanwenli@amazon.com>
Description of changes
Add ODCR configuration under queue and compute resource levels: See commits descriptions for details
To Do
In a separate PR, we will create a dedicated integration test to verify target ODCR
Tests
A unit test is created for the new validator. Integration test (test_efa) is changed to use ODCR from configuration file. Manually verified the run_instances override has higher priority than the parameters in cluster config
Checklist
Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.