feat(api,ui,sdk): Make CPU limits configurable #586

deadlycoconuts · 2024-05-17T03:21:57Z

Description

As of present, users are not able to configure the CPU limits of the pods in which Merlin models and transformers are deployed in - they are instead determined automatically on the platform-level (Merlin API server). Depending on how the API server has been configured, one of the following happens:

the CPU limit of a model is set as its CPU request value, multiplied by a scaling factor (e.g. 2 CPU * 1.5) or,
- Note that this is the existing way memory limits are automatically set by the Merlin API server
the CPU limit is left unset
- Note that because KServe does not currently allow CPU limits to be completely unset, the Merlin API server instead sets an arbitrary value (ideally one that is very big) as the CPU limit instead

This PR introduces a new workflow which would allow users to instead override the platform-level CPU limits (described in the paragraph above) set on a model. This workflow is available via the UI, SDK and by extension, directly calling the API endpoint of the API server.

UI:

SDK:

merlin.deploy(
    version_1,
    resource_request=merlin.ResourceRequest(
        min_replica=0,
        max_replica=0,
        cpu_request="0.5",
        cpu_limit="2",
        memory_request="1Gi",
    ),
)

In addition, this PR adds a new configuration, DefaultEnvVarsWithoutCPULimits, which is a list of env vars that automatically get added to all Merlin models and transformers when CPU limits are not set. This allows the Merlin API server's operators to set env vars platform-wide that can potentially improve these deployments' performance, e.g. env vars involving concurrency.

Modifications

api/cluster/resource/templater.go - Refactoring of templater methods to set default env vars when cpu limits are not explicitly set and when the cpu limit scaling factor is set as 0
api/config/config.go - Addition of the new field DefaultEnvVarsWithoutCPULimits
api/config/config_test.go - Addition of a new unit test to test the parsing of configs from .yaml files
docs/user/templates/model_deployment/01_deploying_a_model_version.md - Addition of docs to demonstrate how the platform-level CPU limits can be overriden
python/sdk/merlin/resource_request.py - Addition of a new cpu limit field to the resource request class
ui/src/pages/version/components/forms/components/CPULimitsFormGroup.js - Addition of a new form group to allow cpu limits to be specified on the UI

Tests

Deploying existing models (and transformers) with and without CPU limits set

Checklist

Added PR label
Added unit test, integration, and/or e2e tests
Tested locally
Updated documentation
Update Swagger spec if the PR introduce API changes
Regenerated Golang and Python client if the PR introduces API changes

Release Notes

NONE

…v vars

…fied

…e not set

codecov · 2024-05-17T03:23:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.50%. Comparing base (cfd27a2) to head (6f1f13b).
Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #586      +/-   ##
==========================================
+ Coverage   60.31%   60.50%   +0.19%     
==========================================
  Files         274      274              
  Lines       21930    22066     +136     
==========================================
+ Hits        13226    13351     +125     
- Misses       7855     7859       +4     
- Partials      849      856       +7

Flag	Coverage Δ
api-test	`58.46% <ø> (+0.22%)`	⬆️
sdk-test-3.10	`75.43% <ø> (-0.05%)`	⬇️
sdk-test-3.8	`75.35% <ø> (-0.05%)`	⬇️
sdk-test-3.9	`75.35% <ø> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

api/cluster/resource/templater.go

api/cluster/resource/templater_test.go

api/config/config_test.go

api/models/env_var.go

ui/src/pages/version/components/forms/components/AutoscalingPolicyFormGroup.js

ui/src/pages/version/components/forms/components/ImageBuilderSection.js

…verage decreases

.github/workflows/merlin.yml

api/cluster/resource/templater.go

ui/src/pages/version/components/forms/components/CPULimitsFormGroup.js

ui/src/components/ResourcesConfigTable.js

leonlnj

thanks, LGTM.

api/cluster/resource/templater_test.go

deadlycoconuts added 4 commits May 17, 2024 11:11

Add new unit test helper method to sort env vars in isvc specs

03baa77

Extend merge env vars helper function to merge more than 2 sets of en…

ffd4e9a

…v vars

Add new configs to set default env vars when cpu limits are not speci…

3c9306b

…fied

Add unit tests to test parsing of default env vars when cpu limits ar…

fb3669d

…e not set

deadlycoconuts added the enhancement New feature or request label May 17, 2024

deadlycoconuts self-assigned this May 17, 2024

deadlycoconuts added 2 commits May 17, 2024 12:06

Fix environment service unit tests

7fe8aa8

Fix model service deployment unit tests

8efd86f

deadlycoconuts force-pushed the make_cpu_limits_configurable branch from 7dd10af to 8efd86f Compare May 17, 2024 05:53

deadlycoconuts added 3 commits May 17, 2024 17:09

Add config parsing tests

7f2e7c4

Make env var setters return errors that will then be checked

c6ff954

Add component to make cpu limits configurable in ui

1d4661d

deadlycoconuts force-pushed the make_cpu_limits_configurable branch from 397ed22 to 1d4661d Compare May 20, 2024 05:12

deadlycoconuts added 12 commits May 20, 2024 13:48

Update swagger docs and autogenerated openapi files

58a6b84

Make model page display cpu limits if configured

b22ab30

Update SDK to expose cpu limit field

9c82fd2

Fix sdk deploy method

b69c772

Revert redundant changes to cpu limit of default configs in sdk tests

8a49a81

Fix incorrect change to sdk integration test

42c751d

Fix integration tests

80248a1

Update tool tip and form group descriptions for cpu limit form

ad7d653

Update docs

450bd94

Refactor default env vars to use a different struct

eff5126

Add cpu limit regex check

a255f43

Refactor cpu limit as nullable field

ac9cb63

deadlycoconuts force-pushed the make_cpu_limits_configurable branch from 8535c51 to ac9cb63 Compare May 24, 2024 03:08

Add cpu limits form group to transformer step and cleanup code

54b2a2d