Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: explicit GPU runner mappings #2862

Merged
merged 16 commits into from
Aug 12, 2022
Merged

Conversation

jjmachan
Copy link
Contributor

@jjmachan jjmachan commented Aug 4, 2022

What does this PR address?

Fixes #2770
With this change, the NvidiaGpuResource will accept and return a list that specifies the exact GPUs to use. This enables the user to configure exactly which GPU to map to each runner worker.

Runner currently supports:

runners:
  resources: 
    nvidia.com/gpu: 4 => still valid but will be automatically converted to use GPUs [0, 1, 2, 3] internally.

This PR introduces:

  resources: 
    nvidia.com/gpu: [2, 4] => only use GPU [2, 4] for runner worker 1 and 2

Before submitting:

Who can help review?

Feel free to tag members/contributors who can help review your PR.

@jjmachan jjmachan requested a review from a team as a code owner August 4, 2022 19:01
@jjmachan jjmachan requested review from sauyon and bojiang and removed request for a team August 4, 2022 19:01
@jjmachan jjmachan mentioned this pull request Aug 4, 2022
5 tasks
@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #2862 (9439bc3) into main (9305132) will increase coverage by 0.16%.
The diff coverage is 95.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2862      +/-   ##
==========================================
+ Coverage   70.10%   70.27%   +0.16%     
==========================================
  Files         114      120       +6     
  Lines        9811     9782      -29     
==========================================
- Hits         6878     6874       -4     
+ Misses       2933     2908      -25     
Impacted Files Coverage Δ
bentoml/_internal/resource.py 76.96% <95.45%> (+0.86%) ⬆️
bentoml/_internal/runner/strategy.py 93.10% <100.00%> (+18.96%) ⬆️
bentoml/_internal/utils/formparser.py 20.00% <0.00%> (-57.94%) ⬇️
bentoml/_internal/runner/utils.py 90.16% <0.00%> (-1.70%) ⬇️
bentoml/_internal/bento/build_config.py 67.62% <0.00%> (-0.60%) ⬇️
bentoml/_internal/runner/runner_handle/remote.py 88.17% <0.00%> (-0.13%) ⬇️
bentoml/_internal/server/service_app.py 87.94% <0.00%> (-0.09%) ⬇️
bentoml/_internal/cli/bento_management.py
bentoml/_internal/server/cli/dev_api_server.py
bentoml/_internal/service/openapi.py
... and 50 more

bentoml/_internal/resource.py Outdated Show resolved Hide resolved
bentoml/_internal/resource.py Outdated Show resolved Hide resolved
bentoml/_internal/resource.py Outdated Show resolved Hide resolved
def from_spec(cls, spec: t.Union[int, str, t.List[int | str]]) -> t.List[int]:
if not isinstance(spec, (int, str, t.List)):
raise TypeError(
"NVidia GPU resource limit must be int, str or a list specifing the exact GPUs to use."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "GPU device IDs".

Copy link
Contributor Author

@jjmachan jjmachan Aug 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This area is my only concern with this PR. is this a bit too confusing?
if the user gives
nvidia.com/gpu: 3 then it is the number of resources
``nvidia.com/gpu: [3]` is the specific GPU number

I feel like if it is documented properly it should be fine, what do you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add more info in the doc section and this type error can point to that section of our doc

bentoml/_internal/resource.py Show resolved Hide resolved
bentoml/_internal/resource.py Outdated Show resolved Hide resolved
@jjmachan
Copy link
Contributor Author

jjmachan commented Aug 5, 2022

Let me know if the change to the resource config is okay and if it is, I'll update the configuration guide too, talking about how we can configure each runner specifically

jjmachan and others added 4 commits August 7, 2022 16:43
Co-authored-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
sauyon
sauyon previously approved these changes Aug 7, 2022
Copy link
Contributor

@sauyon sauyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pretty much looks good to me minus the one comment.

bentoml/_internal/resource.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Aug 8, 2022

Hello @jjmachan, Thanks for updating this PR.

There are currently no PEP 8 issues detected in this PR. Cheers! 🍻

Comment last updated at 2022-08-10 19:00:29 UTC

def test_default_gpu_strategy(monkeypatch):
monkeypatch.setattr(strategy, "get_resource", unvalidated_get_resource)
assert DefaultStrategy.get_worker_count(GPURunnable, {"nvidia.com/gpu": 2}) == 2
assert DefaultStrategy.get_worker_count(GPURunnable, {"nvidia.com/gpu": 0}) == 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was working as expected @sauyon . Now if resources are not specified it will default to a worker count of 1 and give a warning saying "no resource found, falling back to using a single worker" which was the logic earlier

Copy link
Member

@bojiang bojiang Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is the behavior caused by login in DefaultStrategy. I wonder if it's correct but we may solve that in another PR: #2894

Copy link
Member

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some very very style details. 😄

tests/unit/_internal/test_configuration.py Outdated Show resolved Hide resolved
tests/unit/_internal/test_configuration.py Outdated Show resolved Hide resolved
tests/unit/_internal/runner/test_strategy.py Outdated Show resolved Hide resolved
tests/unit/_internal/runner/test_strategy.py Show resolved Hide resolved
return resource_get_resource(x, y, validate=False)


def test_default_gpu_strategy(monkeypatch):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types of monkeypatch is

if TYPE_CHECKING:
    from _pytest.monkeypatch import MonkeyPatch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!
1 Q: the if TYPE_CHECKING is a check to improve perf type checker right?

@ssheng ssheng merged commit 5860906 into bentoml:main Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Runner to GPUs mapping
6 participants