Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: translate gateway ports to proxied tasks #9398

Merged
merged 51 commits into from
May 28, 2024

Conversation

hamidzr
Copy link
Contributor

@hamidzr hamidzr commented May 20, 2024

Ticket

https://hpe-aiatscale.atlassian.net/issues/RM-271

Description

Test Plan

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label May 20, 2024
Copy link

codecov bot commented May 20, 2024

Codecov Report

Attention: Patch coverage is 57.84314% with 86 lines in your changes are missing coverage. Please review.

Project coverage is 40.74%. Comparing base (bc80a2a) to head (afe848c).

Additional details and impacted files
@@                        Coverage Diff                        @@
##           notebook_proxy_feature_branch    #9398      +/-   ##
=================================================================
- Coverage                          46.49%   40.74%   -5.76%     
=================================================================
  Files                                743      665      -78     
  Lines                             106596    77679   -28917     
  Branches                            2405        0    -2405     
=================================================================
- Hits                               49567    31652   -17915     
+ Misses                             56834    46027   -10807     
+ Partials                             195        0     -195     
Flag Coverage Δ
harness 37.94% <ø> (-26.14%) ⬇️
web ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/kubernetesrm/gateway_spec.go 100.00% <100.00%> (ø)
master/internal/rm/kubernetesrm/request_queue.go 85.71% <100.00%> (ø)
master/internal/task/allocation.go 76.39% <100.00%> (ø)
master/internal/rm/kubernetesrm/spec.go 76.34% <98.21%> (ø)
master/internal/rm/kubernetesrm/request_workers.go 81.14% <78.57%> (ø)
master/internal/rm/kubernetesrm/pods.go 20.86% <0.00%> (ø)
master/internal/rm/kubernetesrm/gateway_service.go 56.71% <50.00%> (ø)
master/internal/rm/kubernetesrm/pod.go 74.20% <30.18%> (ø)

... and 1150 files with indirect coverage changes

@hamidzr hamidzr marked this pull request as ready for review May 28, 2024 07:00
@hamidzr hamidzr requested review from a team as code owners May 28, 2024 07:00
@hamidzr hamidzr requested review from NicholasBlaskey, azhou-determined and loksonarius and removed request for a team May 28, 2024 07:00
master/internal/rm/kubernetesrm/gateway_service.go Outdated Show resolved Hide resolved
@@ -3171,11 +3171,11 @@ jobs:
- run:
name: Start defaultrm minikube
command: |
source tools/k8s/launch-minikube-with-gateway.sh defaultrm
K8S_VERSION=1.29.5 source tools/k8s/launch-minikube-with-gateway.sh defaultrm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a specific kubernetes version?

Copy link
Contributor Author

@hamidzr hamidzr May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a version i've been running locally and I think the lowest we want to support as well. I think it'd be a good idea to have CI run the oldest version as well wdy think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally agree.

We need some way for updating it as new versions come out since previously it would just fall out of date then the tests would fail so we unpinned the version.

var started *sproto.ResourcesStarted
// PERF: call once for all pods
gwPortMap, err := p.gatewayService.getDeployedPortMap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job. I think this is a good start for reattach.

@hamidzr hamidzr self-assigned this May 28, 2024
@hamidzr
Copy link
Contributor Author

hamidzr commented May 28, 2024

I've seen some priority setting integ tests fail here inconsistently. I also see it in the target branch (ignoring it for this PR)

diff --git a/master/Makefile b/master/Makefile
index 8514e7067c..7dc869195b 100644
--- a/master/Makefile
+++ b/master/Makefile
@@ -195,7 +195,7 @@ test-intg: export DET_INTEGRATION_POSTGRES_URL ?= postgres://postgres:postgres@l
 test-intg: export DET_INTEGRATION_ES_HOST ?= localhost
 test-intg: export DET_INTEGRATION_ES_PORT ?= 9200
 test-intg: build/mock_gen.stamp
-	gotestsum --junitfile test-intg.junit.xml -- -tags=integration -race -coverprofile=coverage.out -covermode atomic -cover ./...
+	gotestsum --junitfile test-intg.junit.xml -- -count=1 -tags=integration -race -coverprofile=coverage.out -covermode atomic -cover ./internal/rm/kubernetesrm -run SetGroupPrio
 
 .PHONY: pre-package
 pre-package:

@hamidzr hamidzr changed the title translate gateway ports to proxied tasks feat: translate gateway ports to proxied tasks May 28, 2024
@hamidzr hamidzr merged commit 4c2cc08 into notebook_proxy_feature_branch May 28, 2024
77 of 95 checks passed
@hamidzr hamidzr deleted the nbp-ports branch May 28, 2024 17:40
hamidzr pushed a commit that referenced this pull request Jun 12, 2024
)

chore: gateway startup without pwdless sudo (#9382)

feat: translate gateway ports to proxied tasks (#9398)

chore: add a no-dependency multi-port multi-trial test exp (#9432)

feat: make gw port range configurable; add validation (#9458)

https://hpe-aiatscale.atlassian.net/browse/RM-267
https://hpe-aiatscale.atlassian.net/browse/RM-288

roughly rebased

building

tests running

newline

progress?

tests passing

working

cleanup

hopefully this fixes this

chore: bump up default gw listeners to 128 (#9474)

test: cherry pick port registry tests (#9471)

bump up slots per trial to 2 on a new set of tests

Reattach gateways (#9481)

docs: gateway docs part 1 (#9488)

also rename internal exposeProxyConfig var

helm values for bugbash

cpu slot types in helm values

more limited tests

better use test parametrize

some doc updates

document min k8s version

update the mtls notice

reset helm values

publish uncompressed docs

takeout test changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants