Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add /health endpoint [RM-114] #9062

Merged
merged 7 commits into from
Mar 29, 2024
Merged

feat: add /health endpoint [RM-114] #9062

merged 7 commits into from
Mar 29, 2024

Conversation

NicholasBlaskey
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey commented Mar 27, 2024

Description

Adds a /heath endpoint.

Test Plan

Intg / unit tests should cover it

det dev curl /health for manual

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@cla-bot cla-bot bot added the cla-signed label Mar 27, 2024
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Mar 27, 2024
@determined-ci determined-ci requested a review from a team March 27, 2024 18:58
Copy link

netlify bot commented Mar 27, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 789b984
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6606bd56f55c270008ad93d2
😎 Deploy Preview https://deploy-preview-9062--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Mar 27, 2024

Codecov Report

Attention: Patch coverage is 52.00000% with 48 lines in your changes are missing coverage. Please review.

Project coverage is 47.06%. Comparing base (8217508) to head (789b984).

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #9062    +/-   ##
========================================
  Coverage   47.05%   47.06%            
========================================
  Files        1154     1154            
  Lines      142272   142372   +100     
  Branches     2423     2423            
========================================
+ Hits        66953    67005    +52     
- Misses      75129    75177    +48     
  Partials      190      190            
Flag Coverage Δ
backend 42.90% <91.66%> (+0.04%) ⬆️
harness 64.07% <29.68%> (-0.06%) ⬇️
web 38.95% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...ster/internal/rm/agentrm/agent_resource_manager.go 47.14% <100.00%> (+0.45%) ⬆️
...nal/rm/kubernetesrm/kubernetes_resource_manager.go 21.42% <100.00%> (+0.94%) ⬆️
master/internal/rm/resource_manager_iface.go 100.00% <ø> (ø)
master/internal/core.go 4.33% <94.73%> (+2.17%) ⬆️
master/internal/rm/kubernetesrm/pods.go 21.06% <77.77%> (+0.54%) ⬆️
harness/determined/common/api/bindings.py 40.01% <29.68%> (-0.07%) ⬇️

@NicholasBlaskey NicholasBlaskey marked this pull request as ready for review March 27, 2024 20:33
@NicholasBlaskey NicholasBlaskey requested review from a team as code owners March 27, 2024 20:33
Copy link
Contributor

@kkunapuli kkunapuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nice work!

All the comments questions are for my own understanding / curiosity.

hc.Database = model.Healthy
_, err := db.Bun().NewSelect().Table("cluster_id").Exists(ctx)
if err != nil {
hc.Database = model.Unhealthy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curiosity question: why not return as soon as one element is not healthy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid returning partial responses

func TestHealthCheck(t *testing.T) {
api, _, _ := setupAPITest(t, nil)

assertHealthCheck := func(t *testing.T, expectedCode int, expectedHealth model.HealthCheck) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - love how this is organized.

func (p *pods) HealthStatus() model.HealthStatus {
p.mu.Lock()
defer p.mu.Unlock()
for _, podInterface := range p.podInterfaces {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding: would range p.podInterfaces iterate over all pods in the cluster? I see this loop exits early based on the first response; I'm curious about the general intent of podInterfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.podInterfaces contains a mapping from namespaces to kubernetes clients.

https://github.com/kubernetes/client-go/blob/1518fca9f06c6a73fc091535b8966c71704e657b/kubernetes/typed/core/v1/pod.go#L43

We used to use namespace all to avoid needing a client per namespace but this caused permission issues for some customers
https://github.com/determined-ai/determined/blob/85bb3c8881bbd23c34e07359db096fd337e90a51/master/.golangci.yml#L80C10-L80C16

#
# This is a syntax error. This is a small hack to get around this just
# by removing the "model." prefix.
merge_dict(spec, json.loads(f.read().replace("model.", "")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here? I'm not familiar with swagger.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a swagger spec that we create and host for our API documentation. Also we generate bindings.py and api.ts from this swagger spec.

```swagger.py```` is used to combine swagger specs since we generate two and apply some processing. The first comes from our grpc API and the second comes from our echo API generated by swag https://github.com/swaggo/swag

The process is here.

$(echo_swagger_patch_dir):
mkdir -p $(echo_swagger_patch_dir)
$(echo_swagger_patch): $(echo_swagger_patch_dir) $(echo_swagger_source_files)
swag init -g ../master/cmd/determined-master/main.go -d ../master/. -o $(echo_swagger_patch_dir) -ot json
jq 'del(.swagger, .info)' $(echo_swagger_patch) > $(echo_swagger_patch).tmp
mv $(echo_swagger_patch).tmp $(echo_swagger_patch)
build/swagger:
mkdir -p build/swagger
$(swagger_out): $(source_files) build/swagger $(swagger_patch) $(echo_swagger_patch)
protoc -I src $(swagger_in) --swagger_out=logtostderr=true,allow_delete_body=true,json_names_for_fields=true:build/swagger
python3 scripts/swagger.py $@ $(swagger_patch)
python3 scripts/swagger.py $@ $(echo_swagger_patch)

@NicholasBlaskey NicholasBlaskey enabled auto-merge (squash) March 29, 2024 13:10
@NicholasBlaskey NicholasBlaskey merged commit 1a38f0c into main Mar 29, 2024
69 of 82 checks passed
@NicholasBlaskey NicholasBlaskey deleted the health_check_det branch March 29, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants