Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving logs on pod ready #5229

Closed
manusa opened this issue Jun 10, 2023 · 16 comments · Fixed by #5245
Closed

Retrieving logs on pod ready #5229

manusa opened this issue Jun 10, 2023 · 16 comments · Fixed by #5245
Assignees
Milestone

Comments

@manusa
Copy link
Member

manusa commented Jun 10, 2023

Description

In #4695 / #4637 we introduced some waiting logic for regular operations. One of these waits was related to the log retrieval.
Although this seems like a good idea, it doesn't fit all purposes. Some Pods might have started but aren't ready, or might have failed which makes them unready. However, they might have logged something which can be crucial to detect bugs and so on.

For all of these use cases, we should remove the Readiness check and replace by something else to simply detect if the Pod is live and can be queried for logs.

/cc @shawkins

@shawkins
Copy link
Contributor

In #4695 / #4637 we introduced some waiting logic for regular operations.

Yes previously it was only on watchLog.

For all of these use cases, we should remove the Readiness check and replace by something else to simply detect if the Pod is live and can be queried for logs.

This was touched on a little in #4741 but there wasn't any follow up at that time. The default retry logic should now be responsible for 500 errors. That just leaves a narrow possibility of some 400 errors that we're taking responsibility to wait for - but with a check on pod status that is looking for ready or succeeded. That existing check, even for watchLog, is problematic as you're noting in determining an early exit.

Options include:

  • I think that kubectl doesn't bothers with any log/pod operation specific retry check, so we could consider doing the same.
  • Expand the checks in waitUntilReadyOrSucceded to exit early (if the pod is null or terminated in any way, then exit).
  • Change the retry logic to be based upon attempting the operation(s) and retrying based upon the code (at least 400, but I'm not sure about what others).

@manusa
Copy link
Member Author

manusa commented Jun 12, 2023

I'm not sure what I think that kubectl doesn't bothers with any log/pod operation specific retry check, so we could consider doing the same. means. As usual, our behavior should resemble as much as possible to that provided by kubectl or client-go.

IMO, from the context of #4741 and from what I see in the described use-case, if we get an error, I think we should just fail fast. Once the connection has been established, there shouldn't be any recovery attempt since there's no way to determine what was already processed (the purpose of bookmarks in watches). If the connection has not yet been established, then we should fail fast with an Exception providing the reason (Pod ceased to exist, Pod is not accepting connections, and so on).

@shawkins
Copy link
Contributor

I'm not sure what I think that kubectl doesn't bothers with any log/pod operation specific retry check, so we could consider doing the same. means. As usual, our behavior should resemble as much as possible to that provided by kubectl or client-go.

It means there's no wait nor retry for ready or any other condition when performing pod operations.

@manusa
Copy link
Member Author

manusa commented Jun 12, 2023

I think that for Pod Log retrieval, we should basically do the same as kubectl then (don't bother), and fail fast with an exception.
Any UX improvement (retry, wait for next restart, etc.) might actually break many of the other use-cases (as previously discussed).

@shawkins
Copy link
Contributor

might actually break many of the other use-cases (as previously discussed)

Just to make sure there's no confusion about what's there currently - the timeout has soft enforcement and defaults to 5 seconds. If the pod does not become ready or succeeded in that amount of time, the operation will still proceed. Granted even 5 seconds could be too long of an artificial wait if expecting an early exit.

@manusa
Copy link
Member Author

manusa commented Jun 12, 2023

OK, I clearly missed this.

Then we should improve Javadoc to reflect this, both watchLog and getLog are subject to this.

Adding a clarifying paragraph in all entries stating that by default it will wait 5 seconds for the Pod to be ready, but that you can changing with .withReadyWaitTimeout($timeoutInMs).watchLog().

@manusa
Copy link
Member Author

manusa commented Jun 12, 2023

BTW, the time units seem to be wrong too.
In some places it's expressed in ms:

And in others is used as seconds:

PodOperationUtil.waitUntilReadyOrSucceded(this,
getContext().getReadyWaitTimeout() != null ? getContext().getReadyWaitTimeout() : DEFAULT_POD_READY_WAIT_TIMEOUT);

@shawkins
Copy link
Contributor

In some places it's expressed in ms:

That came from the existing javadoc on withLogWaitTimeout - so that's always been wrong too :(

@manusa
Copy link
Member Author

manusa commented Jun 12, 2023

Since ms is the public exposed unit (+ is more fine-grained), I think it can be easy to change everything internally to match it

@cmdjulian
Copy link

cmdjulian commented Jun 14, 2023

I had the exact same problem because of this, when the pod failed very fast, no logs could be retrieved as the condition did not evaluate to true anymore for readiness. The pod is log streamable if its in failed, in succeeded or in running, so it doesn't have to be ready though.

I fixed that by waiting until the pod was ready to stream logs with the following code:

fun Pod.ready(): Boolean = when (status?.phase) {
    "Running", "Succeeded", "Failed" -> true
    else -> false
}

client.pods()
    .withLabel(JOB_ID_LABEL, "$jobId")
    .waitUntilCondition({ pod: Pod? -> pod?.ready() == true }, 150, TimeUnit.SECONDS)

Checking the status works very reliable.
withReadyWaitTimeout would wait indefinitely for the pod to become ready when its already failed. However, it's still log streamable.

@shawkins
Copy link
Contributor

withReadyWaitTimeout would wait indefinitely for the pod to become ready when its already failed. However, it's still log streamable.

Can you elaborate on that - were you explicitly setting withReadyWaitTimeout? If so the likely problem was the mismatch between the javadocs (ms) and the logic actually expecting seconds compounded by the lack of fail-fast behavior.

This issue should address that mismatch and improve the fail-fast nature of the check being performed.

@cmdjulian
Copy link

okay, picture it the following. I create a Pod. The pod now runs for instance a Python script. This python script has a syntax error in the first line. This results in the pod failing. It is now in the phase Failed.
After creating the Pod, I immediately call client.batch().v1().jobs().withName("$jobId").withReadyWaitTimeout(150).watchLog().

The problem now arises from PodOperationUtil.waitUntilReadyOrSucceded() which is called as part of the withReadyWaitTimeout() method. It is defined as

    try {
      // Wait for Pod to become ready or succeeded
      podOperation.waitUntilCondition(p -> {
        podRef.set(p);
        return p != null && (Readiness.isPodReady(p) || Readiness.isPodSucceeded(p));
      },
          logWaitTimeout,
          TimeUnit.SECONDS);
    }

If the Pod is already terminated, with Failed for instance, both of this conditions do not match it and we get stuck inside of this podOperation.waitUntilCondition() until the timeout is reached.

Even kubectl allows querying a Pod for logs if its dead.

The logic from above works only, if the Pod has not started yet, finished successfully or is currently Running.

@shawkins
Copy link
Contributor

The logic from above works only, if the Pod has not started yet, finished successfully or is currently Running.

Yes, we're on the same page, that is what I'm referring to as better fail fast behavior.

@cmdjulian
Copy link

cmdjulian commented Jun 14, 2023

When I understood you correctly, this means as a user of the client I have to make sure on my own, that the pod queryable for logs, right? The lib is not doing any conditional waiting anymore until some ready state is reached?

kubectl doesn't have this feature currently, but there are a few discussions about that and its a request by multiple users:

Can't we just adjust the existing .withReadyWaitTimeout() to do the right thing here, also including Failed pods as a allowed state to break out of the waiting? Calling logWatch() on the other hand skips all that waiting.

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Jun 14, 2023
also fixing the timeout units to match the javadocs
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Jun 14, 2023
also fixing the timeout units to match the javadocs
@shawkins
Copy link
Contributor

When I understood you correctly, this means as a user of the client I have to make sure on my own, that the pod queryable for logs, right?

Please review #5245

@cmdjulian
Copy link

When I understood you correctly, this means as a user of the client I have to make sure on my own, that the pod queryable for logs, right?

Please review #5245

Sorry for the buzz, looks good 😊

@manusa manusa added this to the 6.7.2 milestone Jun 15, 2023
manusa pushed a commit to shawkins/kubernetes-client that referenced this issue Jun 15, 2023
also fixing the timeout units to match the javadocs
manusa pushed a commit that referenced this issue Jun 15, 2023
also fixing the timeout units to match the javadocs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants