Retrieving logs on pod ready #5229

manusa · 2023-06-10T07:16:45Z

Description

In #4695 / #4637 we introduced some waiting logic for regular operations. One of these waits was related to the log retrieval.
Although this seems like a good idea, it doesn't fit all purposes. Some Pods might have started but aren't ready, or might have failed which makes them unready. However, they might have logged something which can be crucial to detect bugs and so on.

For all of these use cases, we should remove the Readiness check and replace by something else to simply detect if the Pod is live and can be queried for logs.

/cc @shawkins

shawkins · 2023-06-10T09:56:39Z

In #4695 / #4637 we introduced some waiting logic for regular operations.

Yes previously it was only on watchLog.

For all of these use cases, we should remove the Readiness check and replace by something else to simply detect if the Pod is live and can be queried for logs.

This was touched on a little in #4741 but there wasn't any follow up at that time. The default retry logic should now be responsible for 500 errors. That just leaves a narrow possibility of some 400 errors that we're taking responsibility to wait for - but with a check on pod status that is looking for ready or succeeded. That existing check, even for watchLog, is problematic as you're noting in determining an early exit.

Options include:

I think that kubectl doesn't bothers with any log/pod operation specific retry check, so we could consider doing the same.
Expand the checks in waitUntilReadyOrSucceded to exit early (if the pod is null or terminated in any way, then exit).
Change the retry logic to be based upon attempting the operation(s) and retrying based upon the code (at least 400, but I'm not sure about what others).

manusa · 2023-06-12T05:31:17Z

I'm not sure what I think that kubectl doesn't bothers with any log/pod operation specific retry check, so we could consider doing the same. means. As usual, our behavior should resemble as much as possible to that provided by kubectl or client-go.

IMO, from the context of #4741 and from what I see in the described use-case, if we get an error, I think we should just fail fast. Once the connection has been established, there shouldn't be any recovery attempt since there's no way to determine what was already processed (the purpose of bookmarks in watches). If the connection has not yet been established, then we should fail fast with an Exception providing the reason (Pod ceased to exist, Pod is not accepting connections, and so on).

shawkins · 2023-06-12T11:19:07Z

I'm not sure what I think that kubectl doesn't bothers with any log/pod operation specific retry check, so we could consider doing the same. means. As usual, our behavior should resemble as much as possible to that provided by kubectl or client-go.

It means there's no wait nor retry for ready or any other condition when performing pod operations.

manusa · 2023-06-12T12:00:58Z

I think that for Pod Log retrieval, we should basically do the same as kubectl then (don't bother), and fail fast with an exception.
Any UX improvement (retry, wait for next restart, etc.) might actually break many of the other use-cases (as previously discussed).

shawkins · 2023-06-12T12:43:38Z

might actually break many of the other use-cases (as previously discussed)

Just to make sure there's no confusion about what's there currently - the timeout has soft enforcement and defaults to 5 seconds. If the pod does not become ready or succeeded in that amount of time, the operation will still proceed. Granted even 5 seconds could be too long of an artificial wait if expecting an early exit.

manusa · 2023-06-12T13:02:10Z

OK, I clearly missed this.

Then we should improve Javadoc to reflect this, both watchLog and getLog are subject to this.

Adding a clarifying paragraph in all entries stating that by default it will wait 5 seconds for the Pod to be ready, but that you can changing with .withReadyWaitTimeout($timeoutInMs).watchLog().

manusa · 2023-06-12T13:06:02Z

BTW, the time units seem to be wrong too.
In some places it's expressed in ms:

kubernetes-client/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/dsl/ContainerResource.java

Line 61 in f63507c

* @param timeout in milliseconds

And in others is used as seconds:

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/core/v1/PodOperationsImpl.java

Lines 125 to 126 in 1c3baf9

    
           PodOperationUtil.waitUntilReadyOrSucceded(this, 
        
               getContext().getReadyWaitTimeout() != null ? getContext().getReadyWaitTimeout() : DEFAULT_POD_READY_WAIT_TIMEOUT);

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/utils/internal/PodOperationUtil.java

Lines 127 to 128 in bafc518

    
           logWaitTimeout, 
        
           TimeUnit.SECONDS);

shawkins · 2023-06-12T13:11:17Z

In some places it's expressed in ms:

That came from the existing javadoc on withLogWaitTimeout - so that's always been wrong too :(

manusa · 2023-06-12T13:12:50Z

Since ms is the public exposed unit (+ is more fine-grained), I think it can be easy to change everything internally to match it

cmdjulian · 2023-06-14T12:17:51Z

I had the exact same problem because of this, when the pod failed very fast, no logs could be retrieved as the condition did not evaluate to true anymore for readiness. The pod is log streamable if its in failed, in succeeded or in running, so it doesn't have to be ready though.

I fixed that by waiting until the pod was ready to stream logs with the following code:

fun Pod.ready(): Boolean = when (status?.phase) {
    "Running", "Succeeded", "Failed" -> true
    else -> false
}

client.pods()
    .withLabel(JOB_ID_LABEL, "$jobId")
    .waitUntilCondition({ pod: Pod? -> pod?.ready() == true }, 150, TimeUnit.SECONDS)

Checking the status works very reliable.
withReadyWaitTimeout would wait indefinitely for the pod to become ready when its already failed. However, it's still log streamable.

shawkins · 2023-06-14T13:02:20Z

withReadyWaitTimeout would wait indefinitely for the pod to become ready when its already failed. However, it's still log streamable.

Can you elaborate on that - were you explicitly setting withReadyWaitTimeout? If so the likely problem was the mismatch between the javadocs (ms) and the logic actually expecting seconds compounded by the lack of fail-fast behavior.

This issue should address that mismatch and improve the fail-fast nature of the check being performed.

cmdjulian · 2023-06-14T13:19:04Z

okay, picture it the following. I create a Pod. The pod now runs for instance a Python script. This python script has a syntax error in the first line. This results in the pod failing. It is now in the phase Failed.
After creating the Pod, I immediately call client.batch().v1().jobs().withName("$jobId").withReadyWaitTimeout(150).watchLog().

The problem now arises from PodOperationUtil.waitUntilReadyOrSucceded() which is called as part of the withReadyWaitTimeout() method. It is defined as

    try {
      // Wait for Pod to become ready or succeeded
      podOperation.waitUntilCondition(p -> {
        podRef.set(p);
        return p != null && (Readiness.isPodReady(p) || Readiness.isPodSucceeded(p));
      },
          logWaitTimeout,
          TimeUnit.SECONDS);
    }

If the Pod is already terminated, with Failed for instance, both of this conditions do not match it and we get stuck inside of this podOperation.waitUntilCondition() until the timeout is reached.

Even kubectl allows querying a Pod for logs if its dead.

The logic from above works only, if the Pod has not started yet, finished successfully or is currently Running.

shawkins · 2023-06-14T13:21:45Z

The logic from above works only, if the Pod has not started yet, finished successfully or is currently Running.

Yes, we're on the same page, that is what I'm referring to as better fail fast behavior.

cmdjulian · 2023-06-14T13:24:09Z

When I understood you correctly, this means as a user of the client I have to make sure on my own, that the pod queryable for logs, right? The lib is not doing any conditional waiting anymore until some ready state is reached?

kubectl doesn't have this feature currently, but there are a few discussions about that and its a request by multiple users:

Forcing kubectl log to wait when container is creating kubernetes/kubectl#1227
kubectl logs -f should wait when in state PodInitializing kubernetes/kubernetes#79547

Can't we just adjust the existing .withReadyWaitTimeout() to do the right thing here, also including Failed pods as a allowed state to break out of the waiting? Calling logWatch() on the other hand skips all that waiting.

also fixing the timeout units to match the javadocs

shawkins · 2023-06-14T15:40:05Z

When I understood you correctly, this means as a user of the client I have to make sure on my own, that the pod queryable for logs, right?

Please review #5245

cmdjulian · 2023-06-14T16:05:45Z

When I understood you correctly, this means as a user of the client I have to make sure on my own, that the pod queryable for logs, right?

Please review #5245

Sorry for the buzz, looks good 😊

also fixing the timeout units to match the javadocs

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Jun 14, 2023

fix fabric8io#5229: allowing for other exits from the readiness check

bed3b39

also fixing the timeout units to match the javadocs

shawkins mentioned this issue Jun 14, 2023

fix #5229: allowing for other exits from the readiness check #5245

Merged

11 tasks

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Jun 14, 2023

fix fabric8io#5229: allowing for other exits from the readiness check

f60f59f

also fixing the timeout units to match the javadocs

manusa assigned shawkins Jun 15, 2023

manusa added this to the 6.7.2 milestone Jun 15, 2023

manusa pushed a commit to shawkins/kubernetes-client that referenced this issue Jun 15, 2023

fix fabric8io#5229: allowing for other exits from the readiness check

65223b1

also fixing the timeout units to match the javadocs

manusa closed this as completed in #5245 Jun 15, 2023

manusa pushed a commit that referenced this issue Jun 15, 2023

fix #5229: allowing for other exits from the readiness check

72fe5ae

also fixing the timeout units to match the javadocs

manusa mentioned this issue Mar 8, 2024

Uploading a file in an init-container is very slow #5782

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieving logs on pod ready #5229

Retrieving logs on pod ready #5229

manusa commented Jun 10, 2023

shawkins commented Jun 10, 2023

manusa commented Jun 12, 2023 •

edited

shawkins commented Jun 12, 2023

manusa commented Jun 12, 2023

shawkins commented Jun 12, 2023

manusa commented Jun 12, 2023

manusa commented Jun 12, 2023

shawkins commented Jun 12, 2023

manusa commented Jun 12, 2023

cmdjulian commented Jun 14, 2023 •

edited

shawkins commented Jun 14, 2023

cmdjulian commented Jun 14, 2023

shawkins commented Jun 14, 2023

cmdjulian commented Jun 14, 2023 •

edited

shawkins commented Jun 14, 2023

cmdjulian commented Jun 14, 2023

Retrieving logs on pod ready #5229

Retrieving logs on pod ready #5229

Comments

manusa commented Jun 10, 2023

Description

shawkins commented Jun 10, 2023

manusa commented Jun 12, 2023 • edited

shawkins commented Jun 12, 2023

manusa commented Jun 12, 2023

shawkins commented Jun 12, 2023

manusa commented Jun 12, 2023

manusa commented Jun 12, 2023

shawkins commented Jun 12, 2023

manusa commented Jun 12, 2023

cmdjulian commented Jun 14, 2023 • edited

shawkins commented Jun 14, 2023

cmdjulian commented Jun 14, 2023

shawkins commented Jun 14, 2023

cmdjulian commented Jun 14, 2023 • edited

shawkins commented Jun 14, 2023

cmdjulian commented Jun 14, 2023

manusa commented Jun 12, 2023 •

edited

cmdjulian commented Jun 14, 2023 •

edited

cmdjulian commented Jun 14, 2023 •

edited