Skip to content

Conversation

andmat900
Copy link
Contributor

Applicable Issues

Description of the Change

This enables termination log retrieval from an arbitrary pod of a job.

Alternate Designs

The initial proposal includes two alternatives:

  • a modified version of the existing terminationLog() function which takes pod name prefix as argument
  • and a more generic terminationLogs() function that returns a map of termination logs for all pods/containers of a job, allowing the caller to process the output

Possible Drawbacks

Sign-off

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

Signed-off-by: Andrei Matveyeu, andrei.matveyeu@axis.com

@andmat900 andmat900 requested a review from a team as a code owner January 7, 2025 11:24
@andmat900 andmat900 requested review from t-persson and fredjn and removed request for a team January 7, 2025 11:24
@andmat900 andmat900 requested review from t-persson and fredjn January 20, 2025 15:29
Comment on lines 127 to 136
// JobResults describes the status and result of an ETOS job which consists of one or more pods
type JobResults struct {
PodResults []PodResults
}

// PodResults describes the status and result of a single pod of an ETOS job which consists of one or more containers
type PodResults struct {
Name string
ContainerResults []Result
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two structs instead of just having a single Results struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be able to determine conclusions and verdicts on different levels: container/pod/job/job group.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?
Don't we just want to get the termination log from the relevant Job. The callers would then need to do "Hey, I have this job, what's the result of the container with name=?"
What happens internally in the terminationLogs function should not matter to the caller.
Internally in the terminationLogs we would need to check which is the "current" pod that we should check result from, and then get the result of the proper container in that one.

So the API for callers would be the same:

result, err := terminationLog(ctx, r, environmentProvider, environmentrequest.Name)
if err != nil {
  result.Description = err.Error()
}
if result.Description == "" {
  result.Description = "Failed to provision an environment - Unknown error"
}
description = fmt.Sprintf("%s; %s: %s", description, environmentProvider.Name, result.Description)

where the terminationLogs is something similar to this:

func terminationLog(...) (*Result, error) {
  var pods corev1.PodList
  if err := c.List(ctx, &pods, ...); err != nil {
    ...
    return
  }
  if len(pods.Items) == 0 {
    return &Result{...}
  )

  pod := getLatestPodCreated(pods)
  for _, status := pod.Status.ContainerStatues {
     ...
  }
}

or if we want to get results from all pods

type Results struct {
  Result  // The result of the last created pod (the one that should be relevant)
  Results []Result
}

func terminationLog(...) (*Results, error) {
  var pods corev1.PodList
  if err := c.List(ctx, &pods, ...); err != nil {
    ...
    return
  }
  if len(pods.Items) == 0 {
    return &Result{...}
  )
  for _, pod := range pods {
    ...
    if isLatest() {
      results.Description = result.Description
      results.Verdict = result.Verdict
      results.Conclusion = result.Conclusion
    }
    results.Results = append(results.Results, result)
    ...
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why getLatestPodCreated()? The original issue requires multiple pods: We should be able to get termination logs from multiple pods #268
  2. In the last example, why is isLatest() determining the Verdict and Conclusion for all pods? Isn't it the combined result of all pods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using single Result struct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why getLatestPodCreated()? The original issue requires multiple pods: We should be able to get termination logs from multiple pods #268

That means that there may be multiple pods for a job, and we should get the results from the pod that is currently "active". We still only get the results from one pod, but we need to iterate over all pods to find the one whose results we should be interested in.

  1. In the last example, why is isLatest() determining the Verdict and Conclusion for all pods? Isn't it the combined result of all pods?

As I wrote above: Our jobs will have at-most a single relevant pod, but they may launch multiple pods if we implement retries or they get rolled over to a new node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified solution starting from commit 1f8fa80.

@andmat900 andmat900 requested a review from t-persson January 21, 2025 15:24
@andmat900 andmat900 force-pushed the 20250107_termination_log_multiple branch from 94be94c to 46446cd Compare January 27, 2025 15:21
@andmat900 andmat900 requested a review from fredjn January 27, 2025 15:31
Change-Id: Ief2cd8748054cda8b07593b8db8f4592e26344cb
Change-Id: Ic78f6007359660c83520910f86cf5f6bd075abb1
@andmat900 andmat900 merged commit 9096806 into eiffel-community:main Feb 3, 2025
andmat900 added a commit to andmat900/etos that referenced this pull request Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants