Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation failures corrupt moving average buckets #1634

Open
werkt opened this issue Feb 15, 2024 · 1 comment
Open

Operation failures corrupt moving average buckets #1634

werkt opened this issue Feb 15, 2024 · 1 comment
Assignees

Comments

@werkt
Copy link
Collaborator

werkt commented Feb 15, 2024

The average decorations in PutOperationStage are not resilient to the state of all sources of putOperation, particularly early in the pipeline when an error occurs. The following is observed continuously if, say, the disk is full and errors an operation out of fetchInputs:

2024-02-14 10:53:23.532 | Feb 14, 2024 3:53:23 PM build.buildfarm.worker.InputFetcher failOperation |  
-- | -- | --
  |   | 2024-02-14 10:53:23.532 | SEVERE: Cannot report failed operation shard/operations/074d0fcf-e77e-4806-9918-a60045fbaae1 |  
  |   | 2024-02-14 10:53:23.532 | java.lang.IllegalArgumentException: Duration is not valid. See proto definition for valid values. Seconds (-315956351598) must be in range [-315,576,000,000, +315,576,000,000]. Nanos (-237000000) must be in range [-999,999,999, +999,999,999]. Nanos must have the same sign as seconds |  
  |   | 2024-02-14 10:53:23.532 | at com.google.protobuf.util.Durations.checkValid(Durations.java:190) |  
  |   | 2024-02-14 10:53:23.532 | at com.google.protobuf.util.Durations.normalizedDuration(Durations.java:479) |  
  |   | 2024-02-14 10:53:23.532 | at com.google.protobuf.util.Durations.add(Durations.java:452) |  
  |   | 2024-02-14 10:53:23.532 | at build.buildfarm.worker.PutOperationStage$OperationStageDurations.addOperations(PutOperationStage.java:200) |  
  |   | 2024-02-14 10:53:23.532 | at build.buildfarm.worker.PutOperationStage$AverageTimeCostOfLastPeriod.addOperation(PutOperationStage.java:135) |  
  |   | 2024-02-14 10:53:23.532 | at build.buildfarm.worker.PutOperationStage.put(PutOperationStage.java:49) |  
  |   | 2024-02-14 10:53:23.532 | at build.buildfarm.worker.InputFetcher.failOperation(InputFetcher.java:334) |  
  |   | 2024-02-14 10:53:23.532 | at build.buildfarm.worker.InputFetcher.fetchPolled(InputFetcher.java:211) |  
  |   | 2024-02-14 10:53:23.532 | at build.buildfarm.worker.InputFetcher.runInterruptibly(InputFetcher.java:106) |  
  |   | 2024-02-14 10:53:23.532 | at build.buildfarm.worker.InputFetcher.run(InputFetcher.java:293) |  
  |   | 2024-02-14 10:53:23.532 | at java.lang.Thread.run(Thread.java:748)

@jacobmou Please have a look at this trace and figure out how we can avoid using invalid (likely 0) timestamps subtracted from the worker timestamp. I recommend only using Stopwatch outputs to determine durations, not time subtraction with possibly 0 endpoints.

@RituRajSingh878
Copy link

image
I'm also receiving the somewhat same issue. Any updates or workaround on this?

CC @werkt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants