Skip to content

Fix duplicated logs and memory issue with S3 log handler#67144

Merged
ferruzzi merged 1 commit into
apache:mainfrom
jvstein:fix_duplicate_s3_logs_with_memory_usage
May 21, 2026
Merged

Fix duplicated logs and memory issue with S3 log handler#67144
ferruzzi merged 1 commit into
apache:mainfrom
jvstein:fix_duplicate_s3_logs_with_memory_usage

Conversation

@jvstein
Copy link
Copy Markdown
Contributor

@jvstein jvstein commented May 19, 2026

In our Kubernetes based celery executor, we ran into a runaway memory issue with a sensor that used
mode="reschedule" and kept scheduling to the same worker repeatedly. In this environment we have
a dedicated worker set devoted to sensors and the task was getting rescheduled to the same worker
every time the poke was executed. As such, the local log file existed and was getting appended to and
then the S3 log file was also getting appended to each time.

Over time, this caused a large memory spike as the supervisor process loaded the logs from S3, attempted
to append a copy of the logs again, upload the result, and then repeat. The memory usage eventually crashed
the worker due to OOM.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: Claude Code following the guidelines

@jvstein jvstein requested a review from o-nikolas as a code owner May 19, 2026 00:57
@jvstein jvstein force-pushed the fix_duplicate_s3_logs_with_memory_usage branch from b22fbc9 to 3c53d5c Compare May 19, 2026 00:58
@eladkal eladkal requested review from ferruzzi and vincbeck May 21, 2026 17:17
Copy link
Copy Markdown
Contributor

@ferruzzi ferruzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a non-blocking thought

Comment on lines +65 to +66
elif has_uploaded:
local_loc.write_text("")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking idea: This is definitely an improvement in the happy case, but if you want to take it further in another PR, consider what happens if the upload to S3 fails for some reason. I haven't thought this through all the way so I may be wrong. I think has_uploaded will remain false, so the local copy doesn't get truncated, then on the next pass we still write with duplication. If that's the case, you could tail the s3 log before uploading to check for duplicates and trim those before writing?

Just an idea. Thanks for fixing this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if an upload fails on a given worker, we still want to append logs on subsequent runs that land on the same worker. For example, if the task lands on different workers between pokes, we would still want to re-attempt the upload of the local log files, even if they arrive out-of-order in the final file.

In any scenario where the log upload fails and never re-executes on a worker, we lose logs, but that's a reality with the current code anyway and I'm not sure what the right solution is.

Maybe the worse problem is around a transient read failure of the log file from S3, which could trample the log file entirely.

Copy link
Copy Markdown
Contributor

@ferruzzi ferruzzi May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to resolve this convo and merge, we're just talking. Maybe I'm looking at it wrong. I think right now if the S3 upload fails, then the local log doesn't get cleaned up. Which means if there is a hiccup and it works the next pass, it will upload duplicate logs. But that's whats already happening right now, so it's not a big deal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm looking at it wrong. I think right now if the S3 upload fails, then the local log doesn't get cleaned up. Which means if there is a hiccup and it works the next pass, it will upload duplicate logs.

I think in that case, that means the lines in the local log file are not present at all in the S3 file. So when it gets appended, there is no duplication. But maybe I'm the one thinking about it wrong.

@ferruzzi ferruzzi merged commit c3bf97d into apache:main May 21, 2026
108 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants