Fix duplicated logs and memory issue with S3 log handler#67144
Conversation
b22fbc9 to
3c53d5c
Compare
ferruzzi
left a comment
There was a problem hiding this comment.
Approved with a non-blocking thought
| elif has_uploaded: | ||
| local_loc.write_text("") |
There was a problem hiding this comment.
Non-blocking idea: This is definitely an improvement in the happy case, but if you want to take it further in another PR, consider what happens if the upload to S3 fails for some reason. I haven't thought this through all the way so I may be wrong. I think has_uploaded will remain false, so the local copy doesn't get truncated, then on the next pass we still write with duplication. If that's the case, you could tail the s3 log before uploading to check for duplicates and trim those before writing?
Just an idea. Thanks for fixing this.
There was a problem hiding this comment.
I think that if an upload fails on a given worker, we still want to append logs on subsequent runs that land on the same worker. For example, if the task lands on different workers between pokes, we would still want to re-attempt the upload of the local log files, even if they arrive out-of-order in the final file.
In any scenario where the log upload fails and never re-executes on a worker, we lose logs, but that's a reality with the current code anyway and I'm not sure what the right solution is.
Maybe the worse problem is around a transient read failure of the log file from S3, which could trample the log file entirely.
There was a problem hiding this comment.
I'm going to resolve this convo and merge, we're just talking. Maybe I'm looking at it wrong. I think right now if the S3 upload fails, then the local log doesn't get cleaned up. Which means if there is a hiccup and it works the next pass, it will upload duplicate logs. But that's whats already happening right now, so it's not a big deal.
There was a problem hiding this comment.
Maybe I'm looking at it wrong. I think right now if the S3 upload fails, then the local log doesn't get cleaned up. Which means if there is a hiccup and it works the next pass, it will upload duplicate logs.
I think in that case, that means the lines in the local log file are not present at all in the S3 file. So when it gets appended, there is no duplication. But maybe I'm the one thinking about it wrong.
In our Kubernetes based celery executor, we ran into a runaway memory issue with a sensor that used
mode="reschedule"and kept scheduling to the same worker repeatedly. In this environment we havea dedicated worker set devoted to sensors and the task was getting rescheduled to the same worker
every time the poke was executed. As such, the local log file existed and was getting appended to and
then the S3 log file was also getting appended to each time.
Over time, this caused a large memory spike as the supervisor process loaded the logs from S3, attempted
to append a copy of the logs again, upload the result, and then repeat. The memory usage eventually crashed
the worker due to OOM.
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code following the guidelines