-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH CLI live log tail #359
Conversation
* Like `tail -f` for job run logs, closely mirroring the output of run logs on the web UI
* sort log lines by id as a fallback if createdAts are equal
…rm API is different without one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about the control flow in the workhorse function here-- I see a few points where it looks possible to get a hard error that doesn't make sense, or to get stuck in an infinite loop. Some additional docstring/inline documentation would also help, since it's not obvious to me what all of the variables are and what different elements of the response mean.
while continue_polling: | ||
response = client.jobs.list_runs_logs(id, run_id, | ||
last_id=seen_max_log_id) | ||
if 'civis-max-id' in response.headers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this header mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the maximum log ID that has been seen on Platform. It's used to decide if we should keep retrieving logs immediately instead of waiting for the next 3 second poll interval, and as a double-check to see if the logs are fully done and we can exit.
Maybe I could rename the variable here to max_log_id_seen_on_platform
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the issue is that I don't know how platform stores log information. Are you saying that each log line has its own id?
Currently the people who maintain the API clients are not people who work on the API itself, so an inline comment mentioning that we can check for new platform logs by looking for new ids would help clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. Yeah, each log message (I'm not calling it a line because I think a single message could technically have multiple newlines) has an ID that should be unique within each run and is in chronological order based on when the message was originally sent.
Here's what one of the log entries from civis jobs list-runs-logs
looks like in YAML format, since that's the same endpoint gets used here:
- createdAt: '2019-11-04T15:56:44.492Z'
id: 492602116
level: info
message: Queued
response = client.jobs.list_runs_logs(id, run_id, | ||
last_id=seen_max_log_id) | ||
if 'civis-max-id' in response.headers: | ||
remote_max_log_id = int(response.headers['civis-max-id']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is only set if the conditional is met. If the conditional is not met, the user will get a confusing error on line 210. What should happen if the header is not present? Maybe raise an informative error rather than continuing on in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, good point. There won't be a civis-max-id if there are no logs at all yet, so I can just set this to None if it doesn't exist.
In that case, I'll have to change the logic on line 210 to account for this edge case: the first iteration through, if Platform hasn't seen any logs for this run yet, seen_max_log_id = 0
and remote_max_log_id = None
. Then line 210 would be false and we'd miss that sleep
and keep trying to hit Platform as fast as possible until we get some logs. That case is unlikely but possible.
civis/cli/_cli_commands.py
Outdated
last_id=seen_max_log_id) | ||
if 'civis-max-id' in response.headers: | ||
remote_max_log_id = int(response.headers['civis-max-id']) | ||
log_finished = response.headers['civis-cache-control'] != 'no-store' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these items being pulled out of headers are mysterious to me. It would be helpful to have a comment saying what is being pulled out of the header and what the value you're expecting means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the actual definition of the value here doesn't really matter as much as the fact that the 'no-store' means Platform is no longer expecting to get any more logs for this run.
Is there a variable name that would make it clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name makes sense to me, but I find the names of things in the Platform response confusing (civis-cache-control
, no-store
). I can take your word for it that this variable holds a bool of whether the log is done or not, but I wondered if there was any context you could add about what Platform is passing. This could provide extra context if, for example, I needed to modify the code in the future for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it's either max-age=120
or no-store
, where the latter means that Platform has decided it might still be getting more logs. The max-age=120
value is currently not used directly by the web UI either. It's in the same format as a regular Cache-Control header to tell the client how long it can cache the results, but right now it's just used to tell the client when to stop polling for results.
civis/cli/_cli_commands.py
Outdated
print(' '.join((log['createdAt'], log['message'].rstrip()))) | ||
# if output is a pipe, write the buffered output immediately: | ||
sys.stdout.flush() | ||
if seen_max_log_id == remote_max_log_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean if this condition is met vs not? If it's not met, won't you be stuck in an infinite loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not met, we'll continue polling.
remote_max_log_id = int(response.headers['civis-max-id']) | ||
log_finished = response.headers['civis-cache-control'] != 'no-store' | ||
logs = response.json() | ||
if logs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if if logs
is not met? Will you be stuck in an infinite loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's okay and normal if there are no more log results; we should normally just keep polling.
civis/cli/_cli_commands.py
Outdated
if log_finished: | ||
continue_polling = False | ||
else: | ||
time.sleep(_FOLLOW_POLL_INTERVAL_SEC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it should be easier to DRY the sleep logic so we only have to do it in one place, but I wasn't able to come up with anything. Suggestions welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this to both document and DRY? It's more code, but might be clearer?
if remote_max_log_id is None:
remote_has_more_logs_to_retrieve_now = False
elif local_max_log_id == remote_max_log_id:
remote_has_more_logs_to_retrieve_now = False
if log_finished:
continue_polling = False
else:
remote_has_more_logs_to_retrieve_now = True
if continue_polling and not remote_has_more_logs_to_retrieve_now:
time.sleep(_FOLLOW_POLL_INTERVAL_SEC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is clearer.
@aripollak I left a couple notes about places where inline comments would be helpful. Otherwise this looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for submitting this feature!
civis jobs follow-log
andcivis jobs follow-run-log
tail -f
for job run logs, closely mirroring the output of run logs on the web UIDemo GIFs