Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature - stdout live reporting #16975

Open
wants to merge 22 commits into
base: dev
Choose a base branch
from

Conversation

gecage952
Copy link
Contributor

@gecage952 gecage952 commented Nov 3, 2023

Hi,
As part of our work at Oak Ridge National Lab, we've been using Galaxy for quite a while (some of us have attended GCC as well). We've also been doing some internal development on some features that users have requested here. One the most common ones was the ability to see live console output as jobs are running. This issue has been brought up before, for example #2332. But given, our users wanted it at the present, I took a stab at an implementation. My main goal was to minimize impact to the way Galaxy works now, so that there aren't any compatibility issues. I'll try to provide details about each part that was touched.

Overview

The overall idea here was to add logic to allow the job manager to read the tool_stdout and tool_stderr files that are saved in the job directory, and return them as part of a status. The reason I put it into the status is because the UI already calls the status regularly form the JobInformation page, so I wouldn't have to make a new thread or anything. It also just kinda made sense to me that you might want it as part ofhte status of the job. To facilitate this, the api endpoint for getting job status was adjusted to allow parameters to select which part of the stdout/stderr (both work the same, so I'll just refer to stdout from here) that you want. There's stdout_pos which is the starting index in the stdout_file, and stdout_length which is how much of the stdout that you want (in chars). Because stdout could potentially be a relatively larger file, I didn't want to force people to read the whole file every time status is called.

I then adjusted the UI for the job information view in a few different ways. First, I made the code blocks scrollable and set a max height for them. Then I moved the expand on click functionality over to only be on the expand icons, rather than the whole table row (if users would try and highlight a part of the stdout or click on the scroll bar to scroll, it would collapse the view). Lastly, I added an autoscroll feature that automatically scrolls the code blocks when the user is at the bottom of the stdout. If the user scrolls up, this is disabled. If they scroll back to the bottom, it starts again.

The last thing I want to note is that as far as compatibility with job runners go, we almost exclusively use Pulsar for running jobs. As is, this pr will only work for job runners that save their stdout to the job directory inside of the Galaxy. Internally, we've added functionality for Pulsar to do this (the purpose of lib/galaxy/webapps/galaxy/api/job_files.py changes). I did not include those changes here, because that would require an additional pr to the Pulsar repository. Let me know if there's interest in seeing that however. It would also be nice to get some feedback on testing this.

I understand this is a pretty big change, and I imagine there are a lot of areas for improvement. Please let me know if this is something people are interested in helping with, or if this is a terrible way to try to do this, or whatever. It's worked for us so far internally, but I would love to have some feedback from here.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. The best way to test is to start a tool that will run for some time with stdout.
    2. Start the tool and go to the JobInformation page for the job.
    3. Expand the Tool stdout by clicking on the expand icon to the right.

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@dannon
Copy link
Member

dannon commented Nov 6, 2023

@gecage952 Very cool! I'll defer to others for feedback on the API design, but I went ahead and pushed a minor change fixing up the linting and client test failures, and updating the client API schema.

@dannon dannon force-pushed the feature_stdout_live_reporting branch from e5b887e to 201f556 Compare November 6, 2023 14:03
@dannon dannon force-pushed the feature_stdout_live_reporting branch from 201f556 to 24e7d15 Compare November 6, 2023 14:05
@bgruening
Copy link
Member

That is very cool and a feature we also get asked every now and then. As a Pulsar deployer, I'm also interested in the Pulsar part and how this works in practice. Are you using the MQ deployment of Pulsar?

Comment on lines 192 to 195
- stdout_position: The index of the character to begin reading stdout from
- stdout_length: How many characters of stdout to read
- stderr_position: The index of the character to begin reading stderr from
- stderr_length: How many characters of stderr to read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add two separate endpoints job for the jobs' stdout and stderr ?

@gecage952
Copy link
Contributor Author

That is very cool and a feature we also get asked every now and then. As a Pulsar deployer, I'm also interested in the Pulsar part and how this works in practice. Are you using the MQ deployment of Pulsar?

Yeah, rabbitmq.
The general idea of the pulsar piece is to send the stdout/stderr files to the job_files api endpoint periodically. To make sure we aren't sending the entire file every time, we keep track of the last position in a map. Then when it's time to send the next chunk, we seek to that position in the files, post it to the job_files endpoint which appends it to the file in the Galaxy job directory. We also considered using the message queue for this instead of the api, but ended up not going that direction.

@mvdbeek
Copy link
Member

mvdbeek commented Nov 7, 2023

This is very cool, thanks a lot!

Let me know if there's interest in seeing that however.

that'd be great!

It would also be nice to get some feedback on testing this.

The integration tests are going to run with the local job runner by default (as opposed to the API tests that could run against external Galaxy servers where the stdio streams may not be available). What you can do is submit a tool job against such an instance that prints something to stdout, then sleeps and prints something at the end, in your test you can then assert that you saw the first message but not the second. Take a look at https://github.com/galaxyproject/galaxy/blob/dev/test/integration/test_job_recovery.py#L30-L36 for running tools in integration tests. Let me know if you need any help with this.

@gecage952
Copy link
Contributor Author

gecage952 commented Nov 17, 2023

So, I opened a pr in the pulsar repo with that code: galaxyproject/pulsar#345
I'll try to get to the suggestions here in the coming weeks (busy time of year).

@martenson martenson marked this pull request as draft November 17, 2023 20:02
@gecage952
Copy link
Contributor Author

Ok, I added a new endpoint for getting stdout and stderr, and then updated everything to use the new endpoint. I see it was mentioned to have them separate, and I can still do that if necessary. I just combined them here now for my own personal testing purposes.

@gecage952
Copy link
Contributor Author

I'll also take a look at the merge conflicts.

@gecage952
Copy link
Contributor Author

Ok. Sorry for taking a while to get back to this. I've tried to update everything to be compatible to the latest changes in Galaxy including updating the code to Vue3, updating the api endpoint, fixing merge conflicts, etc. I'll try to fix the rest of these checks, once they finish.

@gecage952
Copy link
Contributor Author

Just to update on this. This is done enough to start being looked at just fyi. I'm sure there are areas for improvement of course.

@gecage952 gecage952 changed the title [WIP] Feature - stdout live reporting Feature - stdout live reporting May 3, 2024
@martenson martenson added this to the 24.1 milestone May 3, 2024
@martenson martenson marked this pull request as ready for review May 3, 2024 21:24
@mvdbeek mvdbeek requested a review from jmchilton May 7, 2024 14:31
@@ -212,6 +212,12 @@ class JobOutput(Model):
value: EncodedDataItemSourceId = Field(default=..., title="Dataset", description="The associated dataset.")


class JobConsoleOutput(Model):
state: Optional[JobState] = Field(None, title="Job State", description="The current job's state")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to sounds weird but internally we track both a job_stderr/job_stdout and a tool_stderr/tool_stdout. Pulsar is worse at this than the other runners unfortunately. But I think the thing you're returning here is the tool_stdout - the output of the running tool. The job_stdout would be things like messages printed by SLURM and such. Can you update this description here to say Tool STDOUT and Tool STDERR. There are other places in comments where you refer to it as the job's stdout but I think it should be the job's tool stdout and such. It seems like there probably shouldn't be a distinction I imagine but we actually have disentangled a few problems this was and we shield the users from non-tool stuff they don't care about this way too.

):
if job is None:
raise ObjectNotFound()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of reasons this might not be possible I think. Older Pulsar, remote execution outside of Pulsar, a Pulsar modality that doesn't allow this, runners that respect the location of working directory stuff in this fashion. It feels like the right thing to do here is to throw a galaxy.exceptions.ConfigDoesNotAllowException and then catch this on the client and not poll. Does that make sense to you?

Ideally I think we would make all this opt-in based on the job destination until we've hardened it and seen it work well in production without performance issues or confusing user experience. The job object has a job.job_destination YAML thing we could use to configure things per environment/destination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Triage/Discuss
Development

Successfully merging this pull request may close these issues.

None yet

6 participants