-
Notifications
You must be signed in to change notification settings - Fork 967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After upgrade to release_16.01: jobs are no longer run #1789
Comments
Following on rekado's post, below is the output from paster.log file when an upload job is executed:
|
A quick note about the line:
It is a bit misleading in retrospect. It just means the API request to run the tool has been executed, the job is probably sitting in a NEW state ready for a job handler to pick it up and execute it. |
This is our
I don't know what you mean by the second question (multiple processes of what?). We just start Galaxy with |
@rekado would be nice if you could provide a gist to your galaxy.ini file. Do you have an nginx or Apache running? Can you post this config as well? |
@bgruening This is our configuration file: https://gist.github.com/rekado/726e7d34033cde9f83d8 I was not involved in the upgrade, but it seems that the config files have not been changed after the upgrade. Are there release notes that show me what config keys should be added or changed? There are no changes in our git checkout. We use an unmodified checkout of the |
@bgruening hi Bjoern, we are still stuck in the same problem. Now the issue haunt both our dev and production servers. The last solution which we could do is to install v16.01 from the scratch. this could solve the problem while it would cause data loss from the users. Could you guys continue with some suggestions? |
I don't really see any obvious problem with your config files. Are there any jobs listed as running/new in the admin section ? Can you kill those? |
We have plenty of jobs in "new" status; they are just never executed. |
@rekado can you run this query against your DB:
|
On our dev system this query returns no rows. They are all in "scheduled" state. |
@rekado can you try something like this:
|
I tried that just now, restarted Galaxy, and submitted a new upload job, but there's no change in behaviour. We use the following systemd unit to provide the Galaxy service:
I should also note that |
What do you see in the logs during tool execution. This is really weird - I have now migrated 5 Galaxy instances in Germany and did not see this issue at all. |
This is what I see in
Are there other logs that would give me more information? |
If you run multiple handler and workers you will have for every handler and worker one log file. |
Then I'd have to start with |
I never used systemd, maybe you can test at first without this.
|
With It just spins up more HTTP handlers. But the problem we have seems to be that no jobs are actually run in the background. The web interface works just fine. |
The jobs should be distributed to the handlers and they pass it over the the scheduler. What do you see in the logs if you start a Job? I'm wondering how this systems ever worked in production if you have never setup handlers. |
I see nothing interesting at all in the logs. I added the above snippet below the existing
Nothing at all is output beyond that. The rest of the galaxy.ini has not been changed (here's the link to our config again https://gist.github.com/rekado/726e7d34033cde9f83d8). |
okay, so there is no communication with the handlers, which leads me to think that your job_conf.xml
? Also, there is no reference to the job_config_file in the galaxy.ini,
|
I added both these things, but I don't see anything in the handler logs. How is Galaxy supposed to communicate with the workers? I see that the handlers listen on local ports --- does communication go over the network? What's the messaging mechanism here? |
The messaging happens via polling the database - all the handlers should be watching the database for jobs assigned to them. |
Please check the job record in the database. It should be assigned a handler. e.g.: galaxy_test=> SELECT id, create_time, update_time, tool_id, tool_version, state, command_line, runner_name, handler, destination_id, destination_params FROM JOB ORDER BY ID DESC LIMIT 1;
id | create_time | update_time | tool_id | tool_version | state | command_line | runner_name | handler | destination_id | destination_params
--------+----------------------------+----------------------------+--------------------+--------------+-------+--------------+-------------+---------------+----------------+--------------------
520626 | 2016-02-21 14:59:33.100898 | 2016-02-21 14:59:33.100923 | ucsc_table_direct1 | 1.0.0 | new | | | test_handler1 | |
(1 row) |
The latest job has in fact a handler assigned to it
|
Can you also provide the contents of |
Here's the full contents of handler3.log since restarting it. |
It is starting as a handler:
Can you set |
That's a lot of output, but it does seem to perform the queries. I cannot see from the output whether the queries were successful:
Here's the record for that last job:
I notice that the timestamp is off by one hour. |
Ahh, a clue in that output - by any chance are the users missing the activated flag on their record in the |
And even if they have the correct flag please turn off the activation and see if anything changes. |
Yes!! The users' There's another error I'm getting now (encoding related?) that's failing the jobs, but at least the jobs are finally executed! Thank you so much for your support! |
@rekado when we introduced the activation feature (~2 yrs ago?) a part of the db migration script sets all existing users to special thanks to @jmchilton @mvdbeek @bgruening and @natefoo ! p.s. @dyusuf does the fix work for you too? |
@rekado as a quick workaround try to set your locale setting to UTF-8 |
@martenson @jmchilton @mvdbeek @bgruening @natefoo thank you greatly for the trouble-shooting. it saves our days. just a quick comment. it is hard to debug this type of errors without the experts like you. In addition to admin issues, I often encounter issues while developing galaxy tools that I always have to send sos to @bgruening since it is not easy for me to find solutions in the galaxy documentation. To make galaxy admins/tool_developers more independent in trouble-shooting, would be possible to improve galaxy documentation some way or to have some other means. If your think it is needed and can provide support, I would certainly like to invest some efforts on it. |
All documentation efforts are very welcome, on the one hand there is the https://wiki.galaxyproject.org/Admin as well as some code-specific documentation in here https://github.com/galaxyproject/galaxy/tree/dev/doc . I think this particular issue is very tricky, but perhaps we could log a warning if a job is submitted by a user who hasn't been activated. |
@martenson thanks for the info! |
@dyusuf There should be a banner at the top of Galaxy indicating that the account has not been activated. Was this not the case for these users? |
@natefoo There is no banner shown for users that are deactivated. We just set the "active" field back to "f" for one user and we do not see any hint in the UI. The account is marked as inactive only in the admin interface. |
Oh, I misunderstood. This has just been added by merging @mvdbeek's pull request. |
@martenson Did something get broken here with the masthead message? |
Now I may be misunderstanding things, but my PR introduces a warning message being printed to the console in case user_activation is on, a job is submitted and the user is not active or anonymous. I believe @natefoo is talking about web interface functionality that has been pre-existing. (I have not used the user activation feature, so I can't help any further) |
@natefoo I assume they just do not have a message set up (according to their config - #1789 (comment) ) |
check out https://wiki.galaxyproject.org/Admin/UserAccounts for details |
@martenson I think |
@nsoranzo agreed |
Hi,
we've upgraded our Galaxy instances to release 16.01 and since then found that none of our tools work any longer. One test case is to upload a text file with the "upload1" tool. In the logs we see that a job is created, we also see it in the database. The job remains in the "new" state without changes.
Upon starting Galaxy we see that the main Galaxy Queue Worker is initialized to run on our postgresql database, which contains the job records, and we see that 4 LocalRunner workers are started.
The uploaded files are in fact created in
new_file_path
asupload_file_data_0Sf_99
(and have the expected contents), but in thejob_working_directory
they only appear as zero-sized files.Could you please give us a hint as to what's going on here?
The text was updated successfully, but these errors were encountered: