Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix starting runs #459

Merged
merged 4 commits into from
Nov 4, 2019
Merged

Fix starting runs #459

merged 4 commits into from
Nov 4, 2019

Conversation

wilko77
Copy link
Collaborator

@wilko77 wilko77 commented Nov 1, 2019

It all started with this strange failure in a clkhash test:

State: queued
Stage (1/3): waiting for CLKs
Progress: 100.00%
State: queued
Stage (1/3): waiting for CLKs
Progress: 100.00%
State: queued
Stage (2/3): compute similarity scores
State: running
Stage (3/3): compute output
State: running
Stage (4/3): there is no description for this stage

It turns out, that the entity service is naughty.

The problem is around the detection of the race condition to start a run twice.
The idea is that a run's state is set to created after creation, and only once all data provider have uploaded the necessary clks, the run state gets promoted to queued and, consequently, the run execution gets started.

However, some naughty code in views/run/list.py:post did some premature clk checking and run state changing. After all that, it would still call the check_for_executable_runs task, which will then do exactly the same again.
Now, as we previously "fixed" the get_created_runs_and_queue by allowing to return runs that are either created OR queued, together with the fact that check_for_executable_runs can be called twice almost in parallel, led to double queuing of the run and double stage progressing.

To stop all this nonsense, I propose to not include run logic in the views of ngnix. If we always call the same task as entrypoint (check_for_executable_runs), it will, most likely, keep as saner for longer. Hopefully.

wilko added 2 commits November 1, 2019 17:02
did half of the work of `check_for_executable_runs`. And as a side effect,
run stages were increased too many times, broke the race condition disabler
code.
@wilko77 wilko77 requested a review from gusmith November 1, 2019 06:28
Copy link
Contributor

@gusmith gusmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely spotted.
Only things to close this PR is to update the changelog.

@@ -263,7 +252,7 @@ def get_created_runs_and_queue(db, project_id):
UPDATE runs SET
state = 'queued'
WHERE
state IN ('created', 'queued') AND project = %s
state = 'created' AND project = %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -20,7 +20,7 @@ pytest==5.1.3
pytest-xdist==1.29.0
PyYAML==5.1
redis==3.2.1
requests==2.21.0
requests==2.22.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add this dependency update in the changelog.

@wilko77 wilko77 merged commit 7f2dad0 into develop Nov 4, 2019
@wilko77 wilko77 deleted the fix_starting_runs branch November 4, 2019 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants