Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 462 #463

Merged
merged 17 commits into from
Nov 12, 2019
Merged

Fix 462 #463

merged 17 commits into from
Nov 12, 2019

Conversation

gusmith
Copy link
Contributor

@gusmith gusmith commented Nov 11, 2019

Closes #462 which describes the fact that a dataprovider was able to upload clks multiple time to the same project using the same token.
In this PR, we solve this issue by modifying the uploaded value in the table dataproviders in the database from a boolean to an enum which can have the following values:

  • not_started
  • in_progress
  • done
  • error

When posting clks to a project, we check that the token is not already in use (i.e. uploaded set to in_progress or done), and post if possible. In an error case, uploaded will be set to error, which will enable the owner of the token to try uploading their data again, not to block a project if an upload failed.

Main reason to have a single method to do both is against race condition: we are
relying on postgres for the lock, ensuring that we cannot update twice this value to true.
 - not_started when we haven't started uploading the clks
 - in_progress when we start uploading them (lock the resource)
 - done when completed
 - error when an error occured.
One checking that re-using a token fails.
One checking that uploading failing data does not block the project (can re-upload).
…t them or lock the resource.

If an error occurs, free back the resource for later upload.
@gusmith gusmith requested a review from wilko77 November 11, 2019 05:24
Copy link
Collaborator

@wilko77 wilko77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like that we have UPLOADSTATE and UPLOADEDSTATE in the db now.
It is really hard to understand what they are for without spending quite some time reading through code. I made a suggestion how we can resolve that in the comments.


def get_and_set_dataprovider_upload_state_in_progress(db, dp_id):
"""
This method returns true if it was able to update the uploaded status of this dataprovider from false to true.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the status of the dataproviders is not true/false any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well seen. I started by using the boolean value, but added more states to have an in_progress one. But I forgot the comments...

This method returns true if it was able to update the uploaded status of this dataprovider from false to true.
It return false otherwise (i.e. the state was already set to true).
"""
logger.debug("Setting dataprovider {} upload state to True".format(dp_id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-phrased the whole docstring.

@@ -121,57 +121,69 @@ def project_clks_post(project_id):
with DBConn() as conn:
dp_id = db.get_dataprovider_id(conn, token)
project_encoding_size = db.get_project_schema_encoding_size(conn, project_id)
upload_state_updated = db.get_and_set_dataprovider_upload_state_in_progress(conn, dp_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit cryptic. The function get_and_set_dataprovider_upload_state_in_progress mixes two tasks.
How about if we split this to make it clearer:

if not db.dataprovider_allowed_to_upload(conn, project_id, dp_id, ...):
  return safe_fail_request(403, "This token has already been used to upload clks.")
db.set_dp_upload_state_in_progress(...)
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point is to do both actions at once not to have race conditions. Otherwise we would first assess it, and then setting it which should return an error if already set. I thus rename the method is_dataprovider_allowed_to_upload_and_lock.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...

@@ -85,6 +85,12 @@ RETURNS bool AS $$
SELECT $1.state = 'completed'
$$ STABLE LANGUAGE SQL;

CREATE TYPE UPLOADEDSTATE AS ENUM (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, now we have UPLOADEDSTATE and UPLOADSTATE. This is very confusing.
After looking at those for a bit, I think we should rename UPLOADSTATE to PROCESSEDSTATE.
It would also be nice to add a bit of an explanation what all those states are for.
In my view,
UPLOADSTATE describes the state of the processing of the uploaded clks. This is only set after CLKs have been uploaded. 'pending' means we are waiting for some processing to be done, 'ready' means done and 'error' if something fails.
UPLOADEDSTATE describes the state of the CLK upload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was indeed a pickle. I will add comments on the UPLOADEDSTATE.
I didn't want to touch the UPLOADSTATE as I wasn't sure what it was made for (as you mentioned, this is not well documented). But I'll also add some documentation around it and rename it. In which case I will also rename UPLOADEDSTATE to UPLOADSTATE which sounds better :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll finally keep UPLOADEDSTATE instead of renaming it because the value in the table is uploaded. But renamed the UPLOADSTATE to PROCESSEDSTATE, adding some documentation.

@@ -235,3 +235,62 @@ def test_project_single_party_empty_data_upload(
)
assert r.status_code == 400


def test_project_upload_wrong_authentication(requests, valid_project_params):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can be more specific here. You are testing that you cannot upload twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

small_file_path, new_project_data['project_id'], token_to_reuse, 1000, expected_status_code=403)


def test_project_upload_fail_then_works(requests, valid_project_params):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a great name for this test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was inspired :)

@gusmith gusmith requested a review from wilko77 November 12, 2019 03:22
def is_dataprovider_allowed_to_upload_and_lock(db, dp_id):
"""
This method returns true if the dataprovider is allowed to upload her clks.
A data provider is allowed to upload clks if they have not yet been uploaded or if the upload is already in progress.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean something different with the second if.
A DP is not allowed to upload if data has been uploaded, or if an upload is currently underway.

if length < 1:
return False
elif length > 1:
raise ValueError("Houston, we have a problem!!! This dataprovider can upload multiple times her clks.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to write a message to the logger at high severity level as well here.

'in_progress', -- the upload is in progress, so no-one else should be able to upload at the same time
'done', -- the clks have been uploaded, it should stay this way
'error' -- the dataprovider has tried to upload the clks but an error occurred, having this state allows a dataprovider to try again.
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@@ -121,57 +121,69 @@ def project_clks_post(project_id):
with DBConn() as conn:
dp_id = db.get_dataprovider_id(conn, token)
project_encoding_size = db.get_project_schema_encoding_size(conn, project_id)
upload_state_updated = db.get_and_set_dataprovider_upload_state_in_progress(conn, dp_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataproviders can upload their clks mutilple times with the same upload_token
2 participants