Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasetRw: add dataset upload file capabilities to domino-python (multi-part, multi-thread) #185

Merged
merged 15 commits into from
Jan 16, 2024

Conversation

ddl-giuliocapolino
Copy link
Contributor

@ddl-giuliocapolino ddl-giuliocapolino commented Jan 3, 2024

Link to JIRA

DOM-52840

What issue does this pull request solve?

Needed a new endpoint to handle dataset upload

What is the solution?

Add endpoint as well as full client implementation + tests for single file, multi-part, multi-thread uploads

Testing

Briefly describe how the change was tested. The purpose of this section is that a reviewer can identify a test gap, if any.

e.g. "I ran an upgrade from 4.2 to 4.6".

  • Unit test(s)

Pull Request Reminders

References (optional)

@ddl-giuliocapolino ddl-giuliocapolino force-pushed the ddl-giuliocapolino.dataset-file-upload branch from c76f0fa to e48bef8 Compare January 4, 2024 19:22
@ddl-giuliocapolino ddl-giuliocapolino changed the title datasetRw: add dataset upload file capabilities to domino-python datasetRw: add dataset upload file capabilities to domino-python (multi-part, multi-thread) Jan 4, 2024
@ddl-giuliocapolino ddl-giuliocapolino marked this pull request as ready for review January 4, 2024 19:24
@ddl-giuliocapolino ddl-giuliocapolino requested a review from a team January 4, 2024 19:25
UPLOAD_READ_TIMEOUT_IN_SEC = 30


class UploadChunk:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be a good use case for a dataclass https://docs.python.org/3/library/dataclasses.html

url = self.routes.datasets_end_upload(self.dataset_id, self.upload_key, self.target_relative_path)
return self.request_manager.get(url)

def _create_chunk_queue(self) -> [UploadChunk]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to use list[UploadChunk] for proper type hinting here

domino/domino.py Outdated
)

try:
uploader.start_upload_session()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you could easily make this a context manager https://www.pythonmorsels.com/creating-a-context-manager/ to avoid having the user of this class know which method to call first (since there are no arguments once created)

Comment on lines 105 to 106
return self._create_chunk(self.local_path_file_or_directory)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python like early return, no need for else here

# append chunk to queue
nested_q.append(self._create_chunk(relative_path_to_file))
# flattening list of chunks
return [chunk for chunks in nested_q for chunk in chunks]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAT is happening here 😆 . Can you name things differently? is there a way to flatten this in the previous for loop?

if should_upload:
upload_try = 1
uploaded = False
while upload_try <= MAX_UPLOAD_ATTEMPTS and not uploaded:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big fan of making good use a decorator for this kind of logic https://pypi.org/project/retry/. This allow clearer code by making the function do only what it is supposed to do

self.upload_key = upload_key


class Uploader:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment about making this a context manager class


assert response.status_code == 200
assert "filecache" in response.json()
q
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this q supposed to be here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha no!! thanks :)

except Exception:
if upload_try > MAX_UPLOAD_ATTEMPTS:
raise Exception(f"Uploading chunk {chunk.chunk_number} of {chunk.total_chunks} "
f"for {chunk.file_name} failed. Please try again")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can add period after "Please try again"

def datasets_end_upload(self, dataset_id, upload_key, target_relative_path=None):
url = self.host + f"/v4/datasetrw/datasets/{dataset_id}/snapshot/file/end/{upload_key}"
if target_relative_path:
url += f"?targetRelativePath={target_relative_path}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

self.relative_path = relative_path
self.target_chunk_size = target_chunk_size
self.total_chunks = total_chunks
self.upload_key = upload_key
Copy link
Contributor

@ddl-eric-jin ddl-eric-jin Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: Might be clearer to match the order or the arguments of __init__ with the order the variables are assigned.

except Exception:
if upload_try > MAX_UPLOAD_ATTEMPTS:
raise Exception(f"Uploading chunk {chunk.chunk_number} of {chunk.total_chunks} "
f"for {chunk.file_name} failed. Please try again")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could add period at the end of message.

Comment on lines 37 to 49
def __init__(
self,
absolute_path: str,
chunk_number: int,
dataset_id: str,
file_name: str,
file_size: int,
identifier: str,
relative_path: str,
target_chunk_size: int,
total_chunks: int,
upload_key: str,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the __init__ in a dataclass (that was the goal of my comment)

raise RuntimeError(f"upload key for {self.dataset_id} not found. Session could not start.")
return self

def __exit__(self, exc_type, exc_val, exc_tb):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to check on the exc_type to cancel the upload session in the event an exception was raised? This way you handle the whole lifecycle of the upload here instead of leaving that to the user

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i see!

domino/domino.py Outdated
Comment on lines 1087 to 1092
except ValueError:
self.log.error(f"Erroneous input parameter. Please fix the parameters and try again")
except Exception as e:
self.log.error(f"Upload for dataset {dataset_id} and file {local_path_to_file_or_directory} failed, "
f"canceling session. Please try again.")
uploader.cancel_upload_session()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment, want to avoid that here

@pytest.mark.skipif(
not domino_is_reachable(), reason="No access to a live Domino deployment"
)
def test_datasets_upload(default_domino_client):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some more test cases.

Comment on lines 80 to 82
except Exception as e:
self.log.error("Upload failed. See error for details.")
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to use try..catch here since it gets captured in the exit

Comment on lines 86 to 87
if not self.upload_key:
raise RuntimeError(f"upload key for {self.dataset_id} not found. Could not end session.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to check the exception before this since an error can happen before the upload_key is set

# catching errors
if not self.upload_key:
raise RuntimeError(f"upload key for {self.dataset_id} not found. Could not end session.")
if exc_type != None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if exc_type is not None is the python way of checking null values

f"attempting to cancel session. Please try again.")
self.log.error(f"Error type: {exc_val}. Error message: {exc_tb}.")
if not isinstance(exc_type, ValueError):
self._cancel_upload_session() # is it is a ValueError, canceling session would fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happen if this raise an exception? afaik __exit__ must return True or False https://docs.python.org/3/library/contextlib.html#contextlib.ContextDecorator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it raises it! modifying below.

self.log.info("Upload session cancelled successfully.")
except Exception as e:
self.log.info("Cancelling upload session failed. See error for details.")
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to prevent the exception from being re-raise (see previous comment)

domino/domino.py Outdated
target_chunk_size=target_chunk_size,
target_relative_path=target_relative_path
) as path:
if path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what scenario is path not evaluating to true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah previously I was logging “Upload complete” here and was trying to find a way around the possibility of the ending failing - this is a remnant of that. good catch!

Copy link
Contributor

@ddl-gabrielhaim ddl-gabrielhaim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving but couple of comments that we might want to iterate on a future PR.

self.upload_key = self.request_manager.post(start_upload_url, json=start_upload_body).json()
if not self.upload_key:
raise RuntimeError(f"upload key for {self.dataset_id} not found. Session could not start.")
return self._upload()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure whether you should let the user (user of the context manager) initiate the upload or not, seems more correct to let them do it in case they want to do some logging before and after.

except Exception as e:
self.log.error("Ending snapshot upload failed. See error for details. Attempting to cancel "
"upload session.")
self._cancel_upload_session()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment, the exit method should return true/false

@ddl-giuliocapolino ddl-giuliocapolino merged commit 6f501c0 into master Jan 16, 2024
@ZeroCool2u
Copy link

Big hype, this seems like it closes #171! Thank you!

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants