-
Notifications
You must be signed in to change notification settings - Fork 232
Refactor DBFS CLI Put to support new backend. #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
823e844
dd9c750
df78ca1
6b1c654
ff1c4f1
d396ec9
24da3ad
270d6d1
46dd267
c6216da
52437ac
08f451b
09d61d2
947389e
dcd51a7
1abe4ff
a62e262
de5cbb7
9687312
c724a1e
712e7d9
1dad15e
28451f0
7d3fe8e
23d7f51
5efe799
40af579
795ba9f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,9 @@ | |
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
| import os | ||
|
|
||
|
|
||
| class JobsService(object): | ||
| def __init__(self, client): | ||
| self.client = client | ||
|
|
@@ -519,25 +522,35 @@ def list_test(self, path, headers=None): | |
| _data['path'] = path | ||
| return self.client.perform_query('GET', '/dbfs-testing/list', data=_data, headers=headers) | ||
|
|
||
| def put(self, path, contents=None, overwrite=None, headers=None): | ||
| def put(self, path, contents=None, overwrite=None, headers=None, src_path=None): | ||
| _data = {} | ||
| _files = None | ||
| if path is not None: | ||
| _data['path'] = path | ||
| if contents is not None: | ||
| _data['contents'] = contents | ||
| if overwrite is not None: | ||
| _data['overwrite'] = overwrite | ||
| return self.client.perform_query('POST', '/dbfs/put', data=_data, headers=headers) | ||
| if src_path is not None: | ||
| headers = {'Content-Type': None} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't seem right, The content Type is expected to be something of this format See how we check whether a request is a multipart upload or not // https://livegrep.dev.databricks.com/view/databricks/universe/daemon/data/daemon/src/main/scala/com/databricks/backend/daemon/data/server/meta/DbfsFileUploadDownloadBackend.scala#L305
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have checked this quite a lot. If we set the A lot of answers on stack I checked were against setting 'content-type' in this case.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a detailed comment explaining this and assert that in a test to make sure that is true? |
||
| filename = os.path.basename(src_path) | ||
| _files = {'file': (filename, open(src_path, 'rb'), 'multipart/form-data')} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's this struct/tuple you're passing for files? Is this format defined somewhere or did you create it? How does the request know to send these files as a multipart upload?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you go to |
||
| return self.client.perform_query('POST', '/dbfs/put', data=_data, headers=headers, files=_files) | ||
|
|
||
| def put_test(self, path, contents=None, overwrite=None, headers=None): | ||
| def put_test(self, path, contents=None, overwrite=None, headers=None, src_path=None): | ||
| _data = {} | ||
| _files = None | ||
| if path is not None: | ||
| _data['path'] = path | ||
| if contents is not None: | ||
| _data['contents'] = contents | ||
| if overwrite is not None: | ||
| _data['overwrite'] = overwrite | ||
| return self.client.perform_query('POST', '/dbfs-testing/put', data=_data, headers=headers) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What path is this exactly? I don't see a mention of this in the codebase except in
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for letting me know about this. @bogdanghita-db should I edit the |
||
| if src_path is not None: | ||
| headers = {'Content-Type': None} | ||
| filename = os.path.basename(src_path) | ||
| _files = {'file': (filename, open(src_path, 'rb'), 'multipart/form-data')} | ||
| return self.client.perform_query('POST', '/dbfs/put', data=_data, headers=headers, files=_files) | ||
|
|
||
| def mkdirs(self, path, headers=None): | ||
| _data = {} | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -135,6 +135,20 @@ def test_put_file(self, dbfs_api, tmpdir): | |
| api_mock.create.return_value = {'handle': test_handle} | ||
| dbfs_api.put_file(test_file_path, TEST_DBFS_PATH, True) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you also add a test for both ways of doing a put? with contents and with a file? |
||
|
|
||
| # Should not call add-block since file is < 2GB | ||
| assert api_mock.add_block.call_count == 0 | ||
|
|
||
| # Files >= 2GB should use create, add_block, close stream upload. | ||
| def test_put_large_file(self, dbfs_api, tmpdir): | ||
| test_file_path = os.path.join(tmpdir.strpath, 'test') | ||
| with open(test_file_path, 'wt') as f: | ||
| f.write('test') | ||
| api_mock = dbfs_api.client | ||
| # Make streaming upload threshold 2 bytes for testing. | ||
| dbfs_api.MULTIPART_UPLOAD_LIMIT = 2 | ||
| test_handle = 0 | ||
| api_mock.create.return_value = {'handle': test_handle} | ||
| dbfs_api.put_file(test_file_path, TEST_DBFS_PATH, True) | ||
| assert api_mock.add_block.call_count == 1 | ||
|
stormwindy marked this conversation as resolved.
|
||
| assert test_handle == api_mock.add_block.call_args[0][0] | ||
| assert b64encode(b'test').decode() == api_mock.add_block.call_args[0][1] | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see for this case we're passing
datadirectly instead ofjson.dumps(data)like we do above. I'm just curious to know if this is what's expected for multi-part upload. Isdataactually used in this case?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With
json.dumpsit used to fail creating the correct request. I saw somewhere the solution to the error was to pass the data object directly and request library handles it itself. I will try to find the thread about it.