Refactor DBFS CLI Put to support new backend.#371
Refactor DBFS CLI Put to support new backend.#371stormwindy merged 28 commits intodatabricks:masterfrom stormwindy:dbfs_put_backend_mig
Conversation
…fs_put_backend_mig
Codecov Report
@@ Coverage Diff @@
## master #371 +/- ##
==========================================
+ Coverage 84.83% 84.85% +0.01%
==========================================
Files 39 39
Lines 2724 2727 +3
==========================================
+ Hits 2311 2314 +3
Misses 413 413
Continue to review full report at Codecov.
|
| # @self.client sets Content-Type 'text/json' by default. | ||
| # For multipart/form-data POST Content-Type should be set automatically | ||
| # to decode 'Boundary' parameter. | ||
| headers = {'Content-Type': None} |
There was a problem hiding this comment.
This doesn't seem right, The content Type is expected to be something of this format
Content-Type: multipart/form-data; boundary=something
See https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type
See how we check whether a request is a multipart upload or not // https://livegrep.dev.databricks.com/view/databricks/universe/daemon/data/daemon/src/main/scala/com/databricks/backend/daemon/data/server/meta/DbfsFileUploadDownloadBackend.scala#L305
There was a problem hiding this comment.
I have checked this quite a lot. If we set the content-type manually, requests library forces programmer to define other required fields such as boundary. If the content-type is not set (or None), requests library automatically fills them. If a files parameter is passed to the call, it will automatically generate Content-Type: multipart/form-data; boundary=something.
A lot of answers on stack I checked were against setting 'content-type' in this case.
There was a problem hiding this comment.
Could you add a detailed comment explaining this and assert that in a test to make sure that is true?
| # to decode 'Boundary' parameter. | ||
| headers = {'Content-Type': None} | ||
| filename = os.path.basename(src_path) | ||
| _files = {'file': (filename, open(src_path, 'rb'), 'multipart/form-data')} |
There was a problem hiding this comment.
What's this struct/tuple you're passing for files? Is this format defined somewhere or did you create it? How does the request know to send these files as a multipart upload?
There was a problem hiding this comment.
If you go to perform_query, it passes a file= argument to the request. When that is the case, POST requests becomes a multipart upload. It is explained in requests docs: https://docs.python-requests.org/en/master/user/quickstart/#post-a-multipart-encoded-file.
| _data['contents'] = encoded_contents.decode("utf-8") | ||
| if overwrite is not None: | ||
| _data['overwrite'] = overwrite | ||
| return self.client.perform_query('POST', '/dbfs-testing/put', data=_data, headers=headers) |
There was a problem hiding this comment.
What path is this exactly? I don't see a mention of this in the codebase except in service.proto ? I don't think these are used. We should create a ticket to clean this up from universe? cc @bogdanghita-db
rpc putTest(Put) returns (Put.Response) {
option (rpc) = {
endpoints: {
method: "POST",
path: "/dbfs-testing/put",
since: { major: 2, minor: 0 },
},
visibility: PUBLIC,
};
}
There was a problem hiding this comment.
Thanks for letting me know about this. @bogdanghita-db should I edit the service.proto file to have new parameters added? I will have to add src_path if we want to keep current design choices on how to implement new put. (the ones I did.)
| _files = {'file': (filename, open(src_path, 'rb'), 'multipart/form-data')} | ||
| return self.client.perform_query('POST', '/dbfs/put', data=_data, files=_files, headers=headers) | ||
|
|
||
| def put_test(self, path, src_path=None, contents=None, overwrite=None, headers=None): |
There was a problem hiding this comment.
What is the use of this function? I can't see a difference? Do we need it?
| api_mock = dbfs_api.client | ||
| test_handle = 0 | ||
| api_mock.create.return_value = {'handle': test_handle} | ||
| # Should succeed. |
There was a problem hiding this comment.
This comment isn't really helpful.
Can you assert that it succedded by using other API's like list and matching the content?
There was a problem hiding this comment.
I don't think this is possible since the API is a mock. It would only be possible to do this if I explicitly define what the return value of other APIs are which does not help us for testing. Please, correct me if I am wrong.
| test_handle = 0 | ||
| api_mock.create.return_value = {'handle': test_handle} | ||
| # Should succeed. | ||
| dbfs_api.put_file(test_file_path, TEST_DBFS_PATH, True) |
There was a problem hiding this comment.
Could you also add a test for both ways of doing a put? with contents and with a file?
gotibhai
left a comment
There was a problem hiding this comment.
Took a first pass, will take a look after comments are addressed. Could you add a note about testing in the description.
… to check files argument.
bogdanghita-db
left a comment
There was a problem hiding this comment.
LGTM overall, but I see the PR tests are failing.
Thanks for the description on how you tested. I understand from it that you made some changes to the code to test. It would be good to also test the final code end-to-end with databricks fs cp directly against a test shard, if you didn't do it already.
It would be good to get a review from @andrewmchen as well.
| self.client.add_block(handle, b64encode(contents).decode(), headers=headers) | ||
| self.client.close(handle, headers=headers) | ||
| # If file size is >2Gb use streaming upload. | ||
| if os.path.getsize(src_path) <= 2147483648: |
There was a problem hiding this comment.
Did you check that the limit is enforced in the backend with <= as well? If not, let's make this < instead of <=, just to be sure. @gotibhai do you happen to know where is this limit enforced in the backend code, so that we can check?
There was a problem hiding this comment.
I will make it < just to be safe. Sometimes dummy files I generate vary by 1 byte for some reason. So tests might not be a great indicator in fine grained cases.
| verify = self.verify, headers = headers) | ||
| else: | ||
| # Multipart file upload | ||
| resp = self.session.request(method, self.url + path, files = files, data = data, |
There was a problem hiding this comment.
I see for this case we're passing data directly instead of json.dumps(data) like we do above. I'm just curious to know if this is what's expected for multi-part upload. Is data actually used in this case?
There was a problem hiding this comment.
With json.dumps it used to fail creating the correct request. I saw somewhere the solution to the error was to pass the data object directly and request library handles it itself. I will try to find the thread about it.
This reverts commit 1dad15e.
…fs_put_backend_mig
| assert test_handle == api_mock.add_block.call_args[0][0] | ||
| assert b64encode(b'test').decode() == api_mock.add_block.call_args[0][1] | ||
| assert api_mock.close.call_count == 1 | ||
| assert test_handle == api_mock.close.call_args[0][0] |
There was a problem hiding this comment.
We could keep these asserts as well in test_put_large_file, right? And we can keep f.write('test') instead of f.write('\0' * 2). It's still larger that 2B.
Co-authored-by: Bogdan Ghita <57367018+bogdanghita-db@users.noreply.github.com>
Refactors PUT methods in CLI (without creating user facing APIs) for CLI to use new
putbackend of DBFS rather than usingcreate,add_blockandclosemethods to achieve same results. Changes, in short, create a multipart/form request and sent to/dbfs/putbackend. Files with>=2Gbfall back to usingcreate, add_block, close(streaming upload) to not break any pipelines.Version number is not increased for this change since there will not be a new release specific for this change. It will be piggy backed to next release.
Changes are tested on a staging shard. Files sizes of
1kb,1mb,10mb,500mb,750mb,1gb,1.5gband3gbhave been tested by exposing an interface forput_file(not included in the PR) and initiating an upload commanddbfs put_file <file-path>, <dbfs-path>. Later the files on dbfs and local are compared.3gbfile upload has returned an error as expected sinceputAPI supports up to 2Gbs. (For this testing streaming upload fallback was removed.)Moreover, changes are tested by using
cpcommand of the CLI which internally usesput. Files have been copied from one directory to another with sizes1mb,100mb,1gb,3gb.After the testing a fall-back logic has been added to
put_filemethod so that file uploads larger than 2gb automatically uses streaming uploads withopen,add_blockandcloseinstead ofputAPI.