Refactor DBFS CLI Put to support new backend. by stormwindy · Pull Request #371 · databricks/databricks-cli

stormwindy · 2021-05-07T13:08:36Z

Refactors PUT methods in CLI (without creating user facing APIs) for CLI to use new put backend of DBFS rather than using create, add_block and close methods to achieve same results. Changes, in short, create a multipart/form request and sent to /dbfs/put backend. Files with >=2Gb fall back to using create, add_block, close (streaming upload) to not break any pipelines.

Version number is not increased for this change since there will not be a new release specific for this change. It will be piggy backed to next release.

Changes are tested on a staging shard. Files sizes of 1kb, 1mb, 10mb, 500mb, 750mb, 1gb, 1.5gb and 3gb have been tested by exposing an interface for put_file (not included in the PR) and initiating an upload command dbfs put_file <file-path>, <dbfs-path>. Later the files on dbfs and local are compared. 3gb file upload has returned an error as expected since put API supports up to 2Gbs. (For this testing streaming upload fallback was removed.)

Moreover, changes are tested by using cp command of the CLI which internally uses put. Files have been copied from one directory to another with sizes 1mb, 100mb, 1gb, 3gb.

After the testing a fall-back logic has been added to put_file method so that file uploads larger than 2gb automatically uses streaming uploads with open, add_block and close instead of put API.

…fs_put_backend_mig

codecov-commenter · 2021-05-10T09:46:56Z

Codecov Report

Merging #371 (795ba9f) into master (6c48761) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #371      +/-   ##
==========================================
+ Coverage   84.83%   84.85%   +0.01%     
==========================================
  Files          39       39              
  Lines        2724     2727       +3     
==========================================
+ Hits         2311     2314       +3     
  Misses        413      413

Impacted Files	Coverage Δ
databricks_cli/dbfs/api.py	`65.17% <100.00%> (+0.47%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c48761...795ba9f. Read the comment docs.

gotibhai · 2021-05-10T15:52:01Z

+            # @self.client sets Content-Type 'text/json' by default.
+            # For multipart/form-data POST Content-Type should be set automatically
+            # to decode 'Boundary' parameter.
+            headers = {'Content-Type': None}


This doesn't seem right, The content Type is expected to be something of this format
Content-Type: multipart/form-data; boundary=something
See https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type

See how we check whether a request is a multipart upload or not // https://livegrep.dev.databricks.com/view/databricks/universe/daemon/data/daemon/src/main/scala/com/databricks/backend/daemon/data/server/meta/DbfsFileUploadDownloadBackend.scala#L305

I have checked this quite a lot. If we set the content-type manually, requests library forces programmer to define other required fields such as boundary. If the content-type is not set (or None), requests library automatically fills them. If a files parameter is passed to the call, it will automatically generate Content-Type: multipart/form-data; boundary=something.

A lot of answers on stack I checked were against setting 'content-type' in this case.

Could you add a detailed comment explaining this and assert that in a test to make sure that is true?

gotibhai · 2021-05-10T15:54:36Z

+            # to decode 'Boundary' parameter.
+            headers = {'Content-Type': None}
+            filename = os.path.basename(src_path)
+            _files = {'file': (filename, open(src_path, 'rb'), 'multipart/form-data')}


What's this struct/tuple you're passing for files? Is this format defined somewhere or did you create it? How does the request know to send these files as a multipart upload?

If you go to perform_query, it passes a file= argument to the request. When that is the case, POST requests becomes a multipart upload. It is explained in requests docs: https://docs.python-requests.org/en/master/user/quickstart/#post-a-multipart-encoded-file.

gotibhai · 2021-05-10T16:07:00Z

+            _data['contents'] = encoded_contents.decode("utf-8")
        if overwrite is not None:
            _data['overwrite'] = overwrite
-        return self.client.perform_query('POST', '/dbfs-testing/put', data=_data, headers=headers)


What path is this exactly? I don't see a mention of this in the codebase except in service.proto ? I don't think these are used. We should create a ticket to clean this up from universe? cc @bogdanghita-db

rpc putTest(Put) returns (Put.Response) { option (rpc) = { endpoints: { method: "POST", path: "/dbfs-testing/put", since: { major: 2, minor: 0 }, }, visibility: PUBLIC, }; }

This file is generated based on the proto definitions in universe. It's not intended to be edited manually.

@gotibhai The dbfs-testing/... definitions will be deleted from service.proto as part of SC-50539.

Thanks for letting me know about this. @bogdanghita-db should I edit the service.proto file to have new parameters added? I will have to add src_path if we want to keep current design choices on how to implement new put. (the ones I did.)

gotibhai · 2021-05-10T16:07:22Z

+            _files = {'file': (filename, open(src_path, 'rb'), 'multipart/form-data')}
+        return self.client.perform_query('POST', '/dbfs/put', data=_data, files=_files, headers=headers)
+
+    def put_test(self, path, src_path=None, contents=None, overwrite=None, headers=None):


What is the use of this function? I can't see a difference? Do we need it?

gotibhai · 2021-05-10T16:08:15Z

        api_mock = dbfs_api.client
        test_handle = 0
        api_mock.create.return_value = {'handle': test_handle}
+        # Should succeed.


This comment isn't really helpful.
Can you assert that it succedded by using other API's like list and matching the content?

I don't think this is possible since the API is a mock. It would only be possible to do this if I explicitly define what the return value of other APIs are which does not help us for testing. Please, correct me if I am wrong.

gotibhai · 2021-05-10T16:10:42Z

        test_handle = 0
        api_mock.create.return_value = {'handle': test_handle}
+        # Should succeed.
        dbfs_api.put_file(test_file_path, TEST_DBFS_PATH, True)


Could you also add a test for both ways of doing a put? with contents and with a file?

gotibhai

Took a first pass, will take a look after comments are addressed. Could you add a note about testing in the description.

… to check files argument.

bogdanghita-db

LGTM overall, but I see the PR tests are failing.

Thanks for the description on how you tested. I understand from it that you made some changes to the code to test. It would be good to also test the final code end-to-end with databricks fs cp directly against a test shard, if you didn't do it already.

It would be good to get a review from @andrewmchen as well.

bogdanghita-db · 2021-05-25T11:40:05Z

-                self.client.add_block(handle, b64encode(contents).decode(), headers=headers)
-            self.client.close(handle, headers=headers)
+        # If file size is >2Gb use streaming upload.
+        if os.path.getsize(src_path) <= 2147483648:


Did you check that the limit is enforced in the backend with <= as well? If not, let's make this < instead of <=, just to be sure. @gotibhai do you happen to know where is this limit enforced in the backend code, so that we can check?

I will make it < just to be safe. Sometimes dummy files I generate vary by 1 byte for some reason. So tests might not be a great indicator in fine grained cases.

bogdanghita-db · 2021-05-25T11:50:04Z

+                                                verify = self.verify, headers = headers)
+                else:
+                    # Multipart file upload
+                    resp = self.session.request(method, self.url + path, files = files, data = data,


I see for this case we're passing data directly instead of json.dumps(data) like we do above. I'm just curious to know if this is what's expected for multi-part upload. Is data actually used in this case?

With json.dumps it used to fail creating the correct request. I saw somewhere the solution to the error was to pass the data object directly and request library handles it itself. I will try to find the thread about it.

This reverts commit 1dad15e.

…fs_put_backend_mig

bogdanghita-db · 2021-05-27T13:04:34Z

-        assert test_handle == api_mock.add_block.call_args[0][0]
-        assert b64encode(b'test').decode() == api_mock.add_block.call_args[0][1]
        assert api_mock.close.call_count == 1
-        assert test_handle == api_mock.close.call_args[0][0]


We could keep these asserts as well in test_put_large_file, right? And we can keep f.write('test') instead of f.write('\0' * 2). It's still larger that 2B.

Co-authored-by: Bogdan Ghita <57367018+bogdanghita-db@users.noreply.github.com>

stormwindy added 3 commits February 26, 2021 12:41

Bump version to 0.4.3.dev0

823e844

Add draft changes.

dd9c750

Add working prototype with two interfaces.

df78ca1

stormwindy added the feature New enhancement or feature request label May 7, 2021

stormwindy self-assigned this May 7, 2021

stormwindy added 6 commits May 7, 2021 14:18

Merge branch 'master' of github.com:databricks/databricks-cli into db…

6b1c654

…fs_put_backend_mig

Update put_test.

ff1c4f1

Remove interfaces.

d396ec9

Remove asserts in put test for open, add_block, close.

24da3ad

Remove usesr facing add_command additions. Nit changes.

270d6d1

Remove extra empty line.

46dd267

gotibhai reviewed May 10, 2021

View reviewed changes

Comment thread databricks_cli/dbfs/cli.py Outdated

gotibhai reviewed May 10, 2021

View reviewed changes

Comment thread databricks_cli/dbfs/api.py Outdated

gotibhai reviewed May 10, 2021

View reviewed changes

stormwindy added 9 commits May 11, 2021 19:35

Draft changes for autogen

c6216da

Paste auto-generated put(...) method to service.py. Fix api_client.py…

52437ac

… to check files argument.

Add fall-back to put method if files are larger than 2gb.

08f451b

Update put_file tests.

09d61d2

Edit test assert for large file upload.

947389e

Fix expected add_block counts.

dcd51a7

Fix typo in power operator.

1abe4ff

Edit test endpoint.

a62e262

Remove large file test because of failure.

de5cbb7

stormwindy added 2 commits May 20, 2021 12:02

Nit changes.

9687312

Lint fixes.

c724a1e

stormwindy requested a review from bogdanghita-db May 24, 2021 11:12

bogdanghita-db reviewed May 25, 2021

View reviewed changes

stormwindy added 4 commits May 26, 2021 12:11

Code review changes.

712e7d9

Fix build.

1dad15e

Revert "Fix build."

28451f0

This reverts commit 1dad15e.

Merge branch 'master' of github.com:databricks/databricks-cli into db…

7d3fe8e

…fs_put_backend_mig

bogdanghita-db reviewed May 27, 2021

View reviewed changes

stormwindy and others added 4 commits May 27, 2021 15:53

Revert some test changes.

23d7f51

Apply suggestions from code review

5efe799

Co-authored-by: Bogdan Ghita <57367018+bogdanghita-db@users.noreply.github.com>

Fix comments characters.

40af579

Add = char to file comment.

795ba9f

bogdanghita-db approved these changes May 27, 2021

View reviewed changes

stormwindy merged commit 21258ee into databricks:master May 28, 2021

Conversation

stormwindy commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gotibhai left a comment

Choose a reason for hiding this comment

Uh oh!

bogdanghita-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stormwindy commented May 7, 2021 •

edited

Loading

codecov-commenter commented May 10, 2021 •

edited

Loading