Skip to content

Refactor DBFS CLI Put to support new backend.#371

Merged
stormwindy merged 28 commits intodatabricks:masterfrom
stormwindy:dbfs_put_backend_mig
May 28, 2021
Merged

Refactor DBFS CLI Put to support new backend.#371
stormwindy merged 28 commits intodatabricks:masterfrom
stormwindy:dbfs_put_backend_mig

Conversation

@stormwindy
Copy link
Copy Markdown
Contributor

@stormwindy stormwindy commented May 7, 2021

Refactors PUT methods in CLI (without creating user facing APIs) for CLI to use new put backend of DBFS rather than using create, add_block and close methods to achieve same results. Changes, in short, create a multipart/form request and sent to /dbfs/put backend. Files with >=2Gb fall back to using create, add_block, close (streaming upload) to not break any pipelines.

Version number is not increased for this change since there will not be a new release specific for this change. It will be piggy backed to next release.

Changes are tested on a staging shard. Files sizes of 1kb, 1mb, 10mb, 500mb, 750mb, 1gb, 1.5gb and 3gb have been tested by exposing an interface for put_file (not included in the PR) and initiating an upload command dbfs put_file <file-path>, <dbfs-path>. Later the files on dbfs and local are compared. 3gb file upload has returned an error as expected since put API supports up to 2Gbs. (For this testing streaming upload fallback was removed.)

Moreover, changes are tested by using cp command of the CLI which internally uses put. Files have been copied from one directory to another with sizes 1mb, 100mb, 1gb, 3gb.

After the testing a fall-back logic has been added to put_file method so that file uploads larger than 2gb automatically uses streaming uploads with open, add_block and close instead of put API.

@stormwindy stormwindy added the feature New enhancement or feature request label May 7, 2021
@stormwindy stormwindy self-assigned this May 7, 2021
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 10, 2021

Codecov Report

Merging #371 (795ba9f) into master (6c48761) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #371      +/-   ##
==========================================
+ Coverage   84.83%   84.85%   +0.01%     
==========================================
  Files          39       39              
  Lines        2724     2727       +3     
==========================================
+ Hits         2311     2314       +3     
  Misses        413      413              
Impacted Files Coverage Δ
databricks_cli/dbfs/api.py 65.17% <100.00%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c48761...795ba9f. Read the comment docs.

Comment thread databricks_cli/dbfs/cli.py Outdated
Comment thread databricks_cli/dbfs/api.py Outdated
# @self.client sets Content-Type 'text/json' by default.
# For multipart/form-data POST Content-Type should be set automatically
# to decode 'Boundary' parameter.
headers = {'Content-Type': None}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right, The content Type is expected to be something of this format
Content-Type: multipart/form-data; boundary=something
See https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type

See how we check whether a request is a multipart upload or not // https://livegrep.dev.databricks.com/view/databricks/universe/daemon/data/daemon/src/main/scala/com/databricks/backend/daemon/data/server/meta/DbfsFileUploadDownloadBackend.scala#L305

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked this quite a lot. If we set the content-type manually, requests library forces programmer to define other required fields such as boundary. If the content-type is not set (or None), requests library automatically fills them. If a files parameter is passed to the call, it will automatically generate Content-Type: multipart/form-data; boundary=something.

A lot of answers on stack I checked were against setting 'content-type' in this case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a detailed comment explaining this and assert that in a test to make sure that is true?

# to decode 'Boundary' parameter.
headers = {'Content-Type': None}
filename = os.path.basename(src_path)
_files = {'file': (filename, open(src_path, 'rb'), 'multipart/form-data')}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this struct/tuple you're passing for files? Is this format defined somewhere or did you create it? How does the request know to send these files as a multipart upload?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you go to perform_query, it passes a file= argument to the request. When that is the case, POST requests becomes a multipart upload. It is explained in requests docs: https://docs.python-requests.org/en/master/user/quickstart/#post-a-multipart-encoded-file.

_data['contents'] = encoded_contents.decode("utf-8")
if overwrite is not None:
_data['overwrite'] = overwrite
return self.client.perform_query('POST', '/dbfs-testing/put', data=_data, headers=headers)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What path is this exactly? I don't see a mention of this in the codebase except in service.proto ? I don't think these are used. We should create a ticket to clean this up from universe? cc @bogdanghita-db

  rpc putTest(Put) returns (Put.Response) {
    option (rpc) = {
      endpoints: {
        method: "POST",
        path: "/dbfs-testing/put",
        since: { major: 2, minor: 0 },
      },
      visibility: PUBLIC,
    };
  }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is generated based on the proto definitions in universe. It's not intended to be edited manually.

@gotibhai The dbfs-testing/... definitions will be deleted from service.proto as part of SC-50539.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know about this. @bogdanghita-db should I edit the service.proto file to have new parameters added? I will have to add src_path if we want to keep current design choices on how to implement new put. (the ones I did.)

Comment thread databricks_cli/sdk/service.py Outdated
_files = {'file': (filename, open(src_path, 'rb'), 'multipart/form-data')}
return self.client.perform_query('POST', '/dbfs/put', data=_data, files=_files, headers=headers)

def put_test(self, path, src_path=None, contents=None, overwrite=None, headers=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use of this function? I can't see a difference? Do we need it?

Comment thread tests/dbfs/test_api.py Outdated
api_mock = dbfs_api.client
test_handle = 0
api_mock.create.return_value = {'handle': test_handle}
# Should succeed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't really helpful.
Can you assert that it succedded by using other API's like list and matching the content?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is possible since the API is a mock. It would only be possible to do this if I explicitly define what the return value of other APIs are which does not help us for testing. Please, correct me if I am wrong.

Comment thread tests/dbfs/test_api.py
test_handle = 0
api_mock.create.return_value = {'handle': test_handle}
# Should succeed.
dbfs_api.put_file(test_file_path, TEST_DBFS_PATH, True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test for both ways of doing a put? with contents and with a file?

Copy link
Copy Markdown
Contributor

@gotibhai gotibhai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass, will take a look after comments are addressed. Could you add a note about testing in the description.

@stormwindy stormwindy requested a review from bogdanghita-db May 24, 2021 11:12
Copy link
Copy Markdown
Collaborator

@bogdanghita-db bogdanghita-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, but I see the PR tests are failing.

Thanks for the description on how you tested. I understand from it that you made some changes to the code to test. It would be good to also test the final code end-to-end with databricks fs cp directly against a test shard, if you didn't do it already.

It would be good to get a review from @andrewmchen as well.

Comment thread databricks_cli/dbfs/api.py Outdated
self.client.add_block(handle, b64encode(contents).decode(), headers=headers)
self.client.close(handle, headers=headers)
# If file size is >2Gb use streaming upload.
if os.path.getsize(src_path) <= 2147483648:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check that the limit is enforced in the backend with <= as well? If not, let's make this < instead of <=, just to be sure. @gotibhai do you happen to know where is this limit enforced in the backend code, so that we can check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make it < just to be safe. Sometimes dummy files I generate vary by 1 byte for some reason. So tests might not be a great indicator in fine grained cases.

verify = self.verify, headers = headers)
else:
# Multipart file upload
resp = self.session.request(method, self.url + path, files = files, data = data,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see for this case we're passing data directly instead of json.dumps(data) like we do above. I'm just curious to know if this is what's expected for multi-part upload. Is data actually used in this case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With json.dumps it used to fail creating the correct request. I saw somewhere the solution to the error was to pass the data object directly and request library handles it itself. I will try to find the thread about it.

Comment thread tests/dbfs/test_api.py
Comment thread tests/dbfs/test_api.py Outdated
Comment thread databricks_cli/dbfs/api.py Outdated
Comment thread databricks_cli/dbfs/api.py Outdated
Comment thread tests/dbfs/test_api.py Outdated
Comment thread tests/dbfs/test_api.py
Comment on lines -139 to -142
assert test_handle == api_mock.add_block.call_args[0][0]
assert b64encode(b'test').decode() == api_mock.add_block.call_args[0][1]
assert api_mock.close.call_count == 1
assert test_handle == api_mock.close.call_args[0][0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep these asserts as well in test_put_large_file, right? And we can keep f.write('test') instead of f.write('\0' * 2). It's still larger that 2B.

stormwindy and others added 4 commits May 27, 2021 15:53
@stormwindy stormwindy merged commit 21258ee into databricks:master May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New enhancement or feature request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants