Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Do not buffer files in memory when downloading #1599

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

renaudhartert-db
Copy link
Contributor

@renaudhartert-db renaudhartert-db commented Jul 16, 2024

Changes

This PR fixes a performance bug that led downloaded files (e.g. with databricks fs cp dbfs:/Volumes/.../somefile .) to be buffered in memory before being written.

Results from profiling the download of a ~100MB file:

Before:

Type: alloc_space
Showing nodes accounting for 374.02MB, 98.50% of 379.74MB total

After:

Type: alloc_space
Showing nodes accounting for 3748.67kB, 100% of 3748.67kB total

Note that this fix is temporary. A longer term solution should be to use the API provided by the Go SDK rather than making an HTTP request directly from the CLI.

fix #1575

Tests

Verified that the CLI properly downloads the file when doing the profiling.

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for debugging this!

@renaudhartert-db renaudhartert-db added this pull request to the merge queue Jul 17, 2024
Merged via the queue into main with commit 235973e Jul 17, 2024
5 checks passed
@renaudhartert-db renaudhartert-db deleted the rh/stream-download branch July 17, 2024 07:21
andrewnester added a commit that referenced this pull request Jul 18, 2024
CLI:
 * [Fix] Do not buffer files in memory when downloading ([#1599](#1599)).

Bundles:
 * Allow artifacts (JARs, wheels) to be uploaded to UC Volumes ([#1591](#1591)).
 * Upgrade TF provider to 1.48.3 ([#1600](#1600)).
 * Fixed job name normalisation for bundle generate ([#1601](#1601)).

Internal:
 * Add UUID to uniquely identify a deployment state ([#1595](#1595)).
 * Track multiple locations associated with a `dyn.Value` ([#1510](#1510)).
 * Attribute Terraform API requests the CLI ([#1598](#1598)).
 * Use local Terraform state only when lineage match ([#1588](#1588)).
 * Implement readahead cache for Workspace API calls ([#1582](#1582)).

Dependency updates:
 * Bump github.com/databricks/databricks-sdk-go from 0.43.0 to 0.43.2 ([#1594](#1594)).
@andrewnester andrewnester mentioned this pull request Jul 18, 2024
andrewnester added a commit that referenced this pull request Jul 18, 2024
CLI:
 * Do not buffer files in memory when downloading ([#1599](#1599)).

Bundles:
 * Allow artifacts (JARs, wheels) to be uploaded to UC Volumes ([#1591](#1591)).
 * Upgrade TF provider to 1.48.3 ([#1600](#1600)).
 * Fixed job name normalisation for bundle generate ([#1601](#1601)).

Internal:
 * Add UUID to uniquely identify a deployment state ([#1595](#1595)).
 * Track multiple locations associated with a `dyn.Value` ([#1510](#1510)).
 * Attribute Terraform API requests the CLI ([#1598](#1598)).
 * Implement readahead cache for Workspace API calls ([#1582](#1582)).
 * Use local Terraform state only when lineage match ([#1588](#1588)).

Dependency updates:
 * Bump github.com/databricks/databricks-sdk-go from 0.43.0 to 0.43.2 ([#1594](#1594)).
@andrewnester andrewnester mentioned this pull request Jul 18, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jul 18, 2024
CLI:
* Do not buffer files in memory when downloading
([#1599](#1599)).

Bundles:
* Allow artifacts (JARs, wheels) to be uploaded to UC Volumes
([#1591](#1591)).
* Upgrade TF provider to 1.48.3
([#1600](#1600)).
* Fixed job name normalisation for bundle generate
([#1601](#1601)).

Internal:
* Add UUID to uniquely identify a deployment state
([#1595](#1595)).
* Track multiple locations associated with a `dyn.Value`
([#1510](#1510)).
* Attribute Terraform API requests the CLI
([#1598](#1598)).
* Implement readahead cache for Workspace API calls
([#1582](#1582)).
* Use local Terraform state only when lineage match
([#1588](#1588)).
* Add read-only mode for extension aware workspace filer
([#1609](#1609)).


Dependency updates:
* Bump github.com/databricks/databricks-sdk-go from 0.43.0 to 0.43.2
([#1594](#1594)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Copying file from /Volumes seems to copy it entirely in memory
2 participants