Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement readahead cache for Workspace API calls #1582

Merged
merged 10 commits into from
Jul 18, 2024
Merged

Conversation

pietern
Copy link
Contributor

@pietern pietern commented Jul 8, 2024

Changes

The reason this readahead cache exists is that we frequently need to recursively find all files in the bundle root directory, especially for sync include and exclude processing. By caching the response for every file/directory and frontloading the latency cost of these calls, we significantly improve performance and eliminate redundant operations.

Tests

  • Working on unit tests

Base automatically changed from notebook-file-info to main July 10, 2024 06:43
@pietern pietern marked this pull request as ready for review July 15, 2024 12:30
Copy link
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! The code looks good to me!

We should add some performance regression tests here, to make sure we test and avoid regressions in some of the performance characteristics this cache introduces. Ideas:

  1. Cache is used if present, API call is not made again for the same path.
  2. ReadDir on a directory leads to recursive queueing / querying of the downstream directories.
  3. Error during a recursive ReadDir call, when walking the file tree, propagates upwards. A user of the cache is not stuck indefinitely.
  4. Additional Stat API calls are not made for files during a ReadDir file tree walk. DirEntry is reused for the stat calls. Additional calls are only done for notebooks.
  5. ReadDir / Stat calls are immediately added to the queue. Adding elements to the queue does not depend on the API calls actually returning.
  6. Stat API calls defer to "parent" ReadDir calls when it is already queued.


// If the parent directory is in the cache (or queued to be read),
// wait for it to complete to avoid redundant stat calls.
dir := path.Dir(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this is the primary reason for the performance gain when using this cache? That is we avoid the redundant stat API calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not primary, but it helps. Primary I would say is caching in the first place. The extras are a cherry on top.

libs/filer/workspace_files_cache.go Show resolved Hide resolved
@shreyas-goenka
Copy link
Contributor

Given we'll likely migrate over to FUSE in a reasonable timeline, we can likely skip extensive performance coverage for this PR.

Copy link
Contributor

@renaudhartert-db renaudhartert-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it was an interesting PR to review!

[no action required]

If my understanding is correct, the workers remain alive with the cache (until the Cleanup function is called) no matter if there is work to do or not. This bothered me a little and I was wondering if we could have something that creates the workers on a need basis.

For the records, here's what I came up with:

func newWorkspaceFilesReadaheadCache(f Filer) *cache {
	c := &cache{ ... }

	ctx := context.Background()
	go c.work(ctx)

	return c
}

func (c *cache) work() {
	semaphore := make(chan struct{}, 10) // limit the creation of goroutines
	wg := sync.WaitGroup{}

	for e := range c.queue { // stops when Cleanup is called
		semaphore <- struct{}{} // acquire a spot in the semaphore
		wg.Add(1)
		go func(e executable) {
			defer func() { <-semaphore }() // release the spot in the semaphore
			defer wg.Done()
			e.execute(context.Background(), c)
		}(e)
	}

	wg.Wait()
}

func (c *cache) Close() { // previously Cleanup
	close(c.queue)
}

The main difference is that we now have one single long-lived worker that "pops" up to 10 sub-workers to help it empty the queue. I like that it moves most of the concurrency management in c.work.

libs/filer/workspace_files_cache.go Outdated Show resolved Hide resolved
libs/filer/workspace_files_cache.go Outdated Show resolved Hide resolved
@pietern
Copy link
Contributor Author

pietern commented Jul 18, 2024

@shreyas-goenka Thanks for the test suggestions. I have a couple in a local copy and will submit a separate PR for them.

@renaudhartert-db Varying the # of goroutines could work, but it doesn't buy much if there is still one active (i.e. it doesn't remove the need to call Cleanup when done). Alternatively, there could be a way to dynamically spawn and terminate workers as work is added to the queue.

I scaled down the # of workers to 1 to remove concerns about parallelism. The net benefit of the cache is still high, even with only a single background worker doing readahead (I suspect parallel requests were hitting the SDK's rate limiter).

@pietern pietern added this pull request to the merge queue Jul 18, 2024
Merged via the queue into main with commit af0114a Jul 18, 2024
5 checks passed
@pietern pietern deleted the wsfs-caching-filer branch July 18, 2024 09:50
andrewnester added a commit that referenced this pull request Jul 18, 2024
CLI:
 * [Fix] Do not buffer files in memory when downloading ([#1599](#1599)).

Bundles:
 * Allow artifacts (JARs, wheels) to be uploaded to UC Volumes ([#1591](#1591)).
 * Upgrade TF provider to 1.48.3 ([#1600](#1600)).
 * Fixed job name normalisation for bundle generate ([#1601](#1601)).

Internal:
 * Add UUID to uniquely identify a deployment state ([#1595](#1595)).
 * Track multiple locations associated with a `dyn.Value` ([#1510](#1510)).
 * Attribute Terraform API requests the CLI ([#1598](#1598)).
 * Use local Terraform state only when lineage match ([#1588](#1588)).
 * Implement readahead cache for Workspace API calls ([#1582](#1582)).

Dependency updates:
 * Bump github.com/databricks/databricks-sdk-go from 0.43.0 to 0.43.2 ([#1594](#1594)).
@andrewnester andrewnester mentioned this pull request Jul 18, 2024
andrewnester added a commit that referenced this pull request Jul 18, 2024
CLI:
 * Do not buffer files in memory when downloading ([#1599](#1599)).

Bundles:
 * Allow artifacts (JARs, wheels) to be uploaded to UC Volumes ([#1591](#1591)).
 * Upgrade TF provider to 1.48.3 ([#1600](#1600)).
 * Fixed job name normalisation for bundle generate ([#1601](#1601)).

Internal:
 * Add UUID to uniquely identify a deployment state ([#1595](#1595)).
 * Track multiple locations associated with a `dyn.Value` ([#1510](#1510)).
 * Attribute Terraform API requests the CLI ([#1598](#1598)).
 * Implement readahead cache for Workspace API calls ([#1582](#1582)).
 * Use local Terraform state only when lineage match ([#1588](#1588)).

Dependency updates:
 * Bump github.com/databricks/databricks-sdk-go from 0.43.0 to 0.43.2 ([#1594](#1594)).
@andrewnester andrewnester mentioned this pull request Jul 18, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jul 18, 2024
CLI:
* Do not buffer files in memory when downloading
([#1599](#1599)).

Bundles:
* Allow artifacts (JARs, wheels) to be uploaded to UC Volumes
([#1591](#1591)).
* Upgrade TF provider to 1.48.3
([#1600](#1600)).
* Fixed job name normalisation for bundle generate
([#1601](#1601)).

Internal:
* Add UUID to uniquely identify a deployment state
([#1595](#1595)).
* Track multiple locations associated with a `dyn.Value`
([#1510](#1510)).
* Attribute Terraform API requests the CLI
([#1598](#1598)).
* Implement readahead cache for Workspace API calls
([#1582](#1582)).
* Use local Terraform state only when lineage match
([#1588](#1588)).
* Add read-only mode for extension aware workspace filer
([#1609](#1609)).


Dependency updates:
* Bump github.com/databricks/databricks-sdk-go from 0.43.0 to 0.43.2
([#1594](#1594)).
github-merge-queue bot pushed a commit that referenced this pull request Jul 19, 2024
## Changes

Backfill unit tests for #1582.

## Tests

New tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants