New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selective cache invalidation #700

Merged
merged 59 commits into from May 3, 2017

Conversation

Projects
None yet
2 participants
@smashwilson
Member

smashwilson commented Apr 24, 2017

Selectively invalidate the cached Repository state when (a) performing write operations ourselves and (b) receiving filesystem events. The goal here is to reduce the number of times that we need to shell out to git at all to avoid the performance penalty of launching a new process.

Fixes #671. Related to #201 and #627.

Approach

The setup for the actual caching of operations was done back in #654, where both the Cache object and @invalidate() decoration were introduced. This pull request expands on that foundation by:

  • Adding an argument to @invalidate() that's a function that's expected to return an array of cache keys that should be evicted after the Promise returned by that function completes (successfully or unsuccessfully).
  • Reworking the Cache to track "groups" of keyed items for easy eviction of subsets of the cache without needing an O(n) key scan.
  • Pass filesystem events to the Repository rather than just a "hey something changed somewhere" event. Used the modified files to invalidate parts of the cache.
  • Writing a shit-ton of tests to guard against any operations caching data that we shouldn't. More on that later.

Along the way:

  • I tinkered with getCurrentBranch(), because the filesystem event tests revealed a race condition. If an external process changes HEAD between the file content check and the git symbolic-ref call, it would fail with an error.
  • I moved .isPartiallyStaged() from Present to Repository and rewrote it in terms of .getStatusesForChangedFiles(), so that it would take advantage of cached status results. Also, it was annoying to test for with the test harness.

Cache implementation

All of the actual Cache implementation lives in lib/models/repository-states/present.js. It uses a pair of Maps: one that maps a unique string to the Promise generated the last time an operation was invoked, and one that maps a group name to a set of CacheKey objects that belong to that group.

Externally, callers populate the cache by calling .getOrSet() with a CacheKey. Each CacheKey contains a primary (globally-unique) key, often generated from a filename, and zero or more groups to which that key can belong. The primary key is used to do the lookup for an existing result; if none is present, the actual operation proceeds.

To evict items from the cache, @invalidate() passes it a collection of either CacheKey or GroupKey instances. A CacheKey evicts an item that matches its primary key exactly; a GroupKey evicts all keys that belong to the group it names.

To avoid leaving "magic strings" around (and having things fall out of sync), cache keys and groups can be generated by functions and constants in the Keys object. Key functions are either (a) a single constant (Keys.changedFiles, Keys.filePatch.all), (b) a function that generates a single key based on some parameter (naming convention: ".oneWith()", Keys.index.oneWith(fileName)), (c) a function that generates many keys (naming convention: ".eachWithXyz()", Keys.filePatch.eachWithOpts(...)), or (d) a function that generates an array of related keys that are commonly evicted together (Keys.headOperationKeys()).

Test harness

The highest risk is not evicting a cache key when the underlying operation has changed, while what we want to optimize is evicting the smallest set of cache keys that could change as the result of a specific operation. To make it easier to capture these, I wrote the assertCorrectInvalidation harness, which:

  • Uses a Map that gives friendly names to example calls of every Repository accessor that contains a this.cache.getOrSet() call.
  • Executes every call in the map once and remembers the Promises returned by each (before).
  • Executes the code under test. Ideally, this executes a single repository method with arguments and pre-existing state set up to result in as many changed Repository accessors as possible.
  • Executes every accessor in the map again and stores the Promises returned this time. If the call was not evicted from the cache by the operation, the Promise returned will be === to the ones collected before. (cached)
  • Clears the Repository's cache.
  • Executes the accessors one more time to determine what accessors actually changed output during the operation. (after)

To perform the actual assertion, the harness computes (a) after and before to generate the set of accessors that were actually modified by the operation and (b) cached and before to generate the set of accessors that were evicted from the cache.

If an accessor should have been evicted, but was not, the harness throws an error and fails the testcase; that's the bad case. If it should not have been evicted but was, that's okay, it just means that we might have been able to keep something in the cache, so it only causes a failure if {strict: true} is specified. If {verbose: true} is specified, both accessor sets are dumped to the console to make it easier to see what's going on and fine-tune the eviction.

I have test cases in this PR that cover every Repository action method that is marked with the @invalidate() decorator, and for the filesystem events generated by the underlying git operations that they use.

Left to do

  • Add arguments to the @invalidate() decorator
  • Invalidate state based on filesystem events
  • Use cached status results for isPartiallyStaged()
  • More efficient wildcard eviction from the cache
  • Rename Keys. properties to clarify at a glance which are constants, which are functions that generate single keys, which are functions that generate multiple keys...
  • Fill in a lot of tests
    • Pending tests for writing a merge conflict to the index.
  • Docs and writeup

smashwilson added some commits Apr 24, 2017

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Apr 24, 2017

Member

For reference, here's a waterfall I collected before starting this work:

screen shot 2017-04-24 at 10 58 21 am

waterfall data

Member

smashwilson commented Apr 24, 2017

For reference, here's a waterfall I collected before starting this work:

screen shot 2017-04-24 at 10 58 21 am

waterfall data

smashwilson added some commits Apr 24, 2017

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Apr 24, 2017

Member

Here's how it looks with some fairly aggressive caching in place:

screen shot 2017-04-24 at 4 18 21 pm

waterfall data

I still have a bunch left to do -- mostly related to putting tests in place that ensure we're invalidating and not invalidating the correct state for each operation. But I think that's a promising start.

Member

smashwilson commented Apr 24, 2017

Here's how it looks with some fairly aggressive caching in place:

screen shot 2017-04-24 at 4 18 21 pm

waterfall data

I still have a bunch left to do -- mostly related to putting tests in place that ensure we're invalidating and not invalidating the correct state for each operation. But I think that's a promising start.

smashwilson added some commits Apr 24, 2017

Show outdated Hide outdated test/models/repository.test.js
});
});
it('when writing a merge conflict to the index');

This comment has been minimized.

@smashwilson

smashwilson Apr 27, 2017

Member

@kuychaco When you get a minute... is there an easy way to set up a call to repository.writeMergeConflictToIndex()?

Ideally, one that results in as many changes to Repository getter results (getStatusesForChangedFiles(), ...) as possible, to exercise the cache invalidation thoroughly.

@smashwilson

smashwilson Apr 27, 2017

Member

@kuychaco When you get a minute... is there an easy way to set up a call to repository.writeMergeConflictToIndex()?

Ideally, one that results in as many changes to Repository getter results (getStatusesForChangedFiles(), ...) as possible, to exercise the cache invalidation thoroughly.

This comment has been minimized.

@kuychaco

kuychaco May 3, 2017

Member

@smashwilson sorry for the delay. Here's how I set up tests for writeMergeConflictToIndex - https://github.com/atom/github/blob/ku-dont-status-in-background/test/git-strategies.test.js#L864-L882. That method only affects the index and I basically just specify a few random shas to write to it. Hope that helps!

@kuychaco

kuychaco May 3, 2017

Member

@smashwilson sorry for the delay. Here's how I set up tests for writeMergeConflictToIndex - https://github.com/atom/github/blob/ku-dont-status-in-background/test/git-strategies.test.js#L864-L882. That method only affects the index and I basically just specify a few random shas to write to it. Hope that helps!

This comment has been minimized.

@smashwilson

smashwilson May 3, 2017

Member

Ah! Perfect, thanks. 🙇

@smashwilson

smashwilson May 3, 2017

Member

Ah! Perfect, thanks. 🙇

smashwilson added some commits Apr 27, 2017

smashwilson added some commits Apr 27, 2017

@smashwilson smashwilson requested a review from kuychaco Apr 27, 2017

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Apr 27, 2017

Member

Haha, yep. The render from the filesystem event does trample the one from the autorefresh, but both finish right around the 100ms mark:

screen shot 2017-04-27 at 3 46 09 pm

waterfall data

I might actually implement the "expected filesystem update" feature I talked about with @BinaryMuse back when we were first talking about autorefresh. In another PR, though, this is already under the 100ms mark 😁

Member

smashwilson commented Apr 27, 2017

Haha, yep. The render from the filesystem event does trample the one from the autorefresh, but both finish right around the 100ms mark:

screen shot 2017-04-27 at 3 46 09 pm

waterfall data

I might actually implement the "expected filesystem update" feature I talked about with @BinaryMuse back when we were first talking about autorefresh. In another PR, though, this is already under the 100ms mark 😁

smashwilson added some commits Apr 27, 2017

@smashwilson

This comment has been minimized.

Show comment
Hide comment
@smashwilson

smashwilson Apr 28, 2017

Member

On reflection: expected updates might be more important to this than I'd thought, because git status touches the index, which creates an index filesystem event which invalidates the status cache... you can see the uncached git status call in the second waterfall. 🤔

Member

smashwilson commented Apr 28, 2017

On reflection: expected updates might be more important to this than I'd thought, because git status touches the index, which creates an index filesystem event which invalidates the status cache... you can see the uncached git status call in the second waterfall. 🤔

smashwilson added some commits May 2, 2017

@smashwilson smashwilson merged commit 149bb27 into master May 3, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@smashwilson smashwilson deleted the aw-selective-invalidation branch May 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment