Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Implement ProcessingTimeTimeout for RocksDBStateStore #1
Extending the capabilities of this state store, I propose the feature of state expiration/ timeout based on processing time. This feature is available when you do stateful aggregations using Spark's FlatMapGroupWithState (FMGWS) API.
FMGWS is not very flexible in terms of options and usage and the timeouts require query to progress in order timeout keys. With addition of this feature to RocksDB State Store, key expiration is truly decoupled from the query engine in structured streaming, with the DB itself taking care of TTLs.
Will raise a pull request for this shortly, along with certain test scenarios.
@chermenin Apologies for the late response, I implemented this feature using Rocks'
So, you see the expiration is bound to compaction. To maintain throughput and performance, Rocks creates multiple files which are compacted based on certain constraints. Now, even though a compaction filter may be manually defined, the entries will not be evicted till db compacts. Over compaction degrades performance. Thus making the features of TTLDB only partially usable and the testing of the process non deterministic.
An alternative I thought of was to create an in memory collection of keys with their respective deadlines, which will be referenced and updated on every get and set. This collection will hold precedence over the compaction based expiration provided by TTLDB and thus it can be ensured that now expired entries are returned while maintaining the performance of statestore operations.
The caveat of the alternative approach is that it moves back closer to the default implementation provided by spark i.e. the HDFSStateStore which maintains in memory maps.
I have a the alternative approach ready, but do let me know your thoughts. Thanks.
@chitralverma So, I've looked at your comment and I'd like to say that I like your approach. Well, even if we will keep the keys in memory it will be better then keeping everything there as implemented in