Custom state store providers for Apache Spark
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
project
src First implementation of a state store provider based on RocksDB 🚀 Aug 13, 2018
.gitignore
.travis.yml
LICENSE
README.md
_config.yml
build.sbt First implementation of a state store provider based on RocksDB 🚀 Aug 13, 2018

README.md

Custom state store providers for Apache Spark

Build Status

There are some classes in this repository what implementing the state store functionality to keep data between micro-batches for stateful structured streaming processing using Apache Spark.

Motivation

Out of the box, Apache Spark has only one implementation of state store providers. It's HDFSBackedStateStoreProvider which stores all of the data in memory, what is a very memory consuming approach. To avoid OutOfMemory errors this repository and custom state store providers were created.

Usage

To use the custom state store provider for your pipelines use the following additional configuration for the submit script:

--conf spark.sql.streaming.stateStore.providerClass="ru.chermenin.spark.sql.execution.streaming.state.RocksDbStateStoreProvider"

Here is some more information about it: https://docs.databricks.com/spark/latest/structured-streaming/production.html

Contributing

You're welcome to submit pull requests with any changes for this repository at any time. I'll be very glad to see any contributions.

License

The standard Apache 2.0 license is used for this project.