-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19478][SS] JDBC Sink #17190
[SPARK-19478][SS] JDBC Sink #17190
Conversation
@GaalDornick @AmplabJenkins Appreciate excellent work on bringing jdbc support to streams. Really want this feature complete, is it going to be back-ported to 2.1.x? Could you please provide an update on this? |
@GaalDornick hi, I think
should be
(quoted in backticks instead of double quotes)? |
Anybody responsible for bringing this one to life? I think a jdbc sink is an essential feature (or database writing in general) for structured streaming. One manual way by implementing it via a custom ForEachWriter is not useful due to lots of conn opens/terminations in its open+close methods - see also: https://stackoverflow.com/questions/47130229/spark-2-2-struct-streaming-foreach-writer-jdbc-sink-lag |
Sorry for abandoning this. Micheal Armburst had indicated to me that this should be really be a Spark package, and not part of Spark itself. So, it is unlikely that this will get merged. I haven't got the time to create a package Also, Spark 2.3 contains a rewrite of the Data SOurce/SInk interface, so this component will probably need a rewrite to the new interface |
Can one of the admins verify this patch? |
@GaalDornick |
@GaalDornick are you still working on this? It's open for 2 years... |
@HyukjinKwon or somebody can we just close this?
|
Thanks @HyukjinKwon |
What changes were proposed in this pull request?
Implementation of Sink that supports storing structured streaming data into a JDBC compliant RDBMS database. It supports Overwrite and Append modes. By default it supports atleast once operations and can be configured to support exactly once
To keep track of batches that have been written to a table, it creates a log table with the name $_SINK_LOG. This table has 2 columns: batchID and status of batch. The status can either be COMMITTED or UNCOMMITTED. When JDBC Sink receives a batch it checks if there is an entry in the sink log table for that batch with status = COMMITTED. If status is COMMITTED, it ignores the batch, other wise it tries the append/overwrite operation
To enable exactly once the client should create a column in the original table that stores the batchID. This column should be of LongType. The name of the column should be passed in the options with the name batchIdCol. If the JDBC Sink finds that this option is set, it will use exactly once mode. In this mode, it will set the batchIdCol to the batch id that is inserting or overwriting the record. Also, in the beginning of the batch, if it finds a batch with status=UNCOMMITTED, it deletes the records in the original table that match the batchID
How was this patch tested?
Implemented JDBCSinkSuite that is modeled along the lines of other Sink tests