[SPARK-19478][SS] JDBC Sink #17190

GaalDornick · 2017-03-07T13:47:24Z

What changes were proposed in this pull request?

Implementation of Sink that supports storing structured streaming data into a JDBC compliant RDBMS database. It supports Overwrite and Append modes. By default it supports atleast once operations and can be configured to support exactly once

To keep track of batches that have been written to a table, it creates a log table with the name $_SINK_LOG. This table has 2 columns: batchID and status of batch. The status can either be COMMITTED or UNCOMMITTED. When JDBC Sink receives a batch it checks if there is an entry in the sink log table for that batch with status = COMMITTED. If status is COMMITTED, it ignores the batch, other wise it tries the append/overwrite operation

To enable exactly once the client should create a column in the original table that stores the batchID. This column should be of LongType. The name of the column should be passed in the options with the name batchIdCol. If the JDBC Sink finds that this option is set, it will use exactly once mode. In this mode, it will set the batchIdCol to the batch id that is inserting or overwriting the record. Also, in the beginning of the batch, if it finds a batch with status=UNCOMMITTED, it deletes the records in the original table that match the batchID

How was this patch tested?

Implemented JDBCSinkSuite that is modeled along the lines of other Sink tests

sgireddy · 2017-08-18T15:19:23Z

@GaalDornick @AmplabJenkins Appreciate excellent work on bringing jdbc support to streams. Really want this feature complete, is it going to be back-ported to 2.1.x? Could you please provide an update on this?

wuciawe · 2017-10-25T06:58:30Z

@GaalDornick hi, I think

def quote(colName: String): String = {
  s""""$colName""""
}

should be

def quote(colName: String): String = {
  s"""`$colName`"""
}

(quoted in backticks instead of double quotes)?

AydinChavez · 2018-02-02T22:05:27Z

Anybody responsible for bringing this one to life? I think a jdbc sink is an essential feature (or database writing in general) for structured streaming.

One manual way by implementing it via a custom ForEachWriter is not useful due to lots of conn opens/terminations in its open+close methods - see also: https://stackoverflow.com/questions/47130229/spark-2-2-struct-streaming-foreach-writer-jdbc-sink-lag

GaalDornick · 2018-02-05T16:12:00Z

Sorry for abandoning this. Micheal Armburst had indicated to me that this should be really be a Spark package, and not part of Spark itself. So, it is unlikely that this will get merged. I haven't got the time to create a package

Also, Spark 2.3 contains a rewrite of the Data SOurce/SInk interface, so this component will probably need a rewrite to the new interface

AmplabJenkins · 2018-06-09T00:23:09Z

Can one of the admins verify this patch?

mshtelma · 2018-08-23T10:18:19Z

@GaalDornick
Have you thought about publishing this code as external spark package ?
If you do not mind, I can try doing this.

gaborgsomogyi · 2019-01-10T12:25:21Z

@GaalDornick are you still working on this? It's open for 2 years...

gaborgsomogyi · 2019-01-22T16:07:12Z

@HyukjinKwon or somebody can we just close this?
Open for almost 2 years and here is the agreement with Micheal:

Micheal Armburst had indicated to me that this should be really be a Spark package, and not part of Spark itself.

gaborgsomogyi · 2019-01-23T08:58:14Z

Thanks @HyukjinKwon

GaalDornick added 8 commits February 26, 2017 02:13

Implemented JDBCSink

28c8beb

Merge remote-tracking branch 'upstream/master'

f838c49

Formatting code

7ac0d78

Merge remote-tracking branch 'upstream/master'

12086be

Merge remote-tracking branch 'upstream/master'

2a43d29

Merge remote-tracking branch 'upstream/master'

756ea2c

Merge remote-tracking branch 'upstream/master'

dde8b0b

Merge with master

bb80ed5

GaalDornick changed the title ~~[SPARK-19478][SS] JDBC Sink [WIP]~~ [SPARK-19478][SS] JDBC Sink Mar 10, 2017

GaalDornick added 7 commits March 13, 2017 08:35

Merge remote-tracking branch 'upstream/master'

b5f1eaa

Merge remote-tracking branch 'upstream/master'

d26bc76

Merge remote-tracking branch 'upstream/master'

65b8963

Merge remote-tracking branch 'upstream/master'

4c0f563

Merge remote-tracking branch 'upstream/master'

6dc7f22

Merge remote-tracking branch 'upstream/master'

73357e6

Fixing broken build after merge

6e2c5be

Merged master

a7d89c6

HeartSaVioR mentioned this pull request Jan 13, 2019

[SPARK-26429][SS]add jdbc sink for Structured Streaming. #23369

Closed

HyukjinKwon closed this Jan 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19478][SS] JDBC Sink #17190

[SPARK-19478][SS] JDBC Sink #17190

GaalDornick commented Mar 7, 2017

sgireddy commented Aug 18, 2017

wuciawe commented Oct 25, 2017

AydinChavez commented Feb 2, 2018

GaalDornick commented Feb 5, 2018

AmplabJenkins commented Jun 9, 2018

mshtelma commented Aug 23, 2018

gaborgsomogyi commented Jan 10, 2019

gaborgsomogyi commented Jan 22, 2019

gaborgsomogyi commented Jan 23, 2019

[SPARK-19478][SS] JDBC Sink #17190

[SPARK-19478][SS] JDBC Sink #17190

Conversation

GaalDornick commented Mar 7, 2017

What changes were proposed in this pull request?

How was this patch tested?

sgireddy commented Aug 18, 2017

wuciawe commented Oct 25, 2017

AydinChavez commented Feb 2, 2018

GaalDornick commented Feb 5, 2018

AmplabJenkins commented Jun 9, 2018

mshtelma commented Aug 23, 2018

gaborgsomogyi commented Jan 10, 2019

gaborgsomogyi commented Jan 22, 2019

gaborgsomogyi commented Jan 23, 2019