[SPARK-16909][Spark Core] - Streaming for postgreSQL JDBC driver by princejwesley · Pull Request #14502 · apache/spark

princejwesley · 2016-08-05T03:35:56Z

As per the postgreSQL JDBC driver implementation, the default record fetch size is 0(which means, it caches all record)

This fix enforces default record fetch size as 10 to enable streaming of data.

srowen · 2016-08-05T07:06:33Z

core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

The regex is wrong, as it matches 0 or more : at the end. Actually, why don't we just use startsWith in both cases?

Seems reasonable even if 10 is arbitrary. Is that low? but then again mysql above is asked to retrieve row by row, and I'm not actually sure that's a good idea. I wonder if we should dispense with this and just set it to something moderate like 1000 for all drivers? CC @koeninger

Can you update the MySQL link above while you're here to https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html as the existing one doesn't work.

I'll update the PR tonight IST time

On Aug 5, 2016 12:37 PM, "Sean Owen" notifications@github.com wrote:

In core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
#14502 (comment):

stmt.setFetchSize(Integer.MIN_VALUE) logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")

} else if (url.matches("jdbc:postgresql:*")) {

The regex is wrong, as it matches 0 or more : at the end. Actually, why
don't we just use startsWith in both cases?

Seems reasonable even if 10 is arbitrary. Is that low? but then again
mysql above is asked to retrieve row by row, and I'm not actually sure
that's a good idea. I wonder if we should dispense with this and just set
it to something moderate like 1000 for all drivers? CC @koeninger
https://github.com/koeninger

Can you update the MySQL link above while you're here to
https://dev.mysql.com/doc/connector-j/5.1/en/connector-
j-reference-implementation-notes.html as the existing one doesn't work.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14502/files/c45f8212add3ac7ce04edd0ea1b3903ff9782c6d#r73650348,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABjB87uhiH_Kxxu52ElkY8cW9lVAiUepks5qcuExgaJpZM4JdU3L
.

As I recall, the issue there is that otherwise the mysql driver will
attempt to materialize the entire result set in memory at once, regardless
of how big it is.
On Aug 5, 2016 2:07 AM, "Sean Owen" notifications@github.com wrote:

In core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
#14502 (comment):

stmt.setFetchSize(Integer.MIN_VALUE) logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")

} else if (url.matches("jdbc:postgresql:*")) {

The regex is wrong, as it matches 0 or more : at the end. Actually, why
don't we just use startsWith in both cases?

Seems reasonable even if 10 is arbitrary. Is that low? but then again
mysql above is asked to retrieve row by row, and I'm not actually sure
that's a good idea. I wonder if we should dispense with this and just set
it to something moderate like 1000 for all drivers? CC @koeninger
https://github.com/koeninger

Can you update the MySQL link above while you're here to
https://dev.mysql.com/doc/connector-j/5.1/en/connector-
j-reference-implementation-notes.html as the existing one doesn't work.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14502/files/c45f8212add3ac7ce04edd0ea1b3903ff9782c6d#r73650348,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAGAB6tXSgLc0hjxbUQQRfhZPWXNb7Brks5qcuExgaJpZM4JdU3L
.

Yeah that's why the fetch size shouldn't be effectively infinite, but I think this mode means fetch one at a time, which is the other extreme. What if this were, say, fetching 100 records at once? if that strikes you as OK, maybe that's better for efficiency.

What I'm saying is that, at least at the time, the mysql driver ignored the
actual number set there, unless it was min value. It only used it to
toggle between "stream results" and "fetch all results", with nothing in
between.

On Fri, Aug 5, 2016 at 10:14 AM, Sean Owen notifications@github.com wrote:

In core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
#14502 (comment):

stmt.setFetchSize(Integer.MIN_VALUE) logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")

} else if (url.matches("jdbc:postgresql:*")) {

Yeah that's why the fetch size shouldn't be effectively infinite, but I
think this mode means fetch one at a time, which is the other extreme. What
if this were, say, fetching 100 records at once? if that strikes you as OK,
maybe that's better for efficiency.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14502/files/c45f8212add3ac7ce04edd0ea1b3903ff9782c6d#r73708585,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAGABw12omb-phT0S8N4jjxTx1NGa_Nyks5qc1NwgaJpZM4JdU3L
.

OK got it. We'll leave that, but perhaps setFetchSize(100) for everything else. @princejwesley

@srowen Updated!

srowen · 2016-08-05T18:23:20Z

core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

Nit: you could use string interpolation in both statements. Also, it's not really streamed a record at a time in this case here. Maybe just log the statement fetch size.

@srowen Addressed

srowen · 2016-08-06T06:41:35Z

Jenkins test this please

SparkQA · 2016-08-06T06:44:04Z

Test build #63311 has finished for PR 14502 at commit 99528c2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-08-06T06:50:29Z

core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

(This line is too long, fails style checks)

@srowen Updated.

srowen · 2016-08-07T08:36:29Z

Jenkins retest this please

SparkQA · 2016-08-07T10:43:23Z

Test build #63323 has finished for PR 14502 at commit bc1b318.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-08-07T11:18:44Z

Merged to master

srowen reviewed Aug 5, 2016
View reviewed changes

princejwesley force-pushed the spark-postgres branch 4 times, most recently from 7efc807 to fa5fb8d Compare August 5, 2016 16:18

srowen reviewed Aug 5, 2016
View reviewed changes

princejwesley force-pushed the spark-postgres branch 2 times, most recently from 88ca39f to 99528c2 Compare August 6, 2016 05:40

srowen reviewed Aug 6, 2016
View reviewed changes

SPARK-16909 - set fetch size for JDBC connection

bc1b318

princejwesley force-pushed the spark-postgres branch from 99528c2 to bc1b318 Compare August 6, 2016 07:15

asfgit closed this in bdfab9f Aug 7, 2016

Conversation

princejwesley commented Aug 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Aug 6, 2016

Uh oh!

SparkQA commented Aug 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Aug 7, 2016

Uh oh!

SparkQA commented Aug 7, 2016

Uh oh!

srowen commented Aug 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants