Skip to content

[SPARK-16909][Spark Core] - Streaming for postgreSQL JDBC driver#14502

Closed
princejwesley wants to merge 1 commit intoapache:masterfrom
princejwesley:spark-postgres
Closed

[SPARK-16909][Spark Core] - Streaming for postgreSQL JDBC driver#14502
princejwesley wants to merge 1 commit intoapache:masterfrom
princejwesley:spark-postgres

Conversation

@princejwesley
Copy link
Contributor

As per the postgreSQL JDBC driver implementation, the default record fetch size is 0(which means, it caches all record)

This fix enforces default record fetch size as 10 to enable streaming of data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex is wrong, as it matches 0 or more : at the end. Actually, why don't we just use startsWith in both cases?

Seems reasonable even if 10 is arbitrary. Is that low? but then again mysql above is asked to retrieve row by row, and I'm not actually sure that's a good idea. I wonder if we should dispense with this and just set it to something moderate like 1000 for all drivers? CC @koeninger

Can you update the MySQL link above while you're here to https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html as the existing one doesn't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the PR tonight IST time

On Aug 5, 2016 12:37 PM, "Sean Owen" notifications@github.com wrote:

In core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
#14502 (comment):

   stmt.setFetchSize(Integer.MIN_VALUE)
   logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")
  • } else if (url.matches("jdbc:postgresql:*")) {

The regex is wrong, as it matches 0 or more : at the end. Actually, why
don't we just use startsWith in both cases?

Seems reasonable even if 10 is arbitrary. Is that low? but then again
mysql above is asked to retrieve row by row, and I'm not actually sure
that's a good idea. I wonder if we should dispense with this and just set
it to something moderate like 1000 for all drivers? CC @koeninger
https://github.com/koeninger

Can you update the MySQL link above while you're here to
https://dev.mysql.com/doc/connector-j/5.1/en/connector-
j-reference-implementation-notes.html as the existing one doesn't work.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14502/files/c45f8212add3ac7ce04edd0ea1b3903ff9782c6d#r73650348,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABjB87uhiH_Kxxu52ElkY8cW9lVAiUepks5qcuExgaJpZM4JdU3L
.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, the issue there is that otherwise the mysql driver will
attempt to materialize the entire result set in memory at once, regardless
of how big it is.
On Aug 5, 2016 2:07 AM, "Sean Owen" notifications@github.com wrote:

In core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
#14502 (comment):

   stmt.setFetchSize(Integer.MIN_VALUE)
   logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")
  • } else if (url.matches("jdbc:postgresql:*")) {

The regex is wrong, as it matches 0 or more : at the end. Actually, why
don't we just use startsWith in both cases?

Seems reasonable even if 10 is arbitrary. Is that low? but then again
mysql above is asked to retrieve row by row, and I'm not actually sure
that's a good idea. I wonder if we should dispense with this and just set
it to something moderate like 1000 for all drivers? CC @koeninger
https://github.com/koeninger

Can you update the MySQL link above while you're here to
https://dev.mysql.com/doc/connector-j/5.1/en/connector-
j-reference-implementation-notes.html as the existing one doesn't work.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14502/files/c45f8212add3ac7ce04edd0ea1b3903ff9782c6d#r73650348,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAGAB6tXSgLc0hjxbUQQRfhZPWXNb7Brks5qcuExgaJpZM4JdU3L
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's why the fetch size shouldn't be effectively infinite, but I think this mode means fetch one at a time, which is the other extreme. What if this were, say, fetching 100 records at once? if that strikes you as OK, maybe that's better for efficiency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is that, at least at the time, the mysql driver ignored the
actual number set there, unless it was min value. It only used it to
toggle between "stream results" and "fetch all results", with nothing in
between.

On Fri, Aug 5, 2016 at 10:14 AM, Sean Owen notifications@github.com wrote:

In core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
#14502 (comment):

   stmt.setFetchSize(Integer.MIN_VALUE)
   logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")
  • } else if (url.matches("jdbc:postgresql:*")) {

Yeah that's why the fetch size shouldn't be effectively infinite, but I
think this mode means fetch one at a time, which is the other extreme. What
if this were, say, fetching 100 records at once? if that strikes you as OK,
maybe that's better for efficiency.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14502/files/c45f8212add3ac7ce04edd0ea1b3903ff9782c6d#r73708585,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAGABw12omb-phT0S8N4jjxTx1NGa_Nyks5qc1NwgaJpZM4JdU3L
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK got it. We'll leave that, but perhaps setFetchSize(100) for everything else. @princejwesley

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Updated!

@princejwesley princejwesley force-pushed the spark-postgres branch 4 times, most recently from 7efc807 to fa5fb8d Compare August 5, 2016 16:18
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you could use string interpolation in both statements. Also, it's not really streamed a record at a time in this case here. Maybe just log the statement fetch size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Addressed

@princejwesley princejwesley force-pushed the spark-postgres branch 2 times, most recently from 88ca39f to 99528c2 Compare August 6, 2016 05:40
@srowen
Copy link
Member

srowen commented Aug 6, 2016

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2016

Test build #63311 has finished for PR 14502 at commit 99528c2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This line is too long, fails style checks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Updated.

@srowen
Copy link
Member

srowen commented Aug 7, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Aug 7, 2016

Test build #63323 has finished for PR 14502 at commit bc1b318.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 7, 2016

Merged to master

@asfgit asfgit closed this in bdfab9f Aug 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants