[SPARK-16909][Spark Core] - Streaming for postgreSQL JDBC driver#14502
[SPARK-16909][Spark Core] - Streaming for postgreSQL JDBC driver#14502princejwesley wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
The regex is wrong, as it matches 0 or more : at the end. Actually, why don't we just use startsWith in both cases?
Seems reasonable even if 10 is arbitrary. Is that low? but then again mysql above is asked to retrieve row by row, and I'm not actually sure that's a good idea. I wonder if we should dispense with this and just set it to something moderate like 1000 for all drivers? CC @koeninger
Can you update the MySQL link above while you're here to https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html as the existing one doesn't work.
There was a problem hiding this comment.
I'll update the PR tonight IST time
On Aug 5, 2016 12:37 PM, "Sean Owen" notifications@github.com wrote:
In core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
#14502 (comment):stmt.setFetchSize(Integer.MIN_VALUE) logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")
- } else if (url.matches("jdbc:postgresql:*")) {
The regex is wrong, as it matches 0 or more : at the end. Actually, why
don't we just use startsWith in both cases?Seems reasonable even if 10 is arbitrary. Is that low? but then again
mysql above is asked to retrieve row by row, and I'm not actually sure
that's a good idea. I wonder if we should dispense with this and just set
it to something moderate like 1000 for all drivers? CC @koeninger
https://github.com/koeningerCan you update the MySQL link above while you're here to
https://dev.mysql.com/doc/connector-j/5.1/en/connector-
j-reference-implementation-notes.html as the existing one doesn't work.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14502/files/c45f8212add3ac7ce04edd0ea1b3903ff9782c6d#r73650348,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABjB87uhiH_Kxxu52ElkY8cW9lVAiUepks5qcuExgaJpZM4JdU3L
.
There was a problem hiding this comment.
As I recall, the issue there is that otherwise the mysql driver will
attempt to materialize the entire result set in memory at once, regardless
of how big it is.
On Aug 5, 2016 2:07 AM, "Sean Owen" notifications@github.com wrote:
In core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
#14502 (comment):stmt.setFetchSize(Integer.MIN_VALUE) logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")
- } else if (url.matches("jdbc:postgresql:*")) {
The regex is wrong, as it matches 0 or more : at the end. Actually, why
don't we just use startsWith in both cases?Seems reasonable even if 10 is arbitrary. Is that low? but then again
mysql above is asked to retrieve row by row, and I'm not actually sure
that's a good idea. I wonder if we should dispense with this and just set
it to something moderate like 1000 for all drivers? CC @koeninger
https://github.com/koeningerCan you update the MySQL link above while you're here to
https://dev.mysql.com/doc/connector-j/5.1/en/connector-
j-reference-implementation-notes.html as the existing one doesn't work.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14502/files/c45f8212add3ac7ce04edd0ea1b3903ff9782c6d#r73650348,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAGAB6tXSgLc0hjxbUQQRfhZPWXNb7Brks5qcuExgaJpZM4JdU3L
.
There was a problem hiding this comment.
Yeah that's why the fetch size shouldn't be effectively infinite, but I think this mode means fetch one at a time, which is the other extreme. What if this were, say, fetching 100 records at once? if that strikes you as OK, maybe that's better for efficiency.
There was a problem hiding this comment.
What I'm saying is that, at least at the time, the mysql driver ignored the
actual number set there, unless it was min value. It only used it to
toggle between "stream results" and "fetch all results", with nothing in
between.
On Fri, Aug 5, 2016 at 10:14 AM, Sean Owen notifications@github.com wrote:
In core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
#14502 (comment):stmt.setFetchSize(Integer.MIN_VALUE) logInfo("statement fetch size set to: " + stmt.getFetchSize + " to force MySQL streaming ")
- } else if (url.matches("jdbc:postgresql:*")) {
Yeah that's why the fetch size shouldn't be effectively infinite, but I
think this mode means fetch one at a time, which is the other extreme. What
if this were, say, fetching 100 records at once? if that strikes you as OK,
maybe that's better for efficiency.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/14502/files/c45f8212add3ac7ce04edd0ea1b3903ff9782c6d#r73708585,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAGABw12omb-phT0S8N4jjxTx1NGa_Nyks5qc1NwgaJpZM4JdU3L
.
There was a problem hiding this comment.
OK got it. We'll leave that, but perhaps setFetchSize(100) for everything else. @princejwesley
7efc807 to
fa5fb8d
Compare
There was a problem hiding this comment.
Nit: you could use string interpolation in both statements. Also, it's not really streamed a record at a time in this case here. Maybe just log the statement fetch size.
88ca39f to
99528c2
Compare
|
Jenkins test this please |
|
Test build #63311 has finished for PR 14502 at commit
|
There was a problem hiding this comment.
(This line is too long, fails style checks)
99528c2 to
bc1b318
Compare
|
Jenkins retest this please |
|
Test build #63323 has finished for PR 14502 at commit
|
|
Merged to master |
As per the postgreSQL JDBC driver implementation, the default record fetch size is 0(which means, it caches all record)
This fix enforces default record fetch size as 10 to enable streaming of data.