New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19239][PySpark] Check parameters whether equals None when specify the column in jdbc API #16599
Conversation
…ual None in jdbc API The ``jdbc`` API do not check the lowerBound and upperBound when we specified the ``column``, and just throw the following exception: ```int() argument must be a string or a number, not 'NoneType'``` If we check the parameter, we can give a more friendly suggestion.
@zsxwing can you take a look at? |
@@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, lowerBound=None, upperBound=None, numPar | |||
if column is not None: | |||
if numPartitions is None: | |||
numPartitions = self._spark._sc.defaultParallelism | |||
assert lowerBound != None, "lowerBound can not be None when ``column`` is specified" | |||
assert upperBound != None, "upperBound can not be None when ``column`` is specified" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we resemble the condition here -
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
Lines 100 to 103 in 55d528f
require(partitionColumn.isEmpty || | |
(lowerBound.isDefined && upperBound.isDefined && numPartitions.isDefined), | |
s"If '$JDBC_PARTITION_COLUMN' is specified then '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND'," + | |
s" and '$JDBC_NUM_PARTITIONS' are required.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, The Scala code could check this, but the PySpark code will fail at int(lowerBound)
first, so the customer is confused.
ok to test |
Test build #71472 has finished for PR 16599 at commit
|
Test build #71476 has finished for PR 16599 at commit
|
@@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, lowerBound=None, upperBound=None, numPar | |||
if column is not None: | |||
if numPartitions is None: | |||
numPartitions = self._spark._sc.defaultParallelism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is contradicting with the scala version. Could you also change it to the following code
assert numPartitions is not None, "numPartitions can not be None when ``column`` is specified"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a little worry whether this change will break the API. If some users just specify the column
, lowerBound
, upperBound
in some Spark version, its program will fail after update, even very few people just use the default parallelism.
In my personal opinion, I prefer to make a change and keep API consistent.
If your opinion is to add the assert on numPartitions
, I will update the PR soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make the Scala API and Python API consistent. The existing Python API is not following the document.
These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
Have you manually tested your code changes? |
In the Scala API, the `numPartitions` is needed when we specify the `column`, we remove the default parallelism to keep consistent
I update the PR and test the change in pyspark shell. |
assert lowerBound is not None, "lowerBound can not be None when ``column`` is specified" | ||
assert upperBound is not None, "upperBound can not be None when ``column`` is specified" | ||
assert numPartitions is not None, "numPartitions can not be None " \ | ||
"when ``column`` is specified" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert numPartitions is not None, \
"numPartitions can not be None when ``column`` is specified"
Test build #71488 has finished for PR 16599 at commit
|
Test build #71490 has finished for PR 16599 at commit
|
Test build #71496 has finished for PR 16599 at commit
|
LGTM |
Thanks! Merging to master |
…ify the column in jdbc API ## What changes were proposed in this pull request? The `jdbc` API do not check the `lowerBound` and `upperBound` when we specified the ``column``, and just throw the following exception: >```int() argument must be a string or a number, not 'NoneType'``` If we check the parameter, we can give a more friendly suggestion. ## How was this patch tested? Test using the pyspark shell, without the lowerBound and upperBound parameters. Author: DjvuLee <lihu@bytedance.com> Closes apache#16599 from djvulee/pysparkFix.
…ify the column in jdbc API ## What changes were proposed in this pull request? The `jdbc` API do not check the `lowerBound` and `upperBound` when we specified the ``column``, and just throw the following exception: >```int() argument must be a string or a number, not 'NoneType'``` If we check the parameter, we can give a more friendly suggestion. ## How was this patch tested? Test using the pyspark shell, without the lowerBound and upperBound parameters. Author: DjvuLee <lihu@bytedance.com> Closes apache#16599 from djvulee/pysparkFix.
What changes were proposed in this pull request?
The
jdbc
API do not check thelowerBound
andupperBound
when wespecified the
column
, and just throw the following exception:If we check the parameter, we can give a more friendly suggestion.
How was this patch tested?
Test using the pyspark shell, without the lowerBound and upperBound parameters.