[SPARK-19239][PySpark] Check parameters whether equals None when specify the column in jdbc API #16599

libratiger · 2017-01-16T08:52:21Z

What changes were proposed in this pull request?

The jdbc API do not check the lowerBound and upperBound when we
specified the column, and just throw the following exception:

int() argument must be a string or a number, not 'NoneType'

If we check the parameter, we can give a more friendly suggestion.

How was this patch tested?

Test using the pyspark shell, without the lowerBound and upperBound parameters.

…ual None in jdbc API The ``jdbc`` API do not check the lowerBound and upperBound when we specified the ``column``, and just throw the following exception: ```int() argument must be a string or a number, not 'NoneType'``` If we check the parameter, we can give a more friendly suggestion.

libratiger · 2017-01-16T08:55:02Z

@zsxwing can you take a look at?

HyukjinKwon · 2017-01-17T03:35:13Z

python/pyspark/sql/readwriter.py

@@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, lowerBound=None, upperBound=None, numPar
        if column is not None:
            if numPartitions is None:
                numPartitions = self._spark._sc.defaultParallelism
+            assert lowerBound != None, "lowerBound can not be None when ``column`` is specified"
+            assert upperBound != None, "upperBound can not be None when ``column`` is specified"


Should we resemble the condition here -

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

Lines 100 to 103 in 55d528f

require(partitionColumn.isEmpty ||

(lowerBound.isDefined && upperBound.isDefined && numPartitions.isDefined),

s"If '$JDBC_PARTITION_COLUMN' is specified then '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND'," +

s" and '$JDBC_NUM_PARTITIONS' are required.")

?

Yes, The Scala code could check this, but the PySpark code will fail at int(lowerBound) first, so the customer is confused.

gatorsmile · 2017-01-17T05:05:20Z

ok to test

SparkQA · 2017-01-17T05:09:54Z

Test build #71472 has finished for PR 16599 at commit 94c44ba.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-17T06:19:41Z

Test build #71476 has finished for PR 16599 at commit 43602b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-17T07:32:11Z

python/pyspark/sql/readwriter.py

@@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, lowerBound=None, upperBound=None, numPar
        if column is not None:
            if numPartitions is None:
                numPartitions = self._spark._sc.defaultParallelism


This is contradicting with the scala version. Could you also change it to the following code

assert numPartitions is not None, "numPartitions can not be None when ``column`` is specified"

I have a little worry whether this change will break the API. If some users just specify the column, lowerBound, upperBound in some Spark version, its program will fail after update, even very few people just use the default parallelism.

In my personal opinion, I prefer to make a change and keep API consistent.

If your opinion is to add the assert on numPartitions, I will update the PR soon.

I think we should make the Scala API and Python API consistent. The existing Python API is not following the document.

These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.

gatorsmile · 2017-01-17T07:32:38Z

Have you manually tested your code changes?

In the Scala API, the `numPartitions` is needed when we specify the `column`, we remove the default parallelism to keep consistent

libratiger · 2017-01-17T08:21:33Z

I update the PR and test the change in pyspark shell.

gatorsmile · 2017-01-17T08:30:53Z

python/pyspark/sql/readwriter.py

+            assert lowerBound is not None, "lowerBound can not be None when ``column`` is specified"
+            assert upperBound is not None, "upperBound can not be None when ``column`` is specified"
+            assert numPartitions is not None, "numPartitions can not be None " \
+                                              "when ``column`` is specified"


assert numPartitions is not None, \ "numPartitions can not be None when ``column`` is specified"

SparkQA · 2017-01-17T08:48:19Z

Test build #71488 has finished for PR 16599 at commit f2dad9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-17T08:57:45Z

Test build #71490 has finished for PR 16599 at commit e9a1bca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-17T09:21:08Z

Test build #71496 has finished for PR 16599 at commit d851382.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-17T17:44:22Z

LGTM

gatorsmile · 2017-01-17T18:38:25Z

Thanks! Merging to master

…ify the column in jdbc API ## What changes were proposed in this pull request? The `jdbc` API do not check the `lowerBound` and `upperBound` when we specified the ``column``, and just throw the following exception: >```int() argument must be a string or a number, not 'NoneType'``` If we check the parameter, we can give a more friendly suggestion. ## How was this patch tested? Test using the pyspark shell, without the lowerBound and upperBound parameters. Author: DjvuLee <lihu@bytedance.com> Closes apache#16599 from djvulee/pysparkFix.

HyukjinKwon reviewed Jan 17, 2017

View reviewed changes

update the None equal comparison

43602b5

gatorsmile reviewed Jan 17, 2017

View reviewed changes

DjvuLee added 2 commits January 17, 2017 16:14

Update numPartitions check to keep API consistent

f2dad9c

In the Scala API, the `numPartitions` is needed when we specify the `column`, we remove the default parallelism to keep consistent

Update the document for numPartition

e9a1bca

libratiger changed the title ~~[SPARK-19239][PySpark] Check the lowerBound and upperBound whether equals None in jdbc API~~ [SPARK-19239][PySpark] Check parameters whether equals None when specify the column in jdbc API Jan 17, 2017

gatorsmile reviewed Jan 17, 2017

View reviewed changes

Fix the python style

d851382

asfgit closed this in 843ec8e Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19239][PySpark] Check parameters whether equals None when specify the column in jdbc API #16599

[SPARK-19239][PySpark] Check parameters whether equals None when specify the column in jdbc API #16599

libratiger commented Jan 16, 2017

libratiger commented Jan 16, 2017

HyukjinKwon Jan 17, 2017

libratiger Jan 17, 2017

gatorsmile commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

gatorsmile Jan 17, 2017

libratiger Jan 17, 2017

gatorsmile Jan 17, 2017 •

edited

gatorsmile commented Jan 17, 2017

libratiger commented Jan 17, 2017

gatorsmile Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

gatorsmile commented Jan 17, 2017

gatorsmile commented Jan 17, 2017

	require(partitionColumn.isEmpty \|\|
	(lowerBound.isDefined && upperBound.isDefined && numPartitions.isDefined),
	s"If '$JDBC_PARTITION_COLUMN' is specified then '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND'," +
	s" and '$JDBC_NUM_PARTITIONS' are required.")

[SPARK-19239][PySpark] Check parameters whether equals None when specify the column in jdbc API #16599

[SPARK-19239][PySpark] Check parameters whether equals None when specify the column in jdbc API #16599

Conversation

libratiger commented Jan 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

libratiger commented Jan 16, 2017

HyukjinKwon Jan 17, 2017

Choose a reason for hiding this comment

libratiger Jan 17, 2017

Choose a reason for hiding this comment

gatorsmile commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

gatorsmile Jan 17, 2017

Choose a reason for hiding this comment

libratiger Jan 17, 2017

Choose a reason for hiding this comment

gatorsmile Jan 17, 2017 • edited

Choose a reason for hiding this comment

gatorsmile commented Jan 17, 2017

libratiger commented Jan 17, 2017

gatorsmile Jan 17, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

SparkQA commented Jan 17, 2017

gatorsmile commented Jan 17, 2017

gatorsmile commented Jan 17, 2017

gatorsmile Jan 17, 2017 •

edited