Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19239][PySpark] Check parameters whether equals None when specify the column in jdbc API #16599

Closed
wants to merge 5 commits into from

Conversation

libratiger
Copy link

What changes were proposed in this pull request?

The jdbc API do not check the lowerBound and upperBound when we
specified the column, and just throw the following exception:

int() argument must be a string or a number, not 'NoneType'

If we check the parameter, we can give a more friendly suggestion.

How was this patch tested?

Test using the pyspark shell, without the lowerBound and upperBound parameters.

…ual None in jdbc API

The ``jdbc`` API do not check the lowerBound and upperBound when we
specified the ``column``, and just throw the following exception:
```int() argument must be a string or a number, not 'NoneType'```
If we check the parameter, we can give a more friendly suggestion.
@libratiger
Copy link
Author

@zsxwing can you take a look at?

@@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, lowerBound=None, upperBound=None, numPar
if column is not None:
if numPartitions is None:
numPartitions = self._spark._sc.defaultParallelism
assert lowerBound != None, "lowerBound can not be None when ``column`` is specified"
assert upperBound != None, "upperBound can not be None when ``column`` is specified"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we resemble the condition here -

require(partitionColumn.isEmpty ||
(lowerBound.isDefined && upperBound.isDefined && numPartitions.isDefined),
s"If '$JDBC_PARTITION_COLUMN' is specified then '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND'," +
s" and '$JDBC_NUM_PARTITIONS' are required.")
?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, The Scala code could check this, but the PySpark code will fail at int(lowerBound) first, so the customer is confused.

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71472 has finished for PR 16599 at commit 94c44ba.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71476 has finished for PR 16599 at commit 43602b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, lowerBound=None, upperBound=None, numPar
if column is not None:
if numPartitions is None:
numPartitions = self._spark._sc.defaultParallelism
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contradicting with the scala version. Could you also change it to the following code

assert numPartitions is not None, "numPartitions can not be None when ``column`` is specified"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a little worry whether this change will break the API. If some users just specify the column, lowerBound, upperBound in some Spark version, its program will fail after update, even very few people just use the default parallelism.

In my personal opinion, I prefer to make a change and keep API consistent.

If your opinion is to add the assert on numPartitions, I will update the PR soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make the Scala API and Python API consistent. The existing Python API is not following the document.

These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.

@gatorsmile
Copy link
Member

Have you manually tested your code changes?

DjvuLee added 2 commits January 17, 2017 16:14
In the Scala API, the `numPartitions` is needed when we specify the
`column`, we remove the default parallelism to keep consistent
@libratiger
Copy link
Author

I update the PR and test the change in pyspark shell.

@libratiger libratiger changed the title [SPARK-19239][PySpark] Check the lowerBound and upperBound whether equals None in jdbc API [SPARK-19239][PySpark] Check parameters whether equals None when specify the column in jdbc API Jan 17, 2017
assert lowerBound is not None, "lowerBound can not be None when ``column`` is specified"
assert upperBound is not None, "upperBound can not be None when ``column`` is specified"
assert numPartitions is not None, "numPartitions can not be None " \
"when ``column`` is specified"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        assert numPartitions is not None, \
            "numPartitions can not be None when ``column`` is specified"

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71488 has finished for PR 16599 at commit f2dad9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71490 has finished for PR 16599 at commit e9a1bca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71496 has finished for PR 16599 at commit d851382.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merging to master

@asfgit asfgit closed this in 843ec8e Jan 17, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ify the column in jdbc API

## What changes were proposed in this pull request?

The `jdbc` API do not check the `lowerBound` and `upperBound` when we
specified the ``column``, and just throw the following exception:

>```int() argument must be a string or a number, not 'NoneType'```

If we check the parameter, we can give a more friendly suggestion.

## How was this patch tested?
Test using the pyspark shell, without the lowerBound and upperBound parameters.

Author: DjvuLee <lihu@bytedance.com>

Closes apache#16599 from djvulee/pysparkFix.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…ify the column in jdbc API

## What changes were proposed in this pull request?

The `jdbc` API do not check the `lowerBound` and `upperBound` when we
specified the ``column``, and just throw the following exception:

>```int() argument must be a string or a number, not 'NoneType'```

If we check the parameter, we can give a more friendly suggestion.

## How was this patch tested?
Test using the pyspark shell, without the lowerBound and upperBound parameters.

Author: DjvuLee <lihu@bytedance.com>

Closes apache#16599 from djvulee/pysparkFix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants