New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43389][SQL] Added a null check for lineSep option #41904
Conversation
"with 2 characters due to the limitation of supporting multi-char 'lineSep' within quotes.") | ||
sep | ||
val lineSeparator: Option[String] = parameters.get(LINE_SEP) match { | ||
case Some(sep) if sep != null => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a null line separator even valid? I'd imagine that should be an error, if an empty one is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as my understanding, None
when passed through python (lineSep: None)
ends up being null
in this function. Hence, I have a check for null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None isn't the default, right? if a user passes None, it feels like that should be an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I can see that it should be an error just like an empty value ''
.
However, one of the use cases where this might be helpful is where someone might need to dynamically switch the value for lineSep
from None
to something else. For example, code generation is one example that comes to mind, where a base template with lineSep: None
is used that can be replaced with appropriate values as desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but generated code would not need to set 'None' right? that isn't meaningful at the point you actually use Spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. I guess we could also use another placeholder value instead of None
in that scenario.
I can revert the changes and add a require
clause for null check instead of letting it through. Or would you rather have it throw an NullPointerException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just require() it for consistency, unless there's an argument for other handling. (If there is, it'd probably apply to the "" case too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
and I actually think we might need to check every option. In some options, |
Merged to master |
@gdhuper what's your JIRA handle? I can assign it to you |
gdhuper |
### What changes were proposed in this pull request? ### Why are the changes needed? - `spark.read.csv` throws `NullPointerException` when lineSep is set to None - More details about the issue here: https://issues.apache.org/jira/browse/SPARK-43389 ### Does this PR introduce _any_ user-facing change? ~~Users now should be able to explicitly set `lineSep` as `None` without getting an exception~~ After some discussion, it was decided to add a `require` check for `null` instead of letting it through. ### How was this patch tested? Tested the changes with a python script that explicitly sets `lineSep` to `None` ```python from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName("HelloWorld").getOrCreate() # Read CSV into a DataFrame df = spark.read.csv("/tmp/hello.csv", header=True, inferSchema=True, lineSep=None) # Also tested the following case when options are passed before invoking .csv #df = spark.read.option("lineSep", None).csv("/Users/gdhuper/Documents/tmp/hello.csv", header=True, inferSchema=True) # Show the DataFrame df.show() # Stop the SparkSession spark.stop() ``` Closes apache#41904 from gdhuper/gdhuper/SPARK-43389. Authored-by: Gurpreet Singh <gdhuper@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
What changes were proposed in this pull request?
Why are the changes needed?
spark.read.csv
throwsNullPointerException
when lineSep is set to NoneDoes this PR introduce any user-facing change?
Users now should be able to explicitly setlineSep
asNone
without getting an exceptionAfter some discussion, it was decided to add a
require
check fornull
instead of letting it through.How was this patch tested?
Tested the changes with a python script that explicitly sets
lineSep
toNone