[SPARK-43389][SQL] Added a null check for lineSep option #41904

gdhuper · 2023-07-09T04:11:09Z

What changes were proposed in this pull request?

Why are the changes needed?

spark.read.csv throws NullPointerException when lineSep is set to None
More details about the issue here: https://issues.apache.org/jira/browse/SPARK-43389

Does this PR introduce any user-facing change?

~~Users now should be able to explicitly set lineSep as None without getting an exception~~
After some discussion, it was decided to add a require check for null instead of letting it through.

How was this patch tested?

Tested the changes with a python script that explicitly sets lineSep to None

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("HelloWorld").getOrCreate()

# Read CSV into a DataFrame
df = spark.read.csv("/tmp/hello.csv", header=True, inferSchema=True, lineSep=None)

# Also tested the following case when options are passed before invoking .csv
#df = spark.read.option("lineSep", None).csv("/Users/gdhuper/Documents/tmp/hello.csv", header=True, inferSchema=True)

# Show the DataFrame
df.show()

# Stop the SparkSession
spark.stop()

srowen · 2023-07-09T15:55:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

-      "with 2 characters due to the limitation of supporting multi-char 'lineSep' within quotes.")
-    sep
+  val lineSeparator: Option[String] = parameters.get(LINE_SEP) match {
+    case Some(sep) if sep != null =>


Is a null line separator even valid? I'd imagine that should be an error, if an empty one is

As far as my understanding, None when passed through python (lineSep: None) ends up being null in this function. Hence, I have a check for null.

None isn't the default, right? if a user passes None, it feels like that should be an error

@srowen I can see that it should be an error just like an empty value ''.
However, one of the use cases where this might be helpful is where someone might need to dynamically switch the value for lineSep from None to something else. For example, code generation is one example that comes to mind, where a base template with lineSep: None is used that can be replaced with appropriate values as desired.

Hm, but generated code would not need to set 'None' right? that isn't meaningful at the point you actually use Spark

I see your point. I guess we could also use another placeholder value instead of None in that scenario.
I can revert the changes and add a require clause for null check instead of letting it through. Or would you rather have it throw an NullPointerException?

I'd just require() it for consistency, unless there's an argument for other handling. (If there is, it'd probably apply to the "" case too)

HyukjinKwon · 2023-07-10T06:35:55Z

and I actually think we might need to check every option. In some options, None might make sense as a valid option but others do not.

srowen · 2023-07-13T23:17:51Z

Merged to master

srowen · 2023-07-13T23:19:50Z

@gdhuper what's your JIRA handle? I can assign it to you

gdhuper · 2023-07-13T23:22:50Z

@gdhuper what's your JIRA handle? I can assign it to you

gdhuper

### What changes were proposed in this pull request? ### Why are the changes needed? - `spark.read.csv` throws `NullPointerException` when lineSep is set to None - More details about the issue here: https://issues.apache.org/jira/browse/SPARK-43389 ### Does this PR introduce _any_ user-facing change? ~~Users now should be able to explicitly set `lineSep` as `None` without getting an exception~~ After some discussion, it was decided to add a `require` check for `null` instead of letting it through. ### How was this patch tested? Tested the changes with a python script that explicitly sets `lineSep` to `None` ```python from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName("HelloWorld").getOrCreate() # Read CSV into a DataFrame df = spark.read.csv("/tmp/hello.csv", header=True, inferSchema=True, lineSep=None) # Also tested the following case when options are passed before invoking .csv #df = spark.read.option("lineSep", None).csv("/Users/gdhuper/Documents/tmp/hello.csv", header=True, inferSchema=True) # Show the DataFrame df.show() # Stop the SparkSession spark.stop() ``` Closes apache#41904 from gdhuper/gdhuper/SPARK-43389. Authored-by: Gurpreet Singh <gdhuper@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>

added a null check for lineSep option

bb0013f

github-actions bot added the SQL label Jul 9, 2023

fixed formatting

467f7f3

gdhuper marked this pull request as ready for review July 9, 2023 09:22

srowen reviewed Jul 9, 2023

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-43389] [PySpark, SQL] Added a null check for lineSep option~~ [SPARK-43389][PYTHON][SQL] Added a null check for lineSep option Jul 10, 2023

HyukjinKwon changed the title ~~[SPARK-43389][PYTHON][SQL] Added a null check for lineSep option~~ [SPARK-43389][SQL] Added a null check for lineSep option Jul 10, 2023

gdhuper added 2 commits July 13, 2023 00:59

Added a require check for consistency

f824754

clean up

438d77e

srowen approved these changes Jul 13, 2023

View reviewed changes

srowen closed this in 9f07e4a Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43389][SQL] Added a null check for lineSep option #41904

[SPARK-43389][SQL] Added a null check for lineSep option #41904

gdhuper commented Jul 9, 2023 •

edited

srowen Jul 9, 2023

gdhuper Jul 11, 2023

srowen Jul 11, 2023

gdhuper Jul 12, 2023

srowen Jul 12, 2023

gdhuper Jul 12, 2023

srowen Jul 12, 2023

gdhuper Jul 13, 2023

HyukjinKwon commented Jul 10, 2023

srowen commented Jul 13, 2023

srowen commented Jul 13, 2023

gdhuper commented Jul 13, 2023

[SPARK-43389][SQL] Added a null check for lineSep option #41904

[SPARK-43389][SQL] Added a null check for lineSep option #41904

Conversation

gdhuper commented Jul 9, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 10, 2023

srowen commented Jul 13, 2023

srowen commented Jul 13, 2023

gdhuper commented Jul 13, 2023

gdhuper commented Jul 9, 2023 •

edited