[SPARK-37604][SQL] Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files#34860
Conversation
|
ok to test |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Hm,
emptyValuein read path reads the value as"".emptyValuein write path writes""as the value.
But the current change will change the behaviour of emptyValue in read to:
emptyValuein read path reads the value as""as the value.
I think either way make sense so I wouldn't introduce an unnecessary breaking change here.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Yes, I agree with you that it would make a breaking change. The parameter What makes me confused is that is there any better way to read csv files which are written with self-defined emptyValue. In writing, if emptyValue is set to "EMPTY", when reading, we can't recognize empty strings("EMPTY") but got a "EMPTY" string rather than "". We need to handle "EMPTY" strings with hardcode rather than setting options when reading. |
|
Test build #146105 has finished for PR 34860 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #146238 has finished for PR 34860 at commit
|
|
Maybe a kind of possible approach to avoid a breaking change is that adding another boolean option in CSVOptions such as And with this option, we can change the With this option, the default behavior is same with before, but for users such as me, we can succeed to parse emptyValue strings to "" when reading csv files. |
|
Currently, maybe it is not widely required. |
What changes were proposed in this pull request?
We change the effect of emptyValueInRead option in CSVOptions to that any fields matching this string will be set as "" when reading csv files. Before this change, the effect of it is to convert quoted empty strings "\"\"" to defined emptyValueInRead strings.
Why are the changes needed?
After this change, the effect of emptyValueInRead option is more in line with the user's usage habits and it will has the similar behavior as nullValue option when reading csv files. We can parse dataframe from csv files as the same as the original one when writing and reading with the same emptyValue(emptyValueInWrite == emptyValueInRead) setting.
Before this change, for codes follows:
dataframe shows as:
the csv data shows as:
If we read this csv file with the same emptyValue value:
We finally get DataFame data shows as:
After this change, we can parse the csv file to get the same dataframe.
Does this PR introduce any user-facing change?
Yes, uses should know this change clearly and be carefully to deal with the emptyValue parameter with self-defined value.
How was this patch tested?
Upgrade related test cases that already exists in CSVSuite.