-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48807][SQL] Binary Support for CSV datasource #47212
Conversation
only thing from me is that we won't be able to read/write roundtrip. Can we do this with the newer binary string format? |
For IO roundtrip, the UFT8 output style can play it directly. Other styles can play with/ functions, or we can add an extra read option to help |
If we specify the schema as binary, can we read it back as binary? |
I remember we do similar things in thriftserver (cc @wangyum ) so I am fine with this but just want to make sure we can read it back |
Yes, I have added the above tests to verify read-as-raw-string and read-w/-binary-schema |
.option("ds_option", "value") | ||
.format(dataSourceFormat) | ||
.save(path.getCanonicalPath) | ||
val expectedStr = ToStringBase.getBinaryFormatter("Spark SQL".getBytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the value as non UTF8 output instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper method gets a binary formatter based on BINARY_OUTPUT_STYLE and converts the raw bytes here to both UTF8 and non-UTF8 outputs
Thanks you @HyukjinKwon @dongjoon-hyun Merged to master |
### What changes were proposed in this pull request? SPARK-42237 disabled binary output for CSV because the binary values use `java.lang.Object.toString` for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now. ### Why are the changes needed? improve csv with spark sql types ### Does this PR introduce _any_ user-facing change? Yes, but it's from failures to success with binary csv tables ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47212 from yaooqinn/SPARK-48807. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request? SPARK-42237 disabled binary output for CSV because the binary values use `java.lang.Object.toString` for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now. ### Why are the changes needed? improve csv with spark sql types ### Does this PR introduce _any_ user-facing change? Yes, but it's from failures to success with binary csv tables ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47212 from yaooqinn/SPARK-48807. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request? SPARK-42237 disabled binary output for CSV because the binary values use `java.lang.Object.toString` for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now. ### Why are the changes needed? improve csv with spark sql types ### Does this PR introduce _any_ user-facing change? Yes, but it's from failures to success with binary csv tables ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47212 from yaooqinn/SPARK-48807. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request? SPARK-42237 disabled binary output for CSV because the binary values use `java.lang.Object.toString` for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now. ### Why are the changes needed? improve csv with spark sql types ### Does this PR introduce _any_ user-facing change? Yes, but it's from failures to success with binary csv tables ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47212 from yaooqinn/SPARK-48807. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
What changes were proposed in this pull request?
SPARK-42237 disabled binary output for CSV because the binary values use
java.lang.Object.toString
for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now.Why are the changes needed?
improve csv with spark sql types
Does this PR introduce any user-facing change?
Yes, but it's from failures to success with binary csv tables
How was this patch tested?
new tests
Was this patch authored or co-authored using generative AI tooling?
no