-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48219][CORE] StreamReader Charset fix with UTF8 #46509
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you can provide a test coverage to protect your contribution from potential future regression, @xuzifu666 ?
Not need
@dongjoon-hyun Thanks for your attentions,In my option this code change not need to provide tests for it's a specification for ReadStream usage,if not set utf8 charset may occur error when system default charset not contains Chinese Chars. You can refer it in other framework such as Calcite,Hive,all set utf8 when InputStreamReader constructor method be called. |
@dongjoon-hyun Could you give a final review? Thanks |
Sorry but I'll leave this to the other reviewers, @xuzifu666 . |
@HyukjinKwon could you help to give a review? Thanks |
The change itself looks reasonable to me. I also agree with @dongjoon-hyun that we shall add a simple test, maybe in BTW, the PR is tagged as CORE but the changes belong to XML of sql datasource and hive thriftserver |
@@ -171,7 +172,7 @@ protected BufferedReader loadFile(String fileName) throws IOException { | |||
FileInputStream initStream = null; | |||
BufferedReader bufferedReader = null; | |||
initStream = new FileInputStream(fileName); | |||
bufferedReader = new BufferedReader(new InputStreamReader(initStream)); | |||
bufferedReader = new BufferedReader(new InputStreamReader(initStream, StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code here is copied from Hive, do we have the same issue in the upstream repo? Better to cite here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,I would address it @yaooqinn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wouldn't touch here.
@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{ | |||
val in = ValidatorUtil.openSchemaFile(xsdPath) | |||
val xmlSchemaCollection = new XmlSchemaCollection() | |||
xmlSchemaCollection.setBaseUri(xsdPath.toString) | |||
val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in)) | |||
val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is arguable because if you don't specify the encoding, then it will pick the system default encoding up. If your XSD file is written in other encodings, but here tries to read UTF-8, it will fail.
Other places they have to be UTF-8 because Spark encodes so explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks,I cite hive firstly,than feedback to this pr? @HyukjinKwon XSD not do the change,only change hive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described above, I wouldn't touch Hive side.
XSDtoSchema would not modify it, than HiveImpl had also changed can refer recent pr: apache/hive#5243 so I Think it is nesscery to change it? @yaooqinn @HyukjinKwon |
Thank you @xuzifu666. Merged to master |
What changes were proposed in this pull request?
Fix some StreamReader not set with UTF8,if we actually default charset not support Chinese chars such as latin and conf contain Chinese chars,it would not resolve success,so we need set it as utf8 in StreamReader,we can find all StreamReader with utf8 charset in other compute framework,such as Calcite、Hive、Hudi and so on.
Why are the changes needed?
May cause string decode not as expected
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Not need
Was this patch authored or co-authored using generative AI tooling?
No