-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
Currently, Datafusion doesn't appear to support reading CSV files that use a non-UTF-8 encoding scheme, such as the common ISO-8859-1 or others.
While CSV may be a terrible data format, it's also ubiquitous in the wild and many of them use alternative character encodings. It would be useful if there was an option to read CSV files that use an encoding other than UTF-8.
Describe the solution you'd like
Add an option to CsvOptions or elsewhere to specify the encoding used by the input file, defaulting to UTF-8. Datafusion could then use encoding_rs internally to decode chunks of incoming data.
Describe alternatives you've considered
An alternative to depending on encoding_rs directly would be to expose an option that allowed users to provide their own decoding logic, which they would then likely delegate to encoding_rs. This might be desirable if the added dependency is deemed to heavy (though it could easily be put behind a feature flag).
Additional context
No response