Skip to content

Support non-UTF-8 encoded CSV files #20473

@Rafferty97

Description

@Rafferty97

Is your feature request related to a problem or challenge?

Currently, Datafusion doesn't appear to support reading CSV files that use a non-UTF-8 encoding scheme, such as the common ISO-8859-1 or others.

While CSV may be a terrible data format, it's also ubiquitous in the wild and many of them use alternative character encodings. It would be useful if there was an option to read CSV files that use an encoding other than UTF-8.

Describe the solution you'd like

Add an option to CsvOptions or elsewhere to specify the encoding used by the input file, defaulting to UTF-8. Datafusion could then use encoding_rs internally to decode chunks of incoming data.

Describe alternatives you've considered

An alternative to depending on encoding_rs directly would be to expose an option that allowed users to provide their own decoding logic, which they would then likely delegate to encoding_rs. This might be desirable if the added dependency is deemed to heavy (though it could easily be put behind a feature flag).

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions