-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv: read file as binary when encoding_errors is set to ignore #1723
Conversation
16410de
to
0c3de78
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Would you add a test case covering this, please? |
39bcd61
to
f9cb833
Compare
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
added a test case where a UnicodeDecodeError is raised by default, and not raised when adding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Test worked for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
Feature or Bugfix
Detail
read_csv
chokes on encoding errors even when passingencoding_errors='ignore'
. This happens due to ours casting the S3 object toTextIOWrapper
after retrieving it and passing that topd.read_csv
.encoding_errors='ignore'
we now keep the object as a set of bytes (mode=rb
). In this case pandas is now responsible for wrapping this in a TextIOWrapper and deals with encoding and encoding errors.I'm actually thinking we should never wrap the S3 object into a TextIOWrapper ourselves - as far as I can tell there is no advantage doing that and pandas will take care of it anyway.
mode
should always be set torb
in our code... but I'm curious about others' opinion!Relates
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.