Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv: read file as binary when encoding_errors is set to ignore #1723

Merged
merged 2 commits into from
Oct 28, 2022

Conversation

cnfait
Copy link
Contributor

@cnfait cnfait commented Oct 27, 2022

Feature or Bugfix

  • Bugfix

Detail

  • read_csv chokes on encoding errors even when passing encoding_errors='ignore'. This happens due to ours casting the S3 object to TextIOWrapper after retrieving it and passing that to pd.read_csv.
  • When specifying encoding_errors='ignore' we now keep the object as a set of bytes (mode=rb). In this case pandas is now responsible for wrapping this in a TextIOWrapper and deals with encoding and encoding errors.

I'm actually thinking we should never wrap the S3 object into a TextIOWrapper ourselves - as far as I can tell there is no advantage doing that and pandas will take care of it anyway. mode should always be set to rb in our code... but I'm curious about others' opinion!

Relates

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@cnfait cnfait added the bug Something isn't working label Oct 27, 2022
@cnfait cnfait self-assigned this Oct 27, 2022
@cnfait cnfait force-pushed the read-csv-encoding-errors-param branch from 16410de to 0c3de78 Compare October 27, 2022 10:15
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 16410de
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@cnfait cnfait marked this pull request as ready for review October 27, 2022 10:27
@kukushking
Copy link
Contributor

Would you add a test case covering this, please?

@cnfait cnfait force-pushed the read-csv-encoding-errors-param branch from 39bcd61 to f9cb833 Compare October 27, 2022 16:10
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 39bcd61
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@cnfait
Copy link
Contributor Author

cnfait commented Oct 27, 2022

Would you add a test case covering this, please?

added a test case where a UnicodeDecodeError is raised by default, and not raised when adding encoding_errors=ignore.

Copy link
Contributor

@malachi-constant malachi-constant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Test worked for me.

Copy link
Contributor

@kukushking kukushking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@cnfait cnfait merged commit 7b1d252 into main Oct 28, 2022
@cnfait cnfait deleted the read-csv-encoding-errors-param branch October 28, 2022 09:49
@kukushking kukushking added this to the 2.18.0 milestone Dec 2, 2022
@kukushking kukushking moved this from In Review to To Do in AWS SDK for pandas roadmap Dec 2, 2022
@kukushking kukushking moved this from To Do to Done in AWS SDK for pandas roadmap Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging this pull request may close these issues.

None yet

4 participants