Skip to content

Distributed: Add names parameter support to PyArrow reading #2008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 10, 2023

Conversation

LeonLuttenberger
Copy link
Contributor

@LeonLuttenberger LeonLuttenberger commented Feb 10, 2023

Feature or Bugfix

  • Feature

Detail

When we provide a list of column names for a CSV file, our function defaults to using the Pandas read method rather than PyArrow. This PR is adding support for this functionality to be forwarded to PyArrow.

Unfortunately, this uncovered an error with how we load compressed CSV data. I added an xfail to one of the test cases and referenced the GitHub issue in question: #2005

This change will allow us to more efficiently load the dataset with 1 million small CSV files (aws-sdk-pandas-list-us-east-1-658066294590). Because those files do not contain a header, we need to be able to use the names parameter.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@LeonLuttenberger LeonLuttenberger added the enhancement New feature or request label Feb 10, 2023
@LeonLuttenberger LeonLuttenberger marked this pull request as ready for review February 10, 2023 17:22
@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: 776ce83
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 965bf93
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: 965bf93
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-ATYtnXPE7MOa
  • Commit ID: 965bf93
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants