Distributed: Add `names` parameter support to PyArrow reading #2008

LeonLuttenberger · 2023-02-10T17:19:39Z

Feature or Bugfix

Feature

Detail

When we provide a list of column names for a CSV file, our function defaults to using the Pandas read method rather than PyArrow. This PR is adding support for this functionality to be forwarded to PyArrow.

Unfortunately, this uncovered an error with how we load compressed CSV data. I added an xfail to one of the test cases and referenced the GitHub issue in question: #2005

This change will allow us to more efficiently load the dataset with 1 million small CSV files (aws-sdk-pandas-list-us-east-1-658066294590). Because those files do not contain a header, we need to be able to use the names parameter.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

malachi-constant · 2023-02-10T18:18:46Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 776ce83
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2023-02-10T18:22:15Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 965bf93
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2023-02-10T18:23:57Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 965bf93
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2023-02-10T18:34:59Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-ATYtnXPE7MOa
Commit ID: 965bf93
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

LeonLuttenberger added 3 commits February 9, 2023 10:22

Expand arrow reading to support the names parameter

db47c48

Merge branch 'release-3.0.0' into dist/expand-arrow-read-text

ed71d5e

Add xfail and link to GitHub issue

52be959

LeonLuttenberger requested review from jaidisido, cnfait, kukushking and malachi-constant February 10, 2023 17:21

LeonLuttenberger added the enhancement New feature or request label Feb 10, 2023

LeonLuttenberger marked this pull request as ready for review February 10, 2023 17:22

This comment was marked as outdated.

Sign in to view

Add xfail to test_csv_write

776ce83

This comment was marked as outdated.

Sign in to view

Fix missing annotations

965bf93

This comment was marked as outdated.

Sign in to view

malachi-constant approved these changes Feb 10, 2023

View reviewed changes

LeonLuttenberger merged commit 193923d into release-3.0.0 Feb 10, 2023

LeonLuttenberger deleted the dist/expand-arrow-read-text branch February 10, 2023 18:57

LeonLuttenberger mentioned this pull request Apr 21, 2023

Create ADR for switching between PyArrow and Pandas based datasources #2218

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed: Add `names` parameter support to PyArrow reading #2008

Distributed: Add `names` parameter support to PyArrow reading #2008

Uh oh!

LeonLuttenberger commented Feb 10, 2023 •

edited

Loading

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

malachi-constant commented Feb 10, 2023

Uh oh!

malachi-constant commented Feb 10, 2023

Uh oh!

malachi-constant commented Feb 10, 2023

Uh oh!

malachi-constant commented Feb 10, 2023

Uh oh!

Uh oh!

Distributed: Add names parameter support to PyArrow reading #2008

Distributed: Add names parameter support to PyArrow reading #2008

Uh oh!

Conversation

LeonLuttenberger commented Feb 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature or Bugfix

Detail

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

malachi-constant commented Feb 10, 2023

AWS CodeBuild CI Report

Uh oh!

malachi-constant commented Feb 10, 2023

AWS CodeBuild CI Report

Uh oh!

malachi-constant commented Feb 10, 2023

AWS CodeBuild CI Report

Uh oh!

malachi-constant commented Feb 10, 2023

AWS CodeBuild CI Report

Uh oh!

Uh oh!

Distributed: Add `names` parameter support to PyArrow reading #2008

Distributed: Add `names` parameter support to PyArrow reading #2008

LeonLuttenberger commented Feb 10, 2023 •

edited

Loading