Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4386] Improve performance when writing Parquet files #3843

Closed
wants to merge 1 commit into from

Conversation

MickDavies
Copy link
Contributor

Convert type of RowWriteSupport.attributes to Array.

Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared).

Measurements on 575 column table showed this change made a 6x improvement in write times.

Convert type of RowWriteSupport.attributes to Array.

Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on  attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared).

Measurements on 575 column table showed this change showed a 6x improvement in write times.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24902 has started for PR 3843 at commit 892519d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24902 has finished for PR 3843 at commit 892519d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24902/
Test PASSed.

@marmbrus
Copy link
Contributor

Thanks! I'm merging this to master and branch-1.2.

asfgit pushed a commit that referenced this pull request Dec 30, 2014
Convert type of RowWriteSupport.attributes to Array.

Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on  attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared).

Measurements on 575 column table showed this change made a 6x improvement in write times.

Author: Michael Davies <Michael.BellDavies@gmail.com>

Closes #3843 from MickDavies/SPARK-4386 and squashes the following commits:

892519d [Michael Davies] [SPARK-4386] Improve performance when writing Parquet files

(cherry picked from commit 7425bec)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 7425bec Dec 30, 2014
@MickDavies MickDavies deleted the SPARK-4386 branch December 31, 2014 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants