Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20773][SQL] ParquetWriteSupport.writeFields is quadratic in number of fields #18005

Closed
wants to merge 3 commits into from

Conversation

tpoterba
Copy link
Contributor

Fix quadratic List indexing in ParquetWriteSupport.

I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call.

What changes were proposed in this pull request?

The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields.

Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing.

How was this patch tested?

This is a one-line change for performance reasons.

Fix quadratic List indexing in ParquetWriteSupport.

Minimal solution is to convert rootFieldWriters to a WrappedArray, which has O(1) indexing, and restores complexity to linear.
@hvanhovell
Copy link
Contributor

ok to test

@hvanhovell
Copy link
Contributor

Can you also make sure that we do not use a Seq for struct writing?

@SparkQA
Copy link

SparkQA commented May 16, 2017

Test build #76983 has finished for PR 18005 at commit e13ce9f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

It might be nice to explicitly use the type IndexedSeq[ValueWriter] for rootFieldWriters (up on line 61 of this file) since that would capture the intent behind using an Array and would maybe help prevent regression during refactoring.

@tpoterba
Copy link
Contributor Author

Yeah, I can change that - I do hate the standard IndexedSeq implementation (Vector) though, and want to make sure that the collection is actually a WrappedArray.

I've actually done more than make a one-line change using the Github UI now, and will update with performance benchmarks + a successful build.

@@ -90,7 +90,7 @@ private[parquet] class ParquetWriteSupport extends WriteSupport[InternalRow] wit
}


this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray
this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray[ValueWriter]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either call toIndexedSeq or make the rootFieldWriters an Array. Both are fine.

@@ -90,7 +90,7 @@ private[parquet] class ParquetWriteSupport extends WriteSupport[InternalRow] wit
}


this.rootFieldWriters = schema.map(_.dataType).map(makeWriter)
this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray[ValueWriter]
Copy link
Contributor

@hvanhovell hvanhovell May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either call toIndexedSeq or make the rootFieldWriters an Array. Both are fine.

@hvanhovell
Copy link
Contributor

LGTM pending jenkins

@tpoterba
Copy link
Contributor Author

Addressed comments.

I tried to get some benchmark stats for this code:

spark.read.csv(text_file).write.mode('overwrite').parquet(parquet_path)

I wanted to see the performance improvement for files with various numbers of columns/rows that were all 1.5G. However, I didn't see much of a difference with <30 columns and catalyst blew up when I tried ~50 columns (I wanted to go up to several hundred)

@hvanhovell
Copy link
Contributor

What do you mean by catalyst blew up?

@tpoterba
Copy link
Contributor Author

I used this script to generate random CSV files:

import uuid
import sys

try:
    print('args = ' + str(sys.argv))
    filename = sys.argv[1]
    cols = int(sys.argv[2])
    rows = int(sys.argv[3])
    if len(sys.argv) != 4 or cols <= 0 or rows <= 0:
        raise RuntimeError()
except Exception as e:
    raise RuntimeError('Usage: gen_text_file.py <filename> <cols> <rows>')

rand_to_gen = (cols + 7) / 8


with open(filename, 'w') as f:
    f.write(','.join('col%d' % i for i in range(cols)))
    f.write('\n')
    for i in range(rows):
        if (i % 10000 == 0):
            print('wrote %d lines' % i)
        rands = [x[i:i+4] for i in range(8) for x in [uuid.uuid4().hex for _ in range(rand_to_gen)]]
        f.write(','.join(rands[:cols]))
        f.write('\n')

I generated files that were all the same size on disk with different dimensions (cols x rows):
10x18M
20x9M
30x6M
60x3M
150x1200K
300x600K

Here's what I tried to do to them:

>>> spark.read.csv(text_file).write.mode('overwrite').parquet(parquet_path)

The 10, 20, 30-column files all took between 40s to 1m to complete on 2 cores of my laptop. 60 and up never completed, and actually crashed the java process -- I had to kill it with kill -9.

At one point for the 60-column table, I got a "GC overhead limit exceeded" OOM from the parquet writer (the error suggested that parquet was doing something silly trying to use dictionary encoding for random values, but I haven't figured out how to turn that off). I could be conflating this crash with one we encountered a few months ago, where Spark crashed because Catalyst generated bytecode larger than 64k for dataframes with a large schema.

@tpoterba
Copy link
Contributor Author

Others on my team suggest that the >64k bytecode issue has been fixed already (and ported to a 2.1 maintenance release as well)

@SparkQA
Copy link

SparkQA commented May 17, 2017

Test build #77018 has finished for PR 18005 at commit 72c5487.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM - merging to master/2.2. Thanks!

asfgit pushed a commit that referenced this pull request May 19, 2017
…mber of fields

Fix quadratic List indexing in ParquetWriteSupport.

I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call.

## What changes were proposed in this pull request?

The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields.

Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing.

## How was this patch tested?

This is a one-line change for performance reasons.

Author: tpoterba <tpoterba@broadinstitute.org>
Author: Tim Poterba <tpoterba@gmail.com>

Closes #18005 from tpoterba/tpoterba-patch-1.

(cherry picked from commit 3f2cd51)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
@asfgit asfgit closed this in 3f2cd51 May 19, 2017
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…mber of fields

Fix quadratic List indexing in ParquetWriteSupport.

I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call.

## What changes were proposed in this pull request?

The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields.

Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing.

## How was this patch tested?

This is a one-line change for performance reasons.

Author: tpoterba <tpoterba@broadinstitute.org>
Author: Tim Poterba <tpoterba@gmail.com>

Closes apache#18005 from tpoterba/tpoterba-patch-1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants