-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20773][SQL] ParquetWriteSupport.writeFields is quadratic in number of fields #18005
Conversation
Fix quadratic List indexing in ParquetWriteSupport. Minimal solution is to convert rootFieldWriters to a WrappedArray, which has O(1) indexing, and restores complexity to linear.
ok to test |
Can you also make sure that we do not use a |
Test build #76983 has finished for PR 18005 at commit
|
It might be nice to explicitly use the type |
Yeah, I can change that - I do hate the standard IndexedSeq implementation (Vector) though, and want to make sure that the collection is actually a WrappedArray. I've actually done more than make a one-line change using the Github UI now, and will update with performance benchmarks + a successful build. |
@@ -90,7 +90,7 @@ private[parquet] class ParquetWriteSupport extends WriteSupport[InternalRow] wit | |||
} | |||
|
|||
|
|||
this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray | |||
this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray[ValueWriter] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either call toIndexedSeq
or make the rootFieldWriters
an Array. Both are fine.
@@ -90,7 +90,7 @@ private[parquet] class ParquetWriteSupport extends WriteSupport[InternalRow] wit | |||
} | |||
|
|||
|
|||
this.rootFieldWriters = schema.map(_.dataType).map(makeWriter) | |||
this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray[ValueWriter] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either call toIndexedSeq
or make the rootFieldWriters
an Array. Both are fine.
LGTM pending jenkins |
Addressed comments. I tried to get some benchmark stats for this code: spark.read.csv(text_file).write.mode('overwrite').parquet(parquet_path) I wanted to see the performance improvement for files with various numbers of columns/rows that were all 1.5G. However, I didn't see much of a difference with <30 columns and catalyst blew up when I tried ~50 columns (I wanted to go up to several hundred) |
What do you mean by catalyst blew up? |
I used this script to generate random CSV files: import uuid
import sys
try:
print('args = ' + str(sys.argv))
filename = sys.argv[1]
cols = int(sys.argv[2])
rows = int(sys.argv[3])
if len(sys.argv) != 4 or cols <= 0 or rows <= 0:
raise RuntimeError()
except Exception as e:
raise RuntimeError('Usage: gen_text_file.py <filename> <cols> <rows>')
rand_to_gen = (cols + 7) / 8
with open(filename, 'w') as f:
f.write(','.join('col%d' % i for i in range(cols)))
f.write('\n')
for i in range(rows):
if (i % 10000 == 0):
print('wrote %d lines' % i)
rands = [x[i:i+4] for i in range(8) for x in [uuid.uuid4().hex for _ in range(rand_to_gen)]]
f.write(','.join(rands[:cols]))
f.write('\n') I generated files that were all the same size on disk with different dimensions (cols x rows): Here's what I tried to do to them: >>> spark.read.csv(text_file).write.mode('overwrite').parquet(parquet_path) The 10, 20, 30-column files all took between 40s to 1m to complete on 2 cores of my laptop. 60 and up never completed, and actually crashed the java process -- I had to kill it with At one point for the 60-column table, I got a "GC overhead limit exceeded" OOM from the parquet writer (the error suggested that parquet was doing something silly trying to use dictionary encoding for random values, but I haven't figured out how to turn that off). I could be conflating this crash with one we encountered a few months ago, where Spark crashed because Catalyst generated bytecode larger than 64k for dataframes with a large schema. |
Others on my team suggest that the >64k bytecode issue has been fixed already (and ported to a 2.1 maintenance release as well) |
Test build #77018 has finished for PR 18005 at commit
|
LGTM - merging to master/2.2. Thanks! |
…mber of fields Fix quadratic List indexing in ParquetWriteSupport. I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call. ## What changes were proposed in this pull request? The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields. Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing. ## How was this patch tested? This is a one-line change for performance reasons. Author: tpoterba <tpoterba@broadinstitute.org> Author: Tim Poterba <tpoterba@gmail.com> Closes #18005 from tpoterba/tpoterba-patch-1. (cherry picked from commit 3f2cd51) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
…mber of fields Fix quadratic List indexing in ParquetWriteSupport. I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call. ## What changes were proposed in this pull request? The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields. Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing. ## How was this patch tested? This is a one-line change for performance reasons. Author: tpoterba <tpoterba@broadinstitute.org> Author: Tim Poterba <tpoterba@gmail.com> Closes apache#18005 from tpoterba/tpoterba-patch-1.
Fix quadratic List indexing in ParquetWriteSupport.
I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call.
What changes were proposed in this pull request?
The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields.
Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing.
How was this patch tested?
This is a one-line change for performance reasons.