Skip to content

Commit

Permalink
[SPARK-4386] Improve performance when writing Parquet files
Browse files Browse the repository at this point in the history
Convert type of RowWriteSupport.attributes to Array.

Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on  attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared).

Measurements on 575 column table showed this change showed a 6x improvement in write times.
  • Loading branch information
MickDavies committed Dec 30, 2014
1 parent 0e532cc commit 892519d
Showing 1 changed file with 2 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -130,15 +130,15 @@ private[parquet] object RowReadSupport {
private[parquet] class RowWriteSupport extends WriteSupport[Row] with Logging {

private[parquet] var writer: RecordConsumer = null
private[parquet] var attributes: Seq[Attribute] = null
private[parquet] var attributes: Array[Attribute] = null

override def init(configuration: Configuration): WriteSupport.WriteContext = {
val origAttributesStr: String = configuration.get(RowWriteSupport.SPARK_ROW_SCHEMA)
val metadata = new JHashMap[String, String]()
metadata.put(RowReadSupport.SPARK_METADATA_KEY, origAttributesStr)

if (attributes == null) {
attributes = ParquetTypesConverter.convertFromString(origAttributesStr)
attributes = ParquetTypesConverter.convertFromString(origAttributesStr).toArray
}

log.debug(s"write support initialized for requested schema $attributes")
Expand Down

0 comments on commit 892519d

Please sign in to comment.