Skip to content

Commit

Permalink
[SPARK-35783][SQL] Set the list of read columns in the task configura…
Browse files Browse the repository at this point in the history
…tion to reduce reading of ORC data

### What changes were proposed in this pull request?
Set the list of read columns in the task configuration to reduce reading of ORC data.
### Why are the changes needed?
Now, the ORC reader will read all columns of the ORC table when the task configuration does not set the list of read columns . Therefore, we should set the list of read columns in the task configuration to reduce reading of ORC data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
exist unittests

Closes #32923 from weixiuli/SPARK-35783.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
  • Loading branch information
weixiuli authored and dongjoon-hyun committed Jun 17, 2021
1 parent 94bdbec commit 947c7ea
Showing 1 changed file with 2 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,8 @@ class OrcFileFormat
"[BUG] requested column IDs do not match required schema")
val taskConf = new Configuration(conf)

val includeColumns = requestedColIds.filter(_ != -1).sorted.mkString(",")
taskConf.set(OrcConf.INCLUDE_COLUMNS.getAttribute, includeColumns)
val fileSplit = new FileSplit(filePath, file.start, file.length, Array.empty)
val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
val taskAttemptContext = new TaskAttemptContextImpl(taskConf, attemptId)
Expand Down

0 comments on commit 947c7ea

Please sign in to comment.