Rework Spark integration similar to presto #35

vinothchandar · 2017-01-06T19:10:52Z

Approach used in prestodb/presto#7002

Actual changes will be on Apache Spark, this is for tracking

vinothchandar · 2017-01-09T23:44:10Z

Upon deep introspection of the Spark 2.x code line, it seems like the changes ought to be in

CatalogFileIndex.scala

Need more clarity.

vinothchandar · 2017-01-24T16:50:38Z

There are broadly three approaches we can take here.

Approach 1 : Setting Path filters

Only works on Spark 2.x, we have to do something like below.

spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);

Tested basic counts, and a three-way join. Seems to work.

Approach 2 : Make spark also work with `@useFileSplitsFromInputFormat`

This is doable, and would probably ensure a smoother path forward. To do this, we need to change PartitionAwareFileIndex, to call ipf.getSplits based on the annotation.. Doable, but needs some munging of interfaces. Better done, with direct feedback from Spark community,

https://issues.apache.org/jira/browse/SPARK-19351

vinothchandar · 2017-01-24T16:53:03Z

For now, will get Approach 1, ready, so we have 1 more option in the bag. Will add unit tests around joins etc..

@prazanna how is our path towards Spark 2.0 looking?

vinothchandar · 2017-03-06T04:39:17Z

Closing this.. the current path filter based approach is verified on the spark ticket

vinothchandar self-assigned this Jan 6, 2017

vinothchandar closed this as completed Mar 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework Spark integration similar to presto #35

Rework Spark integration similar to presto #35

vinothchandar commented Jan 6, 2017

vinothchandar commented Jan 9, 2017

vinothchandar commented Jan 24, 2017 •

edited

Loading

vinothchandar commented Jan 24, 2017

vinothchandar commented Mar 6, 2017

Rework Spark integration similar to presto #35

Rework Spark integration similar to presto #35

Comments

vinothchandar commented Jan 6, 2017

vinothchandar commented Jan 9, 2017

vinothchandar commented Jan 24, 2017 • edited Loading

Approach 1 : Setting Path filters

Approach 2 : Make spark also work with @useFileSplitsFromInputFormat

vinothchandar commented Jan 24, 2017

vinothchandar commented Mar 6, 2017

vinothchandar commented Jan 24, 2017 •

edited

Loading

Approach 2 : Make spark also work with `@useFileSplitsFromInputFormat`