Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Spark integration similar to presto #35

Closed
vinothchandar opened this issue Jan 6, 2017 · 4 comments
Closed

Rework Spark integration similar to presto #35

vinothchandar opened this issue Jan 6, 2017 · 4 comments
Assignees

Comments

@vinothchandar
Copy link
Member

Approach used in prestodb/presto#7002

Actual changes will be on Apache Spark, this is for tracking

@vinothchandar vinothchandar self-assigned this Jan 6, 2017
@vinothchandar
Copy link
Member Author

Upon deep introspection of the Spark 2.x code line, it seems like the changes ought to be in

  • CatalogFileIndex.scala

Need more clarity.

@vinothchandar
Copy link
Member Author

vinothchandar commented Jan 24, 2017

There are broadly three approaches we can take here.

Approach 1 : Setting Path filters

Only works on Spark 2.x, we have to do something like below.

spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);

Tested basic counts, and a three-way join. Seems to work.

Approach 2 : Make spark also work with @useFileSplitsFromInputFormat

This is doable, and would probably ensure a smoother path forward. To do this, we need to change PartitionAwareFileIndex, to call ipf.getSplits based on the annotation.. Doable, but needs some munging of interfaces. Better done, with direct feedback from Spark community,

https://issues.apache.org/jira/browse/SPARK-19351

@vinothchandar
Copy link
Member Author

For now, will get Approach 1, ready, so we have 1 more option in the bag. Will add unit tests around joins etc..

@prazanna how is our path towards Spark 2.0 looking?

@vinothchandar
Copy link
Member Author

Closing this.. the current path filter based approach is verified on the spark ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant