-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3247][SQL] An API for adding data sources to Spark SQL #2475
Conversation
QA tests have started for PR 2475 at commit
|
QA tests have finished for PR 2475 at commit
|
QA tests have started for PR 2475 at commit
|
QA tests have finished for PR 2475 at commit
|
Test FAILed. |
@@ -209,6 +209,7 @@ object Catalyst { | |||
|
|||
object SQL { | |||
lazy val settings = Seq( | |||
libraryDependencies += "org.apache.avro" % "avro" % "1.7.7", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should check whether this conflicts with anything we already have ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only here as an illustration. This library will be its own package likely and I'll remove this before merging.
Moving some of the existing data sources (parquet or json) into using this api would be a great way to test this api's design. |
Considering from a user's perspective, mixing all kinds of data sources is cool, but correspond to |
QA tests have started for PR 2475 at commit
|
QA tests have finished for PR 2475 at commit
|
QA tests have started for PR 2475 at commit
|
QA tests have finished for PR 2475 at commit
|
@@ -254,6 +254,8 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] { | |||
def numPartitions = self.numPartitions | |||
|
|||
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { | |||
case l @ foreign.LogicalRelation(t: foreign.TableScan) => | |||
ExistingRdd(l.output, t.buildScan()) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we add rules for PrunedScan
and FilteredScan
in this PR?
This PR is a follow up of #2590, and tries to introduce a top level SQL parser entry point for all SQL dialects supported by Spark SQL. A top level parser `SparkSQLParser` is introduced to handle the syntaxes that all SQL dialects should recognize (e.g. `CACHE TABLE`, `UNCACHE TABLE` and `SET`, etc.). For all the syntaxes this parser doesn't recognize directly, it fallbacks to a specified function that tries to parse arbitrary input to a `LogicalPlan`. This function is typically another parser combinator like `SqlParser`. DDL syntaxes introduced in #2475 can be moved to here. The `ExtendedHiveQlParser` now only handle Hive specific extensions. Also took the chance to refactor/reformat `SqlParser` for better readability. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2698 from liancheng/gen-sql-parser and squashes the following commits: ceada76 [Cheng Lian] Minor styling fixes 9738934 [Cheng Lian] Minor refactoring, removes optional trailing ";" in the parser bb2ab12 [Cheng Lian] SET property value can be empty string ce8860b [Cheng Lian] Passes test suites e86968e [Cheng Lian] Removes debugging code 8bcace5 [Cheng Lian] Replaces digit.+ to rep1(digit) (Scala style checking doesn't like it) d15d54f [Cheng Lian] Unifies SQL and HiveQL parsers
Test build #22509 has finished for PR 2475 at commit
|
Test build #22641 has finished for PR 2475 at commit
|
Okay, thanks for all the comments. I've done the following:
|
Test build #22646 has finished for PR 2475 at commit
|
Test build #22653 has finished for PR 2475 at commit
|
|
||
package org.apache.spark.sql.sources | ||
|
||
abstract sealed class Filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make it sealed, will this lead to compatibility issues across Spark versions (if someone has code compiled when there was only one subclass of this, but they link to some version of Spark with 2 of them)? At one point I thought sealed classes got their own numeric ID to avoid isInstanceOf checks, but I'm not sure that really happens. @heathermiller do you know how this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed sealed
to be safe.
The new API for sources looks good to me, thanks for making the changes. It will be easy to plug in a lot of neat data sources here. |
|
||
abstract sealed class Filter | ||
|
||
case class EqualTo(attribute: String, value: Any) extends Filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this change to decouple it from the expression hierarchy.
presumably we would want > and < as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wanted to sketch it out first to make sure everyone was onboard. I've added more types now.
I like the new API a lot. Java support. Long term binary compatibility! |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala sql/core/src/main/scala/org/apache/spark/sql/package.scala
Test build #22711 has finished for PR 2475 at commit
|
Test build #22713 has finished for PR 2475 at commit
|
Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
Test build #22731 has finished for PR 2475 at commit
|
LGTM. You'd want to update the PR description before merging it. |
Jenkins, retest this please. |
Test build #22765 has finished for PR 2475 at commit
|
case class GreaterThan(attribute: String, value: Any) extends Filter | ||
case class GreaterThanOrEqual(attribute: String, value: Any) extends Filter | ||
case class LessThan(attribute: String, value: Any) extends Filter | ||
case class LessThanOrEqual(attribute: String, value: Any) extends Filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we only support push down filters with attribute and literal? It looks to me that filters with 2 attribute are also push-able, like 'a < 'b
cc @liancheng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only intended to be an initial set of things that we could push down. We can always add more here.
This PR introduces a new set of APIs to Spark SQL to allow other developers to add support for reading data from new sources in
org.apache.spark.sql.sources
.New sources must implement the interface
BaseRelation
, which is responsible for describing the schema of the data. BaseRelations have threeScan
subclasses, which are responsible for producing an RDD containing row objects. The various Scan interfaces allow for optimizations such as column pruning and filter push down, when the underlying data source can handle these operations.By implementing a class that inherits from RelationProvider these data sources can be accessed using using pure SQL. I've used the functionality to update the JSON support so it can now be used in this way as follows:
Further example usage can be found in the test cases: https://github.com/marmbrus/spark/tree/foreign/sql/core/src/test/scala/org/apache/spark/sql/sources
There is also a library that uses this new API to read avro data available here:
https://github.com/marmbrus/sql-avro