[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233

HyukjinKwon · 2023-12-07T07:53:57Z

What changes were proposed in this pull request?

This PR is another approach of #43784 which proposes to support Python Data Source can be with SQL (in favour of #43949), SparkR and all other exiting combinations by wrapping the Python Data Source by DSv2 interface (but yet uses V1Table interface).

The approach is: one Python Data Source wrapper looks up Python data sources

Self-contained working example:

from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition

class TestDataSourceReader(DataSourceReader):
    def __init__(self, options):
        self.options = options
    def partitions(self):
        return [InputPartition(i) for i in range(3)]
    def read(self, partition):
        yield partition.value, str(partition.value)

class TestDataSource(DataSource):
    @classmethod
    def name(cls):
        return "test"
    def schema(self):
        return "x INT, y STRING"
    def reader(self, schema) -> "DataSourceReader":
        return TestDataSourceReader(self.options)

spark.dataSource.register(TestDataSource)
sql("CREATE TABLE tblA USING test")
sql("SELECT * from tblA").show()

results in:

+---+---+
|  x|  y|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
+---+---+

There are limitations and followups to make:

Statically loading Python Data Sources is still not supported (SPARK-45917)

Why are the changes needed?

In order for Python Data Source to be able to be used in all other place including SparkR, Scala together.

Does this PR introduce any user-facing change?

Yes. Users can register their Python Data Source, and use them in SQL, SparkR, etc.

How was this patch tested?

Unittests were added, and manually tested.

Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43784

HyukjinKwon · 2023-12-07T07:54:24Z

cc @allisonwang-db and @cloud-fan

allisonwang-db

The approach looks good!

allisonwang-db · 2023-12-07T18:51:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala

+  private def getOrCreateSourceDataFrame(
+      options: CaseInsensitiveStringMap, maybeSchema: Option[StructType]): DataFrame = {
+    if (sourceDataFrame != null) return sourceDataFrame
+    // TODO(SPARK-45600): should be session-based.


This one should be fixed?

For basic support, I think so. The thing is that we should take a look into session inheritance, testcase, etc. So I leave this as a todo for now.

allisonwang-db · 2023-12-07T18:51:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala

+    // TODO(SPARK-45600): should be session-based.
+    val builder = SparkSession.active.sessionState.dataSourceManager.lookupDataSource(shortName)
+    val plan = builder(
+      SparkSession.active,


Does it get the correct session for spark connect?

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

allisonwang-db · 2023-12-07T18:58:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala

+/**
+ * Data Source V2 wrapper for Python Data Source.
+ */
+class PythonTableProvider(shortName: String) extends TableProvider {


Should we support external metadata for this data source? I.e users can create a table using a python datasource with user defined table schema.

I believe it already does (?).

dongjoon-hyun · 2023-12-08T20:50:03Z

Just a question. So, which one is the final decision, this PR or #43784 ? Do we need to collect more opinions?

HyukjinKwon · 2023-12-08T23:52:58Z

Was investigating pros and cons. For now, this one is more likely the one but waiting @cloud-fan 's sign off :-).

HyukjinKwon · 2023-12-09T07:06:18Z

Update: we're offline discussing. I will make another POC PR. We will write up the summary in the final PR description.

Reusing existing codegeneration logic

90ba777

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Dec 7, 2023

DSv2 interface (non-codegen)

ff2798a

HyukjinKwon force-pushed the sql-register-pydatasource-nocodegen branch from 6558767 to ff2798a Compare December 7, 2023 07:58

HyukjinKwon changed the title ~~[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single Python wrapper)~~ [SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) Dec 7, 2023

allisonwang-db reviewed Dec 7, 2023

View reviewed changes

Address a comment

ea895b9

HyukjinKwon force-pushed the sql-register-pydatasource-nocodegen branch from 7a70593 to ea895b9 Compare December 8, 2023 00:30

HyukjinKwon mentioned this pull request Dec 8, 2023

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

Closed

HyukjinKwon marked this pull request as draft December 9, 2023 06:55

HyukjinKwon mentioned this pull request Dec 11, 2023

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (DSv2 exec) #44305

Closed

HyukjinKwon closed this in a1b0da2 Dec 15, 2023

HyukjinKwon deleted the sql-register-pydatasource-nocodegen branch January 15, 2024 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233

HyukjinKwon commented Dec 7, 2023 •

edited

Loading

HyukjinKwon commented Dec 7, 2023

allisonwang-db left a comment

allisonwang-db Dec 7, 2023

HyukjinKwon Dec 7, 2023 •

edited

Loading

allisonwang-db Dec 7, 2023

HyukjinKwon Dec 7, 2023

allisonwang-db Dec 7, 2023

HyukjinKwon Dec 7, 2023

dongjoon-hyun commented Dec 8, 2023

HyukjinKwon commented Dec 8, 2023

HyukjinKwon commented Dec 9, 2023

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233

Conversation

HyukjinKwon commented Dec 7, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Dec 7, 2023

allisonwang-db left a comment

Choose a reason for hiding this comment

allisonwang-db Dec 7, 2023

Choose a reason for hiding this comment

HyukjinKwon Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

allisonwang-db Dec 7, 2023

Choose a reason for hiding this comment

HyukjinKwon Dec 7, 2023

Choose a reason for hiding this comment

allisonwang-db Dec 7, 2023

Choose a reason for hiding this comment

HyukjinKwon Dec 7, 2023

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 8, 2023

HyukjinKwon commented Dec 8, 2023

HyukjinKwon commented Dec 9, 2023

HyukjinKwon commented Dec 7, 2023 •

edited

Loading

HyukjinKwon Dec 7, 2023 •

edited

Loading