Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Dec 7, 2023

What changes were proposed in this pull request?

This PR is another approach of #43784 which proposes to support Python Data Source can be with SQL (in favour of #43949), SparkR and all other exiting combinations by wrapping the Python Data Source by DSv2 interface (but yet uses V1Table interface).

The approach is: one Python Data Source wrapper looks up Python data sources

Self-contained working example:

from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition

class TestDataSourceReader(DataSourceReader):
    def __init__(self, options):
        self.options = options
    def partitions(self):
        return [InputPartition(i) for i in range(3)]
    def read(self, partition):
        yield partition.value, str(partition.value)

class TestDataSource(DataSource):
    @classmethod
    def name(cls):
        return "test"
    def schema(self):
        return "x INT, y STRING"
    def reader(self, schema) -> "DataSourceReader":
        return TestDataSourceReader(self.options)
spark.dataSource.register(TestDataSource)
sql("CREATE TABLE tblA USING test")
sql("SELECT * from tblA").show()

results in:

+---+---+
|  x|  y|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
+---+---+

There are limitations and followups to make:

  1. Statically loading Python Data Sources is still not supported (SPARK-45917)

Why are the changes needed?

In order for Python Data Source to be able to be used in all other place including SparkR, Scala together.

Does this PR introduce any user-facing change?

Yes. Users can register their Python Data Source, and use them in SQL, SparkR, etc.

How was this patch tested?

Unittests were added, and manually tested.

Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43784

@HyukjinKwon
Copy link
Member Author

cc @allisonwang-db and @cloud-fan

@HyukjinKwon HyukjinKwon force-pushed the sql-register-pydatasource-nocodegen branch from 6558767 to ff2798a Compare December 7, 2023 07:58
@HyukjinKwon HyukjinKwon changed the title [SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single Python wrapper) [SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) Dec 7, 2023
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good!

private def getOrCreateSourceDataFrame(
options: CaseInsensitiveStringMap, maybeSchema: Option[StructType]): DataFrame = {
if (sourceDataFrame != null) return sourceDataFrame
// TODO(SPARK-45600): should be session-based.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should be fixed?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For basic support, I think so. The thing is that we should take a look into session inheritance, testcase, etc. So I leave this as a todo for now.

// TODO(SPARK-45600): should be session-based.
val builder = SparkSession.active.sessionState.dataSourceManager.lookupDataSource(shortName)
val plan = builder(
SparkSession.active,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it get the correct session for spark connect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

/**
* Data Source V2 wrapper for Python Data Source.
*/
class PythonTableProvider(shortName: String) extends TableProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support external metadata for this data source? I.e users can create a table using a python datasource with user defined table schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it already does (?).

@dongjoon-hyun
Copy link
Member

Just a question. So, which one is the final decision, this PR or #43784 ? Do we need to collect more opinions?

@HyukjinKwon
Copy link
Member Author

Was investigating pros and cons. For now, this one is more likely the one but waiting @cloud-fan 's sign off :-).

@HyukjinKwon HyukjinKwon marked this pull request as draft December 9, 2023 06:55
@HyukjinKwon
Copy link
Member Author

Update: we're offline discussing. I will make another POC PR. We will write up the summary in the final PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants