-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233
[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233
Conversation
cc @allisonwang-db and @cloud-fan |
6558767
to
ff2798a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good!
private def getOrCreateSourceDataFrame( | ||
options: CaseInsensitiveStringMap, maybeSchema: Option[StructType]): DataFrame = { | ||
if (sourceDataFrame != null) return sourceDataFrame | ||
// TODO(SPARK-45600): should be session-based. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For basic support, I think so. The thing is that we should take a look into session inheritance, testcase, etc. So I leave this as a todo for now.
// TODO(SPARK-45600): should be session-based. | ||
val builder = SparkSession.active.sessionState.dataSourceManager.lookupDataSource(shortName) | ||
val plan = builder( | ||
SparkSession.active, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it get the correct session for spark connect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala
Outdated
Show resolved
Hide resolved
/** | ||
* Data Source V2 wrapper for Python Data Source. | ||
*/ | ||
class PythonTableProvider(shortName: String) extends TableProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support external metadata for this data source? I.e users can create a table using a python datasource with user defined table schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it already does (?).
7a70593
to
ea895b9
Compare
Just a question. So, which one is the final decision, this PR or #43784 ? Do we need to collect more opinions? |
Was investigating pros and cons. For now, this one is more likely the one but waiting @cloud-fan 's sign off :-). |
Update: we're offline discussing. I will make another POC PR. We will write up the summary in the final PR description. |
What changes were proposed in this pull request?
This PR is another approach of #43784 which proposes to support Python Data Source can be with SQL (in favour of #43949), SparkR and all other exiting combinations by wrapping the Python Data Source by DSv2 interface (but yet uses
V1Table
interface).The approach is: one Python Data Source wrapper looks up Python data sources
Self-contained working example:
results in:
There are limitations and followups to make:
Why are the changes needed?
In order for Python Data Source to be able to be used in all other place including SparkR, Scala together.
Does this PR introduce any user-facing change?
Yes. Users can register their Python Data Source, and use them in SQL, SparkR, etc.
How was this patch tested?
Unittests were added, and manually tested.
Was this patch authored or co-authored using generative AI tooling?
No.
Closes #43784