-
Notifications
You must be signed in to change notification settings - Fork 123
Enable Dataframe to be converted into views which can be used in register_table #1016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…comments and assertions
…ions for DataFrame results
python/tests/test_view.py
Outdated
view = df_filtered.into_view() | ||
|
||
assert view.kind == "view" | ||
|
||
ctx.register_table("view1", view) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is modelled after how into_view is used in datafusion:
async fn with_column_renamed_ambiguous() -> Result<()> {
let df = test_table().await?.select_columns(&["c1", "c2", "c3"])?;
let ctx = SessionContext::new();
let table = df.into_view();
ctx.register_table("t1", table.clone())?;
ctx.register_table("t2", table)?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a valuable addition. Thank you!
I do have a few concerns below, and I'm happy to help us come to a common solution.
#[pyclass(name = "TableProvider", module = "datafusion")] | ||
pub struct PyTableProvider { | ||
provider: Arc<dyn TableProvider>, | ||
} | ||
|
||
impl PyTableProvider { | ||
pub fn new(provider: Arc<dyn TableProvider>) -> Self { | ||
Self { provider } | ||
} | ||
|
||
pub fn as_table(&self) -> PyTable { | ||
let table_provider: Arc<dyn TableProvider> = self.provider.clone(); | ||
PyTable::new(table_provider) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think this is a good idea, but I'm worried about causing confusion with a table provider created from a view and a table provider that is passed from an external source using pycapsule. I can imagine a user would think that a table provider object from one place can be used with another. That is, if I create a table provider with into_view I should be able to register it with the session context. Now, I don't think that operation is strictly necssary but I do expect it would cause some confusion.
What I think we want to do is to have a single common PyTableProvider that can be created either via a pycapsule or into_view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a single common PyTableProvider that can be created either via a pycapsule or into_view.
Do you mean a constructor that takes a pycapsule argument, then extract provider to use in
PyTableProvider::new(provider)?
Can I check how I can obtain the provider from pub struct PyCapsule(PyAny)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance you can give me some code points or reference PRs that would help with implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just skip the whole creating a view as a table provider and instead go straight to registering a view on the session context?
We could do something like register_view(df: DataFrame)
which would under the hood do exactly what you've got except not expose it back as a PyTableProvider
and eliminate any possible confusion. Then we'd also save the user a step.
@matko would that solve your needs or do you need that view table provider exposed for other use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise I think we have to plan for how we can have a common concept around two ways of creating table providers in python code. Also we would want to think about how we would handle the return type of a udtf, which we haven't even addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip the whole creating a view as a table provider and instead go straight to registering a view on the session context?
Sounds good.
Implemented.
src/dataframe.rs
Outdated
@@ -156,6 +174,20 @@ impl PyDataFrame { | |||
PyArrowType(self.df.schema().into()) | |||
} | |||
|
|||
/// Convert this DataFrame into a Table that can be used in register_table | |||
fn _into_view(&self) -> PyDataFusionResult<PyTable> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend disabling that specific clippy warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
README.md
Outdated
## Registering a DataFrame as a View | ||
|
||
You can use the `into_view` method to convert a DataFrame into a view and register it with the context. | ||
|
||
```python | ||
from datafusion import SessionContext, col, literal | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is very good, but would be more helpful if moved into the appropriate docs section so it goes into the online documentation rather than the readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a view.rst for this.
…r DataFrame registration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. I will try to run it later today.
src/dataframe.rs
Outdated
/// | ||
/// # Methods | ||
/// | ||
/// - `new`: Creates a new PyDataFrame. | ||
/// - `__getitem__`: Enable selection for `df[col]`, `df[col1, col2, col3]`, and `df[[col1, col2, col3]]`. | ||
/// - `__repr__`: Returns a string representation of the DataFrame. | ||
/// - `_repr_html_`: Returns an HTML representation of the DataFrame. | ||
/// - `describe`: Calculate summary statistics for a DataFrame. | ||
/// - `schema`: Returns the schema from the logical plan. | ||
/// - `into_view`: Convert this DataFrame into a Table that can be used in register_table. We have not finalized on PyTableProvider approach yet. | ||
/// - `select_columns`: Select columns from the DataFrame. | ||
/// - `select`: Select expressions from the DataFrame. | ||
/// - `drop`: Drop columns from the DataFrame. | ||
/// - `filter`: Filter the DataFrame based on a predicate. | ||
/// - `with_column`: Add a new column to the DataFrame. | ||
/// - `with_columns`: Add multiple new columns to the DataFrame. | ||
/// - `with_column_renamed`: Rename a column in the DataFrame. | ||
/// - `aggregate`: Aggregate the DataFrame based on group by and aggregation expressions. | ||
/// - `sort`: Sort the DataFrame based on expressions. | ||
/// - `limit`: Limit the number of rows in the DataFrame. | ||
/// - `collect`: Executes the plan, returning a list of `RecordBatch`es. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this whole section. The methods of the class should auto populate into the help
and online documentation. Do you think this is necessary? It's another thing we'd have to manually maintain over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not necessary.
I deleted it.
Which issue does this PR close?
Closes #1004.
Rationale for this change
Currently, Datafusion supports views via ViewTable, allowing logical plans to be registered as tables. However, this feature is not exposed in the Python bindings. This PR addresses that gap by enabling Dataframes to be converted to view, enabling users to create views programmatically without relying on raw SQL strings.
What changes are included in this PR?
Adds DataFrame.into_view which can then be used in
register_table.
Are there any user-facing changes?
Yes, Python users will now be able to convert dataframes into view, then register_table with the view.