Skip to content
This repository has been archived by the owner on Jul 25, 2022. It is now read-only.

Support custom TableProvider #45

Closed
jychen7 opened this issue Apr 2, 2022 · 4 comments
Closed

Support custom TableProvider #45

jychen7 opened this issue Apr 2, 2022 · 4 comments

Comments

@jychen7
Copy link
Member

jychen7 commented Apr 2, 2022

Background

I would like to use datafusion-python to query Bigtable. In Rust, datafusion-bigtable have implement BigtableDataSource as custom TableProvider.

Problem

I tried to add register_table in #46 and expose a python BigtableTable in datafusion-bigtable at datafusion-contrib/datafusion-bigtable#3.

The problem is how to convert python BigtableTable to python Table? Or how to serialize/deserialize rust TableProvider to some Python Object?

classDiagram
    BigtableTable_Python <|-- PyBigtableTable_Rust
    Table_Python <|-- PyTable_Rust
    TableProvider_Rust <|-- BigtableDatasource_Rust
    TableProvider_Rust <|-- ListingTable_Rust
    ListingTable_Rust <|-- CSV
    ListingTable_Rust <|-- Parquet
    ListingTable_Rust <|-- JSON
    ListingTable_Rust <|-- Avro
    class BigtableTable_Python{
    }
    class PyBigtableTable_Rust{
        table: TableProvider_Rust
    }
    class Table_Python{
    }
    class PyTable_Rust{
        table: TableProvider_Rust
    }
            
Loading

following is a non-working example, because bigtable.table() is TableProvider(Rust) and have no corresponding python object

from datafusion import ExecutionContext
from datafusion._internal import Table as DatafusionTable
from datafusion_bigtable import BigtableTable

@pytest.fixture
def df_table():
    bigtable = BigtableTable(
        project="emulator",
        xxx
    )
    return DatafusionTable(bigtable.table())
@jychen7
Copy link
Member Author

jychen7 commented Apr 3, 2022

I believe it can work using https://pyo3.rs/v0.15.1/class.html?highlight=inheri#inheritance, close now

@jychen7 jychen7 closed this as completed Apr 3, 2022
@jychen7
Copy link
Member Author

jychen7 commented Apr 4, 2022

I try both inheritance and non-inheritance, compiling works, but pytest still show error

with inheritance, ctx.register_table("weather_balloons", bigtable_table) returns TypeError: argument 'table': 'BigtableTable' object cannot be converted to 'Table'
https://github.com/datafusion-contrib/datafusion-bigtable/blob/014d02f26800402d37638113948d07197fb7b201/python/src/datasource.rs#L11-L12

without inheritance, ctx.register_table("weather_balloons", bigtable_table.to_pytable()) returns TypeError: argument 'table': 'Table' object cannot be converted to 'Table'
https://github.com/datafusion-contrib/datafusion-bigtable/blob/fb2c794a33b5ee9234f7a9e24f2afebc7e17a7fb/python/src/datasource.rs#L56-L58


I have tried register_csv then use the PyTable to register_table as t1, it works. The weird thing is in following log, both t1 and t2 have same class/type, but t2 will fail register_table

(Pdb) ctx.register_csv("temp", "/path/to/temp.csv")

(Pdb) t1 = ctx.catalog().database("public").table("temp")
(Pdb) t1
<datafusion.Table object at 0x1055086f0>
(Pdb) ctx.register_table("t1", t1)
(Pdb) ctx.tables()
{'t1', 'temp'}

(Pdb) t2 = bigtable_table.to_pytable()
(Pdb) t2
<datafusion.Table object at 0x1055085a0>
(Pdb) ctx.register_table("t2", t2)
*** TypeError: argument 'table': 'Table' object cannot be converted to 'Table'

@jychen7 jychen7 reopened this Apr 4, 2022
@jychen7
Copy link
Member Author

jychen7 commented Apr 4, 2022

@jimexist , sorry to bother, just wonder whether you have idea about how to resolve the type conversion error in #45 (comment)
(Not sure whether it is a limitation of pyo3, or I miss sth, seems almost there)

jimexist pushed a commit that referenced this issue Apr 4, 2022
* Add register_table and deregister_table

* expose public module and method for PyTable inheritant
@jychen7
Copy link
Member Author

jychen7 commented Apr 5, 2022

Looks like it is not supported in pyo3. According to PyO3/pyo3#1444, even though datafusion-bigtable use PyTable from datafusion-python, after compile, pyo3 thinks the two PyTable are different types

The key issue is that #[pyclass] stores the pyclass type object in static storage. This means that (if Rust's usual rlib linkage is used) packages A and B will have their own copies of the MyClass type object, and Python will think that they're actually different types coming from the two packages.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant