New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python: errors using pyarrow Dataset with adbc_ingest() for adbc_driver_postgres() #1310
Comments
Just for reference, what kind of errors are you getting? |
test_csv_dataset_table()OperationalError Traceback (most recent call last)
/Users/---/projects/---/pipeline/scratch.ipynb Cell 14 line 1
----> 1 test_csv_dataset_table(base_path)
/Users/---/projects/---/pipeline/scratch.ipynb Cell 14 line 1
9 print(dst.schema)
10 table = dst.to_table()
---> 11 ingest_data(conn_uri, table, mode="create_append")
/Users/---/projects/---/pipeline/scratch.ipynb Cell 14 line 4
2 with dbapi.connect(conn_uri) as conn:
3 with conn.cursor() as cursor:
----> 4 cursor.adbc_ingest(
5 db_schema_name="public",
6 table_name="test_table",
7 data=data,
8 mode=mode,
9 )
File ~/miniforge3/envs/---/lib/python3.11/site-packages/adbc_driver_manager/dbapi.py:894, in Cursor.adbc_ingest(self, table_name, data, mode, catalog_name, db_schema_name, temporary)
891 self._stmt.bind_stream(handle)
893 self._last_query = None
--> 894 return self._stmt.execute_update()
File ~/miniforge3/envs/---/lib/python3.11/site-packages/adbc_driver_manager/_lib.pyx:1184, in adbc_driver_manager._lib.AdbcStatement.execute_update()
File ~/miniforge3/envs/---/lib/python3.11/site-packages/adbc_driver_manager/_lib.pyx:227, in adbc_driver_manager._lib.check_error()
OperationalError: IO: Error writing tuple field data: no COPY in progress test_csv_dataset_batch()---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/Users/---/projects/---/pipeline/scratch.ipynb Cell 16 line 1
----> 1 test_csv_dataset_batch(base_path)
/Users/---/projects/---/pipeline/scratch.ipynb Cell 16 line 1
2 dst = ds.dataset(
3 base_path,
4 format=ds.CsvFileFormat(read_options=csv.ReadOptions(use_threads=False, block_size=CHUNK_SIZE)),
(...)
7 ),
8 )
9 record_batch = dst.to_batches()
---> 10 ingest_data(conn_uri, record_batch, mode="create_append")
/Users/---/projects/---/pipeline/scratch.ipynb Cell 16 line 4
2 with dbapi.connect(conn_uri) as conn:
3 with conn.cursor() as cursor:
----> 4 cursor.adbc_ingest(
5 db_schema_name="public",
6 table_name="test_table",
7 data=data,
8 mode=mode,
9 )
File ~/miniforge3/envs/---/lib/python3.11/site-packages/adbc_driver_manager/dbapi.py:890, in Cursor.adbc_ingest(self, table_name, data, mode, catalog_name, db_schema_name, temporary)
888 data = data.to_reader()
889 handle = _lib.ArrowArrayStreamHandle()
--> 890 data._export_to_c(handle.address)
891 self._stmt.bind_stream(handle)
893 self._last_query = None
AttributeError: '_cython_3_0_5.generator' object has no attribute '_export_to_c' test_csv_dataset_reader()---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
/Users/---/projects/---/pipeline/scratch.ipynb Cell 18 line 1
----> 1 test_csv_dataset_reader(base_path)
/Users/---/projects/---/pipeline/scratch.ipynb Cell 18 line 1
2 dst = ds.dataset(
3 base_path,
4 format=ds.CsvFileFormat(read_options=csv.ReadOptions(use_threads=False, block_size=CHUNK_SIZE)),
(...)
7 ),
8 )
9 scanner = dst.scanner()
---> 10 ingest_data(conn_uri, scanner.to_reader(), mode="create_append")
/Users/---/projects/---/pipeline/scratch.ipynb Cell 18 line 4
2 with dbapi.connect(conn_uri) as conn:
3 with conn.cursor() as cursor:
----> 4 cursor.adbc_ingest(
5 db_schema_name="public",
6 table_name="test_table",
7 data=data,
8 mode=mode,
9 )
File ~/miniforge3/envs/---/lib/python3.11/site-packages/adbc_driver_manager/dbapi.py:894, in Cursor.adbc_ingest(self, table_name, data, mode, catalog_name, db_schema_name, temporary)
891 self._stmt.bind_stream(handle)
893 self._last_query = None
--> 894 return self._stmt.execute_update()
File ~/miniforge3/envs/---/lib/python3.11/site-packages/adbc_driver_manager/_lib.pyx:1184, in adbc_driver_manager._lib.AdbcStatement.execute_update()
File ~/miniforge3/envs/---/lib/python3.11/site-packages/adbc_driver_manager/_lib.pyx:227, in adbc_driver_manager._lib.check_error()
OperationalError: IO: Error writing tuple field data: no COPY in progress |
As workaround for now I think you should be able to do: def ingest_dataset(dst: pyarrow.dataset.Dataset, table_name: str):
"""Test adbc_ingest() with csv files using pyarrow.dataset.dataset(), read into pyarrow RecordBatches."""
record_batches = dst.to_batches()
schema_name = "public"
with dbapi.connect(uri=CONNECTION_URI) as conn:
with conn.cursor() as cursor:
# Recreate table based on the first batch
cursor.adbc_ingest(
db_schema_name=schema_name,
table_name=table_name,
data=next(record_batches),
mode="replace",
)
# Append subsequent batches
for record_batch in record_batches:
cursor.adbc_ingest(
db_schema_name=schema_name,
table_name=table_name,
data=next(record_batches),
mode="append",
)
conn.commit() Though, I'd agree that it would be preferable (and probably slightly more performant) if you could just pass the Dataset (or RecordBatchReader, or Scanner) to the driver. |
We can make RecordBatchReader work; for Dataset there is a proposal to turn it into a protocol/interface we can recognize as well |
OK. It looks like the actual problem is multiple batches of data. The COPY loop we do is ending the copy too early. |
OK, should have a PR up later today hopefully...the question is what to do with |
The COPY writer was ending the COPY command after each batch, so any dataset with more than one batch would fail. Instead, write the header once and don't end the command until we've written all batches. Fixes #1310.
I'm running into errors when trying to bulk load data into postgres using
adbc_driver_postgres
with apyarrow.dataset
. The dataset is composed of partitioned csv files. I have verified that usingpyarrow.csv.open_csv()
works withadbc_ingest()
. Additionally it appears loading the dataset into apyarrow.Table
, passing it to polars, and callingdf.to_arrow()
works withadbc_ingest()
.The following is a script to reproduce the errors I have been getting:
Python version
3.11.6
Dependancies
MRE
The text was updated successfully, but these errors were encountered: