-
Notifications
You must be signed in to change notification settings - Fork 120
Description
Hi there,
I'm using this client library to fetch much data from our DBX environment. The version I'm using is 3.3.0
.
The library keeps crashing when attempts to concatenate two current and partial results. I can not attach to full trace, because it contains some of the internal schemas, but here is the gist of it:
creation_date
First Schema: creation_date: timestamp[us, tz=Etc/UTC]
Second Schema: creation_date: timestamp[us, tz=Etc/UTC] not null
status
First Schema: status: string
Second Schema: status: string not null
sender_id
First Schema: sender_id: string
Second Schema: sender_id: string not null
.
.
.
and a few other fields with the exact same discrepancy
The exact stack trace is
Traceback (most recent call last):
File "/Users/X/work/scripts/raw_order.py", line 34, in <module>
for r in tqdm(cursor, total=max_items):
File "/Users/X/work/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 422, in __iter__
for row in self.active_result_set:
File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1112, in __iter__
row = self.fetchone()
File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1217, in fetchone
res = self._convert_arrow_table(self.fetchmany_arrow(1))
File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1193, in fetchmany_arrow
results = pyarrow.concat_tables([results, partial_results])
File "pyarrow/table.pxi", line 5962, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 1 was different:
GOES INTO DETAILS SPECIFIED ABOVE AS TO WHAT IS DIFFERNT
The data is coming through a cursor like this
connection = sql.connect(
server_hostname="X",
http_path="B",
access_token=app_settings.dbx_access_token,
)
cursor = connection.cursor()
max_items = 100000
batch_size = 10000
cursor.execute(
f"SELECT * from X where creation_date between '2024-06-01' and '2024-09-01' limit {max_items}"
)
The source table is created through a CTAS statement, so all fields are nullable by default. I have found two ways to resolve the issue:
- either set
results = pyarrow.concat_tables([results, partial_results], promote_options="permissive")
promote options topermissive
to so pyarrow can marry two schemas, or - Downgrade to the latest previous major version
2.9.6
I checked the 2.9.6
source code and it does not seem to be using a permissive schema casting, so seems like a regression in this case.
I'm not sure if I can add anything else beyond that, but do let me know.
And to be clear, I request like 100k records at a time there, and can iterate through like 95k of them, and then it fails. So I'm not really sure if there is a reliable way to reproduce that
If the cluster runtime matters, 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)
Thanks!