ResultSet `fetchmany_arrow`/`fetchall_arrow` methods fail during `concat_tables`

Hi there,

I'm using this client library to fetch much data from our DBX environment. The version I'm using is `3.3.0`.

The library keeps crashing when attempts to concatenate two current and partial results. I can not attach to full trace, because it contains some of the internal schemas, but here is the gist of it:
```
creation_date
First Schema: creation_date: timestamp[us, tz=Etc/UTC]
Second Schema: creation_date: timestamp[us, tz=Etc/UTC] not null

status
First Schema: status: string
Second Schema: status: string not null

sender_id
First Schema: sender_id: string
Second Schema: sender_id: string not null

.
.
.
and a few other fields with the exact same discrepancy
```
The exact stack trace is 

```
Traceback (most recent call last):
  File "/Users/X/work/scripts/raw_order.py", line 34, in <module>
    for r in tqdm(cursor, total=max_items):
  File "/Users/X/work/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 422, in __iter__
    for row in self.active_result_set:
  File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1112, in __iter__
    row = self.fetchone()
  File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1217, in fetchone
    res = self._convert_arrow_table(self.fetchmany_arrow(1))
  File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1193, in fetchmany_arrow
    results = pyarrow.concat_tables([results, partial_results])
  File "pyarrow/table.pxi", line 5962, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 1 was different: 
GOES INTO DETAILS SPECIFIED ABOVE AS TO WHAT IS DIFFERNT
```

The data is coming through a cursor like this 

```
connection = sql.connect(
    server_hostname="X",
    http_path="B",
    access_token=app_settings.dbx_access_token,
)

cursor = connection.cursor()

max_items = 100000
batch_size = 10000

cursor.execute(
    f"SELECT * from X where  creation_date between '2024-06-01' and '2024-09-01' limit {max_items}"
)
```

The source table is created through a CTAS statement, so all fields are nullable by default. I have found two ways to resolve the issue:
* either set `results = pyarrow.concat_tables([results, partial_results], promote_options="permissive")` promote options to `permissive` to so pyarrow can marry two schemas, or
* Downgrade to the latest previous major version `2.9.6`

I checked the `2.9.6` source code and it does not seem to be using a permissive schema casting, so seems like a regression in this case.

I'm not sure if I can add anything else beyond that, but do let me know.

And to be clear, I request like 100k records at a time there, and can iterate through like 95k of them, and then it fails. So I'm not really sure if there is a reliable way to reproduce that

If the cluster runtime matters, `13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)`

Thanks! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ResultSet `fetchmany_arrow`/`fetchall_arrow` methods fail during `concat_tables` #418

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ResultSet fetchmany_arrow/fetchall_arrow methods fail during concat_tables #418

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ResultSet `fetchmany_arrow`/`fetchall_arrow` methods fail during `concat_tables` #418