Columns with Null Types when converted to Arrow return an int type with nulls #7149

pdet · 2023-04-19T15:36:29Z

What happens?

Examples:

import duckdb 
duckdb.sql('select null')

Result:

┌───────┐
│ NULL  │
│ int32 │
├───────┤
│  NULL │
└───────┘

duckdb.sql('select null').arrow()

pyarrow.Table
NULL: int32
----
NULL: [[null]]

duckdb.sql('select typeof(i) from (select null) tbl(i)')

Result:

┌───────────┐
│ typeof(i) │
│  varchar  │
├───────────┤
│ NULL      │
└───────────┘

To Reproduce

See Above

OS:

Mac Os

DuckDB Version:

Master

DuckDB Client:

Python

Full Name:

Pedro

Affiliation:

DuckDB

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

The text was updated successfully, but these errors were encountered:

pdet · 2023-04-20T08:24:29Z

@cpcloud , I've chatted with @Mytherin after our meeting and confirmed that it is actually intended behavior.

We could check if columns are null from the stats and output as the arrow null type. But then SELECT NULL::INT will also output the arrow null. And worse, this could affect the schema of tables by simply inserting data.

On the other hand, I can see an argument that arrow results are not necessarily a representation of a table schema but rather of a query result.

I think, if this is something you believe it is crucial, we could enable it with an option. (e.g., output_null_type)

cpcloud · 2023-04-20T12:46:49Z

@pdet Interesting!

I guess it doesn't make complete sense to me why the choice of int32 for select null versus any other type.

I would expect the columns a, b, c in select null as a, null::int as b, null::string[] as c to each have a different column type and choosing anything other than SQLNULL for a seems like it would be incorrect.

And worse, this could affect the schema of tables by simply inserting data.

Isn't this true already?

In [5]: import duckdb

In [6]: con = duckdb.connect()

In [7]: con.execute("select null as a").fetch_arrow_table().schema
Out[7]: a: int32

In [8]: con.execute("select null as a union select 'xyz' as a").fetch_arrow_table().schema
Out[8]: a: string

On the other hand, I can see an argument that arrow results are not necessarily a representation of a table schema but rather of a query result.

I'm not sure I follow 😅! How are a table's schema and a query result's schema different things?

Mytherin · 2023-04-20T12:55:17Z

What is happening is that we support NULL only as an internal type during the bind phase - we don't support the NULL type for either (1) storing data in tables, (2) as an intermediate vector type, or (3) as the result of a query. When NULL types are found at these boundaries, they are converted into integers instead. This applies not only to query results, but also e.g. for our own tables, or when exporting data to Parquet or similar, e.g.:

D CREATE TABLE t AS SELECT NULL;
D DESCRIBE t;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ NULL        │ INTEGER     │ YES     │ NULL    │ NULL    │  NULL │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘

The idea is that NULL is not a particularly useful type, since it really only comes from a query that involves a constant scalar NULL. Forcing all clients to deal with NULL types at the boundaries is a lot of work for essentially no gain. Many external data representations also can't deal with a NULL type and would have to do this conversion anyway.

We could disable this behavior for the Arrow conversion, as Arrow does support the NULL type, but it is inconsistent with the rest of the external interfaces of the system.

cpcloud · 2023-04-20T13:03:44Z

I see, thanks for the explanation. It's probably not worth special casing this at the moment.

github-actions · 2023-07-29T00:27:34Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions · 2023-08-28T00:30:45Z

This issue was closed because it has been stale for 30 days with no activity.

pdet self-assigned this Apr 20, 2023

cpcloud mentioned this issue Jul 18, 2023

feat: Uint dtypes from duckdb engine ibis-project/ibis#6632

Closed

1 task

github-actions bot added the stale label Jul 29, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Columns with Null Types when converted to Arrow return an int type with nulls #7149

Columns with Null Types when converted to Arrow return an int type with nulls #7149

pdet commented Apr 19, 2023

pdet commented Apr 20, 2023

cpcloud commented Apr 20, 2023 •

edited

Mytherin commented Apr 20, 2023

cpcloud commented Apr 20, 2023

github-actions bot commented Jul 29, 2023

github-actions bot commented Aug 28, 2023

Columns with Null Types when converted to Arrow return an int type with nulls #7149

Columns with Null Types when converted to Arrow return an int type with nulls #7149

Comments

pdet commented Apr 19, 2023

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

pdet commented Apr 20, 2023

cpcloud commented Apr 20, 2023 • edited

Mytherin commented Apr 20, 2023

cpcloud commented Apr 20, 2023

github-actions bot commented Jul 29, 2023

github-actions bot commented Aug 28, 2023

Have you tried this on the latest `master` branch?

cpcloud commented Apr 20, 2023 •

edited