Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columns with Null Types when converted to Arrow return an int type with nulls #7149

Closed
2 tasks done
pdet opened this issue Apr 19, 2023 · 6 comments
Closed
2 tasks done
Assignees
Labels

Comments

@pdet
Copy link
Member

pdet commented Apr 19, 2023

What happens?

Examples:

import duckdb 
duckdb.sql('select null') 

Result:

┌───────┐
│ NULL  │
│ int32 │
├───────┤
│  NULL │
└───────┘
duckdb.sql('select null').arrow()
pyarrow.Table
NULL: int32
----
NULL: [[null]]
duckdb.sql('select typeof(i) from (select null) tbl(i)')

Result:

┌───────────┐
│ typeof(i) │
│  varchar  │
├───────────┤
│ NULL      │
└───────────┘

To Reproduce

See Above

OS:

Mac Os

DuckDB Version:

Master

DuckDB Client:

Python

Full Name:

Pedro

Affiliation:

DuckDB

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@pdet
Copy link
Member Author

pdet commented Apr 20, 2023

@cpcloud , I've chatted with @Mytherin after our meeting and confirmed that it is actually intended behavior.

We could check if columns are null from the stats and output as the arrow null type. But then SELECT NULL::INT will also output the arrow null. And worse, this could affect the schema of tables by simply inserting data.

On the other hand, I can see an argument that arrow results are not necessarily a representation of a table schema but rather of a query result.

I think, if this is something you believe it is crucial, we could enable it with an option. (e.g., output_null_type)

@pdet pdet self-assigned this Apr 20, 2023
@cpcloud
Copy link
Contributor

cpcloud commented Apr 20, 2023

@pdet Interesting!

I guess it doesn't make complete sense to me why the choice of int32 for select null versus any other type.

I would expect the columns a, b, c in select null as a, null::int as b, null::string[] as c to each have a different column type and choosing anything other than SQLNULL for a seems like it would be incorrect.

And worse, this could affect the schema of tables by simply inserting data.

Isn't this true already?

In [5]: import duckdb

In [6]: con = duckdb.connect()

In [7]: con.execute("select null as a").fetch_arrow_table().schema
Out[7]: a: int32

In [8]: con.execute("select null as a union select 'xyz' as a").fetch_arrow_table().schema
Out[8]: a: string

On the other hand, I can see an argument that arrow results are not necessarily a representation of a table schema but rather of a query result.

I'm not sure I follow 😅! How are a table's schema and a query result's schema different things?

@Mytherin
Copy link
Collaborator

What is happening is that we support NULL only as an internal type during the bind phase - we don't support the NULL type for either (1) storing data in tables, (2) as an intermediate vector type, or (3) as the result of a query. When NULL types are found at these boundaries, they are converted into integers instead. This applies not only to query results, but also e.g. for our own tables, or when exporting data to Parquet or similar, e.g.:

D CREATE TABLE t AS SELECT NULL;
D DESCRIBE t;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varcharvarcharvarcharvarcharvarchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ NULLINTEGER     │ YES     │ NULLNULLNULL │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘

The idea is that NULL is not a particularly useful type, since it really only comes from a query that involves a constant scalar NULL. Forcing all clients to deal with NULL types at the boundaries is a lot of work for essentially no gain. Many external data representations also can't deal with a NULL type and would have to do this conversion anyway.

We could disable this behavior for the Arrow conversion, as Arrow does support the NULL type, but it is inconsistent with the rest of the external interfaces of the system.

@cpcloud
Copy link
Contributor

cpcloud commented Apr 20, 2023

I see, thanks for the explanation. It's probably not worth special casing this at the moment.

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Jul 29, 2023
@github-actions
Copy link

This issue was closed because it has been stale for 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants