List all the columns names and types for arrow connection #3623

oscar-defelice · 2022-05-12T09:10:39Z

oscar-defelice
May 12, 2022

I have the following

import duckdb
import pyarrow as pa
import pyarrow.dataset as ds

# Open dataset 
db = ds.dataset('path_to_parquets/') # Many parquet files
con = duckdb.connect()
c = con.cursor()

This works in the sense I can query my database for example by

c.execute("SELECT COUNT(value) from db;").fetchall()

However, of course db is not a normal table.
If I list the tables by

c.execute("PRAGMA show_tables;").fetchall()

I got an empty list. Now, How can I print the analogous of the query

c.execute("PRAGMA table_info('db');").fetchall()

such that I can get name type of all columns in db?

Alex-Monahan · 2022-05-12T12:22:05Z

Alex-Monahan
May 12, 2022
Collaborator

I think the information schema is what you are looking for!
https://duckdb.org/docs/sql/information_schema

select * from information_schema.columns

1 reply

oscar-defelice May 12, 2022
Author

No, actually this also shows an empty list

Alex-Monahan · 2022-05-12T13:10:43Z

Alex-Monahan
May 12, 2022
Collaborator

Ah, you want to query the metadata of Arrow datasets - my mistake. One workaround might be to create a view on top of your dataset. Otherwise, DuckDB doesn't really track all of your current Python variables. So I think some kind of extra step would be needed.

It may also be best to handle this with a Python loop. You can loop through your Python local variables and look for ones of type Arrow, etc.

5 replies

pdet May 16, 2022
Collaborator

Just to add to what Alex said, your code is using the "replacement scans", these are mainly for when you wish to quickly query a python object (pandas, arrow) without registering the object within DuckDB. So DuckDB holds no meta information of said object.
If you want to query it as a table, registering it as a view is the way to go.

oscar-defelice May 18, 2022
Author

Thank you very much. However, when I try to create a view I do not get anything. (or better I get an empty view). On the other hand if I put everything in a .db file I run out of memory, and in principle, I have a folder containing 100Gb of data in parquet, and I am transferring everything in a ~100Gb (actually bigger) .db file.

Alex-Monahan May 18, 2022
Collaborator

Would you mind posting your view creation code? I wouldn't expect an empty view! I would have thought a view would somewhat solve your issue. Thanks!

oscar-defelice May 19, 2022
Author

Here it is

import duckdb
import pyarrow as pa
import pyarrow.dataset as ds

# Open dataset 
db = ds.dataset('database_source/parquet_files/')
con = duckdb.connect()
c = con.cursor()

c.execute("SELECT * FROM db LIMIT 10").fetchall() # this works fine

c.execute("CREATE OR REPLACE VIEW v1 AS SELECT * FROM db;").fetchall() # this gives empty

Alex-Monahan May 19, 2022
Collaborator

Thank you for posting!
That is actually expected behavior. View creation in SQL does not return the contents of the view, it just defines it. To consume the view, simply execute another select statement on that view once it has been created! You can also query the metadata of that view after it has been created.

c.execute("SELECT * FROM v1").fetchall()

As a note, we tend to recommend pulling your output as either Pandas or Arrow since it will be faster! Just have a look at our how to guides!
https://duckdb.org/docs/guides/index

orzom411 · 2022-05-20T17:08:15Z

orzom411
May 20, 2022

Another suggestion, from a peek at the parquet file prospective, not sure if this will help but I've found it useful in detecting schema changes (clients sometimes make things interesting), the essence:

select * from parquet_schema('*.parquet') as s;

From here it isn't difficult to work through the results. I found this useful as I can create an md5 for each "file_name" of the "name" field to detect different schemas, e.g.:

select s.file_name, md5( string_agg( s.name ) ) from parquet_schema('*.parquet') as s group by s.file_name;

-- All of the above via the CLI... just an old SQL dog --

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List all the columns names and types for arrow connection #3623

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

List all the columns names and types for arrow connection #3623

oscar-defelice May 12, 2022

Replies: 3 comments · 6 replies

Alex-Monahan May 12, 2022 Collaborator

oscar-defelice May 12, 2022 Author

Alex-Monahan May 12, 2022 Collaborator

pdet May 16, 2022 Collaborator

oscar-defelice May 18, 2022 Author

Alex-Monahan May 18, 2022 Collaborator

oscar-defelice May 19, 2022 Author

Alex-Monahan May 19, 2022 Collaborator

orzom411 May 20, 2022

oscar-defelice
May 12, 2022

Replies: 3 comments 6 replies

Alex-Monahan
May 12, 2022
Collaborator

oscar-defelice May 12, 2022
Author

Alex-Monahan
May 12, 2022
Collaborator

pdet May 16, 2022
Collaborator

oscar-defelice May 18, 2022
Author

Alex-Monahan May 18, 2022
Collaborator

oscar-defelice May 19, 2022
Author

Alex-Monahan May 19, 2022
Collaborator

orzom411
May 20, 2022