executing pragma statements on duckdb.query() #2223

deshpand · 2021-09-01T19:35:03Z

deshpand
Sep 1, 2021

Upon reading https://duckdb.org/2021/05/14/sql-on-pandas.html I'm very interested in running queries, specifically replacing pd.merge() with join sql (for now) and hoping for both multi threaded execution and less usage of memory. Currently, I'm seeing neither improvement. CPU usage during query seems to be 100% or lower indicating single thread use, the query takes more memory and runs longer than pd.merge().

Also, it appears I need to unset the index on the dataframe or else I get an error that the columns don't exist. It would be nice if duckdb can take advantage of existing index.

Please let me know what I am missing. Thanks for your work and eagerly looking forward to future updates!

import pandas as pd
import duckdb

# pragmas don't seem to be taking effect
con = duckdb.connect()
con.execute("PRAGMA threads=4")
con.execute("PRAGMA memory_limit='32GB'")

df1 = pd.read_parquet("<filename1>")
df2 = pd.read_parquet("<filename2>"

#using pandas merge..
#df = pd.merge(df1, df2, on=['loan_id', 'act_dte'])
df = duckdb.query("select df1.loan_id, df2.some_code from df1, df2 where df1.loan_id=df2.loan_id and df1.act_dte=df2.act_dte").to_df()

Answered by Mytherin

Sep 2, 2021

Upon reading https://duckdb.org/2021/05/14/sql-on-pandas.html I'm very interested in running queries, specifically replacing pd.merge() with join sql (for now) and hoping for both multi threaded execution and less usage of memory. Currently, I'm seeing neither improvement. CPU usage during query seems to be 100% or lower indicating single thread use, the query takes more memory and runs longer than pd.merge().

The current implementation is limited in what can and can't be parallelized, as we are incrementally adding parallelism support to different operators in the engine. While joins can be fully parallelized, materializing to a Pandas DataFrame cannot be parallelized yet, and as such …

View full answer

deshpand · 2021-09-01T23:40:05Z

deshpand
Sep 1, 2021
Author

Upon reading further, I registered the dataframes with con.register() and then used con.execute("...").fetchdf() instead of duckdb.query("...").to_df().

So far, I don't see any improvement. pandas.merge() seems to be doing better and I don't see multiple cores being used by duckdb, with the pragma being specified.

I'd love to see some of the improvements referenced in the article

We make heavy use of the python native stack of pandas/dask/sklearn/numba and I am beginning to feel that duckdb will be a welcome addition to the stack. For some reason, we have had issues with dask.merge() though we are a big beneficiary of its other features (may be a bit of learning curve left there too)

0 replies

Mytherin · 2021-09-02T07:03:55Z

Mytherin
Sep 2, 2021
Maintainer

Upon reading https://duckdb.org/2021/05/14/sql-on-pandas.html I'm very interested in running queries, specifically replacing pd.merge() with join sql (for now) and hoping for both multi threaded execution and less usage of memory. Currently, I'm seeing neither improvement. CPU usage during query seems to be 100% or lower indicating single thread use, the query takes more memory and runs longer than pd.merge().

The current implementation is limited in what can and can't be parallelized, as we are incrementally adding parallelism support to different operators in the engine. While joins can be fully parallelized, materializing to a Pandas DataFrame cannot be parallelized yet, and as such the final pipeline will be run on a single-thread right now. This is something we will fix in the future. If you want to see parallelism you would need to e.g. add an aggregate or ORDER BY statement at the end currently.

One important note is that DuckDB should be used in a slightly different manner than Pandas. In your example I see that you are going Pandas -> Pandas -> Pandas -> Pandas and measuring only a single operator (a join). While this is how you would use Pandas, in DuckDB it is much better to run multiple operators at the same time, since DuckDB will be able to execute the entire tree at once and perform optimizations through-out the tree while avoiding expensive data materialization.

Especially if you are reading from Parquet, I recommend using DuckDB's parquet reader directly, rather than first going through Pandas, e.g.:

# how to do it in pandas
df1 = pd.read_parquet("<filename1>")
df2 = pd.read_parquet("<filename2>")

df = pd.merge(df1, df2, on=['loan_id', 'act_dte'])

# how to do it in DuckDB
df = con.execute('select df1.loan_id, df2.some_code from read_parquet("<filename1>") df1, read_parquet("<filename2>") df2 where df1.loan_id=df2.loan_id and df1.act_dte=df2.act_dte').to_df()

# or by creating a view
con.execute('create view df1 as select * from read_parquet("<filename1>")')
con.execute('create view df2 as select * from read_parquet("<filename2>")')
df = con.execute('select df1.loan_id, df2.some_code from df1, df2 where df1.loan_id=df2.loan_id and df1.act_dte=df2.act_dte').to_df()

deshpand Sep 2, 2021
Author

Thanks again. I already see a ticket for the bug #1011

Even though the ticket above references multi index, I am having trouble using an index on a single column too.

import pandas as pd
import duckdb

df = pd.DataFrame({
        'a': [2, 3, 1, 4],
        'u': [12, 4, 3, 0]
})
df = df.set_index(['a'])

df1 = pd.DataFrame({
        'a': [2, 3, 1, 4],
        's': ['A', 'B', 'C', 'D']
})
df1 = df1.set_index(['a'])

# fine
x = pd.merge(df, df1, left_index=True, right_index=True)

# not fine, column a not found
y = duckdb.query("SELECT df.*, df1.s from df, df1 where df.a=df1.a").to_df()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executing pragma statements on duckdb.query() #2223

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

executing pragma statements on duckdb.query() #2223

deshpand Sep 1, 2021

Replies: 2 comments · 1 reply

deshpand Sep 1, 2021 Author

Mytherin Sep 2, 2021 Maintainer

deshpand Sep 2, 2021 Author

deshpand
Sep 1, 2021

Replies: 2 comments 1 reply

deshpand
Sep 1, 2021
Author

Mytherin
Sep 2, 2021
Maintainer

deshpand Sep 2, 2021
Author