-
Upon reading https://duckdb.org/2021/05/14/sql-on-pandas.html I'm very interested in running queries, specifically replacing pd.merge() with join sql (for now) and hoping for both multi threaded execution and less usage of memory. Currently, I'm seeing neither improvement. CPU usage during query seems to be 100% or lower indicating single thread use, the query takes more memory and runs longer than pd.merge(). Also, it appears I need to unset the index on the dataframe or else I get an error that the columns don't exist. It would be nice if duckdb can take advantage of existing index. Please let me know what I am missing. Thanks for your work and eagerly looking forward to future updates!
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Upon reading further, I registered the dataframes with So far, I don't see any improvement. I'd love to see some of the improvements referenced in the article We make heavy use of the python native stack of pandas/dask/sklearn/numba and I am beginning to feel that duckdb will be a welcome addition to the stack. For some reason, we have had issues with dask.merge() though we are a big beneficiary of its other features (may be a bit of learning curve left there too) |
Beta Was this translation helpful? Give feedback.
-
The current implementation is limited in what can and can't be parallelized, as we are incrementally adding parallelism support to different operators in the engine. While joins can be fully parallelized, materializing to a Pandas DataFrame cannot be parallelized yet, and as such the final pipeline will be run on a single-thread right now. This is something we will fix in the future. If you want to see parallelism you would need to e.g. add an aggregate or ORDER BY statement at the end currently. One important note is that DuckDB should be used in a slightly different manner than Pandas. In your example I see that you are going Especially if you are reading from Parquet, I recommend using DuckDB's parquet reader directly, rather than first going through Pandas, e.g.: # how to do it in pandas
df1 = pd.read_parquet("<filename1>")
df2 = pd.read_parquet("<filename2>")
df = pd.merge(df1, df2, on=['loan_id', 'act_dte'])
# how to do it in DuckDB
df = con.execute('select df1.loan_id, df2.some_code from read_parquet("<filename1>") df1, read_parquet("<filename2>") df2 where df1.loan_id=df2.loan_id and df1.act_dte=df2.act_dte').to_df()
# or by creating a view
con.execute('create view df1 as select * from read_parquet("<filename1>")')
con.execute('create view df2 as select * from read_parquet("<filename2>")')
df = con.execute('select df1.loan_id, df2.some_code from df1, df2 where df1.loan_id=df2.loan_id and df1.act_dte=df2.act_dte').to_df() See also our Parquet blog.
That sounds like a bug. Could you open an issue with a reproducible example? Thanks! |
Beta Was this translation helpful? Give feedback.
The current implementation is limited in what can and can't be parallelized, as we are incrementally adding parallelism support to different operators in the engine. While joins can be fully parallelized, materializing to a Pandas DataFrame cannot be parallelized yet, and as such …