You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would have thought that .query().execute() (unlike just .query()) behaves the same as .query().arrow() with the only difference that the former doesn't require the intermediate Arrow representation. If this expectation is unfounded, I apologize but would appreciate if you could tell me if there is a way I can achieve the run-only-once behavior without third-party intermediate representations.
To Reproduce
Using duckdb version 0.10.2 on a Ubuntu 20.04.6LTS with an 8-core Intel i5 and 16GB RAM, run
importduckdbimportpandasaspdimportnumpyasnpimporttimeitclassTimer:
def__init__(self, name):
self.name=namedef__enter__(self):
self.t1=timeit.default_timer()
def__exit__(self, *args, **kwargs):
self.stopped=Truet2=timeit.default_timer()
print(f"{self.name} took {t2-self.t1:.3g}s")
deffake_data(m, n) ->pd.DataFrame:
rng=np.random.default_rng(0)
key=rng.integers(0, m, n)
match=rng.integers(0, m, n)
df=pd.DataFrame({'key': key, 'match': match})
returndfm=500n=200_000forarrowin [False, True]:
print(f"{arrow=}")
df1=fake_data(m, n)
df2=fake_data(m, n)
withTimer("First query"):
q=duckdb.query(""" SELECT df1.key AS key1, df2.key AS key2, count(*) AS c FROM df1 JOIN df2 USING (match) GROUP BY ALL """)
res=q.arrow() ifarrowelseq.execute()
withTimer("Second query"):
n_rows=duckdb.query("SELECT sum(c) FROM res").execute()
arrow=False
First query took 2.81s
Second query took 3.01s
arrow=True
First query took 2.7s
Second query took 0.00261s
OS:
Linux
DuckDB Version:
0.10.2
DuckDB Client:
Python
Full Name:
Soeren Wolfers
Affiliation:
G-Research
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
Yes, I have
The text was updated successfully, but these errors were encountered:
Hmm that is expected behavior, but I see how that could be unexpected after having explicitly called execute on the relation.
To explain a bit of why this is happening:
For the replacement scan that happens for a the parsed query of the a Relation is inserted as a subquery.
I think we can detect that the Relation has been executed and insert the materialized result instead
That'd be great. I always feel like converting back and forth to other formats to avoid double execution is not only less efficient but also will eventually cause me subtle problems.
What happens?
In pseudocode,
executes
MYFIRSTQUERY
twice.I would have thought that
.query().execute()
(unlike just.query()
) behaves the same as.query().arrow()
with the only difference that the former doesn't require the intermediate Arrow representation. If this expectation is unfounded, I apologize but would appreciate if you could tell me if there is a way I can achieve the run-only-once behavior without third-party intermediate representations.To Reproduce
Using duckdb version 0.10.2 on a Ubuntu 20.04.6LTS with an 8-core Intel i5 and 16GB RAM, run
OS:
Linux
DuckDB Version:
0.10.2
DuckDB Client:
Python
Full Name:
Soeren Wolfers
Affiliation:
G-Research
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: