Executed query is being run again #11753

soerenwolfers · 2024-04-21T15:57:47Z

What happens?

In pseudocode,

a = duckdb.query("SELECT MYFIRSTQUERY").execute()
b = duckdb.query("SELECT sum(a)").execute()

executes MYFIRSTQUERY twice.

I would have thought that .query().execute() (unlike just .query()) behaves the same as .query().arrow() with the only difference that the former doesn't require the intermediate Arrow representation. If this expectation is unfounded, I apologize but would appreciate if you could tell me if there is a way I can achieve the run-only-once behavior without third-party intermediate representations.

To Reproduce

Using duckdb version 0.10.2 on a Ubuntu 20.04.6LTS with an 8-core Intel i5 and 16GB RAM, run

import duckdb
import pandas as pd
import numpy as np
import timeit

class Timer:
    def __init__(self, name):
        self.name = name
        
    def __enter__(self):
        self.t1 = timeit.default_timer()

    def __exit__(self, *args, **kwargs):
        self.stopped = True
        t2 = timeit.default_timer()
        print(f"{self.name} took {t2 - self.t1:.3g}s")

def fake_data(m, n) -> pd.DataFrame:
    rng = np.random.default_rng(0)
    key = rng.integers(0, m, n)
    match = rng.integers(0, m, n)
    df = pd.DataFrame({'key': key, 'match': match})
    return df 

m = 500
n = 200_000
for arrow in [False, True]:
    print(f"{arrow=}")
    df1 = fake_data(m, n)
    df2 = fake_data(m, n)
    with Timer("First query"):
        q = duckdb.query("""
            SELECT 
                df1.key AS key1,
                df2.key AS key2,
                count(*) AS c
            FROM df1
            JOIN df2 
            USING (match)
            GROUP BY ALL
            """)
        res = q.arrow() if arrow else q.execute()
    with Timer("Second query"):
        n_rows = duckdb.query("SELECT sum(c) FROM res").execute()

arrow=False
First query took 2.81s
Second query took 3.01s
arrow=True
First query took 2.7s
Second query took 0.00261s

OS:

Linux

DuckDB Version:

0.10.2

DuckDB Client:

Python

Full Name:

Soeren Wolfers

Affiliation:

G-Research

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

Tishj · 2024-04-21T16:12:42Z

Hmm that is expected behavior, but I see how that could be unexpected after having explicitly called execute on the relation.

To explain a bit of why this is happening:
For the replacement scan that happens for a the parsed query of the a Relation is inserted as a subquery.
I think we can detect that the Relation has been executed and insert the materialized result instead

soerenwolfers · 2024-04-21T16:22:48Z

That'd be great. I always feel like converting back and forth to other formats to avoid double execution is not only less efficient but also will eventually cause me subtle problems.

soerenwolfers added the needs triage label Apr 21, 2024

szarnyasg added the reproduced label Apr 22, 2024

duckdblabs-bot removed the needs triage label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executed query is being run again #11753

Executed query is being run again #11753

soerenwolfers commented Apr 21, 2024 •

edited

Loading

Tishj commented Apr 21, 2024 •

edited

Loading

soerenwolfers commented Apr 21, 2024 •

edited

Loading

Executed query is being run again #11753

Executed query is being run again #11753

Comments

soerenwolfers commented Apr 21, 2024 • edited Loading

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Tishj commented Apr 21, 2024 • edited Loading

soerenwolfers commented Apr 21, 2024 • edited Loading

soerenwolfers commented Apr 21, 2024 •

edited

Loading

Tishj commented Apr 21, 2024 •

edited

Loading

soerenwolfers commented Apr 21, 2024 •

edited

Loading