Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executed query is being run again #11753

Open
2 tasks done
soerenwolfers opened this issue Apr 21, 2024 · 2 comments
Open
2 tasks done

Executed query is being run again #11753

soerenwolfers opened this issue Apr 21, 2024 · 2 comments

Comments

@soerenwolfers
Copy link

soerenwolfers commented Apr 21, 2024

What happens?

In pseudocode,

a = duckdb.query("SELECT MYFIRSTQUERY").execute()
b = duckdb.query("SELECT sum(a)").execute()

executes MYFIRSTQUERY twice.

I would have thought that .query().execute() (unlike just .query()) behaves the same as .query().arrow() with the only difference that the former doesn't require the intermediate Arrow representation. If this expectation is unfounded, I apologize but would appreciate if you could tell me if there is a way I can achieve the run-only-once behavior without third-party intermediate representations.

To Reproduce

Using duckdb version 0.10.2 on a Ubuntu 20.04.6LTS with an 8-core Intel i5 and 16GB RAM, run

import duckdb
import pandas as pd
import numpy as np
import timeit

class Timer:
    def __init__(self, name):
        self.name = name
        
    def __enter__(self):
        self.t1 = timeit.default_timer()

    def __exit__(self, *args, **kwargs):
        self.stopped = True
        t2 = timeit.default_timer()
        print(f"{self.name} took {t2 - self.t1:.3g}s")

def fake_data(m, n) -> pd.DataFrame:
    rng = np.random.default_rng(0)
    key = rng.integers(0, m, n)
    match = rng.integers(0, m, n)
    df = pd.DataFrame({'key': key, 'match': match})
    return df 

m = 500
n = 200_000
for arrow in [False, True]:
    print(f"{arrow=}")
    df1 = fake_data(m, n)
    df2 = fake_data(m, n)
    with Timer("First query"):
        q = duckdb.query("""
            SELECT 
                df1.key AS key1,
                df2.key AS key2,
                count(*) AS c
            FROM df1
            JOIN df2 
            USING (match)
            GROUP BY ALL
            """)
        res = q.arrow() if arrow else q.execute()
    with Timer("Second query"):
        n_rows = duckdb.query("SELECT sum(c) FROM res").execute()
arrow=False
First query took 2.81s
Second query took 3.01s
arrow=True
First query took 2.7s
Second query took 0.00261s

OS:

Linux

DuckDB Version:

0.10.2

DuckDB Client:

Python

Full Name:

Soeren Wolfers

Affiliation:

G-Research

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@Tishj
Copy link
Contributor

Tishj commented Apr 21, 2024

Hmm that is expected behavior, but I see how that could be unexpected after having explicitly called execute on the relation.

To explain a bit of why this is happening:
For the replacement scan that happens for a the parsed query of the a Relation is inserted as a subquery.
I think we can detect that the Relation has been executed and insert the materialized result instead

@soerenwolfers
Copy link
Author

soerenwolfers commented Apr 21, 2024

That'd be great. I always feel like converting back and forth to other formats to avoid double execution is not only less efficient but also will eventually cause me subtle problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants