Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Timing CLI vs. Python #12490

Closed
2 tasks done
rootjalex opened this issue Jun 11, 2024 · 2 comments
Closed
2 tasks done

Inconsistent Timing CLI vs. Python #12490

rootjalex opened this issue Jun 11, 2024 · 2 comments

Comments

@rootjalex
Copy link

What happens?

I've noticed on a number of queries that the CLI will run slowly on the first iteration, and much faster on subsequent iterations. Using the Python API however, the runtime is always as slow (or slower) than the first iteration of the CLI. I suspect the CLI is doing some form of caching, but the Python API is not.

To Reproduce

INSTALL tpch;
LOAD tpch;
CALL dbgen(sf = 15);
.timer on
# set threads here

SELECT L_SHIPINSTRUCT FROM lineitem WHERE extract('month' FROM L_SHIPDATE) = 6;
# repeat multiple times ...

On my machine, three runs (single-threaded, SET threads TO 1;) via Python takes: 3.492882013320923s, 3.344236135482788s, 3.2870419025421143s. Via the CLI, it's 3.613s, 1.366s, 1.340s.
Multi-threaded (e.g. SET threads TO 8;), Python takes: 2.1488077640533447s, 2.1455469131469727s, 2.143683910369873s. Via the CLI, it's 0.427s, 0.209s, 0.231s.

I installed duckdb via pip. I am using Python 3.12.3

For the Python measurements, I use the following loop:

duckdb.sql("CREATE TABLE lineitem AS FROM\'tpc_h_lineitem_sf15.csv\'")
duckdb.sql("SET threads TO 1")
results = list()
for _ in range(ITER_COUNT):
    start = time.time()
    res = duckdb.sql("SELECT L_SHIPINSTRUCT FROM lineitem WHERE extract('month' FROM L_SHIPDATE) = 6").fetchall()
    end = time.time()
    print(f"   took: {end - start}s")
    print(f"   COUNT(*) = {len(res)}")
    results.append(end - start)
print(f"duckdb filter (threads=1) min = {min(results)}s  avg = {sum(results) / len(results)}s  max = {max(results)}s")

(and the same loop with threads=8)

OS:

aarch64 (Mac M1)

DuckDB Version:

v0.10.1 (for CLI)

DuckDB Client:

CLI and Python

Full Name:

Alexander Root

Affiliation:

Stanford University

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@Mytherin
Copy link
Collaborator

This seems unrelated to caching to me. This query generates a large result set - .fetchall() fetches the entire result as Python objects which is slow. Try fetching the result using e.g. .arrow() instead.

@rootjalex
Copy link
Author

Ah fantastic, that produces consistent behavior with the CLI for single-threaded. Not quite for multi-threaded (e.g. it's 0.341s on average), but that discrepancy seems likely to be Python overheads. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants