Inconsistent Timing CLI vs. Python #12490

rootjalex · 2024-06-11T17:43:59Z

What happens?

I've noticed on a number of queries that the CLI will run slowly on the first iteration, and much faster on subsequent iterations. Using the Python API however, the runtime is always as slow (or slower) than the first iteration of the CLI. I suspect the CLI is doing some form of caching, but the Python API is not.

To Reproduce

INSTALL tpch;
LOAD tpch;
CALL dbgen(sf = 15);
.timer on
# set threads here

SELECT L_SHIPINSTRUCT FROM lineitem WHERE extract('month' FROM L_SHIPDATE) = 6;
# repeat multiple times ...

On my machine, three runs (single-threaded, SET threads TO 1;) via Python takes: 3.492882013320923s, 3.344236135482788s, 3.2870419025421143s. Via the CLI, it's 3.613s, 1.366s, 1.340s.
Multi-threaded (e.g. SET threads TO 8;), Python takes: 2.1488077640533447s, 2.1455469131469727s, 2.143683910369873s. Via the CLI, it's 0.427s, 0.209s, 0.231s.

I installed duckdb via pip. I am using Python 3.12.3

For the Python measurements, I use the following loop:

duckdb.sql("CREATE TABLE lineitem AS FROM\'tpc_h_lineitem_sf15.csv\'")
duckdb.sql("SET threads TO 1")
results = list()
for _ in range(ITER_COUNT):
    start = time.time()
    res = duckdb.sql("SELECT L_SHIPINSTRUCT FROM lineitem WHERE extract('month' FROM L_SHIPDATE) = 6").fetchall()
    end = time.time()
    print(f"   took: {end - start}s")
    print(f"   COUNT(*) = {len(res)}")
    results.append(end - start)
print(f"duckdb filter (threads=1) min = {min(results)}s  avg = {sum(results) / len(results)}s  max = {max(results)}s")

(and the same loop with threads=8)

OS:

aarch64 (Mac M1)

DuckDB Version:

v0.10.1 (for CLI)

DuckDB Client:

CLI and Python

Full Name:

Alexander Root

Affiliation:

Stanford University

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

Mytherin · 2024-06-11T20:07:22Z

This seems unrelated to caching to me. This query generates a large result set - .fetchall() fetches the entire result as Python objects which is slow. Try fetching the result using e.g. .arrow() instead.

rootjalex · 2024-06-11T20:53:59Z

Ah fantastic, that produces consistent behavior with the CLI for single-threaded. Not quite for multi-threaded (e.g. it's 0.341s on average), but that discrepancy seems likely to be Python overheads. Thank you!

rootjalex added the needs triage label Jun 11, 2024

rootjalex closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Timing CLI vs. Python #12490

Inconsistent Timing CLI vs. Python #12490

rootjalex commented Jun 11, 2024

Mytherin commented Jun 11, 2024

rootjalex commented Jun 11, 2024

Inconsistent Timing CLI vs. Python #12490

Inconsistent Timing CLI vs. Python #12490

Comments

rootjalex commented Jun 11, 2024

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Mytherin commented Jun 11, 2024

rootjalex commented Jun 11, 2024