You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed on a number of queries that the CLI will run slowly on the first iteration, and much faster on subsequent iterations. Using the Python API however, the runtime is always as slow (or slower) than the first iteration of the CLI. I suspect the CLI is doing some form of caching, but the Python API is not.
To Reproduce
INSTALL tpch;
LOAD tpch;
CALL dbgen(sf = 15);
.timer on
# set threads here
SELECT L_SHIPINSTRUCT FROM lineitem WHERE extract('month' FROM L_SHIPDATE) = 6;
# repeat multiple times ...
On my machine, three runs (single-threaded, SET threads TO 1;) via Python takes: 3.492882013320923s, 3.344236135482788s, 3.2870419025421143s. Via the CLI, it's 3.613s, 1.366s, 1.340s.
Multi-threaded (e.g. SET threads TO 8;), Python takes: 2.1488077640533447s, 2.1455469131469727s, 2.143683910369873s. Via the CLI, it's 0.427s, 0.209s, 0.231s.
I installed duckdb via pip. I am using Python 3.12.3
For the Python measurements, I use the following loop:
duckdb.sql("CREATE TABLE lineitem AS FROM\'tpc_h_lineitem_sf15.csv\'")
duckdb.sql("SET threads TO 1")
results = list()
for _ in range(ITER_COUNT):
start = time.time()
res = duckdb.sql("SELECT L_SHIPINSTRUCT FROM lineitem WHERE extract('month' FROM L_SHIPDATE) = 6").fetchall()
end = time.time()
print(f" took: {end - start}s")
print(f" COUNT(*) = {len(res)}")
results.append(end - start)
print(f"duckdb filter (threads=1) min = {min(results)}s avg = {sum(results) / len(results)}s max = {max(results)}s")
(and the same loop with threads=8)
OS:
aarch64 (Mac M1)
DuckDB Version:
v0.10.1 (for CLI)
DuckDB Client:
CLI and Python
Full Name:
Alexander Root
Affiliation:
Stanford University
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a source build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
Yes, I have
The text was updated successfully, but these errors were encountered:
This seems unrelated to caching to me. This query generates a large result set - .fetchall() fetches the entire result as Python objects which is slow. Try fetching the result using e.g. .arrow() instead.
Ah fantastic, that produces consistent behavior with the CLI for single-threaded. Not quite for multi-threaded (e.g. it's 0.341s on average), but that discrepancy seems likely to be Python overheads. Thank you!
What happens?
I've noticed on a number of queries that the CLI will run slowly on the first iteration, and much faster on subsequent iterations. Using the Python API however, the runtime is always as slow (or slower) than the first iteration of the CLI. I suspect the CLI is doing some form of caching, but the Python API is not.
To Reproduce
On my machine, three runs (single-threaded,
SET threads TO 1;
) via Python takes: 3.492882013320923s, 3.344236135482788s, 3.2870419025421143s. Via the CLI, it's 3.613s, 1.366s, 1.340s.Multi-threaded (e.g.
SET threads TO 8;
), Python takes: 2.1488077640533447s, 2.1455469131469727s, 2.143683910369873s. Via the CLI, it's 0.427s, 0.209s, 0.231s.I installed duckdb via pip. I am using Python 3.12.3
For the Python measurements, I use the following loop:
(and the same loop with threads=8)
OS:
aarch64 (Mac M1)
DuckDB Version:
v0.10.1 (for CLI)
DuckDB Client:
CLI and Python
Full Name:
Alexander Root
Affiliation:
Stanford University
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a source build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: