Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANY_VALUE(<UUID>) is 10x slower than ANY_VALUE(<UUID>::TEXT) #11837

Open
2 tasks done
exaroj opened this issue Apr 25, 2024 · 0 comments
Open
2 tasks done

ANY_VALUE(<UUID>) is 10x slower than ANY_VALUE(<UUID>::TEXT) #11837

exaroj opened this issue Apr 25, 2024 · 0 comments

Comments

@exaroj
Copy link

exaroj commented Apr 25, 2024

What happens?

Switching from VARCHAR to UUID causes a 10x slowdown when using ANY_VALUE()

To Reproduce

Performance change was observed with production data, synthetic data here reproduces.

import duckdb
import os
import tempfile
import time

def uuid_perf() -> None:
    QUERY = """
        WITH
        event_names AS (
            SELECT UNNEST(['event1', 'event2', 'event3', 'event5']) AS eventName
        ),
        event_timestamps AS (
            SELECT CAST('2024-01-01 00:00:00+00:00' AS TIMESTAMPTZ)
                + INTERVAL (i) HOUR AS eventTime
            FROM GENERATE_SERIES(0, 1000000) AS s(i)
        ),
        log_records AS (
            SELECT eventName, eventTime, gen_random_uuid() AS eventID
            FROM event_names
            CROSS JOIN event_timestamps
            ORDER BY eventId
        ),
        sample AS (
            SELECT
                eventName,
                DATE_TRUNC('hour', eventTime) AS eventHour,
                ANY_VALUE(eventID) AS sampleEvent
            FROM log_records
            GROUP BY 1, 2
            LIMIT 10
        )
        SELECT * FROM sample
        ;
    """

    def time_query(q:str) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
            with duckdb.connect(os.path.join(tmpdir, "test.db")) as con:
                print(con.execute("SELECT * FROM pragma_version();").fetchall())
                
                start: int = time.clock_gettime_ns(time.CLOCK_MONOTONIC)
                con.execute(q).fetchall()
                stop: int = time.clock_gettime_ns(time.CLOCK_MONOTONIC)
                
                elapsed = stop - start
                print(f"elapsed {elapsed/1.0e9}s")
    
    print("Original Query")
    time_query(QUERY)
    
    print("Query with TEXT CAST")
    time_query(QUERY.replace("ANY_VALUE(eventID) AS sampleEvent",
                             "ANY_VALUE(eventID::TEXT) AS sampleEvent"))
    
uuid_perf()

# [('v0.10.1', '4a89d97db8')]

# Original Query
# elapsed 17.178034s
# Query with TEXT CAST
# elapsed 1.861703s


# [('v0.10.3-dev388', '86fee9ed38')]

# Original Query
# elapsed 15.525825s
# Query with TEXT CAST
# elapsed 1.678268s

OS:

macOS 14.4.1 Apple M3 Max

DuckDB Version:

0.10.1

DuckDB Client:

Python

Full Name:

Rob Jackson

Affiliation:

exaforce.com

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants