Skip to content

Potential invalid ClickBench results #18982

@nuno-faria

Description

@nuno-faria

Describe the bug

I think some ClickBench queries are not being correctly executed, as EventDate is being treated as a string.

For example, in query 36:

SELECT "URL", COUNT(*) AS PageViews
FROM hits
WHERE "CounterID" = 62
    AND "EventDate" >= '2013-07-01'
    AND "EventDate" <= '2013-07-31'
    AND "DontCountHits" = 0
    AND "IsRefresh" = 0
    AND "URL" <> ''
GROUP BY "URL"
ORDER BY PageViews DESC
LIMIT 10;

When executing it, EventDate (originally a UInt16) will be casted to Utf8:

DataSourceExec: ...
  predicate=CounterID@1 = 62
    AND CAST(EventDate@0 AS Utf8) >= 2013-07-01
    AND CAST(EventDate@0 AS Utf8) <= 2013-07-31
    AND DontCountHits@4 = 0
    AND IsRefresh@3 = 0
    AND URL@2 != , ...

This in turn makes the query return nothing. If we instead cast the EventDate first to date then it will work as expected:

-- current
SELECT "URL", COUNT(*) AS PageViews
FROM hits
WHERE "CounterID" = 62
    AND "EventDate" >= '2013-07-01'
    AND "EventDate" <= '2013-07-31'
    AND "DontCountHits" = 0
    AND "IsRefresh" = 0
    AND "URL" <> ''
GROUP BY "URL"
ORDER BY PageViews DESC
LIMIT 10;

+-----+-----------+
| URL | pageviews |
+-----+-----------+
+-----+-----------+
0 row(s) fetched. 
Elapsed 0.120 seconds.

-- with cast
SELECT "URL", COUNT(*) AS PageViews
FROM hits
WHERE "CounterID" = 62
    AND "EventDate"::int::date >= '2013-07-01'
    AND "EventDate"::int::date <= '2013-07-31'
    AND "DontCountHits" = 0
    AND "IsRefresh" = 0
    AND "URL" <> ''
GROUP BY "URL"
ORDER BY PageViews DESC
LIMIT 10;

+--------------------------------------------------------------------------+-----------+
| URL                                                                      | pageviews |
+--------------------------------------------------------------------------+-----------+
| http://irr.ru/index.php?showalbum/login-leniya7777294,938303130          | 102341    |
| http://komme%2F27.0.1453.116                                             | 51218     |
| http://irr.ru/index.php?showalbum/login-kapusta-advert2668]=0&order_by=0 | 18315     |
| http://irr.ru/index.php?showalbum/login-kapustic/product_name            | 16461     |
| http://irr.ru/index.php                                                  | 12577     |
| http://irr.ru/index.php?showalbum/login                                  | 10880     |
| http://komme%2F27.0.1453.116 Safari%2F5.0 (compatible; MSIE 9.0;         | 7627      |
| http://irr.ru/index.php?showalbum/login-kupalnik                         | 4369      |
| http://irr.ru/index.php?showalbum/login-kapusta-advert27256.html_params  | 4058      |
| http://komme%2F27.0.1453.116 Safari                                      | 3021      |
+--------------------------------------------------------------------------+-----------+
10 row(s) fetched. 
Elapsed 0.189 seconds.

I tracked it down to this issue #15509 and this PR #15574, where it has been removed by mistake and the existing tests were not able to caught it. It includes queries 36 to 42. I think the confusion came from the fact that DuckDB also does not do the cast when executing the query, but it casts when creating the view:

CREATE VIEW hits AS
SELECT *
    REPLACE (make_date(EventDate) AS EventDate)
FROM read_parquet('hits.parquet', binary_as_string=True);

To Reproduce

ClickBench.

Expected behavior

Return data in queries 36-42.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions