Skip to content

Aggregation's "where" clause not working when querying Parquet in vector runtime #5559

@philrz

Description

@philrz

tl;dr

With this test data in Parquet form:

{log_time:2012-10-01T00:00:02Z,client_ip:99.85.61.193,request:"/courses/cs132/2012/",status_code:304(uint16),object_size:213(uint64)}(=bench2)
{log_time:2012-01-01T00:00:00Z,client_ip:25.152.171.147,request:"/books/Six_Easy_Pieces.html",status_code:404(uint16),object_size:271(uint64)}(=bench2)

The where clause in the following aggregation causes the entry with client_ip:25.152.171.147 to show a count of 1 when it should have been 0.

$ SUPER_VAM=1 super -c 'from data.parquet | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:"99.85.61.193",count:1(uint64)}
{client_ip:"25.152.171.147",count:1(uint64)}

Details

Repro is with super commit fc8ab65. This is a simplification of the mgbench bench2/q4 query.

Starting with the data.zson.gz test data shown above, in sequential runtime we see the record with client_ip:25.152.171.147 showing a count of 0 as we'd expect given the filter where log_time >= 2012-10-01T00:00:00Z.

$ super -version
Version: v1.18.0-213-gfc8ab655

$ super -c 'from data.zson.gz | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:99.85.61.193,count:1(uint64)}
{client_ip:25.152.171.147,count:0(uint64)}

However, the problem surfaces if we turn the data into Parquet and execute the query in the vector runtime.

$ super -f parquet -o data.parquet data.zson.gz 

$ SUPER_VAM=1 super -c 'from data.parquet | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:"99.85.61.193",count:1(uint64)}
{client_ip:"25.152.171.147",count:1(uint64)}

But the problem doesn't happen if I query the same Parquet file using the sequential runtime, or query the data as CSUP in vector runtime.

$ super -c 'from data.parquet | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:"99.85.61.193",count:1(uint64)}
{client_ip:"25.152.171.147",count:0(uint64)}

$ super -f csup -o data.csup data.zson.gz 

$ SUPER_VAM=1 super -c 'from data.csup | count() where log_time >= 2012-10-01T00:00:00Z by client_ip'
{client_ip:99.85.61.193,count:1(uint64)}
{client_ip:25.152.171.147,count:0(uint64)}

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions