Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results in 0.7.1 and 0.7.2-dev2675 vs 0.4.0 (Issue with IEJoin/CTE's/?) #7278

Closed
2 tasks done
Alex-Monahan-Intel opened this issue Apr 27, 2023 · 4 comments
Closed
2 tasks done

Comments

@Alex-Monahan-Intel
Copy link

Alex-Monahan-Intel commented Apr 27, 2023

What happens?

When I run the below query, I receive a null value for the average computation and 0 rows for the count(*) aggregation using 0.7.1 or 0.7.2-dev2675. In 0.4.0, I receive non-zero results for both columns.

The query is an IE join that connects a shiftly calendar with data that is recorded at specific timestamps. I am using the IEJoin to filter the data and assign a shift number to the data. This query is with random data, but I encountered this in a real-world query of similar structure.

Please forgive me on the ugly IN statements... I was just hacking something together and figured that the calendar table was small enough to handle my ugly SQL... ;-)

To Reproduce

I am running this query in the Python client, but it is all encapsulated in a single SQL statement:

SELECT SETSEED(0.8675309);
create temp table calendar as 
	SELECT 
		start_ts,
		start_ts + interval '12 hours' - interval '1 second' as end_ts,
		date_part('isodow',start_ts)::bigint as dow,
		date_part('yearweek',start_ts)::bigint as yyyyww,
		((date_part('isodow',start_ts)/2)*2)::varchar as shift_estimate
	FROM range(TIMESTAMP '2023-01-01 06:00:00', TIMESTAMP '2023-06-01 00:00:00', INTERVAL '12 hours') tbl(start_ts)
;

create temp table snapshot_data as 
	select 
		TIMESTAMP '2023-03-01 06:00:00' + to_seconds(((90*24*60*60)*1*random())::int) as snapshot_ts,
		'woot' || mod(i,3) as varchar_column,
		(random() * i * 10000)::bigint as snapshot_value,
		(random() * i * 10000)::bigint as snapshot_value2,
		(random() * i * 10000)::bigint as snapshot_value3,
		
	from generate_series(1000000) t(i)
;

with last_13_wws as (
	SELECT distinct yyyyww
	FROM calendar
	WHERE
		start_ts >= (current_timestamp - (interval 1 day * (7 * 13)))
		and end_ts <= current_timestamp
		--Not the current ww
		and yyyyww != (select yyyyww from calendar where start_ts <= current_timestamp and end_ts >= current_timestamp limit 1)
), cal_last_13 as (
	select * from calendar where yyyyww in (select yyyyww from last_13_wws)  
)
  
select 
	avg(snapshot_value),
	count(*),
from snapshot_data data
join cal_last_13 cal
	on data.snapshot_ts >= cal.start_ts
	and data.snapshot_ts <= cal.end_ts
-- where
-- 	varchar_column = 'woot1'

Results in 0.7.1 and 0.7.2dev:

avg(snapshot_value) count_star()
NULL 0

Results in 0.4.0:

avg(snapshot_value) count_star()
2501231162.132651 599654.0

If I remove the CTE's, the query does return correct results:

SELECT SETSEED(0.8675309);
create temp table calendar as 
	SELECT 
		start_ts,
		start_ts + interval '12 hours' - interval '1 second' as end_ts,
		date_part('isodow',start_ts)::bigint as dow,
		date_part('yearweek',start_ts)::bigint as yyyyww,
		((date_part('isodow',start_ts)/2)*2)::varchar as shift_estimate
	FROM range(TIMESTAMP '2023-01-01 06:00:00', TIMESTAMP '2023-06-01 00:00:00', INTERVAL '12 hours') tbl(start_ts)
;

create temp table snapshot_data as 
	select 
		TIMESTAMP '2023-03-01 06:00:00' + to_seconds(((90*24*60*60)*1*random())::int) as snapshot_ts,
		'woot' || mod(i,3) as varchar_column,
		(random() * i * 10000)::bigint as snapshot_value,
		(random() * i * 10000)::bigint as snapshot_value2,
		(random() * i * 10000)::bigint as snapshot_value3,
		
	from generate_series(1000000) t(i)
;
  
select 
	avg(snapshot_value),
	count(*),
from snapshot_data data
join calendar cal
	on data.snapshot_ts >= cal.start_ts
	and data.snapshot_ts <= cal.end_ts
avg(snapshot_value) count_star()
2502216150.542991 1000001.0

OS:

Windows

DuckDB Version:

0.7.1 and 0.7.2-dev2675

DuckDB Client:

Python

Full Name:

Alex Monahan

Affiliation:

Intel and DuckDB Labs

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@hawkfish
Copy link
Contributor

SELECT SETSEED(0.8675309);

Will make the random numbers deterministic.

@Alex-Monahan-Intel
Copy link
Author

SELECT SETSEED(0.8675309);

Will make the random numbers deterministic.

I'll make that change!

@Alex-Monahan-Intel
Copy link
Author

Updated to be deterministic using SETSEED. Thanks!

@paultiq
Copy link

paultiq commented Apr 29, 2023

In case it helps, a more minimal reproduction (I've been interested in understanding duckdb's IEJoin implementation so this seemed like a good chance to try to dig in a little... didn't get far in code yet, but at least narrowed down the reproduction):

After removing some extraneous factors, like random(), I observed that the number of elements in snapshot data + the nested subquery in the CTE are both factors

Edit: to add a better explanation of the cases: Case 1 is a minimized reproduction of the original issue. Case 2 only differs by the sequence length (1000 vs 100). Case 3 is a modification of Case 1 that passes that differs from Case 1 by the removal of the nested subquery in the CTE.

Case 1: fails (returns 0 count) w/ generateseries(1000)

create temp table calendar as 
	SELECT 
		start_ts,
		start_ts + interval '12 hours' as end_ts,
		date_part('yearweek',start_ts)::bigint as yyyyww
	FROM range(TIMESTAMP '2023-01-01 06:00:00', TIMESTAMP '2023-06-01 00:00:00', INTERVAL '12 hours') tbl(start_ts)
;

create temp table snapshot_data as 
	select 
		TIMESTAMP '2023-03-01 08:00:00' as snapshot_ts,
		1 as snapshot_value
	from generate_series(1000) t(i)
;
 
with cal_last_13 as (
	select * from calendar where yyyyww in (SELECT yyyyww
	FROM calendar)  
)
select 
	count(*),
from snapshot_data data
join cal_last_13 cal
	on data.snapshot_ts >= cal.start_ts
	and data.snapshot_ts <= cal.end_ts

Case 2: Passes (101 count) with generate_series(100)

create temp table calendar as 
	SELECT 
		start_ts,
		start_ts + interval '12 hours' as end_ts,
		date_part('yearweek',start_ts)::bigint as yyyyww
	FROM range(TIMESTAMP '2023-01-01 06:00:00', TIMESTAMP '2023-06-01 00:00:00', INTERVAL '12 hours') tbl(start_ts)
;
create temp table snapshot_data as 
	select 
		TIMESTAMP '2023-03-01 08:00:00' as snapshot_ts,
		1 as snapshot_value
	from generate_series(100) t(i)
;
with cal_last_13 as (
	select * from calendar where yyyyww in (SELECT yyyyww
	FROM calendar)  
)
select 
	count(*),
from snapshot_data data
join cal_last_13 cal
	on data.snapshot_ts >= cal.start_ts
	and data.snapshot_ts <= cal.end_ts

Case 3: Passes after removing the nested query from the CTE

with cal_last_13 as (
	select * from calendar
)
select 
	count(*),
from snapshot_data data
join cal_last_13 cal
	on data.snapshot_ts >= cal.start_ts
	and data.snapshot_ts <= cal.end_ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants