Skip to content

Document non-deterministic ordering of joining, group by, etc. #7373

@caseykneale

Description

@caseykneale

Describe the bug

I am using the SQL interface to query parquet data. I am registering each file in a datafusion context. The query contains some joins, and groupbys (where I speculate the trouble is). Maybe 1/5 attempts I get the correct answer from a compiled binary (0 records). but the other 4/5 attempts I see a lot of erroneous results appearing.

So to be clear, the correct answer is 0 records, and we should never see records appearing otherwise (unless DF's groupby operations are nondeterministic/nonsequential?). Yet I only see that on a rare occasion of runs.

I feel like I might be missing something here(do I need to sort first?) but this looks like a bug to me.

To Reproduce

I can't share the data, but the query looks like the last two queries on this sql fiddle(they're the same) I borrowed from someone on stack overflow and wrote the query I care about.

https://dbfiddle.uk/hA8-ejaw

In this example we do see some rows being returned but in my actual use case there should be none.

Expected behavior

The correct answer of records are returned or there is documentation explaining why this doesn't happen/an example?

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions