Skip to content

Python bindings create duplicated qualified fields after joining #64

@matthewmturner

Description

@matthewmturner

Describe the bug
im working on getting datafusion added to db-benchmark (#147). while putting the benchmarks together i came across an error while doing the join benchmark that i wasnt expecting. specifically the error is:

Traceback (most recent call last):
  File "datafusion/join-datafusion.py", line 72, in <module>
    df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'ce9f0daee780e4f2796b9953bd267457c.id1'")

The test code that produced that is here:

question = "small inner on int" # q1
gc.collect()
t_start = timeit.default_timer()
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
shape = ans_shape(ans)
print(shape)
t = timeit.default_timer() - t_start
t_start = timeit.default_timer()
df = ctx.create_dataframe([ans])
chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0]
chkt = timeit.default_timer() - t_start
m = memory_usage()
write_log(task=task, data=data_name, in_rows=x_data.num_rows, question=question, out_rows=shape[0], out_cols=shape[1], solution=solution, version=ver, git=git, fun=fun, run=1, time_sec=t, mem_gb=m, cache=cache, chk=make_chk([chk]), chk_time_sec=chkt, on_disk=on_disk)
del ans
gc.collect()

if i update the sql to:

SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1

I get:

Traceback (most recent call last):
  File "datafusion/join-datafusion.py", line 73, in <module>
    df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'cb53bcf8886f449c3bd2651571df185d4.id4'")

to me this looks like a bug as i think i should be able to write the query without having to alias the overlapping columns (when i alias the overlapping columns it works). for example, below is the equivalent spark query.

select * from x join small using (id1)

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.
I should be able to run either of the following

ans = ctx.sql("SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions