-
Notifications
You must be signed in to change notification settings - Fork 134
Description
Describe the bug
im working on getting datafusion added to db-benchmark (#147). while putting the benchmarks together i came across an error while doing the join benchmark that i wasnt expecting. specifically the error is:
Traceback (most recent call last):
File "datafusion/join-datafusion.py", line 72, in <module>
df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'ce9f0daee780e4f2796b9953bd267457c.id1'")
The test code that produced that is here:
question = "small inner on int" # q1
gc.collect()
t_start = timeit.default_timer()
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
shape = ans_shape(ans)
print(shape)
t = timeit.default_timer() - t_start
t_start = timeit.default_timer()
df = ctx.create_dataframe([ans])
chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0]
chkt = timeit.default_timer() - t_start
m = memory_usage()
write_log(task=task, data=data_name, in_rows=x_data.num_rows, question=question, out_rows=shape[0], out_cols=shape[1], solution=solution, version=ver, git=git, fun=fun, run=1, time_sec=t, mem_gb=m, cache=cache, chk=make_chk([chk]), chk_time_sec=chkt, on_disk=on_disk)
del ans
gc.collect()
if i update the sql to:
SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1
I get:
Traceback (most recent call last):
File "datafusion/join-datafusion.py", line 73, in <module>
df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field name 'cb53bcf8886f449c3bd2651571df185d4.id4'")
to me this looks like a bug as i think i should be able to write the query without having to alias the overlapping columns (when i alias the overlapping columns it works). for example, below is the equivalent spark query.
select * from x join small using (id1)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
I should be able to run either of the following
ans = ctx.sql("SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])
Additional context
Add any other context about the problem here.