Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Projections map cast operations to original column name #948

Open
charlesbluca opened this issue Dec 1, 2022 · 0 comments
Open

[BUG] Projections map cast operations to original column name #948

charlesbluca opened this issue Dec 1, 2022 · 0 comments
Labels
bug Something isn't working datafusion Related to work in DataFusion

Comments

@charlesbluca
Copy link
Collaborator

What happened:
When attempting to project a column that has been casted to a different dtype, unexpected behavior can occur due to the fact that DataFusion seems to map cast operations to the name of the original column (e.g. the key for cast(df.a to date) would be df.a).

In particular, this can cause significant issues when trying to project both a casted column and the original, as this results in a collision in our named projects, causing us to use the same column for both projects.

What you expected to happen:
I would've expected cast operations to be mapped to some alias that would distinguish them from the original column, such that collisions wouldn't occur here.

Minimal Complete Verifiable Example:

We get parsing issues when trying to project the casted and original column without an alias:

import pandas as pd
from dask_sql import Context

df = pd.DataFrame({"a": ["1999-06-21"]})

c = Context()
c.create_table("df", df)

c.sql("""
    select
        a,
        cast(a as date)
    from df 
""")

# ParsingException: Plan("Projections require unique expression names but the expression \"df.a\" at position 0 and \"CAST(df.a AS Date32)\" at position 1 have the same name. Consider aliasing (\"AS\") one of them.")

When using an alias, we see that one column is used for both projects:

c.sql("""
    select
        a,
        cast(a as date) as b
    from df 
""")

# Dask DataFrame Structure:
#                             a               b
# npartitions=1                                
# 0              datetime64[ns]  datetime64[ns]
# 0                         ...             ...
# Dask Name: rename, 15 graph layers

Anything else we need to know?:
I'm fairly sure this is the underlying issue behind failures we were seeing in q21 and q40 before merging in #924, as the failures seemed to indicate that a cast column wasn't the expected dtype (cc @ayushdg).

Environment:

  • dask-sql version: latest
  • Python version: 3.9
  • Operating System: ubuntu
  • Install method (conda, pip, source): source
@charlesbluca charlesbluca added bug Something isn't working needs triage Awaiting triage by a dask-sql maintainer datafusion Related to work in DataFusion and removed needs triage Awaiting triage by a dask-sql maintainer labels Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datafusion Related to work in DataFusion
Projects
None yet
Development

No branches or pull requests

1 participant