Skip to content

Add new error message content when columns are in a different order#5927

Merged
quasiben merged 2 commits intodask:masterfrom
jsignell:5886
Feb 19, 2020
Merged

Add new error message content when columns are in a different order#5927
quasiben merged 2 commits intodask:masterfrom
jsignell:5886

Conversation

@jsignell
Copy link
Copy Markdown
Member

Closes: #5886

  • Tests added / passed
  • Passes black dask / flake8 dask

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Feb 19, 2020 via email

@quasiben
Copy link
Copy Markdown
Member

Thanks @jsignell

@KrishanBhasin
Copy link
Copy Markdown
Contributor

Can I ask about the intent for this? I've been running into an issue on dask==2.11.0 using the read_sql() connector; I try to use a HELP COLUMN ... query to obtain column metadata (after some translation between my table types and pandas types) to pass in directly to the function.

It seems that the ordering of the data is not guaranteed so dask throws an error.

I was hoping to propose a PR that meant that check_matching_columns() allowed mis-ordering through without bother, but after seeing this PR think I might be missing some context/intent behind why the error is needed.

@TomAugspurger
Copy link
Copy Markdown
Member

Just to be clear: Dask is checking the ordering of the columns here, not the rows. We call this in many places, and it's an existing check (this PR improved the error message). Do you have an example of this failing?

@KrishanBhasin
Copy link
Copy Markdown
Contributor

The ordering of the SELECT columns statement generated by SQLAlchemy doesn't appear to be consistent; on occasions it lines up with the ordering meta object we create from the result of our HELP COLUMN ... query and the code passes, but on other occasions we get the error:

E   ValueError: The columns in the computed data do not match the columns in the provided metadata
E     Extra:   []
E     Missing: []

which will become
The order of the columns do not match
in the next Dask release

@TomAugspurger
Copy link
Copy Markdown
Member

Do you have an example? I'm not familiar with HELP COLUMN.

@KrishanBhasin
Copy link
Copy Markdown
Contributor

I think that's unnecessary detail - its just how we shortcut obtaining column metadata on behalf of Dask to save it doing the TOP query it does by default.

Essentially, we create a Dask meta object and pass it in to read_sql(), along with a SQLAlchemy query object.

The query object appears to generate SELECT statements to pass on to pd.read_sql() that have different ordering of the columns on each run through.

This is an issue for SQLAlchemy to fix the ordering of the columns in the generated SQL, and/or an issue for Dask to handle the mis-ordering of columns here (as, as far as I can tell, the column ordering does not matter)

@TomAugspurger
Copy link
Copy Markdown
Member

Sorry, I'm having trouble understanding the issue :/ DataFrame's have a well-defined ordering of the columns. Perhaps you can can ensure that your meta object has the correct order prior to the read_sql..

@KrishanBhasin
Copy link
Copy Markdown
Contributor

KrishanBhasin commented Mar 4, 2020

A SQLAlchemy query object is the main thing passed to dd.read_sql_table() to define what data to retrieve.
This query object generates a literal string of SQL which is actually submitted to the database.
The query object appears to generate the this statement with inconsistent column orderings each time (I think there might be an issue/PR to SQLAlchemy to fix this behaviour). This means that the returned dataframe, also has this inconsistent ordering of columns.

Because of this, I am unable to predict how to order the columns in my meta object.

I was hoping that Dask didn't worry too much about column ordering (as it references columns within the underlying Pandas dataframes by label), so we could remove this part of the check.
I'm still pretty new to Dask so I imagine I've misunderstood large parts about how Dask works though.

Thanks for stepping through this with me, sorry for not being clear enough at the start!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unclear error message when meta column order does not match

5 participants