add output field name rfc #422

houqp · 2021-05-25T06:31:35Z

Which issue does this PR close?

Keeping the process light since the project is still in a rapid development phase.

Closes #302.
Relates to #55.

Once this is reviewed and accepted, I will send the update invariant doc in a follow up PR.

Rationale for this change

Unblock PR #55.

What changes are included in this PR?

First RFC for field name semantic and RFC process.

alamb

Thanks @houqp looks great to me

For anyone else following along here is a google doc link to drafts of this content.

https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/edit

alamb · 2021-05-25T16:28:33Z

docs/rfcs/output-field-name-semantic.md

+* Literal string MUST not be wrapped with quotes or double quotes.
+  * `SELECT 'foo'` SHOULD result in field name: `foo`
+* Operator expressions MUST be wrapped with parentheses.
+  * `SELECT -2` SHOULD result in field name: `(- 2)`


Should there really be a space between - and 2?

Suggested change

* `SELECT -2` SHOULD result in field name: `(- 2)`

* `SELECT -2` SHOULD result in field name: `(-2)`

This came from the rule below: Operator and operand MUST be separated by spaces.. I don't have a strong opinion on this. Picked this because this is what spark does today. We can also change the rule below to require no spaces between operator and operands. Which one do you prefer?

docs/rfcs/output-field-name-semantic.md

jorgecarleitao · 2021-05-25T16:42:39Z

So, a question that I would like to bring before committing to RFCs. I see two ways of approaching this:

the delta approach: each RFC may change previous RFCs, becoming the "latest" or a mixture of
the specification approach: each change is a PR, and the combined result is the new "specification"

Both have benefits and downsides.

My opinion is that we should not use the RFC approach, and instead work under the "specification" model.

The reason is that, in my opinion, RFCs are hard to follow because they were made in a specific moment in time and no longer updated. If updated, there is a new RFC that does that, which requires some form of consolidation (e.g. RFCs have the term "amends", "superseded by", "revoked", PEP also).

My opinion is that, with PRs, git history and git blame, there is no need to store the "deltas" (RFCs) in the repository itself, and we should instead offer the consolidated, up-to-date picture of the specification.

This was the rational of the original issue, at least.

TL;DR: an RFC is a "delta", the specification is the "state". An RFC/PR brings the state from the previous state to a new state. IMO we should have the specification on the repo and use PRs to track deltas (as opposed to having the deltas themselves in the repo)

Dandandan · 2021-05-25T16:43:11Z

docs/rfcs/output-field-name-semantic.md

+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`


One concern I had when implementing something very similar to this RFC here https://github.com/apache/arrow-datafusion/pull/280 is that introducing an alias makes the column available under that name in the whole query. IMO that can be confusing if you do use this implicit convention. Any idea what other engines / databases are doing here?

(So it looks like this is a common practice). Confusing - yes - but maybe not a hige concern, as users might not often rely on this as it is quite unconventional.

@Dandandan I documented a survey of behavior from mysql/sqlite/postgres/spark in this doc as well, for example: https://github.com/houqp/arrow-datafusion/blob/qp_rfc/docs/rfcs/output-field-name-semantic.md#function-with-operators. Basically mysql and sqlite use the raw user query as the column name, postgres throws in the towel and just use ?column? for everything while spark SQL constructs the column name based on the expression.

I picked Spark's behavior because it's the one that is the closest to what we had at the time. But since you already implemented mysql and sqlite's behavior since then, i am happy to update the doc to account for that. In this case, we need two sets of rules, one for SQL queries, which is to just reuse what's provided in the query. The other one for dataframe queries, which is what I outlined in this doc.

UPDATE: after taking a second look at #280, turns out the PR is closed due to the issue you mentioned above. I am now leaning back towards not preserving user provided names from query to keep things simple. It's one less thing to worry about and keeps the rules simple so we can apply the same set of rules for outputs produced from both SQL queries and dataframe queries. Let me know if you have a strong opinion on this though.

Dandandan · 2021-05-25T16:54:05Z

docs/rfcs/output-field-name-semantic.md

+in many cases, which makes output less readable than other designs.
+
+MySQL and SQLite preserve user query input as the field name. This adds extra
+implementation and runtime overhead with little gain for end uers.


What is the extra implementation / runtime overhead? I would say just printing the original (representation) of expressions should be less work than getting rid of qualified names? Preserving casing is currently not something done by the parser - but qualified names are maintained.

When I wrote this, I was mostly thinking about attaching the extra alias expression node and having to traverse this node multiple times during plan optimization. But honestly, I don't think this overhead matters in practice. It's just one more thing we need to do compared to not preserving the raw query input.

Getting rid of the qualifier during physical planning is just one line change in the function that returns physical field name for a logical column.

jorgecarleitao

The field semantic makes sense to me, and formalizes something that we already do today.

I would just try to write it in a way that does not feel like a proposal, but rather as something that "is", i.e. that users can expect when using DataFusion.

houqp · 2021-05-26T07:53:56Z

@jorgecarleitao I agree with you, I also don't want this thing to get too formal and create unnecessary frictions whenever we need to update it. My original intent was a hybrid model of RFCs and Specs. Where we allow minor and backwards compatible changes to RFC directly (tracked through git), but require new RFCs for bigger changes. I will convert it to specs so people can walk in with the right expectation. We can come back to the RFC models in the future if we want to get serious about proposing changes.

jorgecarleitao · 2021-05-26T09:13:07Z

Thanks @houqp ! I am 100% with you here.

Wrt to the content itself, I am a big +1 here. We could discuss the finer print but my approval reads as: this is a fantastic first start; we may need some more iterations here; I would like to avoid committing to the exact wording at this point and instead allow further iterations on the existing text (i.e. without requiring another proposal / RFC, long text, and/or a way to consolidate texts.

alamb · 2021-05-26T20:09:31Z

Looks like we just need a RAT (apache copyright statement) to get a clean CI run and merge it. Perhaps we can add some section to the developer's guide once implemented with the "here is how output names are created" with a link to this RFC for the rationale

alamb · 2021-05-26T20:10:58Z

Looks like we just need a RAT (apache copyright statement) to get a clean CI run and merge it. Perhaps we can add some section to the developer's guide once implemented with the "here is how output names are created" with a link to this RFC for the rationale

houqp · 2021-05-27T05:48:48Z

@alamb @jorgecarleitao @Dandandan I reorganized everything to better align with the specification model. Could you take another look to see if there is anything you would like me to change?

jorgecarleitao

Beautiful. Thank you so much for your patience and work here. 💯

alamb

Thanks @houqp -- looks good to me

alamb · 2021-05-27T10:19:04Z

DEVELOPERS.md

+
+Here is the list current active specifications:
+
+* [Output field name semantic](docs/specification/output-field-name-semantic.md)


codecov-commenter · 2021-05-27T16:16:39Z

Codecov Report

Merging #422 (6fcf0b5) into master (9c0ad7b) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #422   +/-   ##
=======================================
  Coverage   75.27%   75.27%           
=======================================
  Files         147      147           
  Lines       24834    24834           
=======================================
  Hits        18694    18694           
  Misses       6140     6140

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c0ad7b...6fcf0b5. Read the comment docs.

alamb · 2021-05-28T11:01:06Z

windows test failure is unrelated

* docs: add guide to adding a new expression * docs: revise with presentation info * docs: fix warning * docs: fix header level * docs: better info about handling datafusion udfs * docs: grammar/typos/etc * docs: clarify datafusion/datafusion comet path * docs: clarify language about `isSpark32` * docs: fix error

add output field name rfc

a74460a

alamb approved these changes May 25, 2021

View reviewed changes

alamb added datafusion documentation Improvements or additions to documentation labels May 25, 2021

Dandandan reviewed May 25, 2021

View reviewed changes

docs/rfcs/output-field-name-semantic.md Outdated Show resolved Hide resolved

Dandandan reviewed May 25, 2021

View reviewed changes

jorgecarleitao approved these changes May 25, 2021

View reviewed changes

houqp requested a review from Dandandan May 26, 2021 07:54

houqp added 2 commits May 26, 2021 22:19

move to spec model

78ca11d

add link to developers docs & add ASF header

3a0193c

houqp force-pushed the qp_rfc branch from 992984c to 3a0193c Compare May 27, 2021 05:44

houqp requested review from alamb and jorgecarleitao May 27, 2021 05:44

Merge remote-tracking branch 'upstream/master' into qp_rfc

65270de

jorgecarleitao approved these changes May 27, 2021

View reviewed changes

alamb approved these changes May 27, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into qp_rfc

6fcf0b5

alamb merged commit c9ed34c into apache:master May 28, 2021

houqp deleted the qp_rfc branch May 28, 2021 17:24

houqp mentioned this pull request Dec 4, 2021

Add RFCS for datafusion #1397

Closed

	* `SELECT -2` SHOULD result in field name: `(- 2)`
	* `SELECT -2` SHOULD result in field name: `(-2)`


		Here is the list current active specifications:

		* [Output field name semantic](docs/specification/output-field-name-semantic.md)

add output field name rfc #422

add output field name rfc #422

Uh oh!

Conversation

houqp commented May 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb May 25, 2021

Choose a reason for hiding this comment

Uh oh!

houqp May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorgecarleitao commented May 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan May 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan May 25, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan May 25, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan May 25, 2021

Choose a reason for hiding this comment

Uh oh!

houqp May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan May 25, 2021

Choose a reason for hiding this comment

Uh oh!

houqp May 26, 2021

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

houqp commented May 26, 2021

Uh oh!

jorgecarleitao commented May 26, 2021

Uh oh!

alamb commented May 26, 2021

Uh oh!

alamb commented May 26, 2021

Uh oh!

houqp commented May 27, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb May 27, 2021

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 27, 2021

Codecov Report

Uh oh!

alamb commented May 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

houqp commented May 25, 2021 •

edited

Loading

houqp May 26, 2021 •

edited

Loading

jorgecarleitao commented May 25, 2021 •

edited

Loading

Dandandan May 25, 2021 •

edited

Loading

houqp May 26, 2021 •

edited

Loading