Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQLite] Discussion: how to handle duplicate column names in result set? #696

Closed
wants to merge 2 commits into from

Conversation

geelen
Copy link
Collaborator

@geelen geelen commented May 24, 2023

If you have the following SQL:

CREATE TABLE abc (a INT, b INT, c INT);
CREATE TABLE cde (c INT, d INT, e INT);
INSERT INTO abc VALUES (1,2,3),(4,5,6);
INSERT INTO cde VALUES (7,8,9),(1,2,3);

SELECT * FROM abc, cde;

You get (in real sqlite in 'table' output mode):

+---+---+---+---+---+---+
| a | b | c | c | d | e |
+---+---+---+---+---+---+
| 1 | 2 | 3 | 7 | 8 | 9 |
| 1 | 2 | 3 | 1 | 2 | 3 |
| 4 | 5 | 6 | 7 | 8 | 9 |
| 4 | 5 | 6 | 1 | 2 | 3 |
+---+---+---+---+---+---+

(Note the repeated column name c)

Doing that in workerd's SQLite:

Array.from(sql.prepare(`SELECT * FROM abc, cde`)())
// =>
[
  {"a":1,"b":2,"c":7,"d":8,"e":9},
  {"a":1,"b":2,"c":1,"d":2,"e":3},
  {"a":4,"b":5,"c":7,"d":8,"e":9},
  {"a":4,"b":5,"c":1,"d":2,"e":3}
]

Note that the second value of c, 7, has overwritten the first value 3.

Using .raw() gets the right data:

Array.from(sql.prepare(`SELECT * FROM abc, cde`)().raw())
// =>
[
  [1,2,3,7,8,9],
  [1,2,3,1,2,3],
  [4,5,6,7,8,9],
  [4,5,6,1,2,3]
]

But currently, the D1 shim assumes .raw() can always be retrieved from the Object.values of the "object" response type. We'll be changing the shim but, for backwards compatibility, should we endeavour to make the default response format never drop data, by preferring to instead mangle the column names?

[
  {"a":1,"b":2,"c":3,"c_1":7,"d":8,"e":9},
]

Note the c_1 key meaning "first duplicate of c" column. We could also use . as a separator, or even ~ if we wanted to channel DOS filenames... :)

Alternatively, we could throw an exception if this case occurs, though that might be painful for people who do something like: SELECT * FROM users, projects WHERE projects.user_id = users.user_id. There's a duplicate column name but the data is always the same, so is that really bad? The fact that the raw() and normal responses have different lengths of responses is still not great though...

As an aside, it's a shame the full_column_names pragma is deprecated (and has no effect). It would have solved this nicely for anyone who relied on select * a bunch in their app...

@rozenmd
Copy link
Contributor

rozenmd commented May 24, 2023

I also tried this test case, but didn't get far: #691

@geelen
Copy link
Collaborator Author

geelen commented May 24, 2023

Ah, apologies @rozenmd I missed that you'd already pushed something to demonstrate the issue.

Another option that just occurred to me is making .raw() the default for all D1 queries, which we can then stitch together into JS objects either in the DO or even in newer versions of the shim, and implement renaming there. But we'd require a way to also get the column names out during the RawIterator's loop. I had a quick go at adding a getCachedColumnNames() to SqlStorage::Cursor but couldn't make it work. I'm sure I'm missing something simple though...

But then it's probably no good if D1 introduces its own column-collision-renaming behaviour. It'd be better to do it in workerd so it's standard across all users of the API.

@kentonv
Copy link
Member

kentonv commented May 24, 2023

IMO, it's fine to say to the app: "If you're going to request rows as objects, it's your responsibility to make sure the column names don't collide by using AS where needeed."

But this doesn't work for D1 specifically because D1 doesn't know at the time of the query whether the application is requesting objects vs. raw arrays, right?

So I think a columnNames property on Cursor would make sense, to let D1 solve this. This would also save bytes on the wire between the eyeball and the database DO, since D1's protocol could be designed to send only one copy of the column names instead of repeating them for every row.

@kentonv
Copy link
Member

kentonv commented May 24, 2023

I actually think maybe the DO API should throw an exception if rows are requested in object form and it turns out there are duplicate column names. If the app doesn't realize it has duplicate column names, renaming one of them isn't going to make the code work correctly. If the app does know, then it can use AS. So adding a _1 suffix or whatever doesn't seem helpful.

@elithrar
Copy link
Contributor

elithrar commented May 24, 2023 via email

@geelen
Copy link
Collaborator Author

geelen commented Jan 30, 2024

FYI, this will be fixed D1 once #1586 is merged and rolls out

@geelen
Copy link
Collaborator Author

geelen commented Feb 13, 2024

This is now live. .raw() returns the full results, regardless of column name collisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants