Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENUMs treat empty strings "" as if they were <nil> on CSV import, resulting in error #5807

Closed
alecstein opened this issue Apr 26, 2023 · 3 comments · Fixed by #5856
Closed
Labels
bug Something isn't working

Comments

@alecstein
Copy link
Contributor

alecstein commented Apr 26, 2023

Empty strings "" are distinct from null, but when an ENUM has an empty string "" as a possible value, providing that empty string results in it being interpreted as null.

a.sql

create table alphabet (
        letter enum('', 'a', 'b')
  )

data.csv

letter
a
""

Do:

dolt sql < a.sql
dolt table import -u alphabet data.csv

Result:

> select * from alphabet
+--------+
| letter |
+--------+
| NULL   |
| a      |
+--------+

This is a problem when the "" empty string is part of a primary key with a blank value.

Right now the workaround is to use the index of the ENUM. Starting with this file,

data_workaround.csv

letter
a
1

do:

dolt sql -q "drop table alphabet"
dolt sql < a.sql
dolt table import -u alphabet data_workaround.csv

Then you'll get correctly:

> select * from alphabet;
+--------+
| letter |
+--------+
| a      |
|        |
+--------+
@alecstein alecstein added the bug Something isn't working label Apr 26, 2023
@jennifersp jennifersp linked a pull request May 2, 2023 that will close this issue
@jennifersp jennifersp self-assigned this May 2, 2023
@fulghum
Copy link
Contributor

fulghum commented May 2, 2023

We debated this one for a while today, and @jennifersp even coded up a nice PR for this. It's a good issue, but I'm not comfortable yet with some of the effects of the change. Since there is a workaround currently (using enum integer values instead of the string values, like you mentioned), I think we shouldn't rush into a change that might affect other uses of dolt table import...

Here's my reasoning:

  • Using the empty string in enums/sets does not seem like a common use case (I know in this case we were just trying to hack around a PK non-null constraint)
  • There is a workaround today, so customers aren't blocked if they need to run imports that import empty string enum values.
  • If we did start interpreting empty string enum/set values in imports as the enum empty string, then there wouldn't be a workaround for people who do want the NULL value, so it feels like we might be creating a new problem. I don't have data for this, but my gut says this is probably the more common use case. (We could try to be more clever and look at the schema to see if an enum/set even has an empty string member, but that feels like it's getting a little too clever and might be unintuitive behavior.)

I'm happy to leave this one open if we want to see if other customers hit this issue, but for right now, I think we should not make this change. I'm open to other ideas or compromises of course. My favorite idea that Jennifer I discussed was having a command line flag that switches this behavior. That felt like the cleanest solution to me, but I was hesitant to add that until we had some more customer feedback on needing this.

@fulghum fulghum removed the bug Something isn't working label May 2, 2023
@alecstein
Copy link
Contributor Author

alecstein commented May 3, 2023

It's a bug only in the sense the the behavior of dolt table import regarding CSV rows like

value,"",""

is not the same across the board. In any non-ENUM field, the value "" is considered the empty string, and distinct from NULL. The counterintuitive behavior is treating "" like NULL only in the context of ENUMs.

If we did start interpreting empty string enum/set values in imports as the enum empty string, then there wouldn't be a workaround for people who do want the NULL value

the NULL value would be just ,, as before, instead of "".

But in the spirit of what you say: yes, it's not widespread, and I also agree that now that we've figured out another solution for our use case, it's not high priority.

@fulghum
Copy link
Contributor

fulghum commented May 9, 2023

I was thinking about this some more and I tested how our import and export commands work – you're totally right that they do already distinguish between null and empty string via ,, and ,"",. This is not a standard part of CSV formatting though, and there seems to be a fair bit of debate on whether people agree this is a good approach (but lots of agreement that this is a frustrating deficiency in CSV files!)

Ultimately, if we're already making this distinction in the import/export code, then I'm totally fine with continuing it for enums. It cleanly handles my earlier concern about needing to treat all ,, as the empty string for enums (i.e. they would have to be ""). It also means that we can correctly roundtrip data from export back into import without corrupting the null/empty values.

Apologies that I didn't catch that we were already making that distinction in the import/export code! Given that we are, you've convinced me that this is a bug and we should give enum values the same handling to differentiate between ,, and ,"",.

@fulghum fulghum added the bug Something isn't working label May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants