Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed type validation with support for NULL #72

Merged
merged 1 commit into from Jan 21, 2022
Merged

Fixed type validation with support for NULL #72

merged 1 commit into from Jan 21, 2022

Conversation

pvieito
Copy link
Contributor

@pvieito pvieito commented Nov 23, 2021

Without this fix when reading a boolean column with a NULL value ("" in the input CSV) it would fail to parse as JSON and in numeric columns the NULL value would be coerced to a 0. Now when an empty string is detected it is correctly handled as a NULL value.

Without this fix when reading a boolean column with a NULL value ("" in the input CSV) it would fail to parse as JSON and in numeric columns the NULL value would be coerced to a 0. Now when an empty string is detected it is correctly handled as a NULL value.
@ghdna
Copy link
Owner

ghdna commented Jan 17, 2022

Sorry, this took a while getting back to. I've been testing this and had a few observations.

  1. On Athena itself, when you run a query on data that contains empty and null numbers and boolean, the result shown is always empty on the console. See below. The same is reflected in the CSV file that Athena produces and stores in S3. Those fields simply don't have any value.

Screen Shot 2022-01-17 at 10 28 33

  1. When queuing this data using our library, athena-express, such fields are hidden from those rows, by default. If you set ignoreEmpty as false, then you see the value 0 for number.

But there is no way to know whether the original uploaded data value was empty or NULL. And since Athena itself supports both for all fields, including number (which is an anti pattern), then I think Athena-express shouldn't (and cannot accurately) correct that and therefore present that data as-is.

Let me know what you think

@pvieito
Copy link
Contributor Author

pvieito commented Jan 17, 2022

If we leave the code "as is" it crashes with queries like this:

SELECT TRUE AS "BoolColumn"
UNION ALL
SELECT NULL
UNION ALL
SELECT FALSE

If the column is of Bool or Numeric it should parse an empty string as a NULL value (it cannot be an empty string).

@ghdna
Copy link
Owner

ghdna commented Jan 17, 2022

How will we distinguish if the original uploaded value was meant to be empty or null, given Athena doesn't distinguish between the two? Feel like we might be making a leap here with polluting the response since Empty and Null are not the same values.

@pvieito
Copy link
Contributor Author

pvieito commented Jan 17, 2022

If the column has been marked by Athena as boolean it cannot contains empty values, only true, false or null. The same for numeric columns. Only varchar columns have this ambiguity, but typically data libraries like Pandas infer null for empty values when reading CSV files.

@ghdna
Copy link
Owner

ghdna commented Jan 17, 2022

That's what I was showing with the screenshot I posted above. Athena allows empty values for Boolean and Numeric columns. Upload a sample dataset with empty and null values for Athena and you will notice both are allowed and both show up as empty.

@pvieito
Copy link
Contributor Author

pvieito commented Jan 17, 2022

I does not. It is interpreting them as NULL values and showing them as NULL. Try this in your example dataset:

SELECT *
FROM table
WHERE boolean is NULL

And obviously this is allowed:

SELECT TRUE
UNION ALL 
SELECT NULL
UNION ALL 
SELECT FALSE

But this is not:

SELECT TRUE
UNION ALL 
SELECT ''
UNION ALL 
SELECT FALSE

-- SYNTAX_ERROR: line 4:5: column 2 in UNION query has incompatible types: boolean, varchar(4)

@ghdna
Copy link
Owner

ghdna commented Jan 17, 2022

You're right in that Athena interprets uploaded empty values for Boolean/Number as NULL.

@ghdna
Copy link
Owner

ghdna commented Jan 17, 2022

With your PR, all fields (not just Boolean and number) that are empty will become null. And that will ignore the genuine empty values

@pvieito
Copy link
Contributor Author

pvieito commented Jan 17, 2022

Yes, but note that this is the expected behavior in CSV parsers. Even Athena does this, if you upload a dataset with some empty values in a text column I will infer NULL values:

SELECT *
FROM table
WHERE text is NULL

@ghdna
Copy link
Owner

ghdna commented Jan 17, 2022

So you're suggesting replacing all empty values with NULL. Is there no use case where empty values could stay as empty values?

@pvieito
Copy link
Contributor Author

pvieito commented Jan 20, 2022

Yes, that is the typical behavior of CSV/Athena parsers. At least boolean and numeric values are currently being incorrectly handled (they cannot contain empty strings). While varchar values can contain empty strings there is no way to distinguish between it and a NULL value so one should "win" (I personally prefer NULL but we could change the PR to return '' by default for varchar values).

@ghdna
Copy link
Owner

ghdna commented Jan 20, 2022

ok I'll accept the PR

@ghdna ghdna merged commit 6bc00f0 into ghdna:master Jan 21, 2022
ghdna added a commit that referenced this pull request Jan 21, 2022
@ghdna
Copy link
Owner

ghdna commented Jan 22, 2022 via email

@alfaproject
Copy link

@ghdna no, sorry. I just misread something, nvm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants