-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Improve Handling of Nulls in Druid #4349
Comments
I thought this day would never come. @nishantmonu51 you didn't mention motivation in the description, but I guess it's so Druid's runtime would be able to treat nulls in a more SQL standard way. Some initial thoughts:
If the goal is more SQL-like behavior, you don't want this, since it's not how SQL works. SQL aggregators like
You're missing a big one, which is backwards compatibility in the runtime: preserving current behavior where |
In SQL if you have this data
and do
Which is not necessarily a good thing. Here's an example using DERBY embedded: DROP TABLE FIRSTTABLE;
CREATE TABLE FIRSTTABLE
(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
NAME VARCHAR(12), ATTR VARCHAR(12) DEFAULT NULL);
INSERT INTO FIRSTTABLE (NAME, ATTR) VALUES ('A', 'z');
INSERT INTO FIRSTTABLE (NAME, ATTR) VALUES ('B', 'z');
INSERT INTO FIRSTTABLE (NAME, ATTR) VALUES ('C', 'z');
INSERT INTO FIRSTTABLE (NAME) VALUES ('D');
SELECT * FROM FIRSTTABLE;
SELECT * FROM FIRSTTABLE
WHERE ATTR NOT LIKE 'z';
SELECT * FROM FIRSTTABLE
WHERE ATTR NOT LIKE 'bannana pajamas'; this gives the following:
IMHO this behavior is super annoying and detrimental to the flexible schema approach that druid uses. So I don't think just blindly taking SQL null handling is a good idea. |
@drcrallen I think the idea behind following a standard is to make interoperability between druid and other system easy, is there a better standard ? |
@nishantmonu51 nice proposal! I think three-valued logic also needs to be considered for comparing nulls and non-null values. @drcrallen @gianm I vote for making our null handing compatible with the SQL standard. I think choosing a SQL-like but non-standard language is like reinventing another wheel which encompasses a lot of pains. We eventually need to follow another better standard or make our own standard. This means we need to worry about the ecosystem around druid's special language. |
@gianm @jihoonson : Thanks for the comments have updated the proposal to include your comments. |
DROP TABLE FIRSTTABLE;
CREATE TABLE FIRSTTABLE
(ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
NAME VARCHAR(12),
ATTR VARCHAR(12) DEFAULT NULL,
ATTR2 VARCHAR(12) DEFAULT NULL
);
INSERT INTO FIRSTTABLE (NAME, ATTR, ATTR2) VALUES ('A', 'z', 'y');
INSERT INTO FIRSTTABLE (NAME, ATTR2) VALUES ('B', 'y');
INSERT INTO FIRSTTABLE (NAME, ATTR) VALUES ('C', 'z');
INSERT INTO FIRSTTABLE (NAME) VALUES ('D');
SELECT NAME FROM FIRSTTABLE
WHERE (ATTR IS NULL OR ATTR NOT LIKE 'z') AND (ATTR2 IS NULL OR ATTR2 NOT LIKE 'y'); gives the expected result of As long as there isn't a performance hit for doing it this way instead of just the druid not filter and selector, it should be ok. Can you give more information on the backwards compatibility / migration plan for "already indexed" data at query time? |
I don't think the replacement at ingestion time is enough to preserve the current behavior. There's a lot of places in the query runtime where nulls and empty strings are treated the same, like comparisons in filters (consider |
@gianm @drcrallen, I see, yeah, to be fully backwards compatible the places at query time which treats empty string and nulls as same will need to have two separate code paths based on a druid runtime property e.g druid.null.behavior by default it will treat null and empty strings as same and preserve the current behavior and users can set it to 'sql' to get the new sql behavior. I also feel that having two separate code paths and behavior at lots of places may not be a good and will lead to confusion for users in long run and I would propose to move to sql behavior in future major releases. |
Also updated the backwards compatibility section in the description |
This would be awesome. Thank you for proposing it. @drcrallen Your statement "most people expect 4" is debatable. I mean that in the best sense of "debatable". In the Druid community, you are probably right that most people expect 4. In the broader data community, people have gotten used to SQL semantics, and so they would expect 3. Suppose I am using a database I've not used before -- say MongoDB or Elasticsearch -- and I write their equivalent of
Can you generalize "Math expressions" to "All operators and functions unless stated otherwise". Almost all the built-in functions and operators in SQL generate NULL if any of their arguments are NULL. (Notable exceptions are the
In my opinion, the ability to store and retrieve null values is more important than 3-valued logic. And 3-valued logic will take a lot of time & effort to engineer. So, could you consider doing 3-valued logic in a second phase? In phase 1 you could do all of the other features but still retain 2-valued logic (with Don't get me wrong: we need 3-valued logic to be SQL compliant. But if Druid can simply store and retrieve null values, and has a consistent semantics, we (by which I mean us folks building a SQL adapter using Calcite and Hive) can work with that in the short to medium term. |
I thought that this proposal is for a long term goal making Druid SQL-compliant. I agree on going forward step by step. |
I'm not sure where to bring this up, so I'll do it here. I was reading through the null handling code and started thinking that it doesn't make sense for
But it should not affect indexing at all. That should be driven by the indexing spec. The reason is that it should be possible to specify a column as nullable or non-nullable on a per column basis. You shouldn't have to be locked into all your columns being one way or the other via a systemwide property. It's similar to how a relational db would let you say column /cc @nishantmonu51 |
@gianm : IMO, the system wide config is more of kind of backwards compatibility thing, which will be removed in future releases when we fully move to SQL compatible behavior and support ONLY that. For users who want to insert default values for nulls or replace nulls with default values, they can do that in a pre-indexing step of use CASE statements in SQL to replace nulls with some values. |
I think if a column is marked as non-null then it makes sense for Druid to replace nulls in inputs with default values. It's mostly used for analytics and not as a source of truth, so it doesn't make sense for it to be too rigid about what it accepts. Source-of-truth DBs do that in order to maintain data integrity. But a database meant for analytics on already-existing data has to deal with what is already known, so at least it should have a choice between a mode where it is more accepting and a mode where it is more rigid, rather than be rigid all the time.
I don't think this plan is going to work. One reason in particular is that Druid doesn't require users to know each and every column in their data before ingesting it. It can detect them dynamically at ingestion time. And for dynamically detected columns, users don't have the opportunity to specify CASE statements or anything like that. So, there has to be some way to control how the dynamically detected columns are parsed and stored without knowing what their names are ahead of time. I think it should at least be per-datasource rather than a systemwide property (or even totally unconfigurable as I think you are proposing). |
@gianm: Sorry I misunderstood your comment, I think for the ingestion part per column settings as outlined in the proposal where user can specify the list of columns along with defaults per column and similar defaults for per column type for on index task in case of schemaless ingestion level would be enough. do you think we still need to infer and replace defaults based on NOT NULL/ NULL semantics, I think ability to specify default values during ingestion would be enough. For the query part, there are two parts
Note the second PR for Null handling does not have the ability of specifying defaults during ingestion. Will add that in a follow up PR if this sounds reasonable to you. |
The mechanism in the section "Ability to specify default values for dimensions/metrics while ingestion" is not sufficient, due to auto-detected columns. IMO, the best way to handle autodetected columns is not to ask the user to specify a default value per type, but instead to have a catch-all config like "replace missing values with defaults rather than nulls". It would be like the null handling config in the current patchsets, but only affects indexing time. My personal opinion is that in the long run, by default, Druid should replace missing values with defaults at indexing time rather than storing them as nulls. But it should support nulls properly at query time if they happen to be stored in segments. (Or if an expression function returns a null from a nonnull input.) The main reason for this is concern about the performance overhead of nullable columns. Imagine a longSum over a nonnull column vs a nullable column. The nullable one has extra work to check the bitmap, an extra branch… it (in theory) adds up. |
It could be handled pretty easily at query time, by checking if the null bitmap has cardinality of 0, then executing special version of code. |
I was concerned about something different: columns where there actually are nulls in the source data. If the default behavior is to store them as nulls in the columns, then this optimization isn't available, because the null bitmap will be nonempty. So my concern was that by setting the default to 'nullable' rather than 'not nullable', we will incur needless performance hits when the user doesn't really care if the missing values get stored as nulls or zeroes. I was thinking that it's better to set the default to replace-missing-values-with-default, since then the user would need to opt into the potentially slower behavior of store-missing-values-as-nulls. Btw, I noticed that the isNull methods are usually implemented like, public boolean isNull()
{
return nullValueBitmap.get(offset.getOffset());
} From what I recall, |
In this case, during the segment preparation phase, the default value could be written in place of null positions. Then at the query execution phase, if the query strategy is to differentiate default values from nulls, then the nullBitmap is taken into consideration, but if the strategy is to replace nulls with default values, the bitmap is just ignored: default values are "redundantly" written at all needed positions. I think @nishantmonu51's series of patches already does exactly this in the first part (during the segment preparation phase), but no optimisations yet in the query time.
Please make this comment in one of the patches as a review to a specific line(s) of code |
@gianm I agree, having a default catch-all config to replace missing values during ingestion would be useful. I was thinking that instead of having this as a global config, this would be part of the indexing spec in the task json. Will do this in a follow up PR.
I havn't deeply investigated the performance impact, atleast in the current patchset this is called only if null handling is enabled, Will add some performance benchmarks in a follow up PR. |
It's definitely good to write default values in the "null" columns, although what I was hoping is that we could do a default configuration where:
This allows the default configuration to be speedy (no nulls), while also allowing people to take advantage of more powerful null handling without too much work (they just need to change the storage-side config). Importantly, people don't need to set a cluster-wide config that changes the behavior for all datasources and all columns.
Haha, I tried but GitHub cannot load #5278 anymore:
Instead, I raised an issue: #5569 |
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time. |
Reopening because this is not a completely finished project. |
I notice when ingesting csvs, where I want to allow for null values, I see: exceptions: [could not convert value [] to double,could not convert value [] to double,could not convert value [] to double,could not convert value [] to double,could not convert value [] to double,could not convert value [] to double,] I'd like to set maxParseExceptions to 0, but then these will halt the ingestion. How can we ingest null values for a double dimension without creating parse exceptions? |
I think the bug is with https://github.com/apache/incubator-druid/blob/82b248cc1778d24b6f37864e113d8c77250d0cdf/core/src/main/java/org/apache/druid/common/config/NullHandling.java#L82, I'm thinking that should be !replaceWithDefault() as when we aren't using default values, we need to convert the object to null |
@WilliamWhispell there is nothing wrong with this line. |
I just learned of the existence of the config field And if the flag is implemented and released, why is this issue still open? Sorry I don't have time to read through all the replies here. But could you please document the config field? |
Has this proposal been completely finished now? |
i guess proposal no 6 means to preserve column metadata on all null columns, right? in my case(1 of 2 cases): i have this column which only contain null values (for now). after ingesting it to druid, the column is missing (the datasource doesn't have that column). so my query in some app to druid using that column ended up as an error (no column). and when some other segment ingested with the column contains non null values, the datasource finally have that column (even though only 1 segment have that non null value) the other case is with turnilo, which i assume only using the last segment metadata as the datasource structure when the turnilo refresh it. so if the last segment ingested contains an 'all-null' column, it resulted in an error |
Hey @noskcire11 How did you overcome this? All segments have null values for the dimension so the dimension ends up being removed in the column metadata for that datasource.
|
From my understanding and testing it appears that if Druid has no Values in any of the rows (all nulls) it will automatically drop that column in end when completing the ingestion. |
This is the SQL layer issue that it cannot make a query plan with a missing column. Druid query engine treats the missing column and the column of null values the same. |
This issue has been marked as stale due to 280 days of inactivity. |
This issue has been closed due to lack of activity. If you think that |
Float columns Nulls are considered equivalent to 0. Is this still true? |
Motivation
This proposal is to improve handling of null values in druid by treating nulls as missing values which don’t actually exist. This will make it more compatible to SQL standard and help with integration with other existing BI systems which support ODBC/JDBC.
Current Behavior
Current Null handling in druid is as follows -
Proposed Behavior Changes for Null Handling
Null semantics after the changes described in proposal -
e.g
Implementation Changes -
Ability to specify default values for dimensions/metrics while ingestion
This idea is to allow users to specify a default value for each column that can be used to replace nulls during ingestion.. e.g
In this case, any null value in column1 will be replaced with String “abc” and any null value for long column will be replaced with 5. The final segment created will store these default values instead of nulls.
Storage Format changes
The current storage format is not sufficient for being able to differentiate between the fact that whether a value is null or not. We need to change the storage format to also store the information about the presence of a value. we can choose following options -
For String column, it seems approach (a) will be better as it will allow simplifying handling of multi values columns also.
We can also skip creating a presence bitmap for the case when all the values in a column are known to be non-null.
After the changes the schema for the columns will be as follows -
Nullability for Aggregators/Metrics
As per the current implementation, most aggregators are coded to replace null values with default e.g. sum treats them as zeroes, min treats them as positive infinity, etc.
To match the new semantics we need to make following changes -
Math Expressions Null Handling
Null handling for math expressions will be similar to the aggregators. For general calculations like sum, full expression will be considered as null if any of the components is null. Specifying a default value for null will be supported by the use of NVL or IF clause to assign default values at query time.
Filtering on Null values
SelectorDimFilter currently specifies filtering on null values but the implementation assumes null and empty strings as equivalent. The implementation will be changed to consider null and empty string differently. Cache key for selectorDimFilter also assumes.
Generation of cache key for selectorDimFilter also needs to be modified to support null.
Changes to Druid build-in SQL layer
Misc Changes
Above are the major changes in the user facing APIs and behavior. Other than these there are multiple places in the code where we convert empty Strings to null and vice-versa. They will be changed accordingly in order to treat null and String values separately.
Backwards Compatibility
The text was updated successfully, but these errors were encountered: