Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Improve Handling of Nulls in Druid #4349

Closed
nishantmonu51 opened this issue Jun 1, 2017 · 36 comments
Closed

[Proposal] Improve Handling of Nulls in Druid #4349

nishantmonu51 opened this issue Jun 1, 2017 · 36 comments

Comments

@nishantmonu51
Copy link
Member

nishantmonu51 commented Jun 1, 2017

Motivation

This proposal is to improve handling of null values in druid by treating nulls as missing values which don’t actually exist. This will make it more compatible to SQL standard and help with integration with other existing BI systems which support ODBC/JDBC.

Current Behavior

Current Null handling in druid is as follows -

  1. String Dimensions - Nulls and Empty Strings are considered equivalent
  2. Long/Float columns Nulls are considered equivalent to 0.

Proposed Behavior Changes for Null Handling

Null semantics after the changes described in proposal -

  1. Null and Empty Strings will be handled differently
  2. For every column type, null value will be considered equivalent to an absent/missing value.
  3. Aggregate functions will ignore null values. If the data set contains no rows, or contains only rows with nulls as arguments to the aggregate function, then the function returns null except count which will return 0 instead of null.
    e.g
    1. a min aggregation over null and 5 will give a result of 5
    2. longSum/doubleSum over values (null, 1, null) will return 1.
    3. Count over values (null, 1, null) will return 1
    4. Count over values (null, null) will return 0
  4. Rollup of metrics - if a null and a non-null value are rolled up using an aggregator at ingestion time the result will be defined by the behavior of aggregator specified above.
  5. Math expressions - the result value will be null if any of the components is null.
  6. missing column will be considered equivalent to a column having null values for all the rows.
  7. In case of Sorting, by default nulls will be sorted as the lowest value. Users can specify to sort nulls as the highest value as part of the query.
  8. Users will be able to specify default values to replace NULL values at ingestion time.
  9. Comparisons between null and non-null values will follow three-valued logic and the result of a comparison involving NULL is not a boolean value – it is a null value.

Implementation Changes -

Ability to specify default values for dimensions/metrics while ingestion
This idea is to allow users to specify a default value for each column that can be used to replace nulls during ingestion.. e.g

{
"type" : "String", 
"name" : "column1", 
"replaceMissingValuesWith" : "abc"
}, 
{
"type" : "Long", 
"Name": "column2", 
"replaceMissingValuesWith" : 5
} 

In this case, any null value in column1 will be replaced with String “abc” and any null value for long column will be replaced with 5. The final segment created will store these default values instead of nulls.

Storage Format changes

The current storage format is not sufficient for being able to differentiate between the fact that whether a value is null or not. We need to change the storage format to also store the information about the presence of a value. we can choose following options -

  1. For String columns - we can choose one of the following -
    1. In the dictionary encoding reserve 0 as being mapped to null value, any other value will get dictionary encoding value > 0. This will also work for multi value column, where an array of values can store 0 as the id. This will need incrementing segment version for backwards compatibility.
    2. Add a presence bitmap which tells whether the value is present or absent in the row. For a row with multiple value, it may be possible that both the presence bit is not set and the row have some values, which will indicate that the value is a multi value including null.
  2. For long/Float columns - Add a presence bitmap to the columns which tells whether the value is present or not.

For String column, it seems approach (a) will be better as it will allow simplifying handling of multi values columns also.
We can also skip creating a presence bitmap for the case when all the values in a column are known to be non-null.

After the changes the schema for the columns will be as follows -

  1. String Column -
    1. Dictionary Encoding starting will 1, 0 will be reserved for mapping null value.
    2. Column Data
    3. Bitmap Indexes
  2. Numeric Column -
    1. Column Data
    2. Presence Bitmap

Nullability for Aggregators/Metrics

As per the current implementation, most aggregators are coded to replace null values with default e.g. sum treats them as zeroes, min treats them as positive infinity, etc.
To match the new semantics we need to make following changes -

  1. Make aggregators nullable, where if an aggregator encounters only the null values the result will be null value.
  2. The behavior will be same for max,min and sum except that they will return null in case all the encountered values are null.
  3. Count aggregator will ignore null values and only count non-null values, If all the encountered values are null, the result will be 0. This is different from current behavior where null values are also counted.

Math Expressions Null Handling

Null handling for math expressions will be similar to the aggregators. For general calculations like sum, full expression will be considered as null if any of the components is null. Specifying a default value for null will be supported by the use of NVL or IF clause to assign default values at query time.

Filtering on Null values

SelectorDimFilter currently specifies filtering on null values but the implementation assumes null and empty strings as equivalent. The implementation will be changed to consider null and empty string differently. Cache key for selectorDimFilter also assumes.
Generation of cache key for selectorDimFilter also needs to be modified to support null.

Changes to Druid build-in SQL layer

  1. IS NULL and IS NOT NULL will be mapped to SelectorDimFilter with value null.
  2. NVL, IFNULL and COALESCE will be mapped to DimensionSpec replaceMissingValueWith property.

Misc Changes

Above are the major changes in the user facing APIs and behavior. Other than these there are multiple places in the code where we convert empty Strings to null and vice-versa. They will be changed accordingly in order to treat null and String values separately.

Backwards Compatibility

  1. For backwards compatibility we can add a new runtime property to preserve old behavior. e.g 'druid.null.behavior' If this property is set to 'sql' we will follow the sql behavior, If this is not set we will preserve backwards compatibility.
  2. Query execution backwards compatibility - There are lots of places in code that are based on current druid null handling e.g count aggregator. They will need to be made backwards compatible in order to preserve current behavior where null and "" the treated the same. This would mean having two branches of the code at all the places which currently assumes null and empty strings to be equivalent during query execution e.g filters, extraction functions, lookups, count aggregator and more.
  3. Ingestion time - At ingestion we replace null with empty strings, the same behavior will be preserved by having multiple code paths based on the 'druid.null.behavior' property
  4. Increment the versions of columns for supporting backwards compatibility in storage changes and be able to read columns with older version.
@gianm
Copy link
Contributor

gianm commented Jun 2, 2017

I thought this day would never come.

@nishantmonu51 you didn't mention motivation in the description, but I guess it's so Druid's runtime would be able to treat nulls in a more SQL standard way. Some initial thoughts:

To match the new semantics we need to make aggregators nullable, where if an aggregator encounters a null value the result will be null value.

If the goal is more SQL-like behavior, you don't want this, since it's not how SQL works. SQL aggregators like SUM(expr) skip NULLs but still aggregate all non-NULLs. It's still different from what Druid does though (especially for MIN/MAX where Druid would treat missing as 0 and standard SQL behavior would skip it).

Backwards Compatibility

You're missing a big one, which is backwards compatibility in the runtime: preserving current behavior where null and "" the treated the same. There's a lot of code out there that expects Druid's null handling to work the way it currently does, and we don't want to break it.

@drcrallen
Copy link
Contributor

In SQL if you have this data

NAME ATTR VALUE
A z 1
B z 1
C z 1
D (null) 1

and do

SELECT SUM(value) WHERE `ATTR` != 'z' you will get 3 if I recall correctly, where most people expect 4.

Which is not necessarily a good thing.

Here's an example using DERBY embedded:

DROP TABLE FIRSTTABLE;
CREATE TABLE FIRSTTABLE
    (ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    NAME VARCHAR(12), ATTR VARCHAR(12) DEFAULT NULL);

INSERT INTO FIRSTTABLE (NAME, ATTR) VALUES ('A', 'z');
INSERT INTO FIRSTTABLE (NAME, ATTR) VALUES ('B', 'z');
INSERT INTO FIRSTTABLE (NAME, ATTR) VALUES ('C', 'z');
INSERT INTO FIRSTTABLE (NAME) VALUES ('D');

SELECT * FROM FIRSTTABLE;

SELECT * FROM FIRSTTABLE
WHERE ATTR NOT LIKE 'z';

SELECT * FROM FIRSTTABLE
WHERE ATTR NOT LIKE 'bannana pajamas';

this gives the following:

ID           NAME          ATTR          
-----------------------------------------
1            A             z             
2            B             z             
3            C             z             
4            D             <null>        
ID           NAME          ATTR          
-----------------------------------------

ID           NAME          ATTR          
-----------------------------------------
1            A             z             
2            B             z             
3            C             z             

IMHO this behavior is super annoying and detrimental to the flexible schema approach that druid uses. So I don't think just blindly taking SQL null handling is a good idea.

@b-slim
Copy link
Contributor

b-slim commented Jun 2, 2017

@drcrallen I think the idea behind following a standard is to make interoperability between druid and other system easy, is there a better standard ?
Also by adding this support it will actually add more flexibility for instance you can elect to replace null with empty therefore you SELECT SUM(value) WHERE ATTR != 'z' will return 4 instead of 3 and vice versa. Anyway i agree with you SQL is not god's word and we can elect to not follow it as it is.

@jihoonson
Copy link
Contributor

@nishantmonu51 nice proposal! I think three-valued logic also needs to be considered for comparing nulls and non-null values.

@drcrallen @gianm I vote for making our null handing compatible with the SQL standard. I think choosing a SQL-like but non-standard language is like reinventing another wheel which encompasses a lot of pains. We eventually need to follow another better standard or make our own standard. This means we need to worry about the ecosystem around druid's special language.
Instead, if druid speaks a language compatible with the SQL standard, SQL is one of the most popular languages, and thus we can get a lot of advantages like easy integration with many existing BI tools which support ODBC/JDBC.

@nishantmonu51
Copy link
Member Author

nishantmonu51 commented Jun 5, 2017

@gianm @jihoonson : Thanks for the comments have updated the proposal to include your comments.
@drcrallen: As @b-slim and @jihoonson pointed out following the SQL standard makes it easier to make Druid work with other existing tools. For the above case you mentioned, with the above proposal, the user has the ability to replace null with any default value at the ingestion time and achieve the desired behavior.

@drcrallen
Copy link
Contributor

DROP TABLE FIRSTTABLE;
CREATE TABLE FIRSTTABLE
    (ID INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
    NAME VARCHAR(12),
    ATTR VARCHAR(12) DEFAULT NULL,
    ATTR2 VARCHAR(12) DEFAULT NULL
    );
INSERT INTO FIRSTTABLE (NAME, ATTR, ATTR2) VALUES ('A', 'z', 'y');
INSERT INTO FIRSTTABLE (NAME, ATTR2) VALUES ('B', 'y');
INSERT INTO FIRSTTABLE (NAME, ATTR) VALUES ('C', 'z');
INSERT INTO FIRSTTABLE (NAME) VALUES ('D');

SELECT NAME FROM FIRSTTABLE
WHERE (ATTR IS NULL OR ATTR NOT LIKE 'z') AND (ATTR2 IS NULL OR ATTR2 NOT LIKE 'y');

gives the expected result of D.

As long as there isn't a performance hit for doing it this way instead of just the druid not filter and selector, it should be ok.

Can you give more information on the backwards compatibility / migration plan for "already indexed" data at query time?

@gianm
Copy link
Contributor

gianm commented Jun 6, 2017

@nishantmonu51,

with the above proposal, the user has the ability to replace null with any default value at the ingestion time and achieve the desired behavior.

I don't think the replacement at ingestion time is enough to preserve the current behavior. There's a lot of places in the query runtime where nulls and empty strings are treated the same, like comparisons in filters (consider "filter":{"type":"selector","dimension":"foo", "value":null} and "filter":{"type":"selector","dimension":"foo", "value":""}which currently have identical behavior). Other examples are args to extractionFns, return values from extractionFns, and behavior of keys into lookups. It seems to me that if we want to maintain compatibility then a lot of places in the runtime will need to have two code paths.

@nishantmonu51
Copy link
Member Author

@gianm @drcrallen, I see, yeah, to be fully backwards compatible the places at query time which treats empty string and nulls as same will need to have two separate code paths based on a druid runtime property e.g druid.null.behavior by default it will treat null and empty strings as same and preserve the current behavior and users can set it to 'sql' to get the new sql behavior.

I also feel that having two separate code paths and behavior at lots of places may not be a good and will lead to confusion for users in long run and I would propose to move to sql behavior in future major releases.

@nishantmonu51
Copy link
Member Author

nishantmonu51 commented Jun 6, 2017

Also updated the backwards compatibility section in the description

@julianhyde
Copy link

This would be awesome. Thank you for proposing it.

@drcrallen Your statement "most people expect 4" is debatable. I mean that in the best sense of "debatable". In the Druid community, you are probably right that most people expect 4. In the broader data community, people have gotten used to SQL semantics, and so they would expect 3.

Suppose I am using a database I've not used before -- say MongoDB or Elasticsearch -- and I write their equivalent of where attr <> 3. I don't want to have to go to their manual to see how they handle null values. I want to be able to assume that their behavior is standard. Whatever their behavior, it's pretty easy to achieve the effect I want (e.g. where attr <> 3 or attr is null), but they're wasting my time if they violate the principle of least surprise.

Math expressions - the result value will be null if any of the components is null.

Can you generalize "Math expressions" to "All operators and functions unless stated otherwise". Almost all the built-in functions and operators in SQL generate NULL if any of their arguments are NULL. (Notable exceptions are the IS operators like IS NULL, boolean operators OR, AND and NOT, COALESCE.) But even SUBSTRINGandCHAR_LENGTH` adhere to this rule. (Remember, the empty string is not null.)

Comparisons between null and non-null values will follow three-valued logic and the result of a comparison involving NULL is not a boolean value – it is a null value.

In my opinion, the ability to store and retrieve null values is more important than 3-valued logic. And 3-valued logic will take a lot of time & effort to engineer. So, could you consider doing 3-valued logic in a second phase?

In phase 1 you could do all of the other features but still retain 2-valued logic (with x = y returning false if one of them is null).

Don't get me wrong: we need 3-valued logic to be SQL compliant. But if Druid can simply store and retrieve null values, and has a consistent semantics, we (by which I mean us folks building a SQL adapter using Calcite and Hive) can work with that in the short to medium term.

@jihoonson
Copy link
Contributor

In my opinion, the ability to store and retrieve null values is more important than 3-valued logic. And 3-valued logic will take a lot of time & effort to engineer. So, could you consider doing 3-valued logic in a second phase?

I thought that this proposal is for a long term goal making Druid SQL-compliant. I agree on going forward step by step.

@gianm
Copy link
Contributor

gianm commented Mar 23, 2018

I'm not sure where to bring this up, so I'll do it here.

I was reading through the null handling code and started thinking that it doesn't make sense for druid.generic.useDefaultValueForNull to determine indexing time behavior. IMO the system-wide setting should really only control query-time behavior. And what it should do is simple: if it's true (the default) then:

  1. Nulls and empty strings should be treated the same in string context, as they are today
  2. Nulls in numeric context should be treated like zeroes, as they are today

But it should not affect indexing at all. That should be driven by the indexing spec.

The reason is that it should be possible to specify a column as nullable or non-nullable on a per column basis. You shouldn't have to be locked into all your columns being one way or the other via a systemwide property. It's similar to how a relational db would let you say column x is BIGINT NOT NULL but column y is BIGINT NULL in the same table.

/cc @nishantmonu51

@nishantmonu51
Copy link
Member Author

@gianm :
If the user says x is INT NOT NULL, and user tries to insert null values, then i think for SQL the insertion fails the NOT NULL constraint and would fail. However, in case of druid nulls are replaced with default values silently.

IMO, the system wide config is more of kind of backwards compatibility thing, which will be removed in future releases when we fully move to SQL compatible behavior and support ONLY that.

For users who want to insert default values for nulls or replace nulls with default values, they can do that in a pre-indexing step of use CASE statements in SQL to replace nulls with some values.

@gianm
Copy link
Contributor

gianm commented Mar 26, 2018

If the user says x is INT NOT NULL, and user tries to insert null values, then i think for SQL the insertion fails the NOT NULL constraint and would fail. However, in case of druid nulls are replaced with default values silently.

I think if a column is marked as non-null then it makes sense for Druid to replace nulls in inputs with default values. It's mostly used for analytics and not as a source of truth, so it doesn't make sense for it to be too rigid about what it accepts. Source-of-truth DBs do that in order to maintain data integrity. But a database meant for analytics on already-existing data has to deal with what is already known, so at least it should have a choice between a mode where it is more accepting and a mode where it is more rigid, rather than be rigid all the time.

IMO, the system wide config is more of kind of backwards compatibility thing, which will be removed in future releases when we fully move to SQL compatible behavior and support ONLY that.

For users who want to insert default values for nulls or replace nulls with default values, they can do that in a pre-indexing step of use CASE statements in SQL to replace nulls with some values.

I don't think this plan is going to work. One reason in particular is that Druid doesn't require users to know each and every column in their data before ingesting it. It can detect them dynamically at ingestion time. And for dynamically detected columns, users don't have the opportunity to specify CASE statements or anything like that. So, there has to be some way to control how the dynamically detected columns are parsed and stored without knowing what their names are ahead of time. I think it should at least be per-datasource rather than a systemwide property (or even totally unconfigurable as I think you are proposing).

@nishantmonu51
Copy link
Member Author

@gianm: Sorry I misunderstood your comment,
The proposal also has a section for "Ability to specify default values for dimensions/metrics while ingestion" which proposes to add the ability to specify default values at the ingestion time.

I think for the ingestion part per column settings as outlined in the proposal where user can specify the list of columns along with defaults per column and similar defaults for per column type for on index task in case of schemaless ingestion level would be enough.

do you think we still need to infer and replace defaults based on NOT NULL/ NULL semantics, I think ability to specify default values during ingestion would be enough.

For the query part, there are two parts

  1. Backwards compatibility - maintained by the global setting during the transition phase.
  2. Null Handling going forward - It should be SQL compatible way. If user wants to treat nulls as 0 or empty strings, he can specify that at ingestion time. The difference here from current way would be that he would get the default values back from queries if nulls are replaced with some defaults instead of nulls from the queries.

Note the second PR for Null handling does not have the ability of specifying defaults during ingestion. Will add that in a follow up PR if this sounds reasonable to you.

@gianm
Copy link
Contributor

gianm commented Apr 2, 2018

@nishantmonu51,

The mechanism in the section "Ability to specify default values for dimensions/metrics while ingestion" is not sufficient, due to auto-detected columns. IMO, the best way to handle autodetected columns is not to ask the user to specify a default value per type, but instead to have a catch-all config like "replace missing values with defaults rather than nulls". It would be like the null handling config in the current patchsets, but only affects indexing time.

My personal opinion is that in the long run, by default, Druid should replace missing values with defaults at indexing time rather than storing them as nulls. But it should support nulls properly at query time if they happen to be stored in segments. (Or if an expression function returns a null from a nonnull input.)

The main reason for this is concern about the performance overhead of nullable columns. Imagine a longSum over a nonnull column vs a nullable column. The nullable one has extra work to check the bitmap, an extra branch… it (in theory) adds up.

@leventov
Copy link
Member

leventov commented Apr 2, 2018

The main reason for this is concern about the performance overhead of nullable columns. Imagine a longSum over a nonnull column vs a nullable column. The nullable one has extra work to check the bitmap, an extra branch… it (in theory) adds up.

It could be handled pretty easily at query time, by checking if the null bitmap has cardinality of 0, then executing special version of code.

@gianm
Copy link
Contributor

gianm commented Apr 2, 2018

It could be handled pretty easily at query time, by checking if the null bitmap has cardinality of 0, then executing special version of code.

I was concerned about something different: columns where there actually are nulls in the source data. If the default behavior is to store them as nulls in the columns, then this optimization isn't available, because the null bitmap will be nonempty. So my concern was that by setting the default to 'nullable' rather than 'not nullable', we will incur needless performance hits when the user doesn't really care if the missing values get stored as nulls or zeroes.

I was thinking that it's better to set the default to replace-missing-values-with-default, since then the user would need to opt into the potentially slower behavior of store-missing-values-as-nulls.

Btw, I noticed that the isNull methods are usually implemented like,

        public boolean isNull()
        {
          return nullValueBitmap.get(offset.getOffset());
        }

From what I recall, get on compressed bitmaps is quite slow compared to walking an iterator.

@leventov
Copy link
Member

leventov commented Apr 2, 2018

I was concerned about something different: columns where there actually are nulls in the source data. If the default behavior is to store them as nulls in the columns, then this optimization isn't available, because the null bitmap will be nonempty. So my concern was that by setting the default to 'nullable' rather than 'not nullable', we will incur needless performance hits when the user doesn't really care if the missing values get stored as nulls or zeroes.

In this case, during the segment preparation phase, the default value could be written in place of null positions. Then at the query execution phase, if the query strategy is to differentiate default values from nulls, then the nullBitmap is taken into consideration, but if the strategy is to replace nulls with default values, the bitmap is just ignored: default values are "redundantly" written at all needed positions.

I think @nishantmonu51's series of patches already does exactly this in the first part (during the segment preparation phase), but no optimisations yet in the query time.

From what I recall, get on compressed bitmaps is quite slow compared to walking an iterator.

Please make this comment in one of the patches as a review to a specific line(s) of code

@nishantmonu51
Copy link
Member Author

@gianm I agree, having a default catch-all config to replace missing values during ingestion would be useful. I was thinking that instead of having this as a global config, this would be part of the indexing spec in the task json. Will do this in a follow up PR.

From what I recall, get on compressed bitmaps is quite slow compared to walking an iterator.

I havn't deeply investigated the performance impact, atleast in the current patchset this is called only if null handling is enabled, Will add some performance benchmarks in a follow up PR.

@gianm
Copy link
Contributor

gianm commented Apr 3, 2018

In this case, during the segment preparation phase, the default value could be written in place of null positions. Then at the query execution phase, if the query strategy is to differentiate default values from nulls, then the nullBitmap is taken into consideration, but if the strategy is to replace nulls with default values, the bitmap is just ignored: default values are "redundantly" written at all needed positions.

It's definitely good to write default values in the "null" columns, although what I was hoping is that we could do a default configuration where:

  1. Ingestion-side, we replace nulls with default values.
  2. Query-side, we respect nulls if they are present in storage.

This allows the default configuration to be speedy (no nulls), while also allowing people to take advantage of more powerful null handling without too much work (they just need to change the storage-side config). Importantly, people don't need to set a cluster-wide config that changes the behavior for all datasources and all columns.

Please make this comment in one of the patches as a review to a specific line(s) of code

Haha, I tried but GitHub cannot load #5278 anymore:

This page is taking way too long to load.

Sorry about that. Please try refreshing and contact us if the problem persists.

Instead, I raised an issue: #5569

@stale
Copy link

stale bot commented Jun 21, 2019

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jun 21, 2019
@stale
Copy link

stale bot commented Jul 5, 2019

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Jul 5, 2019
@leventov leventov removed the stale label Jul 26, 2019
@leventov
Copy link
Member

Reopening because this is not a completely finished project.

@leventov leventov reopened this Jul 26, 2019
@WilliamWhispell
Copy link

I notice when ingesting csvs, where I want to allow for null values, I see:

exceptions: [could not convert value [] to double,could not convert value [] to double,could not convert value [] to double,could not convert value [] to double,could not convert value [] to double,could not convert value [] to double,]
at org.apache.druid.segment.incremental.IncrementalIndex.getCombinedParseException(IncrementalIndex.java:757) ~[druid-processing-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.segment.incremental.IncrementalIndex.add(IncrementalIndex.java:611) ~[druid-processing-0.15.0-incubating.jar:0.15.0-incubating]
at org.apache.druid.segment.realtime.plumber.Sink.add(Sink.java:180) ~[druid-server-0.15.0-incubating.jar:0.15.0-incubating]

I'd like to set maxParseExceptions to 0, but then these will halt the ingestion. How can we ingest null values for a double dimension without creating parse exceptions?

@WilliamWhispell
Copy link

I think the bug is with https://github.com/apache/incubator-druid/blob/82b248cc1778d24b6f37864e113d8c77250d0cdf/core/src/main/java/org/apache/druid/common/config/NullHandling.java#L82, I'm thinking that should be !replaceWithDefault() as when we aren't using default values, we need to convert the object to null

@leventov
Copy link
Member

leventov commented Aug 1, 2019

@WilliamWhispell there is nothing wrong with this line. emptyToNullIfNeeded() method may be misused somewhere, but the method itself is fine.

@pantaoran
Copy link

I just learned of the existence of the config field druid.generic.useDefaultValueForNull from the 0.13.0 release notes, but I can't find any mention of the field in the general configuration reference at https://druid.apache.org/docs/latest/configuration/index.html

And if the flag is implemented and released, why is this issue still open? Sorry I don't have time to read through all the replies here. But could you please document the config field?

@pan3793
Copy link
Member

pan3793 commented May 27, 2020

Reopening because this is not a completely finished project.

Has this proposal been completely finished now?

@noskcire11
Copy link

i guess proposal no 6 means to preserve column metadata on all null columns, right?

in my case(1 of 2 cases): i have this column which only contain null values (for now). after ingesting it to druid, the column is missing (the datasource doesn't have that column). so my query in some app to druid using that column ended up as an error (no column). and when some other segment ingested with the column contains non null values, the datasource finally have that column (even though only 1 segment have that non null value)

the other case is with turnilo, which i assume only using the last segment metadata as the datasource structure when the turnilo refresh it. so if the last segment ingested contains an 'all-null' column, it resulted in an error

@bharadwajrembar
Copy link

bharadwajrembar commented Dec 16, 2020

Hey @noskcire11 How did you overcome this? All segments have null values for the dimension so the dimension ends up being removed in the column metadata for that datasource.

in my case(1 of 2 cases): i have this column which only contain null values (for now). after ingesting it to druid, the column 
is missing (the datasource doesn't have that column). so my query in some app to druid using that column ended up as an 
error (no column). and when some other segment ingested with the column contains non null values, the datasource finally 
have that column (even though only 1 segment have that non null value)

@steliosph
Copy link

in my case(1 of 2 cases): i have this column which only contain null values (for now). after ingesting it to druid, the column is missing (the datasource doesn't have that column). so my query in some app to druid using that column ended up as an error (no column). and when some other segment ingested with the column contains non null values, the datasource finally have that column (even though only 1 segment have that non null value)

From my understanding and testing it appears that if Druid has no Values in any of the rows (all nulls) it will automatically drop that column in end when completing the ingestion.
Don't know if there is any handling to prevent this.

@jihoonson
Copy link
Contributor

in my case(1 of 2 cases): i have this column which only contain null values (for now). after ingesting it to druid, the column is missing (the datasource doesn't have that column). so my query in some app to druid using that column ended up as an error (no column). and when some other segment ingested with the column contains non null values, the datasource finally have that column (even though only 1 segment have that non null value)

This is the SQL layer issue that it cannot make a query plan with a missing column. Druid query engine treats the missing column and the column of null values the same.

@github-actions
Copy link

github-actions bot commented Jun 1, 2023

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 1, 2023
@github-actions
Copy link

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 30, 2023
@jmitterh
Copy link

Float columns Nulls are considered equivalent to 0. Is this still true?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests