-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NULL value support for all data types #4230
Comments
Here is the high-level idea
One open question is, should we restrict generating presence vector to segment creation phase or allow generating them on the fly when we load the segment (similar to inverted index). Generating them on the fly means we need to generate presence vector from the forward index. We can't do this as we represent NULL using the defaultValue in fieldSpec. One workaround is to treat the default value in the forward index as NULL. I think this might work but not 100% sure. Does anyone see any issues with treating default value in forward index implies NULL? For the first phase, I think restricting it to segment generation phase should be good enough. Thoughts? |
Few thoughts:
|
@sunithabeeram Changed bullet to the numbered list.
|
A few questions:
|
Good questions.
|
For 4: Is there a reason where users would not want to filter out NULL values for metrics? With default value of 0 for metrics, Also +1 to building presence vector in segment generation phase. |
Can't think of a case where users would NOT want to filter out NULL for metrics. I feel our first goal should be able to give the users to get the behavior they want without compromising a lot on performance. So for hyperloglog one can do one of the following
But, as you mentioned, for aggregation functions such as min, max, avg user can't really get the behavior |
Interacting through the issue is a bit limiting as its hard to reference comments appropriately. Will try my best. For #1 in the questions I asked;as I think more through this: isNullable seems to just be an optimization hint, correct? All columns are currently nullable, so adding that explicitly could be confusing - what does it mean when a user says isNullable=false and we encounter null values? Should an exception be thrown? (ie, if the column is not set as nullable OR a defaultValue is not specified, throw an excpetion; This is probably a better thing to do in my mind because currently our default defaultNullValues could really backfire for certain aggregations but this is introducing a behavior change - not sure if you had that in mind.) |
@kishoreg for 4: If it is safe to assume that users would always want to filter out the NULL for metrics, why not implicitly filter out NULLs for metrics in the first implementation (as opposed to eventually)? @sunithabeeram Re-iterating our offline discussion (to share with all), unless there's a functional reason for adding isNullable in the fieldSpec, I'd recommend not adding it. We should auto derive it (and put in metadata) during segment generation. The only issue would be consuming segments, and in that cases, we can always build the presence vector for all columns. I see the following issues with requiring this information in fieldSpec:
|
Good suggestions @mayankshriv @sunithabeeram. Yes, we can filter out NULLs for metrics automatically if we see the presence vector. @mayankshriv what is the issue with consuming segments? |
consuming segments will need a mutable presence vector. While committing the segment to persistent store, the presence vector should also be translated. Default value for metric is 0. Metric can very likely have a value of 0, so translating 0 to null value (i.e. building the presence vector on the server while loading) is not an option, I would think. |
@mcvsubbu is there an issue if the presence vector is always built? If I understand things correctly, I believe the current state of handling nullables is a bit unfortunate. The scenario you mentioned is not just an issue for building presence vector, but, as @mayankshriv pointed out, it can actually result in incorrect aggregations. Is the defaultNullValue of 0 just for metrics? Or is it based on datatype? |
@mcvsubbu Presence vector is nothing a but a single roaring bitmap for each column. When we ingest a row, we set the corresponding bit to 1 (not null)/0 (null). In case of real-time, it will be a mutable roaring bitmap and in the case of offline, it will be an immutable roaring bitmap. |
Exactly what I was thinking. However, don't we need a switch to decide whether to use presence vector or retain current behavior? Or, are you thinking that the current behavior will be obsoleted (i.e. null becomes default value) |
|
@mcvsubbu there is absolutely no change in the existing behavior on what gets stored in the forward index in the segment. null values will be auto-populated with the default value. The only change I am proposing is to have an additional presence vector that will allow the users to skip rows where x is NULL by adding a predicate x != NULL. This will make sure that there is no change in existing behavior (even though its incorrect in some cases). Mayank is suggesting that we add this predicate automatically if x is a metric. This will fix the incorrect behavior. We can control this behavior via some feature flag if needed. |
Just adding predicate x != NULL is not good enough. We need to plug-in the NULL handling logic into all predicates as well. E.g. |
@Jackie-Jiang yes. Also, you logged an issue to document the behavior of default value #4188. Having that will definitely help. Is anyone working on that? |
@kishoreg bumping up one point which I would like to get explicit acknowledgement/agreement on. Currently, all columns are nullable. If user doesn't specify a defaultNullValue, we use "default" values (as mentioned in the description of this issue as well as confirmed by @mcvsubbu in one of the comments). The problem with the "default" defaultNullValues is that they can result in incorrect behavior for metrics or dimensions. I think it is important that this be corrected as we attempt to build presence vectors. Here is the behavior I would like considered going forward:
If the above is not acceptable, it would be good to clarify the reasons we think its OK to continue with the existing default defaultNullValues in the face of incorrectness. And we should update the user facing documentation so its clear to users what they can expect. Let me know what you think. |
We should avoid asking users to upgrade schema to maintain backward compatibility. It will be challenging operationally. If I understand you correctly, what you are suggesting is basically the same as my original proposal - adding isNullable field. Instead of adding isNullable explicitly in the fieldSpec you are proposing deriving that based on the presence/absence of defaultNullValue. I prefer having isNullable explicitly instead of deriving it based on defaultNullValue. It might be easier if we break this into two parts and look at the options on the table. Segment Generation and query execution. Segment Generation (Creating Presence Vector): Options
Query Execution: Options
Note: We are not going to change what's stored in the forward index. We will always use defaultNullValue (it can be default defaultNullValue based on dataType or something user specifies in the schema). I prefer starting with option 2 for Segment Generation and option 1 for Query Execution. This will be backward compatible with current behavior and provide the flexibility to the user to achieve the behavior they want by changing the fieldSpec and query as needed. We can always move to 1 in Segment Generation by changing the default value of createPresenceVector=true in phase 2. We can also start to move to option 2 for Query Execution over time. |
I would prefer always assume all columns are nullable unless users explicitly mark it nonNull (which is very rare, but might be useful for boolean type. A default null value is required). |
In the long term, may be we can consider leveraging Apache Arrow's inmemory columnar and wire format for our execution engine, then we will get support for nullability as part of moving to Arrow (not suggesting that we should move to Arrow to support nullability but bringing up Arrow as something we may want to consider). An Arrow vector is an off-heap data structure for a column of any particular type (it supports both primitive and complex type). The vector has 2 buffers in direct memory -- data buffer for storing column values and nullability buffer (a bitmap) for indicating if the corresponding column value is NULL or not. |
@siddharthteotia Yes, at some point we would love to support various formats such as Parquet, ORC etc and Arrow will definitely be something that we can consider. For now, Pinot execution code is tightly coupled with storage format and bringing in Arrow will require a lot of changes. |
@kishoreg One potential issue with Query Execution Option 1 (explicitly add x != NULL) is when there are multiple aggregations: eg: select sum(col1), avg(col2) where ... col1 !=NULL AND/OR ... col2 != NULL Here we need to come up with the right filtering condition. Do you think making it implicit might be better ? |
@icefury71 Good point. I think the challenge is to come up with the right filtering condition. Whether the filtering is explicit or implicit is just a syntactic convenience. |
Updating this thread with some of the offline discussions I had with a few folks (Devesh and @fx19880617 ). This proposal is to filter out null values in the form of predicates (leaf nodes of query execution). This has 2 potential issues:
Looks like a better approach might be to "tag" DocIDs as null or non-nulls and let the aggregation function (or another intermediate node) process it in the right manner. However, this will affect the entire query execution flow and is a much bigger effort than what's proposed in this issue. Thoughts ? |
@icefury71 Good questions. I don't have a good answer. Let's create a separate issue on how to use presence vectors in the query execution layer. At a high level, there are 3 possible options
For now, let us focus on supporting 1. we can start design discussion for 2/3 |
Opened PR: #4585 |
This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: #4230 High level gist is as follows: Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column. Expose this through the DataSource interface to enable the caller to ignore such document IDs. Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)
This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: apache#4230 High level gist is as follows: Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column. Expose this through the DataSource interface to enable the caller to ignore such document IDs. Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)
This PR adds support for a presence vector inside a mutable and immutable segment. This will enable the query layer to ignore null values in the corresponding columns. Please see this issue for more details: apache#4230 High level gist is as follows: Create a presence vector per column per segment by default. This presence vector keeps track of which document ID has a null value in the corresponding column. Expose this through the DataSource interface to enable the caller to ignore such document IDs. Presence vector is not really used anywhere as part of this PR. Subsequent PRs will use them to enable filtering out columns in predicates (eg: select ... from table where column_name != null)
Adding null predicate support in reference to Issue #4230 This PR adds limited support for "IS NULL" and "IS NOT NULL" filter predicates. Currently this only works for leaf filter predicates.
Adding an end-end integration test for creating a Pinot table with Kafka records including null values and running queries against this table (with the IS NOT NULL predicate). Related to #4230
I'm trying to plot a timeseries with gaps showing null/missing data. Trying to avoid doing:
Can this be done in the query? |
I'm not sure I understand your question. I'm assuming you want to filter out null values in your query response, does the Or do you want this to happen automatically without specifying such predicates ? |
No I want to include null values but as null and not as defaultNullValue so I can distinguish the two cases. If defaultNullValue is zero and nullHandling is enabled and some points are actually zero. t: 1 v: 110 So I can make a plot with gaps: |
Unfortunately that's not possible with the current implementation. Current implementation will ignore nulls or treat them as default values in the filter operator. Everything else in query processing currently, does not need to care about nulls. It is definitely in the scope of this issue to ensure all query processing layers can handle null values - however hasn't been done yet. |
Hi, (just want to make a summary of this) in this case if a column in INT type, and we would like to check IS NOT NULL value, does it mean we need to do
|
@monicaluodialpad If the purpose is just to exclude the null values, with the null value enabled, |
Currently in Pinot we don't have real NULL value support, but use some special default values for NULL. For dimensions, the default value is the minimum value for numeric types, "null" for STRING, empty byte array for BYTES; for metrics, the default value is 0 for numeric types, empty byte array for BYTES. #4214 adds support to treat empty byte array as NULL for BYTES, but it's not the general way to support NULL for all data types.
We should add another type of index to mark if the value is null, and apply that while filtering or fetching the values to skip all NULL values.
The text was updated successfully, but these errors were encountered: