New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema evolution: Default values for newly added columns #161
Conversation
|
Solution 1:High level idea: Pinot already mandates registering a schema for a realtime table. We can easily extend this to offline table as well and make it mandatory if clients need support for default value. On loading a segment we can simply look at the columns in the schema and segment. If we find columns missing, we can create a sorted forward index column. We will have only one value in this column which will be derived using the default value in the schema. Once we generate this, rest of the query execution code should work as usual. Things to consider
Thoughts? |
The biggest issue is if the default value changes, if we wrote a new column to disk, then we can't really fix it with a new default value. What we could do instead is have the column load generate a new column on the fly if it doesn't exist based on the schema values, with nothing written to disk. Since the column has only one value, it's pretty trivial to generate and shouldn't have any significant impact on performance. We'd also need to intercept calls to metadata/dictionaries for columns that don't exist and synthesize metadata/dictionaries for them. Of course, when we support nulls properly, the synthetic column should just contain nulls. |
If we need to change default value (which would be very rare), then with approach 1, we can compare the default values and regenerate new column with new default and delete the previously generated default column. That way there is no additional in-memory complexity. |
@jfim: It should be possible to change persistent default value column. On Thu, Jun 9, 2016 at 4:21 PM, Jean-François Im notifications@github.com
|
@antumbde Good point about MV columns. I am guessing its ok to have sorted forward index even for MV columns as long as default value is just one element. Adwait I am not sure if I get 4. Are you saying we should use default value even if we don't generate new sorted index file for new columns? If we have to do that, then we can completely avoid creating the file as @jfim suggested and create them when needed in memory. |
If we write a column that has only default null values, there's no way to determine that the column is synthetic unless you write a flag in the metadata, which is just another thing that we need to keep track of when we load segments. This means that now instead of just making a column at load time, we need to:
If there's no "synthetic column" flag in the metadata, this gets even worse, because we have to have some weird logic somewhere that checks if it's a column with a single value. Of course, you can't just check if it's a column with a single value, because any column that has a sorted time column would likely have a single value too, and you don't want to replace that with default null values. So then we need to make this logic smarter and figure out a reasonable set of assumptions as to when we can assume the materialized column was because of default null values. If we get it wrong, then we just destroyed some user data and we need to ask them to fix the segments, and at that point, they would've been better off just backfilling on their own. I don't remember offhand the on-disk format for multivalues, but if I remember correctly, it's just indices into a linear array of values. If that's the case, we can just have all of them index to the same value or make a special multivalue column type for synthetic columns that always returns empty arrays. |
@kishoreg Regarding point 4 - your original things to consider mentioned -
|
@antumbde Got it. No, I don't intend to keep the previous behavior. we will process both s1 and s2 as you suggested. I was more concerned about performance. lets say the query is select sum(foo) from myTable. Earlier, we would skip segments which don't have column 'foo' but now we will end up scanning all segments to compute the sum. Answers should match given default value for foo is 0 |
Thanks |
Current coverage is 46.55% (diff: 80.67%)@@ master #161 diff @@
==========================================
Files 707 712 +5
Lines 33231 33447 +216
Methods 0 0
Messages 0 0
Branches 4976 5008 +32
==========================================
+ Hits 15335 15572 +237
+ Misses 16678 16632 -46
- Partials 1218 1243 +25
|
58b4972
to
aa26a07
Compare
Update the PR. Tested locally for v1/v2 format. Also added support for updating the default value for v1/v2 format (v3 does not support index removal as of now). Added a new class schemaEvolutionHandler that does bulk of the work.
TESTING TBD
|
segmentWriter.removeIndex(column, ColumnIndexType.DICTIONARY); | ||
segmentWriter.removeIndex(column, ColumnIndexType.FORWARD_INDEX); | ||
} else { | ||
LOGGER.warn("Updating default value is not supported in {} segment version", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error out here. It will be difficult to track warnings afterwards.
We will need to restart servers if the schema changes. This is increasingly difficult for shared clusters with large datasets like UMP cluster. We should have controller send out alter table request to servers over http. That's easier to manage (self-serviceable), no service disruption and more reliable if there are errors (we know those upfront). Controller will not have to handle failed servers which keeps it simple. Server already runs an http server now. |
We could also do messages over helix? |
One difference with messages over helix is status feedback. Over http user will immediately know if there are failures. For helix, IIUC, we may not know. Operation completion is more deterministic over http. |
return loadSegment(indexDir, readMode, indexLoadingConfigMetadata, null); | ||
} | ||
|
||
public static IndexSegment loadSegment(File indexDir, ReadMode readMode, IndexLoadingConfigMetadata indexLoadingConfigMetadata, Schema schema) throws Exception { | ||
return Loaders.IndexSegment.load(indexDir, readMode, indexLoadingConfigMetadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Loaders.IndexSegment.load(indexDir, readMode, indexLoadingConfigMetadata, schema); ?
Should I take this change and work on all the issues, or add unit tests after you fix the comments and check in this PR? |
I will respond to the comments. Please start working on the unit tests. On Mon, Jul 11, 2016 at 3:47 PM, Jackie-Jiang notifications@github.com
|
For operation, I prefer one method to trigger re-load, either refresh message or http based. while http based feels simple and easier its much harder to control and handle faults. For e.g. with refresh message based, we can control how many segments are getting re-loaded at once across cluster,table etc. With http we have to re-implement throttling. For long running tasks, we need to handle faults - what if controller goes down, what if server goes down or not responding etc. Helix provides us all these features already. If the end goal is for user to know the result, then we can have a call to controller that waits until schema updates are processed before responding. Note that Helix message can also have status updates, current implementation of message handler in pinot does not use that feature. Other option is to have nodes automatically watch for schema updates and trigger segment refresh. It makes sense to do this once we have this feature well tested in production. |
|
||
// delete the temporary directory | ||
FileUtils.deleteDirectory(tmpIndexDir); | ||
return new SegmentMetadataImpl(indexDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete backup metadata
Regarding http vs helix discussion : my main point is for operator to know
For this specific operation, failed node handling is trivial. Restarting servers will see changed schema and apply that. I agree that helix is better mechanism in general. We can do it over helix or hybrid of controller http request + helix. I'm not sure if we can get error reasons over helix, but, without that, having to go through logs impacts manageability. |
80c7856
to
cc42e00
Compare
bd32f42
to
c586793
Compare
LGTM. I tested it by loading sample table and posted a schema with new field and disable/enabled the segments. I saw that new column was generated for v1 format |
Added functionality of adding new columns with default value in DefaultColumnHandler. Let DefaultColumnHandler take a schema argument to control the newily added columns. The schema is passed down from the Helix property store. Inside SegmentMetadataImpl, add methods to modify and store the segment metadata. Support ADD, UPDATE, REMOVE operation on newly added columns. Support segment format v1 and v3. The DefaultColumnHandler is constructed inside SegmentPreProcessor. Added unit test and integration test for all types of newly added columns.
c586793
to
a3514c3
Compare
One of the most common request from users "when we add new columns, can the old segments have default values for these columns". Currently Pinot forces users to re-bootstrap old data when new columns are added. Most users are ok with using default values in old segments for newly added columns.
For example, lets say user added a new column c_new to table that currently has d1, m1 columns.
Lets say we have two segments. S1_OLD (contains d1,m1) and S2_NEW (contains d1,m1, c_new)
Lets go over the queries and its current behavior in Pinot
If the query does not mention new column, results will be correct and as expected.
If the query specifies the new column, then we will not consider old data(w/o new column) while processing this query. In this case it depends on other filters in the query if the result is right or wrong.
For example, lets look at the current behavior in Pinot when a new column c_new is added at time T1