Schema evolution: Default values for newly added columns #161

kishoreg · 2016-07-01T04:59:15Z

One of the most common request from users "when we add new columns, can the old segments have default values for these columns". Currently Pinot forces users to re-bootstrap old data when new columns are added. Most users are ok with using default values in old segments for newly added columns.

For example, lets say user added a new column c_new to table that currently has d1, m1 columns.
Lets say we have two segments. S1_OLD (contains d1,m1) and S2_NEW (contains d1,m1, c_new)

Lets go over the queries and its current behavior in Pinot
If the query does not mention new column, results will be correct and as expected.

If the query specifies the new column, then we will not consider old data(w/o new column) while processing this query. In this case it depends on other filters in the query if the result is right or wrong.

For example, lets look at the current behavior in Pinot when a new column c_new is added at time T1

select sum(m1) from Table //this will work in all cases
select sum(m1) from Table where c_new=x //this will be incorrect as pinot considers only new data
select sum(m1) from Table where c_new=x and time > T1 //this will be correct since all data after T1 will have c_new
select sum(c_new) from table //this works and returns correct result
select sum(c1) from table //this also works as expected
but
select sum(c_new), sum(m1)  from table //THIS WONT WORK AS EXPECTED

jfim · 2016-06-08T21:01:56Z

select * from Table is also broken, it will give a set of columns that is not deterministic.

kishoreg · 2016-06-09T22:36:52Z

Solution 1:

High level idea: Pinot already mandates registering a schema for a realtime table. We can easily extend this to offline table as well and make it mandatory if clients need support for default value. On loading a segment we can simply look at the columns in the schema and segment. If we find columns missing, we can create a sorted forward index column. We will have only one value in this column which will be derived using the default value in the schema. Once we generate this, rest of the query execution code should work as usual.

Things to consider

Should we update the segment metadata when we add new columns. Till now we have always avoided modifying the segment.
What happens to already loaded segments, should we force restart of server or make use of segmentrefresh command to force reload of segment
What happens if the default value changes. With new segment format version, we don't support remove yet.
Some of the existing usecases might rely on existing behavior where we skip segments that dont contain columns in query. We need to make sure the performance is not degraded for such use cases.

Thoughts?

jfim · 2016-06-09T23:21:01Z

The biggest issue is if the default value changes, if we wrote a new column to disk, then we can't really fix it with a new default value.

What we could do instead is have the column load generate a new column on the fly if it doesn't exist based on the schema values, with nothing written to disk. Since the column has only one value, it's pretty trivial to generate and shouldn't have any significant impact on performance. We'd also need to intercept calls to metadata/dictionaries for columns that don't exist and synthesize metadata/dictionaries for them.

Of course, when we support nulls properly, the synthetic column should just contain nulls.

raviarin · 2016-06-09T23:44:20Z

If we need to change default value (which would be very rare), then with approach 1, we can compare the default values and regenerate new column with new default and delete the previously generated default column. That way there is no additional in-memory complexity.

antumbde · 2016-06-10T00:14:48Z

" We will have only one value in this column..."
What is your thinking on multi-value columns ? I guess, you mean single
array as one value for multi-valued columns

We should persist these columns for server scalability as we can easily
mmap those indexes. Otherwise, on shared multi-tenant clusters with, say,
2k segments and 3 additional columns, we need 6k indexes all required in
memory. That can quickly offset some of the gains of new file format.
We should avoid server restarts for segment level changes. We already
require for inverted indexes but this can again be an issue in the shared
cluster. Instead, we can have servers 'force refresh' a segment using the
http admin endpoint (guaranteed action is an issue with http call)
Remove is possible but very costly with the new segment format. We can
support, if really needed. If default value changes are rare, we can model
that as addition of yet-another-new-column. Older one sticks around as
default. Downside is that applications need to change for the changed
column name. One option is to logically delete it by marking the range
invalid. No major performance impact but there is some storage and memory
scale impact.
Regarding users relying on pinot skipping segments - what will select
count(*) from myTable where new_column = new_column_default_value return ?
0 if we skip segments. That won't be correct and confusing for many new
users. We should avoid this and rely on simple behavior. Existing segments
are handled as if the column has default value.

@jfim: It should be possible to change persistent default value column.
What am I missing ?

On Thu, Jun 9, 2016 at 4:21 PM, Jean-François Im notifications@github.com
wrote:

The biggest issue is if the default value changes, if we wrote a new
column to disk, then we can't really fix it with a new default value.

What we could do instead is have the column load generate a new column on
the fly if it doesn't exist based on the schema values, with nothing
written to disk. Since the column has only one value, it's pretty trivial
to generate and shouldn't have any significant impact on performance. We'd
also need to intercept calls to metadata/dictionaries for columns that
don't exist and synthesize metadata/dictionaries for them.

Of course, when we support nulls properly, the synthetic column should
just contain nulls.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#161 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABAmRquUgwqc7XrWHhWZluXq8UeTF-Sbks5qKJ_ggaJpZM4Ivpt6
.

kishoreg · 2016-06-10T05:00:03Z

@antumbde Good point about MV columns. I am guessing its ok to have sorted forward index even for MV columns as long as default value is just one element.

Adwait I am not sure if I get 4. Are you saying we should use default value even if we don't generate new sorted index file for new columns?

If we have to do that, then we can completely avoid creating the file as @jfim suggested and create them when needed in memory.

jfim · 2016-06-10T05:52:00Z

If we write a column that has only default null values, there's no way to determine that the column is synthetic unless you write a flag in the metadata, which is just another thing that we need to keep track of when we load segments. This means that now instead of just making a column at load time, we need to:

Check if the column exists, if it doesn't, generate one
Check if the column if the column was a generated default value column (either with a synthetic column flag in the metadata or checking if it's a single value column), and if it doesn't match with the current default value, regenerate it.

If there's no "synthetic column" flag in the metadata, this gets even worse, because we have to have some weird logic somewhere that checks if it's a column with a single value. Of course, you can't just check if it's a column with a single value, because any column that has a sorted time column would likely have a single value too, and you don't want to replace that with default null values. So then we need to make this logic smarter and figure out a reasonable set of assumptions as to when we can assume the materialized column was because of default null values. If we get it wrong, then we just destroyed some user data and we need to ask them to fix the segments, and at that point, they would've been better off just backfilling on their own.

I don't remember offhand the on-disk format for multivalues, but if I remember correctly, it's just indices into a linear array of values. If that's the case, we can just have all of them index to the same value or make a special multivalue column type for synthetic columns that always returns empty arrays.

antumbde · 2016-06-10T16:01:37Z

@kishoreg Regarding point 4 - your original things to consider mentioned -
"Some of the existing usecases might rely on existing behavior where we skip segments that dont contain columns in query."
Say, segments s1 and s2 don't have column 'foo'. User adds new column 'foo'. s1 and s2 will have column foo with the default value. Now, when you say users rely on pinot skipping the segments, do you mean that segments, s1 and s2 which did not have 'foo' initially, are completely ignored in query processing ? How will we process queries

select count(*) from myTable where foo='default_value'
select avg(foo) from myTable; (// say, foo were metric)
Results will be different whether we skip s1, s2 vs process s1, s2 as if 'foo' as default value. I will vote for processing s1, s2.

kishoreg · 2016-06-10T16:07:27Z

@antumbde Got it. No, I don't intend to keep the previous behavior. we will process both s1 and s2 as you suggested. I was more concerned about performance. lets say the query is select sum(foo) from myTable. Earlier, we would skip segments which don't have column 'foo' but now we will end up scanning all segments to compute the sum. Answers should match given default value for foo is 0

antumbde · 2016-06-10T16:10:56Z

Thanks

codecov-io · 2016-07-01T05:34:37Z

Current coverage is 46.55% (diff: 80.67%)

Merging #161 into master will increase coverage by 0.41%

@@             master       #161   diff @@
==========================================
  Files           707        712     +5   
  Lines         33231      33447   +216   
  Methods           0          0          
  Messages          0          0          
  Branches       4976       5008    +32   
==========================================
+ Hits          15335      15572   +237   
+ Misses        16678      16632    -46   
- Partials       1218       1243    +25

Powered by Codecov. Last update 123e562...c586793

kishoreg · 2016-07-03T21:01:09Z

Update the PR. Tested locally for v1/v2 format. Also added support for updating the default value for v1/v2 format (v3 does not support index removal as of now).

Added a new class schemaEvolutionHandler that does bulk of the work.
Implementation details

looks at schema from ZK and segment metadata.
computes the action required on a per column basis NEW_DIMENSION, UPDATE_DIMENSION etc
creates the new columns in a temp directory
computes the new metadata based on columns added
once all files are generated in temp directory, we update the original segment
returns the new segmentMetadata

TESTING
I did some manual testing where I ran schemaEvolutionHandler on a segment and started the server. Verified that we are able to retrieve the newly added column.

TBD

write a unit test case to capture all data types (i manually tested for STRING dimension, we need tests for all data types for dimensions and metrics)
write an integration test to create a segment with 5 columns and upload the schema with 5 + additional columns (1 for each data type and dimension/metric combination) and then start the server
refresh test case - similar to previous test case but we change the schema after server has started and loaded the segment. We need to either use refresh message or change the segment state to OFFLINE and back to ONLINE to force reload.

antumbde · 2016-07-09T17:33:53Z

pinot-core/src/main/java/com/linkedin/pinot/core/segment/index/loader/SchemaChangeHandler.java

+          segmentWriter.removeIndex(column, ColumnIndexType.DICTIONARY);
+          segmentWriter.removeIndex(column, ColumnIndexType.FORWARD_INDEX);
+        } else {
+          LOGGER.warn("Updating default value is not supported in {} segment version",


Error out here. It will be difficult to track warnings afterwards.

atumbde · 2016-07-11T18:27:15Z

We will need to restart servers if the schema changes. This is increasingly difficult for shared clusters with large datasets like UMP cluster.

We should have controller send out alter table request to servers over http. That's easier to manage (self-serviceable), no service disruption and more reliable if there are errors (we know those upfront).

Controller will not have to handle failed servers which keeps it simple.

Server already runs an http server now.

mcvsubbu · 2016-07-11T18:33:31Z

We could also do messages over helix?

atumbde · 2016-07-11T18:38:32Z

One difference with messages over helix is status feedback. Over http user will immediately know if there are failures. For helix, IIUC, we may not know. Operation completion is more deterministic over http.

Jackie-Jiang · 2016-07-11T21:16:26Z

...-core/src/main/java/com/linkedin/pinot/core/indexsegment/columnar/ColumnarSegmentLoader.java

+    return loadSegment(indexDir, readMode, indexLoadingConfigMetadata, null);
+  }
+
+  public static IndexSegment loadSegment(File indexDir, ReadMode readMode, IndexLoadingConfigMetadata indexLoadingConfigMetadata, Schema schema) throws Exception {
    return Loaders.IndexSegment.load(indexDir, readMode, indexLoadingConfigMetadata);


return Loaders.IndexSegment.load(indexDir, readMode, indexLoadingConfigMetadata, schema); ?

Jackie-Jiang · 2016-07-11T22:47:34Z

Should I take this change and work on all the issues, or add unit tests after you fix the comments and check in this PR?

kishoreg · 2016-07-11T22:53:09Z

I will respond to the comments. Please start working on the unit tests.

On Mon, Jul 11, 2016 at 3:47 PM, Jackie-Jiang notifications@github.com
wrote:

Should I take this change and work on all the issues, or add unit tests
after you fix the comments and check in this PR?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#161 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAz7ZxZ1DhHKN2cwLzI8cTSrCtZP6zkkks5qUsgIgaJpZM4Ivpt6
.

kishoreg · 2016-07-11T23:54:20Z

For operation, I prefer one method to trigger re-load, either refresh message or http based. while http based feels simple and easier its much harder to control and handle faults. For e.g. with refresh message based, we can control how many segments are getting re-loaded at once across cluster,table etc. With http we have to re-implement throttling. For long running tasks, we need to handle faults - what if controller goes down, what if server goes down or not responding etc. Helix provides us all these features already.

If the end goal is for user to know the result, then we can have a call to controller that waits until schema updates are processed before responding. Note that Helix message can also have status updates, current implementation of message handler in pinot does not use that feature.

Other option is to have nodes automatically watch for schema updates and trigger segment refresh. It makes sense to do this once we have this feature well tested in production.

antumbde · 2016-07-12T00:18:24Z

pinot-core/src/main/java/com/linkedin/pinot/core/segment/index/loader/SchemaChangeHandler.java

+
+    // delete the temporary directory
+    FileUtils.deleteDirectory(tmpIndexDir);
+    return new SegmentMetadataImpl(indexDir);


delete backup metadata

antumbde · 2016-07-12T00:44:36Z

Regarding http vs helix discussion : my main point is for operator to know

success vs failure
Reason for failure
Architecturally, we should also plan for self-serviceability and proper tooling. So, we should avoid need for server restarts. (If you've planned for these then ignore my comment.)

For this specific operation, failed node handling is trivial. Restarting servers will see changed schema and apply that. I agree that helix is better mechanism in general.

We can do it over helix or hybrid of controller http request + helix. I'm not sure if we can get error reasons over helix, but, without that, having to go through logs impacts manageability.

kishoreg · 2016-08-11T23:48:04Z

LGTM. I tested it by loading sample table and posted a schema with new field and disable/enabled the segments. I saw that new column was generated for v1 format

Added functionality of adding new columns with default value in DefaultColumnHandler. Let DefaultColumnHandler take a schema argument to control the newily added columns. The schema is passed down from the Helix property store. Inside SegmentMetadataImpl, add methods to modify and store the segment metadata. Support ADD, UPDATE, REMOVE operation on newly added columns. Support segment format v1 and v3. The DefaultColumnHandler is constructed inside SegmentPreProcessor. Added unit test and integration test for all types of newly added columns.

kishoreg self-assigned this Jun 10, 2016

kishoreg force-pushed the schema_evolution_default_value branch from 58b4972 to aa26a07 Compare July 3, 2016 20:43

antumbde reviewed Jul 9, 2016
View reviewed changes

Jackie-Jiang reviewed Jul 11, 2016
View reviewed changes

antumbde reviewed Jul 12, 2016
View reviewed changes

Jackie-Jiang force-pushed the schema_evolution_default_value branch 4 times, most recently from 80c7856 to cc42e00 Compare August 8, 2016 19:52

Jackie-Jiang force-pushed the schema_evolution_default_value branch 2 times, most recently from bd32f42 to c586793 Compare August 11, 2016 22:46

Jackie-Jiang force-pushed the schema_evolution_default_value branch from c586793 to a3514c3 Compare August 15, 2016 21:44

Jackie-Jiang merged commit 2aa0408 into master Aug 15, 2016

Jackie-Jiang deleted the schema_evolution_default_value branch August 15, 2016 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema evolution: Default values for newly added columns #161

Schema evolution: Default values for newly added columns #161

kishoreg commented Jul 1, 2016 •

edited

jfim commented Jun 8, 2016

kishoreg commented Jun 9, 2016

jfim commented Jun 9, 2016

raviarin commented Jun 9, 2016

antumbde commented Jun 10, 2016

kishoreg commented Jun 10, 2016

jfim commented Jun 10, 2016

antumbde commented Jun 10, 2016

kishoreg commented Jun 10, 2016

antumbde commented Jun 10, 2016

codecov-io commented Jul 1, 2016 •

edited

kishoreg commented Jul 3, 2016

antumbde Jul 9, 2016

atumbde commented Jul 11, 2016

mcvsubbu commented Jul 11, 2016

atumbde commented Jul 11, 2016

Jackie-Jiang Jul 11, 2016

Jackie-Jiang commented Jul 11, 2016

kishoreg commented Jul 11, 2016

kishoreg commented Jul 11, 2016

antumbde Jul 12, 2016 •

edited

antumbde commented Jul 12, 2016

kishoreg commented Aug 11, 2016

Schema evolution: Default values for newly added columns #161

Schema evolution: Default values for newly added columns #161

Conversation

kishoreg commented Jul 1, 2016 • edited

jfim commented Jun 8, 2016

kishoreg commented Jun 9, 2016

Solution 1:

jfim commented Jun 9, 2016

raviarin commented Jun 9, 2016

antumbde commented Jun 10, 2016

kishoreg commented Jun 10, 2016

jfim commented Jun 10, 2016

antumbde commented Jun 10, 2016

kishoreg commented Jun 10, 2016

antumbde commented Jun 10, 2016

codecov-io commented Jul 1, 2016 • edited

Current coverage is 46.55% (diff: 80.67%)

kishoreg commented Jul 3, 2016

antumbde Jul 9, 2016

Choose a reason for hiding this comment

atumbde commented Jul 11, 2016

mcvsubbu commented Jul 11, 2016

atumbde commented Jul 11, 2016

Jackie-Jiang Jul 11, 2016

Choose a reason for hiding this comment

Jackie-Jiang commented Jul 11, 2016

kishoreg commented Jul 11, 2016

kishoreg commented Jul 11, 2016

antumbde Jul 12, 2016 • edited

Choose a reason for hiding this comment

antumbde commented Jul 12, 2016

kishoreg commented Aug 11, 2016

kishoreg commented Jul 1, 2016 •

edited

codecov-io commented Jul 1, 2016 •

edited

antumbde Jul 12, 2016 •

edited