Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema evolution: Default values for newly added columns #161

Merged
merged 1 commit into from Aug 15, 2016

Conversation

kishoreg
Copy link
Member

@kishoreg kishoreg commented Jul 1, 2016

One of the most common request from users "when we add new columns, can the old segments have default values for these columns". Currently Pinot forces users to re-bootstrap old data when new columns are added. Most users are ok with using default values in old segments for newly added columns.

For example, lets say user added a new column c_new to table that currently has d1, m1 columns.
Lets say we have two segments. S1_OLD (contains d1,m1) and S2_NEW (contains d1,m1, c_new)

Lets go over the queries and its current behavior in Pinot
If the query does not mention new column, results will be correct and as expected.

If the query specifies the new column, then we will not consider old data(w/o new column) while processing this query. In this case it depends on other filters in the query if the result is right or wrong.

For example, lets look at the current behavior in Pinot when a new column c_new is added at time T1

select sum(m1) from Table //this will work in all cases
select sum(m1) from Table where c_new=x //this will be incorrect as pinot considers only new data
select sum(m1) from Table where c_new=x and time > T1 //this will be correct since all data after T1 will have c_new
select sum(c_new) from table //this works and returns correct result
select sum(c1) from table //this also works as expected
but
select sum(c_new), sum(m1)  from table //THIS WONT WORK AS EXPECTED

@jfim
Copy link
Member

jfim commented Jun 8, 2016

select * from Table is also broken, it will give a set of columns that is not deterministic.

@kishoreg
Copy link
Member Author

kishoreg commented Jun 9, 2016

Solution 1:

High level idea: Pinot already mandates registering a schema for a realtime table. We can easily extend this to offline table as well and make it mandatory if clients need support for default value. On loading a segment we can simply look at the columns in the schema and segment. If we find columns missing, we can create a sorted forward index column. We will have only one value in this column which will be derived using the default value in the schema. Once we generate this, rest of the query execution code should work as usual.

Things to consider

  • Should we update the segment metadata when we add new columns. Till now we have always avoided modifying the segment.
  • What happens to already loaded segments, should we force restart of server or make use of segmentrefresh command to force reload of segment
  • What happens if the default value changes. With new segment format version, we don't support remove yet.
  • Some of the existing usecases might rely on existing behavior where we skip segments that dont contain columns in query. We need to make sure the performance is not degraded for such use cases.

Thoughts?

@jfim
Copy link
Member

jfim commented Jun 9, 2016

The biggest issue is if the default value changes, if we wrote a new column to disk, then we can't really fix it with a new default value.

What we could do instead is have the column load generate a new column on the fly if it doesn't exist based on the schema values, with nothing written to disk. Since the column has only one value, it's pretty trivial to generate and shouldn't have any significant impact on performance. We'd also need to intercept calls to metadata/dictionaries for columns that don't exist and synthesize metadata/dictionaries for them.

Of course, when we support nulls properly, the synthetic column should just contain nulls.

@raviarin
Copy link
Contributor

raviarin commented Jun 9, 2016

If we need to change default value (which would be very rare), then with approach 1, we can compare the default values and regenerate new column with new default and delete the previously generated default column. That way there is no additional in-memory complexity.

@antumbde
Copy link
Contributor

" We will have only one value in this column..."
What is your thinking on multi-value columns ? I guess, you mean single
array as one value for multi-valued columns

  1. We should persist these columns for server scalability as we can easily
    mmap those indexes. Otherwise, on shared multi-tenant clusters with, say,
    2k segments and 3 additional columns, we need 6k indexes all required in
    memory. That can quickly offset some of the gains of new file format.
  2. We should avoid server restarts for segment level changes. We already
    require for inverted indexes but this can again be an issue in the shared
    cluster. Instead, we can have servers 'force refresh' a segment using the
    http admin endpoint (guaranteed action is an issue with http call)
  3. Remove is possible but very costly with the new segment format. We can
    support, if really needed. If default value changes are rare, we can model
    that as addition of yet-another-new-column. Older one sticks around as
    default. Downside is that applications need to change for the changed
    column name. One option is to logically delete it by marking the range
    invalid. No major performance impact but there is some storage and memory
    scale impact.
  4. Regarding users relying on pinot skipping segments - what will select
    count(*) from myTable where new_column = new_column_default_value return ?
    0 if we skip segments. That won't be correct and confusing for many new
    users. We should avoid this and rely on simple behavior. Existing segments
    are handled as if the column has default value.

@jfim: It should be possible to change persistent default value column.
What am I missing ?

On Thu, Jun 9, 2016 at 4:21 PM, Jean-François Im notifications@github.com
wrote:

The biggest issue is if the default value changes, if we wrote a new
column to disk, then we can't really fix it with a new default value.

What we could do instead is have the column load generate a new column on
the fly if it doesn't exist based on the schema values, with nothing
written to disk. Since the column has only one value, it's pretty trivial
to generate and shouldn't have any significant impact on performance. We'd
also need to intercept calls to metadata/dictionaries for columns that
don't exist and synthesize metadata/dictionaries for them.

Of course, when we support nulls properly, the synthetic column should
just contain nulls.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#161 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABAmRquUgwqc7XrWHhWZluXq8UeTF-Sbks5qKJ_ggaJpZM4Ivpt6
.

@kishoreg
Copy link
Member Author

@antumbde Good point about MV columns. I am guessing its ok to have sorted forward index even for MV columns as long as default value is just one element.

Adwait I am not sure if I get 4. Are you saying we should use default value even if we don't generate new sorted index file for new columns?

If we have to do that, then we can completely avoid creating the file as @jfim suggested and create them when needed in memory.

@jfim
Copy link
Member

jfim commented Jun 10, 2016

If we write a column that has only default null values, there's no way to determine that the column is synthetic unless you write a flag in the metadata, which is just another thing that we need to keep track of when we load segments. This means that now instead of just making a column at load time, we need to:

  1. Check if the column exists, if it doesn't, generate one
  2. Check if the column if the column was a generated default value column (either with a synthetic column flag in the metadata or checking if it's a single value column), and if it doesn't match with the current default value, regenerate it.

If there's no "synthetic column" flag in the metadata, this gets even worse, because we have to have some weird logic somewhere that checks if it's a column with a single value. Of course, you can't just check if it's a column with a single value, because any column that has a sorted time column would likely have a single value too, and you don't want to replace that with default null values. So then we need to make this logic smarter and figure out a reasonable set of assumptions as to when we can assume the materialized column was because of default null values. If we get it wrong, then we just destroyed some user data and we need to ask them to fix the segments, and at that point, they would've been better off just backfilling on their own.

I don't remember offhand the on-disk format for multivalues, but if I remember correctly, it's just indices into a linear array of values. If that's the case, we can just have all of them index to the same value or make a special multivalue column type for synthetic columns that always returns empty arrays.

@antumbde
Copy link
Contributor

@kishoreg Regarding point 4 - your original things to consider mentioned -
"Some of the existing usecases might rely on existing behavior where we skip segments that dont contain columns in query."
Say, segments s1 and s2 don't have column 'foo'. User adds new column 'foo'. s1 and s2 will have column foo with the default value. Now, when you say users rely on pinot skipping the segments, do you mean that segments, s1 and s2 which did not have 'foo' initially, are completely ignored in query processing ? How will we process queries

  • select count(*) from myTable where foo='default_value'
  • select avg(foo) from myTable; (// say, foo were metric)
    Results will be different whether we skip s1, s2 vs process s1, s2 as if 'foo' as default value. I will vote for processing s1, s2.

@kishoreg
Copy link
Member Author

@antumbde Got it. No, I don't intend to keep the previous behavior. we will process both s1 and s2 as you suggested. I was more concerned about performance. lets say the query is select sum(foo) from myTable. Earlier, we would skip segments which don't have column 'foo' but now we will end up scanning all segments to compute the sum. Answers should match given default value for foo is 0

@kishoreg kishoreg self-assigned this Jun 10, 2016
@antumbde
Copy link
Contributor

Thanks

@codecov-io
Copy link

codecov-io commented Jul 1, 2016

Current coverage is 46.55% (diff: 80.67%)

Merging #161 into master will increase coverage by 0.41%

@@             master       #161   diff @@
==========================================
  Files           707        712     +5   
  Lines         33231      33447   +216   
  Methods           0          0          
  Messages          0          0          
  Branches       4976       5008    +32   
==========================================
+ Hits          15335      15572   +237   
+ Misses        16678      16632    -46   
- Partials       1218       1243    +25   

Powered by Codecov. Last update 123e562...c586793

@kishoreg kishoreg force-pushed the schema_evolution_default_value branch from 58b4972 to aa26a07 Compare July 3, 2016 20:43
@kishoreg
Copy link
Member Author

kishoreg commented Jul 3, 2016

Update the PR. Tested locally for v1/v2 format. Also added support for updating the default value for v1/v2 format (v3 does not support index removal as of now).

Added a new class schemaEvolutionHandler that does bulk of the work.
Implementation details

  • looks at schema from ZK and segment metadata.
  • computes the action required on a per column basis NEW_DIMENSION, UPDATE_DIMENSION etc
  • creates the new columns in a temp directory
  • computes the new metadata based on columns added
  • once all files are generated in temp directory, we update the original segment
  • returns the new segmentMetadata

TESTING
I did some manual testing where I ran schemaEvolutionHandler on a segment and started the server. Verified that we are able to retrieve the newly added column.

TBD

  • write a unit test case to capture all data types (i manually tested for STRING dimension, we need tests for all data types for dimensions and metrics)
  • write an integration test to create a segment with 5 columns and upload the schema with 5 + additional columns (1 for each data type and dimension/metric combination) and then start the server
  • refresh test case - similar to previous test case but we change the schema after server has started and loaded the segment. We need to either use refresh message or change the segment state to OFFLINE and back to ONLINE to force reload.

segmentWriter.removeIndex(column, ColumnIndexType.DICTIONARY);
segmentWriter.removeIndex(column, ColumnIndexType.FORWARD_INDEX);
} else {
LOGGER.warn("Updating default value is not supported in {} segment version",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error out here. It will be difficult to track warnings afterwards.

@atumbde
Copy link
Contributor

atumbde commented Jul 11, 2016

We will need to restart servers if the schema changes. This is increasingly difficult for shared clusters with large datasets like UMP cluster.

We should have controller send out alter table request to servers over http. That's easier to manage (self-serviceable), no service disruption and more reliable if there are errors (we know those upfront).

Controller will not have to handle failed servers which keeps it simple.

Server already runs an http server now.

@mcvsubbu
Copy link
Contributor

We could also do messages over helix?

@atumbde
Copy link
Contributor

atumbde commented Jul 11, 2016

One difference with messages over helix is status feedback. Over http user will immediately know if there are failures. For helix, IIUC, we may not know. Operation completion is more deterministic over http.

return loadSegment(indexDir, readMode, indexLoadingConfigMetadata, null);
}

public static IndexSegment loadSegment(File indexDir, ReadMode readMode, IndexLoadingConfigMetadata indexLoadingConfigMetadata, Schema schema) throws Exception {
return Loaders.IndexSegment.load(indexDir, readMode, indexLoadingConfigMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Loaders.IndexSegment.load(indexDir, readMode, indexLoadingConfigMetadata, schema); ?

@Jackie-Jiang
Copy link
Contributor

Should I take this change and work on all the issues, or add unit tests after you fix the comments and check in this PR?

@kishoreg
Copy link
Member Author

I will respond to the comments. Please start working on the unit tests.

On Mon, Jul 11, 2016 at 3:47 PM, Jackie-Jiang notifications@github.com
wrote:

Should I take this change and work on all the issues, or add unit tests
after you fix the comments and check in this PR?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#161 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAz7ZxZ1DhHKN2cwLzI8cTSrCtZP6zkkks5qUsgIgaJpZM4Ivpt6
.

@kishoreg
Copy link
Member Author

For operation, I prefer one method to trigger re-load, either refresh message or http based. while http based feels simple and easier its much harder to control and handle faults. For e.g. with refresh message based, we can control how many segments are getting re-loaded at once across cluster,table etc. With http we have to re-implement throttling. For long running tasks, we need to handle faults - what if controller goes down, what if server goes down or not responding etc. Helix provides us all these features already.

If the end goal is for user to know the result, then we can have a call to controller that waits until schema updates are processed before responding. Note that Helix message can also have status updates, current implementation of message handler in pinot does not use that feature.

Other option is to have nodes automatically watch for schema updates and trigger segment refresh. It makes sense to do this once we have this feature well tested in production.


// delete the temporary directory
FileUtils.deleteDirectory(tmpIndexDir);
return new SegmentMetadataImpl(indexDir);
Copy link
Contributor

@antumbde antumbde Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete backup metadata

@antumbde
Copy link
Contributor

Regarding http vs helix discussion : my main point is for operator to know

  1. success vs failure
  2. Reason for failure
    Architecturally, we should also plan for self-serviceability and proper tooling. So, we should avoid need for server restarts. (If you've planned for these then ignore my comment.)

For this specific operation, failed node handling is trivial. Restarting servers will see changed schema and apply that. I agree that helix is better mechanism in general.

We can do it over helix or hybrid of controller http request + helix. I'm not sure if we can get error reasons over helix, but, without that, having to go through logs impacts manageability.

@Jackie-Jiang Jackie-Jiang force-pushed the schema_evolution_default_value branch 4 times, most recently from 80c7856 to cc42e00 Compare August 8, 2016 19:52
@Jackie-Jiang Jackie-Jiang force-pushed the schema_evolution_default_value branch 2 times, most recently from bd32f42 to c586793 Compare August 11, 2016 22:46
@kishoreg
Copy link
Member Author

LGTM. I tested it by loading sample table and posted a schema with new field and disable/enabled the segments. I saw that new column was generated for v1 format

Added functionality of adding new columns with default value in DefaultColumnHandler.
Let DefaultColumnHandler take a schema argument to control the newily added columns.
The schema is passed down from the Helix property store.
Inside SegmentMetadataImpl, add methods to modify and store the segment metadata.
Support ADD, UPDATE, REMOVE operation on newly added columns.
Support segment format v1 and v3.
The DefaultColumnHandler is constructed inside SegmentPreProcessor.
Added unit test and integration test for all types of newly added columns.
@Jackie-Jiang Jackie-Jiang merged commit 2aa0408 into master Aug 15, 2016
@Jackie-Jiang Jackie-Jiang deleted the schema_evolution_default_value branch August 15, 2016 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants