New operator state interfaces #747

gyfora · 2015-05-29T10:42:10Z

This PR contains the proposed changes for the streaming operator state interfaces as described in

https://docs.google.com/document/d/1nTn4Tpafsnt-TCT6L1vlHtGGgRevU90yRsUQEmkRMjk/edit?usp=sharing

Highlights:

Added OperatorState interface accessible from RuntimeContext
System managed checkpointing/restore of the states (no required implementations for Serializable states)
confirmCheckpoint(..) method in the CheckpointCommitter interface has been changed to return the StateHandle for the checkpointed state
Updated Persistent Kafka source to work on the new interfaces
Updated Tests and IT cases

mbalassi · 2015-05-29T10:50:38Z

...time/src/main/java/org/apache/flink/runtime/jobgraph/tasks/CheckpointCommittingOperator.java

 public interface CheckpointCommittingOperator {

-	void confirmCheckpoint(long checkpointId, long timestamp) throws Exception;
+	void confirmCheckpoint(long checkpointId, SerializedValue<StateHandle<?>> state) throws Exception;
 }


Is the chckpointId still needed?

Good point, we could potentially omit it but we kept it for now since it will make testing easier to associate confirmations with specific checkpoints at least.

mbalassi · 2015-05-29T10:55:52Z

Generally looks good. I was wondering whether we could use Flink's serialization instead of having Serializable as a bound - do we have always Flink code on every side of the serialization?

rmetzger · 2015-06-01T12:30:46Z

...n/java/org/apache/flink/streaming/connectors/kafka/api/persistent/PersistentKafkaSource.java

+
+		this.lastOffsets = getRuntimeContext().getOperatorState("offset", defaultOffset);
+
+		//TODO: commit fetched offset to ZK if not default


We can not merge the PR with this TODO open.

mbalassi · 2015-06-09T07:30:32Z

release-0.9 is forked off, this pull request is not blocked by that any more. Did you manage to fix the Kafka source or is some help needed there?

gyfora · 2015-06-09T07:34:24Z

I could not fix the kafka test yet. Also I need to fix some tests hardcoded for the previous interfaces that have been added lately :p

This will probably happen later this week. At some point i might ask for help for getting kafka to work.

gyfora · 2015-06-17T11:46:08Z

Some issues that we need to fix before merge:

Committing the checkpoint doesnt work properly. We need to commit the states one by one and also pass the names for the state.

Statehandles are not properly discarded. I need to add a wrapper statehandle that will discard the ones wrapped as well.

gyfora · 2015-06-17T12:58:03Z

Also we need to add the partitioning setting to the operators as it is currently not exposed through the API

gyfora · 2015-06-18T08:42:39Z

Alright I think I fixed the issues, now the only thing remains is to add partitioning setting to the API.

State partitioning should be a property of the operator therefore it should be set afterwards like parallelism.

For example: stream.map(Mapper).setStatePartitioner(...)

This is quite tricky however as the state partitioner should affect the partitioning scheme of the input streams (otherwise it makes no sense). I see two approaches here: 1. simply overwrite the partitioning without warning
2. Only overwrite in case it is not defined (forward), otherwise throw exception stating that partitioning cannot be different from statePartitioning

mbalassi · 2015-06-18T09:52:33Z

I vote for the second option, it is more clean and more inline with the current behavior of the API.

rmetzger · 2015-06-18T22:08:24Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/SuccessfulCheckpoint.java

+	public StateForTask getState(JobVertexID jobVertexID)
+	{
+		if(vertexToState.containsKey(jobVertexID)) {
+			return vertexToState.get(jobVertexID);


I think you can save one map lookup by calling get() immediately. It will return null on a missing key.

StephanEwen · 2015-06-19T04:23:58Z

Let's keep the current interfaces Checkpointed and AsynchronouslyCheckpointed to not fully break current programs. They are used actually in examples that have been published.

gyfora · 2015-06-19T06:26:34Z

Thats's a good point Stephan, I fixed it.

gyfora · 2015-06-19T07:25:19Z

I still get these random KafkaITCase failures on travis. This might be a timeout issue of some sort, @rmetzger do you have any tips on debugging that?

rmetzger · 2015-06-20T22:59:30Z

Mh. Is it always the same test failing with the same message?
What is the failure?

aljoscha · 2015-06-24T09:11:00Z

...flink-streaming-core/src/main/java/org/apache/flink/streaming/api/datastream/DataStream.java

 	 * 
+	 * It creates a new {@link KeyedDataStream} that uses the provided key for partitioning
+	 * its operator states. Mind that keyBy does not affect the partitioning of the {@link DataStream} 


But it seems to affect the partitioning of the stream since the constructor of KeyedDataStream calls partitionByHash() on the DataStream. (This also applied to the other keyBy() methods)

indeed. I removed that part from the javadoc

aljoscha · 2015-06-24T09:14:02Z

@senorcarbone You mentioned that the KeyedDataStream disallows calls to shuffle()/repartition()/groupBy() and so on. But KeyedDataStream doesn't seem to have this functionality, unless I'm mistaken.

gyfora · 2015-06-24T09:26:34Z

The setConnectionType method is overwritten which will actually make this happen

aljoscha · 2015-06-24T09:28:39Z

Ahh I see, my bad. 😅

gyfora · 2015-06-25T14:03:41Z

Should we merge this?

StephanEwen · 2015-06-25T14:25:44Z

I think we have no real blocker here. I would prefer the exception issue could be addressed (message for wrapping exception).

Everything else will probably show best when we implement sample jobs and sample backends for this new functionality.

gyfora · 2015-06-25T14:32:49Z

Okay I will fix the exceptions and will merge it afterwards

…ework

… an operator + refactor

… rest of the refactoring Closes apache#747

rmetzger · 2015-06-30T12:37:08Z

Please create JIRAs for changes in the future.

… rest of the refactoring Closes apache#747

mbalassi reviewed May 29, 2015
View reviewed changes

gyfora force-pushed the new_state branch from ac2f638 to 08e2153 Compare May 31, 2015 20:48

rmetzger reviewed Jun 1, 2015
View reviewed changes

gyfora force-pushed the new_state branch 3 times, most recently from 3392117 to 8306ddf Compare June 8, 2015 16:06

gyfora force-pushed the new_state branch 2 times, most recently from 288ae6f to 5767ae6 Compare June 17, 2015 08:51

gyfora force-pushed the new_state branch 3 times, most recently from 0fe84fd to 450d5f5 Compare June 18, 2015 07:42

rmetzger reviewed Jun 18, 2015
View reviewed changes

gyfora force-pushed the new_state branch from 450d5f5 to e184d06 Compare June 19, 2015 06:24

gyfora force-pushed the new_state branch 3 times, most recently from 8c999ba to cc50603 Compare June 20, 2015 18:30

aljoscha reviewed Jun 24, 2015
View reviewed changes

senorcarbone force-pushed the new_state branch 2 times, most recently from 4d694af to a024a2e Compare June 24, 2015 16:24

gyfora force-pushed the new_state branch 2 times, most recently from 4cf7971 to ce4d2ff Compare June 25, 2015 14:37

gyfora and others added 10 commits June 25, 2015 16:38

[streaming] Initial rework of the operator state interfaces

a7e2458

[streaming] Allow multiple operator states + stateful function test r…

ef11e63

…ework

[streaming] Add stateHandle to checkpointed message

f27c3f1

[streaming] Return null for empty state instead of empty hashmap

e2e73ad

[streaming] fix for null state in ConfirmCheckpoint messages

0ecab82

[streaming] KafkaSource checkpointing rework for new interfaces

5ddd232

[streaming] Fix shallow discard + proper checkpoint commit

56ae08e

[streaming] [docs] Updated streaming guide for new state interfaces

0a4144e

[streaming] Allow using both partitioned and non-partitioned state in…

474ff4d

… an operator + refactor

[streaming] Re-enable Checkpointed interface for drawing snapshots

0ae1758

gyfora force-pushed the new_state branch from ce4d2ff to bde6e46 Compare June 25, 2015 14:38

[streaming] Add KeyedDataStream abstraction and integrate it with the…

cad8510

… rest of the refactoring Closes apache#747

gyfora force-pushed the new_state branch from bde6e46 to cad8510 Compare June 25, 2015 17:20

asfgit merged commit cad8510 into apache:master Jun 25, 2015

gyfora deleted the new_state branch June 27, 2015 18:46

nikste pushed a commit to nikste/flink that referenced this pull request Sep 29, 2015

[streaming] Add KeyedDataStream abstraction and integrate it with the…

0aa167c

… rest of the refactoring Closes apache#747

nltran pushed a commit to nltran/flink that referenced this pull request Jan 8, 2016

[streaming] Add KeyedDataStream abstraction and integrate it with the…

e9e08f2

… rest of the refactoring Closes apache#747

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New operator state interfaces #747

New operator state interfaces #747

gyfora commented May 29, 2015

mbalassi May 29, 2015

senorcarbone May 29, 2015

mbalassi commented May 29, 2015

rmetzger Jun 1, 2015

mbalassi commented Jun 9, 2015

gyfora commented Jun 9, 2015

gyfora commented Jun 17, 2015

gyfora commented Jun 17, 2015

gyfora commented Jun 18, 2015

mbalassi commented Jun 18, 2015

rmetzger Jun 18, 2015

gyfora Jun 19, 2015

StephanEwen commented Jun 19, 2015

gyfora commented Jun 19, 2015

gyfora commented Jun 19, 2015

rmetzger commented Jun 20, 2015

aljoscha Jun 24, 2015

senorcarbone Jun 24, 2015

aljoscha commented Jun 24, 2015

gyfora commented Jun 24, 2015

aljoscha commented Jun 24, 2015

gyfora commented Jun 25, 2015

StephanEwen commented Jun 25, 2015

gyfora commented Jun 25, 2015

rmetzger commented Jun 30, 2015


		this.lastOffsets = getRuntimeContext().getOperatorState("offset", defaultOffset);

		//TODO: commit fetched offset to ZK if not default

New operator state interfaces #747

New operator state interfaces #747

Conversation

gyfora commented May 29, 2015

mbalassi May 29, 2015

Choose a reason for hiding this comment

senorcarbone May 29, 2015

Choose a reason for hiding this comment

mbalassi commented May 29, 2015

rmetzger Jun 1, 2015

Choose a reason for hiding this comment

mbalassi commented Jun 9, 2015

gyfora commented Jun 9, 2015

gyfora commented Jun 17, 2015

gyfora commented Jun 17, 2015

gyfora commented Jun 18, 2015

mbalassi commented Jun 18, 2015

rmetzger Jun 18, 2015

Choose a reason for hiding this comment

gyfora Jun 19, 2015

Choose a reason for hiding this comment

StephanEwen commented Jun 19, 2015

gyfora commented Jun 19, 2015

gyfora commented Jun 19, 2015

rmetzger commented Jun 20, 2015

aljoscha Jun 24, 2015

Choose a reason for hiding this comment

senorcarbone Jun 24, 2015

Choose a reason for hiding this comment

aljoscha commented Jun 24, 2015

gyfora commented Jun 24, 2015

aljoscha commented Jun 24, 2015

gyfora commented Jun 25, 2015

StephanEwen commented Jun 25, 2015

gyfora commented Jun 25, 2015

rmetzger commented Jun 30, 2015