Skip to content

binary-signal/flink-iteration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flink Iteration API for Apache Flink 2.0

A modern iteration API implementation for Apache Flink 2.0 that replaces the deprecated DataSet and DataStream iteration APIs. This implementation follows the FLIP-176 specification for unified iteration support.

Overview

This project provides a clean, non-deprecated iteration API for Apache Flink 2.0, supporting both bounded and unbounded iterations for machine learning and iterative algorithms.

Features

  • Unified Iteration Model: Support for both bounded and unbounded iterations
  • Synchronous/Asynchronous Execution: Flexible execution modes for different algorithm requirements
  • Per-round/All-round Operators: Configurable operator lifecycle management
  • Epoch Tracking: Built-in epoch/round tracking for iteration progress
  • Termination Control: Flexible termination conditions including max rounds and convergence criteria
  • Checkpoint Support: Integration with Flink's checkpointing mechanism

Architecture

The iteration API consists of several key components:

  • IterationBody: Interface for defining iteration computation logic
  • IterationListener: Callbacks for epoch watermarks and termination events
  • Iterations: Main entry point for creating iterations
  • HeadOperator/TailOperator: Internal operators managing the feedback loop
  • DataStreamList: Helper for managing multiple typed streams
  • IterationConfig: Configuration for iteration behavior

Requirements

  • Java 17 or higher
  • Apache Flink 2.0+
  • Maven 3.6 or higher

Building

mvn clean package

Usage

Basic Unbounded Iteration

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<Integer> initValues = env.fromElements(0);
DataStream<Integer> data = env.fromElements(1, 2, 3, 4, 5);

DataStreamList result = Iterations.iterateUnboundedStreams(
    DataStreamList.of(initValues),
    DataStreamList.of(data),
    (variableStreams, dataStreams) -> {
        DataStream<Integer> variable = variableStreams.get(0);
        DataStream<Integer> input = dataStreams.get(0);

        DataStream<Integer> updated = variable
            .union(input)
            .map(value -> value + 1);

        return new IterationBodyResult(
            DataStreamList.of(updated),  // feedback
            DataStreamList.of(updated)   // output
        );
    }
);

Bounded Iteration with Configuration

IterationConfig config = IterationConfig.newBuilder()
    .setMaxRounds(10)
    .setOperatorLifeCycle(IterationConfig.OperatorLifeCycle.ALL_ROUND)
    .build();

DataStreamList result = Iterations.iterateBoundedStreamsUntilTermination(
    DataStreamList.of(initParameters),
    ReplayableDataStreamList.notReplay(dataset),
    config,
    iterationBody
);

Linear Regression Example

See src/main/java/org/apache/flink/iteration/examples/LinearRegressionExample.java for a complete example of implementing linear regression with SGD using the iteration API.

API Components

IterationBody

The core interface for defining iteration logic:

public interface IterationBody {
    IterationBodyResult process(
        DataStreamList variableStreams,
        DataStreamList dataStreams
    );
}

IterationListener

For operators that need epoch notifications:

public interface IterationListener<T> {
    void onEpochWatermarkIncremented(
        int epochWatermark,
        Context context,
        Collector<T> collector
    );

    void onIterationTerminated(
        Context context,
        Collector<T> collector
    );
}

DataStreamList

Type-safe container for multiple streams:

DataStreamList streams = DataStreamList.of(stream1, stream2, stream3);
DataStream<MyType> first = streams.get(0);

ReplayableDataStreamList

Specify which streams should be replayed in bounded iteration:

ReplayableDataStreamList.replay(stream1, stream2)
    .andNotReplay(stream3);

Testing

Run tests with:

mvn test

Examples

  • Linear Regression: Demonstrates synchronous bounded iteration for ML training
  • More examples coming soon...

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Contributing

Contributions are welcome! Please ensure:

  • All tests pass
  • Code follows Flink coding conventions
  • New features include tests
  • Documentation is updated

Future Enhancements

  • Support for nested iterations
  • Advanced termination criteria
  • More efficient serialisation for IterationRecord etc
  • Performance optimizations for large-scale iterations
  • Additional ML algorithm examples
  • Asynchronous Execution
  • Integration with Flink ML library

Contact

For questions and support, please open an issue on GitHub.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published