[FLINK-40064][python][e2e] Migrate PyFlink DataStream e2e test from Kafka to FileSource/FileSink#28629
Conversation
…afka to FileSource/FileSink Rewrite the job on the bounded FileSource and unified FileSink and re-enable the case. Generated-by: Claude Code (Fable 5)
|
|
||
| watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5))\ | ||
| .with_timestamp_assigner(KafkaRowTimestampAssigner()) | ||
| watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5)) \ |
There was a problem hiding this comment.
Does setting the auto watermark interval above 0 with a wm strategy with a 5 second duration do the same thing as WatermarkStrategy.no_watermarks()? Could be a simplification if so, and wouldnt need to set the env config?
There was a problem hiding this comment.
Almost: with the interval above 0 they are not quite equivalent, because the bounded-out-of-orderness generator emits periodically, so on a slow read the watermark could advance mid-stream and change which timestamps the timers register at. But you are right that no_watermarks() is the cleaner way to get exactly the interval-0 behavior: it never emits during processing, the timestamp assigner still chains onto it (needed for the ctx.timestamp() assertions), and the end-of-input MAX_WATERMARK is forwarded by TimestampsAndWatermarksOperator regardless of the strategy, so the timers still fire. Applied in fe0a7c4 and re-verified locally (same 16 lines, identical across runs).
…eriodic watermarks Generated-by: Claude Code (Fable 5)
What is the purpose of the change
FLINK-40048 removed the flink-sql-connector-kafka dependency and its maven-dependency-plugin copy from flink-sql-client-test, so no Kafka sql-jar is staged into target/sql-jars anymore. test_pyflink.sh still executed
KAFKA_SQL_JAR=$(find "$SQL_JARS_DIR" | grep "kafka")underset -Eeuo pipefail, which now exits 1 and aborts the "PyFlink end-to-end test" nightly leg (e2e_4_ci) before any test runs (first seen in build 76685).The jar and all Kafka scaffolding in the script only served the "Test PyFlink DataStream job" case, disabled since FLINK-36185 because data_stream_job.py used the legacy FlinkKafkaConsumer/FlinkKafkaProducer. This PR rewrites that job to the in-repo filesystem connector and re-enables the case, fixing the nightly leg and restoring PyFlink DataStream coverage (KeyedProcessFunction + event-time timers) dormant since September 2024. No active Kafka e2e coverage is lost; PyFlink-Kafka e2e coverage belongs in the externalized flink-connector-kafka repository.
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Fable 5)