[Bug] [checkpoint-storage] Checkpoints are only write on the first node and sometimes it may timeout. #7106

WarsenLiu · 2024-07-04T07:06:52Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

Build SeaTunnel Engine using three 8u32g servers, but sometimes there may be checkpoint timeout and checkpoints always write to the first node, resulting in too many inode.

SeaTunnel Version

2.3.5

SeaTunnel Config

seatunnel:
  engine:
    classloader-cache-mode: true
    history-job-expire-minutes: 180
    backup-count: 1
    queue-type: blockingqueue
    print-execution-info-interval: 60
    print-job-metrics-info-interval: 60
    slot-service:
      dynamic-slot: true
    checkpoint:
      interval: 300000
      timeout: 600000
      storage:
        type: hdfs
        max-retained: 3
        plugin-config:
          namespace: /data/apache-seatunnel-2.3.5/checkpoint
          # namespace: /tmp/seatunnel/checkpoint_snapshot
          storage.type: hdfs
          fs.defaultFS: file:///data/apache-seatunnel-2.3.5/

Running Command

used ds:
env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 60000
  job.name = "z003"
}

source {
  MySQL-CDC {
    base-url = "jdbc:mysql://xxx:3306/xxx?autoReconnect=true"
    username = "root"
    password = "xxx"
    table-names = ["xxx.xxx"]
    startup.mode = "initial"
    result_table_name = "source_table_2"
    query = "select xxx from xxx"
  }
}

transform {
  Sql {
    source_table_name = "source_table_2"
    result_table_name = "target_table_2"
    query = "select xxx from source_table_2"
  }
  Sql {
    source_table_name = "target_table_2"
    result_table_name = "target_table_log_2"
    query = "select xxx from target_table_2"
  }
}

sink {
  Jdbc {
    url = "jdbc:mysql://xxx:3306/xxx?autoReconnect=true"
    driver= "com.mysql.cj.jdbc.Driver"
    user = "root"
    password = "xxx"
    database = "xxx"
    source_table_name = "target_table_2"
    generate_sink_sql = true
    table = "xxx"
    batch_size = 10
    primary_keys = ["xxx"]
  }
  Jdbc {
    url = "jdbc:mysql://xxx:3306/xxx?autoReconnect=true"
    driver= "com.mysql.cj.jdbc.Driver"
    user = "root"
    password = "xxx"
    database = "xxx"
    source_table_name = "target_table_log_2"
    batch_size = 10
    query = "insert into xxx(xxx) values(?) ON DUPLICATE KEY UPDATE field= VALUES(field);"
  }
}

Error Exception

[INFO] 2024-07-04 13:37:16.704 +0800 -  -> 
	2024-07-04 13:37:15,926 INFO  org.apache.seatunnel.engine.client.job.ClientJobProxy - Job (861093691492663301) end with state FAILED
	2024-07-04 13:37:15,927 INFO  com.hazelcast.core.LifecycleService - hz.client_1 [seatunnel] [5.1] HazelcastClient 5.1 (20220228 - 21f20e7) is SHUTTING_DOWN
	2024-07-04 13:37:15,936 INFO  com.hazelcast.client.impl.connection.ClientConnectionManager - hz.client_1 [seatunnel] [5.1] Removed connection to endpoint: [10.60.162.14]:5801:aed286d6-1625-4ff4-91b6-34028db07da3, connection: ClientConnection{alive=false, connectionId=2, channel=NioChannel{/10.60.162.35:52531->/10.60.162.14:5801}, remoteAddress=[10.60.162.14]:5801, lastReadTime=2024-07-04 13:37:08.482, lastWriteTime=2024-07-04 13:37:08.481, closedTime=2024-07-04 13:37:15.932, connected server version=5.1}
	2024-07-04 13:37:15,940 INFO  com.hazelcast.client.impl.connection.ClientConnectionManager - hz.client_1 [seatunnel] [5.1] Removed connection to endpoint: [10.60.162.16]:5801:a3bbbfda-b0bc-4738-a148-a17337bdb588, connection: ClientConnection{alive=false, connectionId=3, channel=NioChannel{/10.60.162.35:47319->/10.60.162.16:5801}, remoteAddress=[10.60.162.16]:5801, lastReadTime=2024-07-04 13:37:13.484, lastWriteTime=2024-07-04 13:37:13.482, closedTime=2024-07-04 13:37:15.937, connected server version=5.1}
	2024-07-04 13:37:15,942 INFO  com.hazelcast.client.impl.connection.ClientConnectionManager - hz.client_1 [seatunnel] [5.1] Removed connection to endpoint: [10.60.162.31]:5801:b9abbcd8-ac93-41d8-9d85-25664ce23716, connection: ClientConnection{alive=false, connectionId=1, channel=NioChannel{/10.60.162.35:49005->/10.60.162.31:5801}, remoteAddress=[10.60.162.31]:5801, lastReadTime=2024-07-04 13:37:15.906, lastWriteTime=2024-07-04 13:37:13.328, closedTime=2024-07-04 13:37:15.940, connected server version=5.1}
	2024-07-04 13:37:15,942 INFO  com.hazelcast.core.LifecycleService - hz.client_1 [seatunnel] [5.1] HazelcastClient 5.1 (20220228 - 21f20e7) is CLIENT_DISCONNECTED
	2024-07-04 13:37:15,946 INFO  com.hazelcast.core.LifecycleService - hz.client_1 [seatunnel] [5.1] HazelcastClient 5.1 (20220228 - 21f20e7) is SHUTDOWN
	2024-07-04 13:37:15,946 INFO  org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand - Closed SeaTunnel client......
	2024-07-04 13:37:15,946 INFO  org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand - Closed metrics executor service ......
	2024-07-04 13:37:15,946 ERROR org.apache.seatunnel.core.starter.SeaTunnel - 
	
	===============================================================================
	
	
	2024-07-04 13:37:15,947 ERROR org.apache.seatunnel.core.starter.SeaTunnel - Fatal Error, 
	
	2024-07-04 13:37:15,947 ERROR org.apache.seatunnel.core.starter.SeaTunnel - Please submit bug report in https://github.com/apache/seatunnel/issues
	
	2024-07-04 13:37:15,947 ERROR org.apache.seatunnel.core.starter.SeaTunnel - Reason:SeaTunnel job executed failed 
	
	2024-07-04 13:37:15,949 ERROR org.apache.seatunnel.core.starter.SeaTunnel - Exception StackTrace:org.apache.seatunnel.core.starter.exception.CommandExecuteException: SeaTunnel job executed failed
		at org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand.execute(ClientExecuteCommand.java:202)
		at org.apache.seatunnel.core.starter.SeaTunnel.run(SeaTunnel.java:40)
		at org.apache.seatunnel.core.starter.seatunnel.SeaTunnelClient.main(SeaTunnelClient.java:34)
	Caused by: org.apache.seatunnel.engine.common.exception.SeaTunnelEngineException: org.apache.seatunnel.engine.server.checkpoint.CheckpointException: Checkpoint expired before completing. Please increase checkpoint timeout in the seatunnel.yaml or jobConfig env.
		at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.handleCoordinatorError(CheckpointCoordinator.java:274)
		at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$null$9(CheckpointCoordinator.java:590)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
		at java.lang.Thread.run(Thread.java:750)
	
		at org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand.execute(ClientExecuteCommand.java:194)
		... 2 more
	 
	2024-07-04 13:37:15,949 ERROR org.apache.seatunnel.core.starter.SeaTunnel - 
	===============================================================================
	
	
	
	Exception in thread "main" org.apache.seatunnel.core.starter.exception.CommandExecuteException: SeaTunnel job executed failed
		at org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand.execute(ClientExecuteCommand.java:202)
		at org.apache.seatunnel.core.starter.SeaTunnel.run(SeaTunnel.java:40)
		at org.apache.seatunnel.core.starter.seatunnel.SeaTunnelClient.main(SeaTunnelClient.java:34)
	Caused by: org.apache.seatunnel.engine.common.exception.SeaTunnelEngineException: org.apache.seatunnel.engine.server.checkpoint.CheckpointException: Checkpoint expired before completing. Please increase checkpoint timeout in the seatunnel.yaml or jobConfig env.
		at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.handleCoordinatorError(CheckpointCoordinator.java:274)
		at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$null$9(CheckpointCoordinator.java:590)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
		at java.lang.Thread.run(Thread.java:750)
	
		at org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand.execute(ClientExecuteCommand.java:194)
		... 2 more
	2024-07-04 13:37:15,951 INFO  org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand - run shutdown hook because get close signal
[INFO] 2024-07-04 13:37:16.706 +0800 - process has exited. execute path:/data/hubs/dolphinscheduler/tmp/exec/process/default/13823066283424/13831054200608_19/164485/311103, processId:560021 ,exitStatusCode:1 ,processWaitForStatus:true ,processExitValue:1
[INFO] 2024-07-04 13:37:16.707 +0800 - ***********************************************************************************************
[INFO] 2024-07-04 13:37:16.707 +0800 - *********************************  Finalize task instance  ************************************
[INFO] 2024-07-04 13:37:16.707 +0800 - ***********************************************************************************************
[INFO] 2024-07-04 13:37:16.707 +0800 - Upload output files: [] successfully
[INFO] 2024-07-04 13:37:16.707 +0800 - Send task execute status: FAILURE to master : 10.60.162.35:1234
[INFO] 2024-07-04 13:37:16.708 +0800 - Remove the current task execute context from worker cache
[INFO] 2024-07-04 13:37:16.708 +0800 - The current execute mode isn't develop mode, will clear the task execute file: /data/hubs/dolphinscheduler/tmp/exec/process/default/13823066283424/13831054200608_19/164485/311103
[INFO] 2024-07-04 13:37:16.708 +0800 - Success clear the task execute file: /data/hubs/dolphinscheduler/tmp/exec/process/default/13823066283424/13831054200608_19/164485/311103
[INFO] 2024-07-04 13:37:16.708 +0800 - FINALIZE_SESSION

Zeta or Flink or Spark Version

Zeta

Java or Scala Version

java version "1.8.0_401"
Java(TM) SE Runtime Environment (build 1.8.0_401-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.401-b10, mixed mode)

Screenshots

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

WarsenLiu added the bug label Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [checkpoint-storage] Checkpoints are only write on the first node and sometimes it may timeout. #7106

[Bug] [checkpoint-storage] Checkpoints are only write on the first node and sometimes it may timeout. #7106

WarsenLiu commented Jul 4, 2024

[Bug] [checkpoint-storage] Checkpoints are only write on the first node and sometimes it may timeout. #7106

[Bug] [checkpoint-storage] Checkpoints are only write on the first node and sometimes it may timeout. #7106

Comments

WarsenLiu commented Jul 4, 2024

Search before asking

What happened

SeaTunnel Version

SeaTunnel Config

Running Command

Error Exception

Zeta or Flink or Spark Version

Java or Scala Version

Screenshots

Are you willing to submit PR?

Code of Conduct