Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [checkpoint-storage] Checkpoints are only write on the first node and sometimes it may timeout. #7106

Open
2 of 3 tasks
WarsenLiu opened this issue Jul 4, 2024 · 0 comments
Labels

Comments

@WarsenLiu
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

Build SeaTunnel Engine using three 8u32g servers, but sometimes there may be checkpoint timeout and checkpoints always write to the first node, resulting in too many inode.

SeaTunnel Version

2.3.5

SeaTunnel Config

seatunnel:
  engine:
    classloader-cache-mode: true
    history-job-expire-minutes: 180
    backup-count: 1
    queue-type: blockingqueue
    print-execution-info-interval: 60
    print-job-metrics-info-interval: 60
    slot-service:
      dynamic-slot: true
    checkpoint:
      interval: 300000
      timeout: 600000
      storage:
        type: hdfs
        max-retained: 3
        plugin-config:
          namespace: /data/apache-seatunnel-2.3.5/checkpoint
          # namespace: /tmp/seatunnel/checkpoint_snapshot
          storage.type: hdfs
          fs.defaultFS: file:///data/apache-seatunnel-2.3.5/

Running Command

used ds:
env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 60000
  job.name = "z003"
}

source {
  MySQL-CDC {
    base-url = "jdbc:mysql://xxx:3306/xxx?autoReconnect=true"
    username = "root"
    password = "xxx"
    table-names = ["xxx.xxx"]
    startup.mode = "initial"
    result_table_name = "source_table_2"
    query = "select xxx from xxx"
  }
}

transform {
  Sql {
    source_table_name = "source_table_2"
    result_table_name = "target_table_2"
    query = "select xxx from source_table_2"
  }
  Sql {
    source_table_name = "target_table_2"
    result_table_name = "target_table_log_2"
    query = "select xxx from target_table_2"
  }
}

sink {
  Jdbc {
    url = "jdbc:mysql://xxx:3306/xxx?autoReconnect=true"
    driver= "com.mysql.cj.jdbc.Driver"
    user = "root"
    password = "xxx"
    database = "xxx"
    source_table_name = "target_table_2"
    generate_sink_sql = true
    table = "xxx"
    batch_size = 10
    primary_keys = ["xxx"]
  }
  Jdbc {
    url = "jdbc:mysql://xxx:3306/xxx?autoReconnect=true"
    driver= "com.mysql.cj.jdbc.Driver"
    user = "root"
    password = "xxx"
    database = "xxx"
    source_table_name = "target_table_log_2"
    batch_size = 10
    query = "insert into xxx(xxx) values(?) ON DUPLICATE KEY UPDATE field= VALUES(field);"
  }
}

Error Exception

[INFO] 2024-07-04 13:37:16.704 +0800 -  -> 
	2024-07-04 13:37:15,926 INFO  org.apache.seatunnel.engine.client.job.ClientJobProxy - Job (861093691492663301) end with state FAILED
	2024-07-04 13:37:15,927 INFO  com.hazelcast.core.LifecycleService - hz.client_1 [seatunnel] [5.1] HazelcastClient 5.1 (20220228 - 21f20e7) is SHUTTING_DOWN
	2024-07-04 13:37:15,936 INFO  com.hazelcast.client.impl.connection.ClientConnectionManager - hz.client_1 [seatunnel] [5.1] Removed connection to endpoint: [10.60.162.14]:5801:aed286d6-1625-4ff4-91b6-34028db07da3, connection: ClientConnection{alive=false, connectionId=2, channel=NioChannel{/10.60.162.35:52531->/10.60.162.14:5801}, remoteAddress=[10.60.162.14]:5801, lastReadTime=2024-07-04 13:37:08.482, lastWriteTime=2024-07-04 13:37:08.481, closedTime=2024-07-04 13:37:15.932, connected server version=5.1}
	2024-07-04 13:37:15,940 INFO  com.hazelcast.client.impl.connection.ClientConnectionManager - hz.client_1 [seatunnel] [5.1] Removed connection to endpoint: [10.60.162.16]:5801:a3bbbfda-b0bc-4738-a148-a17337bdb588, connection: ClientConnection{alive=false, connectionId=3, channel=NioChannel{/10.60.162.35:47319->/10.60.162.16:5801}, remoteAddress=[10.60.162.16]:5801, lastReadTime=2024-07-04 13:37:13.484, lastWriteTime=2024-07-04 13:37:13.482, closedTime=2024-07-04 13:37:15.937, connected server version=5.1}
	2024-07-04 13:37:15,942 INFO  com.hazelcast.client.impl.connection.ClientConnectionManager - hz.client_1 [seatunnel] [5.1] Removed connection to endpoint: [10.60.162.31]:5801:b9abbcd8-ac93-41d8-9d85-25664ce23716, connection: ClientConnection{alive=false, connectionId=1, channel=NioChannel{/10.60.162.35:49005->/10.60.162.31:5801}, remoteAddress=[10.60.162.31]:5801, lastReadTime=2024-07-04 13:37:15.906, lastWriteTime=2024-07-04 13:37:13.328, closedTime=2024-07-04 13:37:15.940, connected server version=5.1}
	2024-07-04 13:37:15,942 INFO  com.hazelcast.core.LifecycleService - hz.client_1 [seatunnel] [5.1] HazelcastClient 5.1 (20220228 - 21f20e7) is CLIENT_DISCONNECTED
	2024-07-04 13:37:15,946 INFO  com.hazelcast.core.LifecycleService - hz.client_1 [seatunnel] [5.1] HazelcastClient 5.1 (20220228 - 21f20e7) is SHUTDOWN
	2024-07-04 13:37:15,946 INFO  org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand - Closed SeaTunnel client......
	2024-07-04 13:37:15,946 INFO  org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand - Closed metrics executor service ......
	2024-07-04 13:37:15,946 ERROR org.apache.seatunnel.core.starter.SeaTunnel - 
	
	===============================================================================
	
	
	2024-07-04 13:37:15,947 ERROR org.apache.seatunnel.core.starter.SeaTunnel - Fatal Error, 
	
	2024-07-04 13:37:15,947 ERROR org.apache.seatunnel.core.starter.SeaTunnel - Please submit bug report in https://github.com/apache/seatunnel/issues
	
	2024-07-04 13:37:15,947 ERROR org.apache.seatunnel.core.starter.SeaTunnel - Reason:SeaTunnel job executed failed 
	
	2024-07-04 13:37:15,949 ERROR org.apache.seatunnel.core.starter.SeaTunnel - Exception StackTrace:org.apache.seatunnel.core.starter.exception.CommandExecuteException: SeaTunnel job executed failed
		at org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand.execute(ClientExecuteCommand.java:202)
		at org.apache.seatunnel.core.starter.SeaTunnel.run(SeaTunnel.java:40)
		at org.apache.seatunnel.core.starter.seatunnel.SeaTunnelClient.main(SeaTunnelClient.java:34)
	Caused by: org.apache.seatunnel.engine.common.exception.SeaTunnelEngineException: org.apache.seatunnel.engine.server.checkpoint.CheckpointException: Checkpoint expired before completing. Please increase checkpoint timeout in the seatunnel.yaml or jobConfig env.
		at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.handleCoordinatorError(CheckpointCoordinator.java:274)
		at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$null$9(CheckpointCoordinator.java:590)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
		at java.lang.Thread.run(Thread.java:750)
	
		at org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand.execute(ClientExecuteCommand.java:194)
		... 2 more
	 
	2024-07-04 13:37:15,949 ERROR org.apache.seatunnel.core.starter.SeaTunnel - 
	===============================================================================
	
	
	
	Exception in thread "main" org.apache.seatunnel.core.starter.exception.CommandExecuteException: SeaTunnel job executed failed
		at org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand.execute(ClientExecuteCommand.java:202)
		at org.apache.seatunnel.core.starter.SeaTunnel.run(SeaTunnel.java:40)
		at org.apache.seatunnel.core.starter.seatunnel.SeaTunnelClient.main(SeaTunnelClient.java:34)
	Caused by: org.apache.seatunnel.engine.common.exception.SeaTunnelEngineException: org.apache.seatunnel.engine.server.checkpoint.CheckpointException: Checkpoint expired before completing. Please increase checkpoint timeout in the seatunnel.yaml or jobConfig env.
		at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.handleCoordinatorError(CheckpointCoordinator.java:274)
		at org.apache.seatunnel.engine.server.checkpoint.CheckpointCoordinator.lambda$null$9(CheckpointCoordinator.java:590)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
		at java.lang.Thread.run(Thread.java:750)
	
		at org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand.execute(ClientExecuteCommand.java:194)
		... 2 more
	2024-07-04 13:37:15,951 INFO  org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand - run shutdown hook because get close signal
[INFO] 2024-07-04 13:37:16.706 +0800 - process has exited. execute path:/data/hubs/dolphinscheduler/tmp/exec/process/default/13823066283424/13831054200608_19/164485/311103, processId:560021 ,exitStatusCode:1 ,processWaitForStatus:true ,processExitValue:1
[INFO] 2024-07-04 13:37:16.707 +0800 - ***********************************************************************************************
[INFO] 2024-07-04 13:37:16.707 +0800 - *********************************  Finalize task instance  ************************************
[INFO] 2024-07-04 13:37:16.707 +0800 - ***********************************************************************************************
[INFO] 2024-07-04 13:37:16.707 +0800 - Upload output files: [] successfully
[INFO] 2024-07-04 13:37:16.707 +0800 - Send task execute status: FAILURE to master : 10.60.162.35:1234
[INFO] 2024-07-04 13:37:16.708 +0800 - Remove the current task execute context from worker cache
[INFO] 2024-07-04 13:37:16.708 +0800 - The current execute mode isn't develop mode, will clear the task execute file: /data/hubs/dolphinscheduler/tmp/exec/process/default/13823066283424/13831054200608_19/164485/311103
[INFO] 2024-07-04 13:37:16.708 +0800 - Success clear the task execute file: /data/hubs/dolphinscheduler/tmp/exec/process/default/13823066283424/13831054200608_19/164485/311103
[INFO] 2024-07-04 13:37:16.708 +0800 - FINALIZE_SESSION

Zeta or Flink or Spark Version

Zeta

Java or Scala Version

java version "1.8.0_401"
Java(TM) SE Runtime Environment (build 1.8.0_401-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.401-b10, mixed mode)

Screenshots

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@WarsenLiu WarsenLiu added the bug label Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant