-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync job hanging after source job container has completed #8218
Comments
Original comment from @lmossman: Hello, I took a look at your issue but I couldn't find any obvious answer as to why your job is hanging. It seems like you have narrowed down the problem to show that the process is hanging on
in the DefaultAirbyteSource, which in turn is calling BufferedReader#readLine. I tried looking this up and it appears that this is an issue others have faced with the @Dracyr Does it always hang after record Also I am going to reassign this ticket to @benmoriceau as he is the current Platform OC engineer. |
Hello @Dracyr, We accidentally transfer your original ticket to another project which shouldn't have happen. Let's restore it here. I am trying to figure out which person would be the most relevant to address the issue you reported. @lmossman had some question that will help us to figure it out. Thanks, |
Here was the response that @Dracyr provided to my question over slack, since they could not post on the ticket after it was transferred:
|
Do you think that is could be a size limit on DB2? |
Size limit on what exactly? If I generate all the records in the container # Source container running in cluster, running 'sleep infinity' with same source_config.json/source_catalog.json copied in
pv@localhost$: kubectl -n airbyte exec -it airbyte-source-db2-worker-pv /bin/bash
airbyte@airbyte-source-db2-worker-pv:/config$ eval "$AIRBYTE_ENTRYPOINT read --config source_config.json --catalog source_catalog.json" > records.txt And copy the resulting file back to my local machine for inspection, that file looks alright. It contains
|
I added a socket read timeout, which throws a diff --git a/airbyte-workers/src/main/java/io/airbyte/workers/process/KubePodProcess.java b/airbyte-workers/src/main/java/io/airbyte/workers/process/KubePodProcess.java
index 8cb4f4a3f..315eccb11 100644
--- a/airbyte-workers/src/main/java/io/airbyte/workers/process/KubePodProcess.java
+++ b/airbyte-workers/src/main/java/io/airbyte/workers/process/KubePodProcess.java
@@ -37,6 +37,7 @@ import java.io.OutputStream;
import java.lang.ProcessHandle.Info;
import java.net.ServerSocket;
import java.net.Socket;
+import java.net.SocketTimeoutException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Path;
import java.util.AbstractMap;
@@ -448,6 +449,8 @@ public class KubePodProcess extends Process {
LOGGER.info("Creating stdout socket server...");
final var socket = stdoutServerSocket.accept(); // blocks until connected
LOGGER.info("Setting stdout...");
+ socket.setSoTimeout(5000);
+
this.stdout = socket.getInputStream();
} catch (final IOException e) {
e.printStackTrace(); // todo: propagate exception / join at the end of constructor Full Logs from runI removed one of the streams, so there's 10k less records than before.
I don't have to manually cancel it then at least, but it still fails on a different number of records read each time. |
Helpful context from @Dracyr :
This confirms that the source connector can successfully run using exactly the same settings as the source connector in the failing sync:
So the issue is probably in how the worker consumes records from the connector |
Also tried to verify or rule out the network doing stuff now, started up a destination pod and piped the data through to the other sides stdout and printed it in the shell. # Destination pod, open socat, print to stdout
$ socat -d TCP-L:9001 STDOUT
# Source pod, socat container, ip of 'airbyte-destination-postgres-worker-pv'
$ cat /pipes/stdout | socat -d - TCP:192.168.130.40:9001
# Source pod, pipe to stdout pipe
airbyte@airbyte-source-db2-worker-pv:/config$ eval "$AIRBYTE_ENTRYPOINT read --config source_config.json --catalog source_catalog.json" > /pipes/stdout And all records seem to have made it through to the destination pod successfully as well. Which further points to the issue being in the worker consuming the records. |
Another update, I was experimenting a bit with other tables in the source database, to see if the behavior changed. And for some combinations, I can get successful syncs.
So it doesn't look like anything in the source data, since the tables can be synced on their own, but not together. I also tried connecting it to another postgres (so source postgres, destination postgres) and the final log before hanging was So I'm assuming something in the k8s cluster or networking is responsible somehow. If you have some code you'd like me to run in order to test something, I'd be able to compile and run it as well. |
I'm a bit at the end of my abilities too look into this further, if you have any idea of something to try out or run, let me know. |
FYI I still experience this issue with EC2 Docker and |
@Dracyr sorry for the delay on a response here. If you are still able to, could you |
Hey! Sorry that's going to be difficult right now, I was trying to set up an ELT pipeline for my client as a Christmas gift as one of the final things before my project ended, and now I'm going to move on to another one. From memory however:
|
@grishick is this intended to be in the destinations in-progress column? |
Closing since this is old and doesn't seem relevant. |
Enviroment
dev
workerCurrent Behavior
With both source and destination set up, I start a manual sync job.
Both the source and destination containers start up in the cluster, and I see records being read and created on the destination side.
The source container completes successfully, but the job appears to hang. Some records are extracted successfully and inserted on the postgres side, but not all of them.
After this, the destination container and job keeps running, and the logs stop.
When I've created and exec'd into the source-db2 container and run the extraction manually, all records/json lines are outputted successfully. So the issue doesn't seem to be on the source. I don't get any error logs from the destination either. What I see is that the
while (!cancelled.get() && !source.isFinished()) {
call seems to block forever.To try and debug this further I've also added some more logging into the
DefaultReplicationWorker.java
andDefaultAirbyteSource.java
files too, where it seems the call tofinal var isEmpty = !messageIterator.hasNext();
is blocking.DefaultReplicationWorker.java
I added an overloaded method to the `DefaultAirbyteSource.java` just to log a bit more there too.DefaultAirbyteSource.java
I'd hope this was resolved by #8036 or the other related PRs, but doesn't seem to be the case.
Note: I'm running all of the containers as non-root, via wrapping the dockerfiles, see #7872. I have also deployed k8s networkpolicies to make communication possible between the containers in our cluster.
Expected Behavior
The job should complete successfully and not hang.
Logs
Here's one instance where I cancelled the job after 5min of idle activity. I have also waited longer (1h+) to see if it resolves, and have the logs for those as well.
In rare cases I get an log from the destination saying something similar too
But for most of them not.
LOG
Steps to Reproduce
Are you willing to submit a PR?
I would, but I have reached the limit of my debugging.
The text was updated successfully, but these errors were encountered: