-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connector proxy - source connectors #409
Conversation
89007ba
to
c00b6f0
Compare
pub trait Interceptor<T: FlowOperation> { | ||
fn get_converters() -> RequestResponseConverterPair<T> { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
ProxyCommand::ProxyFlowMaterialize(m) => { | ||
proxy_flow_materialize(m, image_inspect_json_path).await | ||
} | ||
ProxyCommand::DelayedExecute(ba) => delayed_execute(ba.config_file_path).await, |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
crates/connector_proxy/src/interceptors/airbyte_capture_interceptor.rs
Outdated
Show resolved
Hide resolved
crates/connector_proxy/src/main.rs
Outdated
|
||
async fn delayed_execute(command_config_path: String) -> Result<(), Error> { | ||
// Sleep for some time to allow parent process to stop the current process. | ||
std::thread::sleep(std::time::Duration::from_millis(100)); |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
@jgraettinger @psFried |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jixij , I'm not able to review this yet but did at least address your comments below.
Reviewed 1 of 10 files at r1, all commit messages.
Reviewable status: 1 of 16 files reviewed, 5 unresolved discussions (waiting on @jixij and @psFried)
crates/connector_proxy/src/apis.rs, line 49 at r6 (raw file):
Previously, jixij (Jixiang Jiang) wrote…
pass in
pid
in theconvert_request
, so that the connector process could be started after the interceptors.The reason is that that arguments used to start the connectors process are constructed by the interceptor, so it is easier if we avoid the interceptor to depend on the connector process to be started.
As we've discussed, I think this can all be simplified by removing the Intercepter trait and ComposedInterceptor, and building the concrete stacks as needed, and as composed literals. Interceptor is a leaky abstraction (since pid is only needed for airbyte) that doesn't buy much.
Instead have a top-level switch over the protocol cases that constructs an impl InterceptorStream by composing its component pieces:
let request_stack = airbyte_request_adapter(
pid,
other_arg_specific_to_airbyte,
some_wrapped_adapter(
some,
stuff,
convert_my_stdin_to_stream(stdin())));
let response_stack = airbyte_response_adapter( ... )
crates/connector_proxy/src/main.rs, line 140 at r6 (raw file):
Previously, jixij (Jixiang Jiang) wrote…
Not using a bash script to start the
proxy-process
, b/c we cannot assume the shell script are working for all docker images, (e.g.distroless
). So I structured it as this additional sub-command of theconnector-proxy
, which basically a separate process that will start the real connector after received some signals from theconnector-proxy
that starts it.
The specific "call ourselves" approach is reasonable, but I do want to raise a latent concern that we need to taking bigger steps back to reconsider engineering constraints we're setting for ourselves: ("it must run in distroless", "we can't assume there's a ssh
binary we can shell to", etc).
We don't need to hit these constraints to have meaningful functionality that works in most contexts, and as we raise the bar for how self-sufficient we're trying to be, it adds exponentially to development time and maintenance.
crates/connector_proxy/src/main.rs, line 194 at r6 (raw file):
Previously, jixij (Jixiang Jiang) wrote…
This might look a bit hacky, and here is the reason.
The approach I've tried is - let the
delayed_execute
process pause itself, and wait forsigcont
signal from its parent (connector proxy) process to continue. The issue here was that the -sigcont
from the parent might came even before thedelayed_execute
stopped and prepared to listen to thesigcont
signal. Ifdelayed_execute
stopped after missing thesigcont
signal, it will stack forever.
The main issue is that - separate the stop/resume actions in two processes is not good, b/c it is difficult to coordinate them.So in this implementation, both the
stop
andstart
signals are coming from the parent process, so that they could be ordered, and added the sleep logic in the delayed process to give enough time for the parent process to stop it (it might be messy if the real connector being started before process was stopped).
Not sure if this is ok. Alternatives I was thinking -
- Using a locker file. The parent creates a locker file before starting the child process. And the child process will block and ping the locker file until it is deleted.
- Building a binary using other languages, and inject it to the docker image together with
connector-proxy
. (Or calling to the other languages from Rust). The process API in Rust does not allow us to do much. Hopefully the other language could.
A generalized pattern to do this is to have the child write to stderr when it's ready (has installed signal handler in this case).
See flow/js_worker.go for an example.
The child writes "READY\n" to stderr when it's started. The parent reads this, and then passes through the remainder of child stderr to its own stderr.
crates/connector_proxy/src/interceptors/airbyte_capture_interceptor.rs, line 44 at r6 (raw file):
Previously, jixij (Jixiang Jiang) wrote…
just came across
tmpfs
that stores files in host memory being another way of storing secure infomation in docker. Shall we look more into that?
No. KISS. The connector-proxy also shouldn't assume anything about the filesystem it's running on (from its perspective, it's not even "aware" it's running within a container).
go/capture/driver/airbyte/driver.go, line 304 at r6 (raw file):
Previously, jixij (Jixiang Jiang) wrote…
The translation from airbyte message to proto message is done in the Rust connector. However, staging / combining the messages are done in Go side. It this OK? Or should the boundary between the Rust/Go be shifted?
Let's not worry about it for now. connector-proxy should eventually do coalescing but that's an optimization we can leave for later.
This can be simplified to pass-through |pullResp| without inspecting it -- the runtime already has error cases over unexpected message types.
Sounds great. Thanks a lot @jgraettinger for the confirmations, it is more clear to me now. |
ffef2ac
to
c49b3c0
Compare
crates/connector_proxy/src/apis.rs, line 49 at r6 (raw file): Previously, jgraettinger (Johnny Graettinger) wrote…
Yes, I've removed the these traits. And implemented three runner functions for different types of connector. They construct their own interceptors (or req/resp stream adaptors), and common logic that shared by them are extracted to separate functions. |
crates/connector_proxy/src/main.rs, line 140 at r6 (raw file): Previously, jgraettinger (Johnny Graettinger) wrote…
Got it.. Thanks! |
crates/connector_proxy/src/main.rs, line 194 at r6 (raw file): Previously, jgraettinger (Johnny Graettinger) wrote…
cool. this functionality was implemented in |
07e6cde
to
141f8e0
Compare
go/capture/driver/airbyte/driver.go
Outdated
|
||
//if resp == nil { | ||
// return nil // Connector flushed prior to exiting. All done. | ||
//} | ||
|
||
// Write a final commit, followed by EOF. | ||
// This happens only when a connector writes output and exits _without_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that I am not very sure about.
It looks to me the go-logic will append an EOF
to the stream if there are pending messages that are not checked in. This logic was missing in Rust, should this be back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be in rust, because their are airbyte connectors that write only records with no checkpoint at all.
pub cursor_field: Option<Vec<String>>, | ||
pub primary_key: Option<Vec<Vec<String>>>, | ||
|
||
// TODO: might be broken if both 'projection' and its alias is present in the JSON data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the pattern here -
A json object could specify two fields field
and estuary.dev/field
, and there values are merged.
In serde
, alias
was used here to specify the fields. However, in this implementation, if a json object has both fields present, it will break.
I am not very sure the usecases of these fields. If there are such usecases that needs both fields co-exist in the same object, I'll switch to a customized deserializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine, they never co-exist. We're the only ones who use this and afaik it will always come in as estuary.dev/projections
stream_to_binding: Arc::new(Mutex::new(HashMap::new())), | ||
tmp_dir: Builder::new() | ||
.prefix("airbyte-source-") | ||
.tempdir_in("/connector-tmpfs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time on the file location. The struggle here is that it is difficult to find a path that the connector could write to.
- The
tmp
dir was mounted to local, so not safe. - The proxy is not running as a Root user, so does not have a permission to create a dir in most places in the file system. And the available locations could be image-dependent and complex.
On the other hand, when researching, I found the tmpfs
is pretty easy to set up, with just a few lines of code change. So I added this as a placehoder. And if storing data as files in host machine memory is not working well for other coonsiderations, especially in k8s env, I'll look into more on this..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about /var/tmp
for now ?
Later we can also add docker inspect info to the *Request / Open messages, which would be the one remaining file passed in I believe, and altogether remove the current /tmp
binding we do to pass through files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, /var/tmp
works fine. Thank you! We can plan the next steps after this. Removing the /tmp
binding would be ideal.
57e5ef0
to
26b3e95
Compare
98f499e
to
681956f
Compare
9ccd79b
to
f302e28
Compare
crates/connector_proxy/Cargo.toml, line 23 at r10 (raw file): Previously, jgraettinger (Johnny Graettinger) wrote…
Yes, I was using The |
go/capture/driver/airbyte/driver.go, line 101 at r10 (raw file): Previously, jgraettinger (Johnny Graettinger) wrote…
Good point. I missed the differences between the FlowCapture and AirbyteSource. |
f302e28
to
85915af
Compare
Thank you again Johnny for the review. I've made changes accordingly. PTAL! FYI, Yes, the timeout issue in CI is real, and as always, it most likely caused by some race conditions that cause the As I read about the inter-process communication, I felt a pipe-based communication mechanism would be more robust. And I have switched to that - the delayed process will wait for a "READY" message from the parent process from stdin, before it starts the connector process. And it seems to be working fine for the CI tests runs so far. If there is any other thoughts, please let me know. |
crates/connector_proxy/src/libs/airbyte_catalog.rs, line 48 at r9 (raw file): Previously, jgraettinger (Johnny Graettinger) wrote…
great! |
will continue this work here: #425 |
Description:
This is the second half of the connector proxy work, which enables it to work with capture connectors that currently speaking in airbyte source protocols.
airbyte_source_interceptor
was implemented to translate the airbyte messages into Flow capture protocol.Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
Once this change is merged, the network proxy configuration that are available to materialize connectors will be available to all capture connectors in theory.
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)