New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SourceFile integration #716
Conversation
Just a random thought I wonder if we could generate more than one connector from this single one. We could have more than one docker file and spec files and separate by source type (s3/azure/gcs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I came into this thinking we'd likely need to split this into multiple integrations, but based on reading it, I think my original inclination was wrong. I think the key here is if we can get the spec.json easy for a user to understand. That should dictate whether this integration gets split into multiple. Based on a first read, I think it's doable to keep it as one integration and just massage the spec.
The only think I'm not certain about is how to make it easy for a user to get all of the file urls right. Those are different depending on which cloud you are using, etc. I think if we just put it all in one and make sure the docs on gitbook are very clean at walking through pulling data from each place that will be fine and preferable. What do you think? Let me know if you have any questions about my comments.
@@ -0,0 +1 @@ | |||
../../bases/base-python/airbyte_protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this file achieving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This symlink is copied from the template:
https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connector-templates/python-source/airbyte_protocol
It is there so when you open your IDE in one of the connector-templates/python-source root directory, references to airbyte_protocol library are properly resolved and highlighting, code navigation etc are working nicely.
So it's mostly for DX purposes
}, | ||
"reader": { | ||
"type": "string", | ||
"description": "The reader from pandas library to use" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we obscure pandas? the idea is we'd like a fairly non-technical person to be able to fill out this configuration. so asking them if they have csv versus json or something like that is okay, but i think exposing pandas might be going too technical. it also locks us into a specific implementation forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this should be pretty easy to do though. you can just have an enum that maps from human readable names to pandas names. as long as we provide an enum with human readable names, i think we are okay.
"pickle" =>"read_pickle"
"json" => "read_json"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i've corrected as you suggested
hdfs:///path/file | ||
hdfs://path/file | ||
webhdfs://host:port/path/file | ||
./local/path/file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious if the local paths actually work. see some description around the complexity of the local fs stuff https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-csv/src/main/resources/spec.json#L11. basically we only mount one directory so we need to provide the user with some help to understand how they can pull info from a local filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works in a Python script but you are right, I haven't tried it through the docker images yet...
Would it be unacceptable to add extra mount arguments to the docker run command?
if you can mount -v the "destination_path:destination_path" so the path is accessible inside the docker container as it is on the host
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that doesn't work well on back because there are limitations on what dirs can be mounted. that's why we set one local mount. I would suggest going this route: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-csv/src/main/resources/spec.json#L11. Where we just tell the user if you're using your local filesystem, it better start with /local and use the local mount. if you can find something better though, i'm down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look at this as part of a new issue?
Out of curiosity, does it work with compressed files? (gzip, zip, bzip2... ) |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting
|
||
"storage": { | ||
"type": "string", | ||
"enum": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the enum be the name of the service?
HTTP
S3
GCS
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wasn't sure how we were going to display things in the UI, I can update it to more general names
], | ||
"default": "csv" | ||
}, | ||
"reader_options": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this?
You can add documentation for these properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes i'm missing a lot of documentation, I just had enough to ship the integration so far.
Will detail more writings about it now
"service_account_json": { | ||
"type": "string" | ||
}, | ||
"reader_impl": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually 2 ways of accessing both GCS and AWS APIs with this connector:
- Using the
smart_open
library which gives opportunities to access multiple back-ends too. Unfortunately I've seen feedback from some users worrying about performances and reliability (maybe due to a wider scope and younger project?) - Using more specialized libraries called
gcsfs
ands3fs
which are a bit harder to use but might be slightly faster to transfer files.
I haven't ran any benchmarks yet of course but since we are able to support both APIs, users would have freedom to switch back and forth if needed and consider the impacts of doing so...
{ | ||
"properties": { | ||
"storage": { | ||
"enum": ["scp://"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future this one will very likely require a key/passphrase option as well
# integration tests but not the main package go in integration_tests. Deps required by both should go in | ||
# install_requires. | ||
"main": [], | ||
"integration_tests": ["airbyte_python_test", "boto3", "pytest"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be airbyte_python_test
or airbyte-python-test
? i thought the -
was more idiomatic here.
|
||
In order to run integrations tests in this connector, you need to: | ||
1. Testing Google Cloud Service Storage | ||
1. Download and store your Google [Service Account](https://console.cloud.google.com/iam-admin/serviceaccounts) JSON file in `secrets/gcs.json`, it should look something like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i missed it, but how does gcs.json get imported into config.json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in: airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py
which are the "custom" integration tests, not the standard_tests.
The content of the JSON is copied into the configuration as a string.
In the UI, you would need to do the same and copy/paste the content of the JSON.
Then in the source.py of this connector, we either are able to manipulate the DICT object directly once we parse that string or have to produce a temporary file with the ocntent of the json (depending on the google API we are using)
@@ -0,0 +1,7 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be in STANDARD_SOURCE_DEFINITION
and not STANDARD_SOURCE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure i understand your comment...
I don't see folders named STANDARD_SOURCE_DEFINITION
in airbyte-config/init/src/main/resources/config/
?
Where is the STANDARD_SOURCE_DEFINITION
?
What
Starting from the idea of CSV connector, I ended up leveraging the capabilities of the libraries i am using to be slightly more generic and handle a wider range of formats and locations.
There's actually a thread on singer's slack about this here:
https://singer-io.slack.com/archives/C2TGFCZEV/p1590702306463600
How
Using: