Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SourceFile integration #716

Merged
merged 25 commits into from Nov 2, 2020
Merged

Add SourceFile integration #716

merged 25 commits into from Nov 2, 2020

Conversation

ChristopheDuong
Copy link
Contributor

What

Starting from the idea of CSV connector, I ended up leveraging the capabilities of the libraries i am using to be slightly more generic and handle a wider range of formats and locations.

There's actually a thread on singer's slack about this here:
https://singer-io.slack.com/archives/C2TGFCZEV/p1590702306463600

How

Using:

@michel-tricot
Copy link
Contributor

Just a random thought I wonder if we could generate more than one connector from this single one.

We could have more than one docker file and spec files and separate by source type (s3/azure/gcs...)

Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I came into this thinking we'd likely need to split this into multiple integrations, but based on reading it, I think my original inclination was wrong. I think the key here is if we can get the spec.json easy for a user to understand. That should dictate whether this integration gets split into multiple. Based on a first read, I think it's doable to keep it as one integration and just massage the spec.

The only think I'm not certain about is how to make it easy for a user to get all of the file urls right. Those are different depending on which cloud you are using, etc. I think if we just put it all in one and make sure the docs on gitbook are very clean at walking through pulling data from each place that will be fine and preferable. What do you think? Let me know if you have any questions about my comments.

@@ -0,0 +1 @@
../../bases/base-python/airbyte_protocol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this file achieving?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This symlink is copied from the template:
https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connector-templates/python-source/airbyte_protocol

It is there so when you open your IDE in one of the connector-templates/python-source root directory, references to airbyte_protocol library are properly resolved and highlighting, code navigation etc are working nicely.

So it's mostly for DX purposes

},
"reader": {
"type": "string",
"description": "The reader from pandas library to use"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we obscure pandas? the idea is we'd like a fairly non-technical person to be able to fill out this configuration. so asking them if they have csv versus json or something like that is okay, but i think exposing pandas might be going too technical. it also locks us into a specific implementation forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this should be pretty easy to do though. you can just have an enum that maps from human readable names to pandas names. as long as we provide an enum with human readable names, i think we are okay.

"pickle" =>"read_pickle"
"json" => "read_json"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i've corrected as you suggested

hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
./local/path/file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if the local paths actually work. see some description around the complexity of the local fs stuff https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-csv/src/main/resources/spec.json#L11. basically we only mount one directory so we need to provide the user with some help to understand how they can pull info from a local filesystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works in a Python script but you are right, I haven't tried it through the docker images yet...

Would it be unacceptable to add extra mount arguments to the docker run command?
if you can mount -v the "destination_path:destination_path" so the path is accessible inside the docker container as it is on the host

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that doesn't work well on back because there are limitations on what dirs can be mounted. that's why we set one local mount. I would suggest going this route: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-csv/src/main/resources/spec.json#L11. Where we just tell the user if you're using your local filesystem, it better start with /local and use the local mount. if you can find something better though, i'm down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look at this as part of a new issue?

@michel-tricot
Copy link
Contributor

Out of curiosity, does it work with compressed files? (gzip, zip, bzip2... )

@ChristopheDuong
Copy link
Contributor Author

Out of curiosity, does it work with compressed files? (gzip, zip, bzip2... )

smart_open seems to support it but i haven't been able to test it yet...

Supported Compression Formats

smart_open allows reading and writing gzip and bzip2 files. They are transparently handled over HTTP, S3, and other protocols, too, based on the extension of the file being opened. You can easily add support for other file extensions and compression formats. 

@ChristopheDuong ChristopheDuong marked this pull request as ready for review October 30, 2020 22:05
Copy link
Contributor

@michel-tricot michel-tricot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting


"storage": {
"type": "string",
"enum": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the enum be the name of the service?

HTTP
S3
GCS
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wasn't sure how we were going to display things in the UI, I can update it to more general names

],
"default": "csv"
},
"reader_options": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

You can add documentation for these properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i'm missing a lot of documentation, I just had enough to ship the integration so far.
Will detail more writings about it now

"service_account_json": {
"type": "string"
},
"reader_impl": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually 2 ways of accessing both GCS and AWS APIs with this connector:

  1. Using the smart_open library which gives opportunities to access multiple back-ends too. Unfortunately I've seen feedback from some users worrying about performances and reliability (maybe due to a wider scope and younger project?)
  2. Using more specialized libraries called gcsfs and s3fs which are a bit harder to use but might be slightly faster to transfer files.

I haven't ran any benchmarks yet of course but since we are able to support both APIs, users would have freedom to switch back and forth if needed and consider the impacts of doing so...

{
"properties": {
"storage": {
"enum": ["scp://"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future this one will very likely require a key/passphrase option as well

# integration tests but not the main package go in integration_tests. Deps required by both should go in
# install_requires.
"main": [],
"integration_tests": ["airbyte_python_test", "boto3", "pytest"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be airbyte_python_test or airbyte-python-test? i thought the - was more idiomatic here.


In order to run integrations tests in this connector, you need to:
1. Testing Google Cloud Service Storage
1. Download and store your Google [Service Account](https://console.cloud.google.com/iam-admin/serviceaccounts) JSON file in `secrets/gcs.json`, it should look something like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i missed it, but how does gcs.json get imported into config.json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in: airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py which are the "custom" integration tests, not the standard_tests.

The content of the JSON is copied into the configuration as a string.
In the UI, you would need to do the same and copy/paste the content of the JSON.

Then in the source.py of this connector, we either are able to manipulate the DICT object directly once we parse that string or have to produce a temporary file with the ocntent of the json (depending on the google API we are using)

@ChristopheDuong ChristopheDuong merged commit daf58b2 into master Nov 2, 2020
@@ -0,0 +1,7 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in STANDARD_SOURCE_DEFINITION and not STANDARD_SOURCE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure i understand your comment...
I don't see folders named STANDARD_SOURCE_DEFINITION in airbyte-config/init/src/main/resources/config/ ?

Where is the STANDARD_SOURCE_DEFINITION ?

@ChristopheDuong ChristopheDuong deleted the chris/source-pandas branch November 9, 2020 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants