Add SourceFile integration #716

ChristopheDuong · 2020-10-27T18:54:10Z

What

Starting from the idea of CSV connector, I ended up leveraging the capabilities of the libraries i am using to be slightly more generic and handle a wider range of formats and locations.

There's actually a thread on singer's slack about this here:
https://singer-io.slack.com/archives/C2TGFCZEV/p1590702306463600

How

Using:

https://pypi.org/project/smart-open/ to leverage how to retrieve files from storages
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html to leverage how to read and parse file formats

michel-tricot · 2020-10-27T23:36:20Z

Just a random thought I wonder if we could generate more than one connector from this single one.

We could have more than one docker file and spec files and separate by source type (s3/azure/gcs...)

cgardens

Nice!

I came into this thinking we'd likely need to split this into multiple integrations, but based on reading it, I think my original inclination was wrong. I think the key here is if we can get the spec.json easy for a user to understand. That should dictate whether this integration gets split into multiple. Based on a first read, I think it's doable to keep it as one integration and just massage the spec.

The only think I'm not certain about is how to make it easy for a user to get all of the file urls right. Those are different depending on which cloud you are using, etc. I think if we just put it all in one and make sure the docs on gitbook are very clean at walking through pulling data from each place that will be fine and preferable. What do you think? Let me know if you have any questions about my comments.

cgardens · 2020-10-28T00:25:38Z

airbyte-integrations/connectors/source-file/airbyte_protocol

@@ -0,0 +1 @@
+../../bases/base-python/airbyte_protocol


what is this file achieving?

This symlink is copied from the template:
https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connector-templates/python-source/airbyte_protocol

It is there so when you open your IDE in one of the connector-templates/python-source root directory, references to airbyte_protocol library are properly resolved and highlighting, code navigation etc are working nicely.

So it's mostly for DX purposes

cgardens · 2020-10-28T00:29:00Z

airbyte-integrations/connectors/source-file/file_source/spec.json

+      },
+      "reader": {
+        "type": "string",
+        "description": "The reader from pandas library to use"


can we obscure pandas? the idea is we'd like a fairly non-technical person to be able to fill out this configuration. so asking them if they have csv versus json or something like that is okay, but i think exposing pandas might be going too technical. it also locks us into a specific implementation forever.

looks like this should be pretty easy to do though. you can just have an enum that maps from human readable names to pandas names. as long as we provide an enum with human readable names, i think we are okay.

"pickle" =>"read_pickle"
"json" => "read_json"

yes i've corrected as you suggested

cgardens · 2020-10-28T00:36:09Z

airbyte-integrations/connectors/source-file/file_source/source.py

+        hdfs:///path/file
+        hdfs://path/file
+        webhdfs://host:port/path/file
+        ./local/path/file


curious if the local paths actually work. see some description around the complexity of the local fs stuff https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-csv/src/main/resources/spec.json#L11. basically we only mount one directory so we need to provide the user with some help to understand how they can pull info from a local filesystem.

it works in a Python script but you are right, I haven't tried it through the docker images yet...

Would it be unacceptable to add extra mount arguments to the docker run command?
if you can mount -v the "destination_path:destination_path" so the path is accessible inside the docker container as it is on the host

i think that doesn't work well on back because there are limitations on what dirs can be mounted. that's why we set one local mount. I would suggest going this route: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-csv/src/main/resources/spec.json#L11. Where we just tell the user if you're using your local filesystem, it better start with /local and use the local mount. if you can find something better though, i'm down.

I can look at this as part of a new issue?

michel-tricot · 2020-10-29T16:36:48Z

Out of curiosity, does it work with compressed files? (gzip, zip, bzip2... )

ChristopheDuong · 2020-10-29T16:45:10Z

Out of curiosity, does it work with compressed files? (gzip, zip, bzip2... )

smart_open seems to support it but i haven't been able to test it yet...

Supported Compression Formats

smart_open allows reading and writing gzip and bzip2 files. They are transparently handled over HTTP, S3, and other protocols, too, based on the extension of the file being opened. You can easily add support for other file extensions and compression formats.

michel-tricot

Exciting

michel-tricot · 2020-10-30T22:54:49Z

airbyte-integrations/connectors/source-file/source_file/spec.json

+
+      "storage": {
+        "type": "string",
+        "enum": [


Shouldn't the enum be the name of the service?

HTTP S3 GCS ...

Yes, I wasn't sure how we were going to display things in the UI, I can update it to more general names

michel-tricot · 2020-10-30T22:55:47Z

airbyte-integrations/connectors/source-file/source_file/spec.json

+        ],
+        "default": "csv"
+      },
+      "reader_options": {


What is this?

You can add documentation for these properties

Yes i'm missing a lot of documentation, I just had enough to ship the integration so far.
Will detail more writings about it now

michel-tricot · 2020-10-30T22:57:13Z

airbyte-integrations/connectors/source-file/source_file/spec.json

+              "service_account_json": {
+                "type": "string"
+              },
+              "reader_impl": {


Why do we need that?

There are actually 2 ways of accessing both GCS and AWS APIs with this connector:

Using the smart_open library which gives opportunities to access multiple back-ends too. Unfortunately I've seen feedback from some users worrying about performances and reliability (maybe due to a wider scope and younger project?)

Using more specialized libraries called gcsfs and s3fs which are a bit harder to use but might be slightly faster to transfer files.

I haven't ran any benchmarks yet of course but since we are able to support both APIs, users would have freedom to switch back and forth if needed and consider the impacts of doing so...

michel-tricot · 2020-10-30T22:58:23Z

airbyte-integrations/connectors/source-file/source_file/spec.json

+          {
+            "properties": {
+              "storage": {
+                "enum": ["scp://"]


In the future this one will very likely require a key/passphrase option as well

cgardens · 2020-11-01T01:24:30Z

airbyte-integrations/connectors/source-file/setup.py

+        # integration tests but not the main package go in integration_tests. Deps required by both should go in
+        # install_requires.
+        "main": [],
+        "integration_tests": ["airbyte_python_test", "boto3", "pytest"],


should this be airbyte_python_test or airbyte-python-test? i thought the - was more idiomatic here.

cgardens · 2020-11-01T01:27:38Z

airbyte-integrations/connectors/source-file/README.md

+
+In order to run integrations tests in this connector, you need to:
+1. Testing Google Cloud Service Storage
+    1. Download and store your Google [Service Account](https://console.cloud.google.com/iam-admin/serviceaccounts) JSON file in `secrets/gcs.json`, it should look something like this:   


maybe i missed it, but how does gcs.json get imported into config.json?

It's used in: airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py which are the "custom" integration tests, not the standard_tests.

The content of the JSON is copied into the configuration as a string.
In the UI, you would need to do the same and copy/paste the content of the JSON.

Then in the source.py of this connector, we either are able to manipulate the DICT object directly once we parse that string or have to produce a temporary file with the ocntent of the json (depending on the google API we are using)

jrhizor · 2020-11-02T18:17:24Z

...fig/init/src/main/resources/config/STANDARD_SOURCE/778daa7c-feaf-4db6-96f3-70fd645acc77.json

@@ -0,0 +1,7 @@
+{


Shouldn't this be in STANDARD_SOURCE_DEFINITION and not STANDARD_SOURCE?

I am not sure i understand your comment...
I don't see folders named STANDARD_SOURCE_DEFINITION in airbyte-config/init/src/main/resources/config/ ?

Where is the STANDARD_SOURCE_DEFINITION ?

ChristopheDuong added 10 commits October 27, 2020 16:31

Copy fresh source template to init csv-source

04acf12

Start CSV Source

924506e

Implement CSV Source

2dfaa28

Handle Exception cases while loading CSVs

e4bc915

Fix codestyle formatting

9eb53f8

Handle different reader methods for the future

59b1b47

Handle private buckets on GCS & S3

d2f2d6f

Setting up tests

e5f5aad

Rename CsvSource to FileSource

0dfb77d

Add call to other formats of readers

62c50a5

cgardens reviewed Oct 28, 2020

View reviewed changes

ChristopheDuong added 2 commits October 28, 2020 12:22

Tweak File Source Configuration naming

d761e76

Merge remote-tracking branch 'origin/master' into source-pandas

5957ab5

ChristopheDuong mentioned this pull request Oct 28, 2020

split airbyte protocol generated structs out of base_python #721

Merged

ChristopheDuong added 4 commits October 28, 2020 17:55

Setting standard python tests

e26189b

Adapt to new spec.json

863d3c9

Print more details while running docker tests

c6381b2

Add integration tests and implement proper spec.json

1843141

Update docs

1769521

ChristopheDuong added 5 commits October 29, 2020 20:30

Fix integratins tests for AWS

fbecf0e

Handle catalog when reading

53e6b10

Merge remote-tracking branch 'origin/master' into source-file branch

99160ed

Fixing some typos

203d080

Update to mirror source-python-template

c32625d

ChristopheDuong marked this pull request as ready for review October 30, 2020 22:05

ChristopheDuong requested review from sherifnada and michel-tricot October 30, 2020 22:06

michel-tricot approved these changes Oct 30, 2020

View reviewed changes

cgardens approved these changes Nov 1, 2020

View reviewed changes

ChristopheDuong added 3 commits November 2, 2020 13:03

Tweaks from reviews

3c2a60e

Add descriptions in spec.json

68c1f16

Enable ssh/scp/sftp sources using paramiko library

068c684

ChristopheDuong merged commit daf58b2 into master Nov 2, 2020

jrhizor reviewed Nov 2, 2020

View reviewed changes

ChristopheDuong deleted the chris/source-pandas branch November 9, 2020 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SourceFile integration #716

Add SourceFile integration #716

ChristopheDuong commented Oct 27, 2020

michel-tricot commented Oct 27, 2020

cgardens left a comment

cgardens Oct 28, 2020

ChristopheDuong Oct 28, 2020

cgardens Oct 28, 2020

cgardens Oct 28, 2020

ChristopheDuong Nov 2, 2020

cgardens Oct 28, 2020

ChristopheDuong Oct 28, 2020

cgardens Nov 1, 2020

ChristopheDuong Nov 2, 2020

michel-tricot commented Oct 29, 2020

ChristopheDuong commented Oct 29, 2020

michel-tricot left a comment

michel-tricot Oct 30, 2020

ChristopheDuong Nov 2, 2020

michel-tricot Oct 30, 2020

ChristopheDuong Nov 2, 2020

michel-tricot Oct 30, 2020

ChristopheDuong Nov 2, 2020 •

edited

michel-tricot Oct 30, 2020

cgardens Nov 1, 2020

cgardens Nov 1, 2020

ChristopheDuong Nov 2, 2020

jrhizor Nov 2, 2020

ChristopheDuong Nov 2, 2020

		@@ -0,0 +1 @@
		../../bases/base-python/airbyte_protocol

Add SourceFile integration #716

Add SourceFile integration #716

Conversation

ChristopheDuong commented Oct 27, 2020

What

How

michel-tricot commented Oct 27, 2020

cgardens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michel-tricot commented Oct 29, 2020

ChristopheDuong commented Oct 29, 2020

michel-tricot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Nov 2, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Nov 2, 2020 •

edited