[BEAM-9722] added SnowflakeIO with Read operation #11360

DariuszAniszewski · 2020-04-09T09:49:11Z

This PR is adding SnowflakeIO with Read operation as part of BEAM-9722

Snowflake is an analytic data warehouse provided as Software-as-a-Service (SaaS). It uses a new SQL database engine with a unique architecture designed for the cloud. To read more details please check here and here.

The SnowflakeIO.Read uses Snowflake's JDBC driver to run COPY INTO statement to move data on GCS as CSV files that are then read via FileIO.

The SnowflakeIO allows to use three authentication methods against Snowflake:

username and password
key-pair
pre-obtained OAuth token

This PR is first of series, once merged, subsequent PRs will come - with Write, integration tests etc. We're also working on cross-language to support Python SDK.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java

DariuszAniszewski · 2020-04-15T12:08:26Z

R: @aromanenko-dev
on dev-list you were interested in reviewing PRs related to SnowflakeIO - could you please take a look or point other reviewer? thanks!

aromanenko-dev · 2020-04-15T17:07:41Z

@DariuszAniszewski Thank you for contribution! I'm a bit busy for now but I'll try to take a look on this asap.

DariuszAniszewski · 2020-04-21T15:44:19Z

@aromanenko-dev I don't want to push you but do you have any ETA on the review? I'm OK with waiting but it would be nice to roughly know how long it may take ;)

I also have a generic question - as mentioned in description of this very PR, it's first from a series. We have more pieces of SnowflakeIO done, but we wanted to make as small and atomic PRs as possible to ease the review process and merge them one by one. But maybe this approach is misfired and it's better to put all pieces of the IO (in atomic commits of course) into this one PR?

WDYT?

aromanenko-dev · 2020-04-21T16:50:18Z

@DariuszAniszewski I'm sorry for a delay (have quite busy time at work) with review. I'll try to do the first round till the end of this week.

Regarding a second part of your message. AFAIK, there is no common rule for the size of PR in Beam, but personally I'd prefer an approach you proposed and move forward by logical atomic steps since it will be easier for all, as I believe. Also, I'd suggest to create the separate Jiras in this case.

aromanenko-dev · 2020-04-22T15:15:32Z

@iemejia Would you have some time to take a look as well?

aromanenko-dev

Thank you for contribution again! I think it should be a useful and demanded IO for Beam users.

I did a first round of review (only main business logic, not tests yet), please take a look on my comments.
Also, please, fix a commit message - it has to start with Jira ID as well as PR name.

CHANGES.md

sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java

aromanenko-dev · 2020-04-24T16:52:20Z

sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java

+          .apply(Wait.on(output))
+          .apply(
+              ParDo.of(
+                  new CleanTmpFilesFromGcsFn(


Will the temp directory be cleaned if pipeline was failed before?

It seems that it doesn't at the moment. @kkucharc will share more info here

Yes, we checked and it doesn't. @aromanenko-dev do you think it should be provided? In case of testing, probably tests should take care of cleanup.

I'm afraid that many failed pipelines could lead to wasting of used dick space in this case. It would be better to avoid such behavior, if possible.

While I generally agree that failing pipelines will eventually lead to lots of garbage on GCS (or other storage in future) I'm not sure how to ensure those files are deleted regardless of pipeline status.

AFAIK there is nothing like @After or @Teardown for PTransform which we have in our IO. Using Wait transform gives us ability to remove files once data is read and seems reasonable choice.

I've been checking how i.e. BigQueryIO is handling that case as it also needs to cleanup and they also have a cleanup transform that is called once all rows are read. I assume in their case cleanup also won't be run if something in-between fails.

How about filing JIRA issue for this case and try to look at solution in parallel to delivering other pieces of the Snowflake connector? @aromanenko-dev @kkucharc WDYT?

This is a known issue with BQ source as well. Failed pipelines can leave temporary files behind. I'm afraid there is no good solution today. I think we need to introduce some sort of a generalized cleanup step to address this.

In this case, I'd suggest to add this into IO class Java doc to make users aware that such situation is possible and it will require manual procedure to clean temp dirs up.

I've added a note here - would it be enough?

sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeServiceImpl.java

...va/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/credentials/package-info.java

sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/package-info.java

.../io/snowflake/src/test/java/org/apache/beam/sdk/io/snowflake/test/FakeSnowflakeDatabase.java

sdks/java/io/snowflake/src/test/java/org/apache/beam/sdk/io/snowflake/test/package-info.java

lukecwik · 2020-04-24T19:39:44Z

...java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeCloudProvider.java

+
+/** Interface which defines common methods for cloud providers. */
+public interface SnowflakeCloudProvider {
+  void removeFiles(String bucketName, String pathOnBucket);


Why not use Apache Beam's notion of a FileSystem?

Thanks for suggestion @lukecwik . I changed it to GCSFileSystem.

I also tried to change removeFiles in Fake implementation to use LocalFileSystem but I am a little bit concerned - LocalFileSystem doesn't match nested directories and fails on deleting not empty directory. That can cause that testing directory won't be cleaned and tests will become flaky.

lukecwik · 2020-04-24T19:40:03Z

sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/GCSProvider.java

+public class GCSProvider implements SnowflakeCloudProvider, Serializable {
+
+  @Override
+  public void removeFiles(String bucketName, String pathOnBucket) {


Why not use the Apache Beam GCS filesystem?

DariuszAniszewski · 2020-04-26T16:02:42Z

Thanks @aromanenko-dev and @lukecwik for review.
I rebased onto (current) master and applied simplest changes already - I'll be back in a few days with rest of them.

DariuszAniszewski · 2020-04-27T10:38:47Z

I'm dealing with some private issues this week and I'll be unavailable.
@kkucharc and @purbanow will continue working on this PR

… with Snowflake

DariuszAniszewski · 2020-05-06T14:35:08Z

@aromanenko-dev @lukecwik thanks again for the review. We've applied the changes and all your comments/questions are addressed: either changed or answered.

Can you please re-review? :)

chamikaramj

Thanks. Just one comment.

chamikaramj · 2020-05-20T03:18:30Z

sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java

+          emptyCollection
+              .apply(
+                  ParDo.of(
+                      new CopyIntoStageFn(


It's better but I don't think it's a good user experience to fail the pipeline whenever a bundle is retried. Bundle retries are a usual part of the runner execution and source/sink transforms should be able to handle that.

What is usually done in this case is to write data to a new temporary location each time the copy step is run. Then you add a Reshuffle write after the copy operation to checkpoint the results. This way you can guarantee that the set of files that are output to the next step are from a single execution and the job does not have to fail simply because a bundle is retried.

When you cleanup you have to clean all temporary locations including retries.

I'm fine if you don't want to handle this in the first try but let's at least add a TODO and a JIRA to fix this later.

chamikaramj · 2020-05-20T19:08:29Z

LGTM. Thanks.

chamikaramj · 2020-05-20T19:08:43Z

Retest tthis please

chamikaramj · 2020-05-20T19:08:50Z

Retest this please

chamikaramj · 2020-05-20T19:09:50Z

I can "squash and merge" after tests pass.
Feel free to squash/fixup commits if you need more than one commit.

chamikaramj · 2020-05-20T19:10:39Z

Retest this please

DariuszAniszewski · 2020-05-20T19:12:50Z

Thanks!
I'm OK with squash-and-merge - it was intended to go as single commit

chamikaramj · 2020-05-20T19:36:11Z

Retest this please

chamikaramj · 2020-05-20T19:40:21Z

Retest this please

chamikaramj · 2020-05-20T20:03:21Z

Having trouble re-triggering tests.
Can someone else try ?

chamikaramj · 2020-05-20T21:48:48Z

Retest this please

chamikaramj · 2020-05-21T03:15:46Z

Run Python PreCommit

chamikaramj · 2020-05-21T03:16:01Z

Run Python PreCommit

chamikaramj · 2020-05-21T03:16:10Z

Run Python2_PVR_Flink PreCommit

chamikaramj · 2020-05-21T03:16:17Z

Run Python2_PVR_Flink PreCommit

chamikaramj · 2020-05-21T03:16:31Z

Run Python2_PVR_Flink PreCommit

kkucharc · 2020-05-21T09:41:29Z

Run Python2_PVR_Flink PreCommit

DariuszAniszewski · 2020-05-21T10:52:05Z

Just a small comment about the force-push from above - it was mistakenly done, then reverted. HEAD of this branch is still on 3ba192a and comment is a leftover.

kkucharc · 2020-05-21T14:08:05Z

I retested failing test - probably the previous one was timeouting.

chamikaramj · 2020-06-01T18:23:19Z

Seems like this was not included in Beam 2.22.0 cut. So I'll remove it from CHANGES.md entry for Beam 2.22.0.

* [BEAM-9722] added SnowflakeIO with Read operation * [BEAM-9722] Added SnowflakeCloudProvider to enable use various clouds with Snowflake * [BEAM-9722] added docstrings for public methods * [BEAM-9722] Added changed cleanup staged GCS files to Beam FileSystems * [BEAM-9722] Added javadocs for public methods in DataSourceConfiguration * add testing p8 file to RAT exclude refactor SnowflakeCredentials add information about possibly left files on cloud storage small docs changes * documentation changes * [BEAM-9722] Added TestRule and changed Unit tests to use pipeline.run * [BEAM-9722] Renamed Snowflake Read unit test and applied spotless * [BEAM-9722] remove SnowflakeCloudProvider interface * [BEAM-9722] doc changes * [BEAM-9722] add `withoutValidation` to disable verifying connection to Snowflake during pipeline construction * [BEAM-9722] added MoveOption and removed leftover file * [BEAM-9722] fixed tests. Add tests for `withQuery` * [BEAM-9722] make `CopyIntoStageFn` retryable * [BEAM-9722] added `Reshuffle` step after `CopyIntoStageFn` Co-authored-by: Kasia Kucharczyk <katarzyna.kucharczyk@polidea.com> Co-authored-by: pawel.urbanowicz <pawel.urbanowicz@polidea.com>

probot-autolabeler bot added build io java labels Apr 9, 2020

DariuszAniszewski changed the title ~~added SnowflakeIO with Read operation~~ [WIP] added SnowflakeIO with Read operation Apr 9, 2020

DariuszAniszewski marked this pull request as draft April 9, 2020 09:51

DariuszAniszewski changed the title ~~[WIP] added SnowflakeIO with Read operation~~ [WIP][BEAM-9722] added SnowflakeIO with Read operation Apr 9, 2020

takidau reviewed Apr 9, 2020

View reviewed changes

sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java Outdated Show resolved Hide resolved

DariuszAniszewski force-pushed the snowflake branch 3 times, most recently from 6365c6e to 70518c2 Compare April 10, 2020 14:35

DariuszAniszewski changed the title ~~[WIP][BEAM-9722] added SnowflakeIO with Read operation~~ [BEAM-9722] added SnowflakeIO with Read operation Apr 10, 2020

DariuszAniszewski marked this pull request as ready for review April 10, 2020 14:45

aromanenko-dev requested changes Apr 24, 2020

View reviewed changes

lukecwik reviewed Apr 24, 2020

View reviewed changes

DariuszAniszewski force-pushed the snowflake branch from 70518c2 to c981338 Compare April 26, 2020 15:52

[BEAM-9722] added SnowflakeIO with Read operation

189d45d

DariuszAniszewski force-pushed the snowflake branch from c981338 to 189d45d Compare April 26, 2020 15:55

kkucharc and others added 4 commits April 30, 2020 17:16

[BEAM-9722] Added SnowflakeCloudProvider to enable use various clouds…

5f50e46

… with Snowflake

[BEAM-9722] added docstrings for public methods

2ecedb1

[BEAM-9722] Added changed cleanup staged GCS files to Beam FileSystems

00b0ba5

[BEAM-9722] Added javadocs for public methods in DataSourceConfiguration

3830fb7

DariuszAniszewski requested a review from aromanenko-dev May 6, 2020 14:35

chamikaramj reviewed May 20, 2020

View reviewed changes

DariuszAniszewski requested a review from chamikaramj May 20, 2020 13:07

[BEAM-9722] make CopyIntoStageFn retryable

3f7f3bd

DariuszAniszewski force-pushed the snowflake branch from ac38c8b to 3f7f3bd Compare May 20, 2020 13:23

[BEAM-9722] added Reshuffle step after CopyIntoStageFn

3ba192a

chamikaramj approved these changes May 20, 2020

View reviewed changes

purbanow force-pushed the snowflake branch from 3ba192a to 4641631 Compare May 21, 2020 07:55

chamikaramj merged commit 73fa135 into apache:master May 21, 2020

purbanow mentioned this pull request May 22, 2020

[BEAM-9894] Add batch SnowflakeIO.Write to Java SDK #11794

Merged

4 tasks

DariuszAniszewski mentioned this pull request Jun 26, 2020

[BEAM-10331] Add SnowflakeIO to list of built-in IO transforms. #12099

Merged

4 tasks

[BEAM-9722] added SnowflakeIO with Read operation #11360

[BEAM-9722] added SnowflakeIO with Read operation #11360

Conversation

DariuszAniszewski commented Apr 9, 2020 • edited

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

DariuszAniszewski commented Apr 15, 2020

aromanenko-dev commented Apr 15, 2020

DariuszAniszewski commented Apr 21, 2020

aromanenko-dev commented Apr 21, 2020

aromanenko-dev commented Apr 22, 2020

aromanenko-dev left a comment • edited

Choose a reason for hiding this comment

aromanenko-dev Apr 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DariuszAniszewski commented Apr 26, 2020

DariuszAniszewski commented Apr 27, 2020

DariuszAniszewski commented May 6, 2020

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj commented May 20, 2020

chamikaramj commented May 20, 2020

chamikaramj commented May 20, 2020

chamikaramj commented May 20, 2020

chamikaramj commented May 20, 2020

DariuszAniszewski commented May 20, 2020

chamikaramj commented May 20, 2020

chamikaramj commented May 20, 2020

chamikaramj commented May 20, 2020

chamikaramj commented May 20, 2020

chamikaramj commented May 21, 2020

chamikaramj commented May 21, 2020

chamikaramj commented May 21, 2020

chamikaramj commented May 21, 2020

chamikaramj commented May 21, 2020

kkucharc commented May 21, 2020

DariuszAniszewski commented May 21, 2020

kkucharc commented May 21, 2020

chamikaramj commented Jun 1, 2020

DariuszAniszewski commented Apr 9, 2020 •

edited

aromanenko-dev left a comment •

edited

aromanenko-dev Apr 24, 2020 •

edited