New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-9722] added SnowflakeIO with Read operation #11360
Conversation
sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java
Outdated
Show resolved
Hide resolved
6365c6e
to
70518c2
Compare
R: @aromanenko-dev |
@DariuszAniszewski Thank you for contribution! I'm a bit busy for now but I'll try to take a look on this asap. |
@aromanenko-dev I don't want to push you but do you have any ETA on the review? I'm OK with waiting but it would be nice to roughly know how long it may take ;) I also have a generic question - as mentioned in description of this very PR, it's first from a series. We have more pieces of SnowflakeIO done, but we wanted to make as small and atomic PRs as possible to ease the review process and merge them one by one. But maybe this approach is misfired and it's better to put all pieces of the IO (in atomic commits of course) into this one PR? WDYT? |
@DariuszAniszewski I'm sorry for a delay (have quite busy time at work) with review. I'll try to do the first round till the end of this week. Regarding a second part of your message. AFAIK, there is no common rule for the size of PR in Beam, but personally I'd prefer an approach you proposed and move forward by logical atomic steps since it will be easier for all, as I believe. Also, I'd suggest to create the separate Jiras in this case. |
@iemejia Would you have some time to take a look as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contribution again! I think it should be a useful and demanded IO for Beam users.
I did a first round of review (only main business logic, not tests yet), please take a look on my comments.
Also, please, fix a commit message - it has to start with Jira ID as well as PR name.
sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java
Show resolved
Hide resolved
sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java
Outdated
Show resolved
Hide resolved
.apply(Wait.on(output)) | ||
.apply( | ||
ParDo.of( | ||
new CleanTmpFilesFromGcsFn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the temp directory be cleaned if pipeline was failed before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that it doesn't at the moment. @kkucharc will share more info here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we checked and it doesn't. @aromanenko-dev do you think it should be provided? In case of testing, probably tests should take care of cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that many failed pipelines could lead to wasting of used dick space in this case. It would be better to avoid such behavior, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I generally agree that failing pipelines will eventually lead to lots of garbage on GCS (or other storage in future) I'm not sure how to ensure those files are deleted regardless of pipeline status.
AFAIK there is nothing like @After
or @Teardown
for PTransform
which we have in our IO. Using Wait
transform gives us ability to remove files once data is read and seems reasonable choice.
I've been checking how i.e. BigQueryIO
is handling that case as it also needs to cleanup and they also have a cleanup transform that is called once all rows are read. I assume in their case cleanup also won't be run if something in-between fails.
How about filing JIRA issue for this case and try to look at solution in parallel to delivering other pieces of the Snowflake connector? @aromanenko-dev @kkucharc WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a known issue with BQ source as well. Failed pipelines can leave temporary files behind. I'm afraid there is no good solution today. I think we need to introduce some sort of a generalized cleanup step to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I'd suggest to add this into IO class Java doc to make users aware that such situation is possible and it will require manual procedure to clean temp dirs up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a note here - would it be enough?
sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeServiceImpl.java
Outdated
Show resolved
Hide resolved
...va/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/credentials/package-info.java
Show resolved
Hide resolved
sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/package-info.java
Show resolved
Hide resolved
.../io/snowflake/src/test/java/org/apache/beam/sdk/io/snowflake/test/FakeSnowflakeDatabase.java
Show resolved
Hide resolved
sdks/java/io/snowflake/src/test/java/org/apache/beam/sdk/io/snowflake/test/package-info.java
Show resolved
Hide resolved
|
||
/** Interface which defines common methods for cloud providers. */ | ||
public interface SnowflakeCloudProvider { | ||
void removeFiles(String bucketName, String pathOnBucket); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use Apache Beam's notion of a FileSystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion @lukecwik . I changed it to GCSFileSystem
.
I also tried to change removeFiles in Fake
implementation to use LocalFileSystem
but I am a little bit concerned - LocalFileSystem
doesn't match nested directories and fails on deleting not empty directory. That can cause that testing directory won't be cleaned and tests will become flaky.
public class GCSProvider implements SnowflakeCloudProvider, Serializable { | ||
|
||
@Override | ||
public void removeFiles(String bucketName, String pathOnBucket) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the Apache Beam GCS filesystem?
70518c2
to
c981338
Compare
c981338
to
189d45d
Compare
Thanks @aromanenko-dev and @lukecwik for review. |
@aromanenko-dev @lukecwik thanks again for the review. We've applied the changes and all your comments/questions are addressed: either changed or answered. Can you please re-review? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just one comment.
emptyCollection | ||
.apply( | ||
ParDo.of( | ||
new CopyIntoStageFn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better but I don't think it's a good user experience to fail the pipeline whenever a bundle is retried. Bundle retries are a usual part of the runner execution and source/sink transforms should be able to handle that.
What is usually done in this case is to write data to a new temporary location each time the copy step is run. Then you add a Reshuffle write after the copy operation to checkpoint the results. This way you can guarantee that the set of files that are output to the next step are from a single execution and the job does not have to fail simply because a bundle is retried.
When you cleanup you have to clean all temporary locations including retries.
I'm fine if you don't want to handle this in the first try but let's at least add a TODO and a JIRA to fix this later.
ac38c8b
to
3f7f3bd
Compare
LGTM. Thanks. |
Retest tthis please |
Retest this please |
I can "squash and merge" after tests pass. |
Retest this please |
Thanks! |
Retest this please |
1 similar comment
Retest this please |
Having trouble re-triggering tests. |
Retest this please |
Run Python PreCommit |
1 similar comment
Run Python PreCommit |
Run Python2_PVR_Flink PreCommit |
2 similar comments
Run Python2_PVR_Flink PreCommit |
Run Python2_PVR_Flink PreCommit |
Run Python2_PVR_Flink PreCommit |
Just a small comment about the force-push from above - it was mistakenly done, then reverted. HEAD of this branch is still on 3ba192a and comment is a leftover. |
I retested failing test - probably the previous one was timeouting. |
Seems like this was not included in Beam 2.22.0 cut. So I'll remove it from CHANGES.md entry for Beam 2.22.0. |
* [BEAM-9722] added SnowflakeIO with Read operation * [BEAM-9722] Added SnowflakeCloudProvider to enable use various clouds with Snowflake * [BEAM-9722] added docstrings for public methods * [BEAM-9722] Added changed cleanup staged GCS files to Beam FileSystems * [BEAM-9722] Added javadocs for public methods in DataSourceConfiguration * add testing p8 file to RAT exclude refactor SnowflakeCredentials add information about possibly left files on cloud storage small docs changes * documentation changes * [BEAM-9722] Added TestRule and changed Unit tests to use pipeline.run * [BEAM-9722] Renamed Snowflake Read unit test and applied spotless * [BEAM-9722] remove SnowflakeCloudProvider interface * [BEAM-9722] doc changes * [BEAM-9722] add `withoutValidation` to disable verifying connection to Snowflake during pipeline construction * [BEAM-9722] added MoveOption and removed leftover file * [BEAM-9722] fixed tests. Add tests for `withQuery` * [BEAM-9722] make `CopyIntoStageFn` retryable * [BEAM-9722] added `Reshuffle` step after `CopyIntoStageFn` Co-authored-by: Kasia Kucharczyk <katarzyna.kucharczyk@polidea.com> Co-authored-by: pawel.urbanowicz <pawel.urbanowicz@polidea.com>
This PR is adding
SnowflakeIO
withRead
operation as part of BEAM-9722Snowflake is an analytic data warehouse provided as Software-as-a-Service (SaaS). It uses a new SQL database engine with a unique architecture designed for the cloud. To read more details please check here and here.
The
SnowflakeIO.Read
uses Snowflake's JDBC driver to run COPY INTO statement to move data on GCS as CSV files that are then read viaFileIO
.The
SnowflakeIO
allows to use three authentication methods against Snowflake:This PR is first of series, once merged, subsequent PRs will come - with Write, integration tests etc. We're also working on cross-language to support Python SDK.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.