Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-8801][yarn/s3] Fix jars downloading issues due to inconsistent timestamp in S3 Filesystem #8215

Conversation

yanyan300300
Copy link

@yanyan300300 yanyan300300 commented Apr 18, 2019

What is the purpose of the change

This change reverts the original fix for FLINK-8801's due to its insufficient in some scenarios where S3AFilesystem does not implement setTimes methods.
Instead, it uses retry to wait for the remote file to be available and overwrite the local file timestamp.

Brief change log

  • Revert PR 5602
  • Add retry at FileNotFoundException

Verifying this change

This change is already covered by existing tests, such as YarnFileStageTestS3ITCase for the upload path via S3.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: Yarn
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 18, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@klion26
Copy link
Member

klion26 commented Apr 21, 2019

@yanyan300300 thanks for your contribution. would you mind creating an issue for this, and update the title and description of this patch

@yanyan300300 yanyan300300 changed the title Fix jars downloading issues due to inconsistent timestamp in S3 Filesystem [FLINK-8801][yarn/s3] Fix jars downloading issues due to inconsistent timestamp in S3 Filesystem Apr 22, 2019
Copy link
Contributor

@NicoK NicoK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it looks like this is the better approach. Aside from some minor details in the code, I would only have one concern which I'm not sure about:

  • Are we allowed to change the local resource's time? What are the implications?

I guess, this is the local resource file that was uploaded to the JM and there are no further implications but I'm not 100% sure.

flink-yarn/src/main/java/org/apache/flink/yarn/Utils.java Outdated Show resolved Hide resolved
flink-yarn/src/main/java/org/apache/flink/yarn/Utils.java Outdated Show resolved Hide resolved
flink-yarn/src/main/java/org/apache/flink/yarn/Utils.java Outdated Show resolved Hide resolved
@yanyan300300
Copy link
Author

yanyan300300 commented Apr 30, 2019

Indeed, it looks like this is the better approach. Aside from some minor details in the code, I would only have one concern which I'm not sure about:

  • Are we allowed to change the local resource's time? What are the implications?

I guess, this is the local resource file that was uploaded to the JM and there are no further implications but I'm not 100% sure.

Yes, Flink will need to set the local resource's time explicitly. This is also the case before your change.

@yanyan300300
Copy link
Author

@tillrohrmann @aljoscha Could you kindly review this change? Thanks!

@yanyan300300
Copy link
Author

@tillrohrmann @aljoscha @NicoK Could I get a review for this PR? Thanks!

@NicoK
Copy link
Contributor

NicoK commented Jun 27, 2019

Indeed, it looks like this is the better approach. Aside from some minor details in the code, I would only have one concern which I'm not sure about:

  • Are we allowed to change the local resource's time? What are the implications?

I guess, this is the local resource file that was uploaded to the JM and there are no further implications but I'm not 100% sure.

Yes, Flink will need to set the local resource's time explicitly. This is also the case before your change.

Can you clarify a bit on this? Where was this done?

My problem is only that Utils#setupLocalResource is not only used for the JobManager uploading resources to remote paths but also by Utils#createTaskExecutorContext() to copy configuration files into the user's home directory. If we now change the timestamp of that config file, I don't know whether this would impact YARN or our own code regarding the integrity of that file.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change @yanyan300300. LGTM +1 for merging.

Concerning @NicoK comment, I think we don't change the timestamp of the local config file but we set the timestamp of the LocalResource to make it consistent with the file which we just uploaded to the user's home directory. I think this should be fine since we did the same before c90a757. Correct me if I understood your question incorrectly.


// now create the resource instance
LocalResource resource = registerLocalResource(dst, localFile.length(), dstModificationTime > 0 ? dstModificationTime
: localFile.lastModified());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: actually, this line is quite long and then broken strangely - how about instead integrating that distinction in the variable itself, when it is created? Also, while we're at it, we could log that we are using the local file's timestamp instead:

		final long dstModificationTime;
		if (fss != null && fss.length >  0) {
			dstModificationTime = fss[0].getModificationTime();
			LOG.debug("Got modification time {} from remote path {}", dstModificationTime, dst);
		} else {
			dstModificationTime = localFile.lastModified();
			LOG.debug("Failed to fetch remote modification time from {}, using local timestamp {}", dst, dstModificationTime);
		}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. When a var can be assigned to different values, personally I prefer to use a default and then override it for other scenarios. There are two benefits: 1) reduce the required number of if/else branches 2) higher line coverage. In this relatively simple case, yes what you suggest also fits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not convinced by your arguments @yanyan300300. The ternary operator implicitly introduces if/else branches which need to be checked by test cases separately (as you would do it if you have explicit if/else branches). Moreover, the higher line coverage only results because one and the same line can result in two values.

In fact, I think these arguments are window dressing and even a bit dangerous because they only aim to improve some static test criteria which, if handled this way, give you a false impression that your program is correctly tested. I mean what does it help you if the line coverage says you've covered all lines but not the case where the ternary operator returns the third argument?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have made it clearer. For 1) I meant it requires one less level of indentation with else. Personally I felt it is easier to read. For 2) Usually to check how many conditions are tested, branch coverage is looked at along with line coverage.

For this case, sure I can update according to @NicoK 's suggestion. Could you confirm if that is also what you are suggesting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some value in the improved debug logging statements. Apart from that I don't have a strong opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I updated as what @NicoK suggested.

@NicoK
Copy link
Contributor

NicoK commented Jul 2, 2019

Thanks @tillrohrmann, yes, I must have mixed things up there - you are right about setting the timestamp of the LocalResource. I added one more small comment and after addressing this, I would also be fine with merging this.

@yanyan300300 yanyan300300 reopened this Jul 2, 2019
@flinkbot
Copy link
Collaborator

flinkbot commented Jul 9, 2019

CI report for commit 11fc41c: SUCCESS Build

@yanyan300300
Copy link
Author

@NicoK Could you approve this PR if you don't have further concerns?

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 14, 2019

CI report:

Copy link
Contributor

@NicoK NicoK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @yanyan300300
lgtm

@NicoK
Copy link
Contributor

NicoK commented Jul 15, 2019

@yanyan300300 can you squash your commits so that we'll have one commit for reverting the old implementation and one with your changes? Please also tag your commit with "[FLINK-8801]" then.

I'll merge it afterwards.

@yanyan300300 yanyan300300 deleted the fix_s3_download_resource_changed branch July 15, 2019 18:29
@yanyan300300 yanyan300300 restored the fix_s3_download_resource_changed branch July 15, 2019 18:44
@yanyan300300 yanyan300300 reopened this Jul 15, 2019
@yanyan300300
Copy link
Author

@NicoK Thanks for the approval!

If I squash in my feature branch, this PR will be closed due to altered commit history. Could you do it at your end? https://help.github.com/en/articles/about-pull-request-merges

@NicoK
Copy link
Contributor

NicoK commented Jul 16, 2019

@yanyan300300 strange - I do this quite frequently in my PRs - squashing work-in-progress git history, then force-push to my branch and the PR gets updated. But I can also do this on my end.

@NicoK
Copy link
Contributor

NicoK commented Jul 16, 2019

merged via 770a404

@NicoK NicoK closed this Jul 16, 2019
@yanyan300300
Copy link
Author

@yanyan300300 strange - I do this quite frequently in my PRs - squashing work-in-progress git history, then force-push to my branch and the PR gets updated. But I can also do this on my end.

Thanks @NicoK Will do this next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants