Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-12378][docs] Consolidate FileSystem Documentation #8326

Closed
wants to merge 5 commits into from

Conversation

sjwiesman
Copy link
Contributor

What is the purpose of the change

Currently flink's filesystem documentation is spread across a number of pages without any clear connection. A non-exhaustive list of issues includes:

S3 documentation spread across many pages
OSS filesystem is listed under deployments when it is an object store
deployments/filesystem.md has a lot of unrelated information

We should create a filesystem subsection under deployments with multiple pages containing all relevant information about Flink's filesystem abstraction.

This PR also resolves FLINK-8513 and FLINK-10249 which were minor additions to the S3 documentation.

Verifying this change

N/A

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)
    This does not touch S3 file system code but does touch the documentation.

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@sjwiesman sjwiesman changed the title Flink 12378 [Flink-12378][docs] Consolidate FileSystem Documentation Apr 30, 2019
@sjwiesman sjwiesman changed the title [Flink-12378][docs] Consolidate FileSystem Documentation [FLINK-12378][docs] Consolidate FileSystem Documentation Apr 30, 2019
@flinkbot
Copy link
Collaborator

flinkbot commented Apr 30, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❗ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@sjwiesman
Copy link
Contributor Author

@flinkbot attention @fhueske

Copy link
Contributor

@StephanEwen StephanEwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big +1 for reorganizing these docs.

Looks good in general. Here are some suggestions to polish it off:

  • The spelling is sometimes "filesystem" and sometimes "file system". Would be nice to consolidate all occurrences to one of these spellings.
  • For the S3 docs, it might be worth more prominently pointing out that the Hadoop S3 FS supports the streaming file sink, but the presto one does now.
  • It may also be worth pointing out that for checkpoints, we typically recommend the presto fs.


{% panel **Note:** You don't have to configure this manually if you are running [Flink on EMR](#emr-elastic-mapreduce). %}

This setup is a bit more complex and we recommend using our shaded Hadoop/Presto file systems
instead (see above) unless required otherwise, e.g. for using S3 as YARN's resource storage dir
Apache Flink provides native [S3 FileSystem's](../filesystems/s3.html) out of the box and we recomend using them unless required otherwise, e.g. for using S3 as YARN's resource storage dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recomend --> recommend

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"native" --> "built-in" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering if we should simply drop the section below, about Hadoop's S3 file systems.
We can mention that Flink also supports Hadoop's file systems and refer to Hadoop docs for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable. I believe most users are using the built-in S3 filesystems at this point.

under the License.
-->

Apache Flink uses to consume and persistently store data, both for results of applications and for fault tolerance and recovery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Flink uses file systems ?


Note that these examples are *not* exhaustive and you can use S3 in other places as well, including your [high availability setup](../jobmanager_high_availability.html) or the [RocksDBStateBackend]({{ site.baseurl }}/ops/state/state_backends.html#the-rocksdbstatebackend); everywhere that Flink expects a FileSystem URI.

For most use cases, you may use one of our shaded `flink-s3-fs-hadoop` and `flink-s3-fs-presto` S3filesystem wrappers which are self-contained and easy to set up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 filesystem

You can use S3 objects like regular files by specifying paths in the following format:

{% highlight plain %}
s3://<your-bucket>/<endpoint>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> ?

@sjwiesman
Copy link
Contributor Author

Thanks for the review, I consolidated on "file systems" and removed the Hadoop references from AWS and OSS pages and replaced it with an extra section about configuring Hadoop on filesystems/index.md.

@asfgit asfgit closed this in 4c0bbc4 May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants