Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-20180][fs-connector][translation] Translate FileSink document into Chinese #14077

Closed
wants to merge 5 commits into from

Conversation

gaoyunhaii
Copy link
Contributor

What is the purpose of the change

Translating the FileSink document into Chinese.

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@gaoyunhaii gaoyunhaii changed the title Pr14061 add doc zh [FLINK-20141][fs-connector] Translate FileSink document into Chinese Nov 16, 2020
@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 605e600 (Mon Nov 16 06:36:55 UTC 2020)

✅no warnings

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 16, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build


桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。
桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。对于行编码格式(参考 [File Formats](#file-formats) )默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。对于批量编码格式我们需要在每次 Checkpoint 时切割文件,但是用户也可以指定额外的基于文件大小和超时时间的条件
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could change the "我们" to more specific. For example ”批量编码格式的默认策略是每次在checkpoint时滚动文件。“

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little difference is that the english specify that "must roll on checkpoint", thus translated to "必须切割文件", the other part is changed.


桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。
桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。对于行编码格式(参考 [File Formats](#file-formats) )默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。对于批量编码格式我们需要在每次 Checkpoint 时切割文件,但是用户也可以指定额外的基于文件大小和超时时间的条件
Copy link
Contributor

@guoweiM guoweiM Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

”时间的条件“

Maybe we could change the "条件" to "策略" for keeping the consistence.

@kl0u
Copy link
Contributor

kl0u commented Nov 16, 2020

Hi @gaoyunhaii , just to avoid doing redundant work, it may make sense to wait for the english docs to get finalised before merging this one. I hope this will be done soon #14061.

@@ -603,7 +604,7 @@ Flink 有两个内置的 BucketAssigners :

## 滚动策略

滚动策略 [RollingPolicy]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html) 定义了指定的文件在何时关闭(closed)并将其变为 Pending 状态,随后变为 Finished 状态。处于 Pending 状态的文件会在下一次 Checkpoint 时变为 Finished 状态,通过设置 Checkpoint 间隔时间,可以控制部分文件(part file)对下游读取者可用的速度、大小和数量。
在流模式下,滚动策略 [RollingPolicy]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html) 定义了指定的文件在何时关闭(closed)并将其变为 Pending 状态,随后变为 Finished 状态。处于 Pending 状态的文件会在下一次 Checkpoint 时变为 Finished 状态,通过设置 Checkpoint 间隔时间,可以控制部分文件(part file)对下游读取者可用的速度、大小和数量。在批模式下,临时文件只会在作业处理完所有输入数据后提交,此时滚动策略可以用来控制每个文件的大小
Copy link
Contributor

@guoweiM guoweiM Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could not introduce a new concept "临时文件". So maybe we could change to following :
在批模式下,所有文件只会在作业处理完所有输入数据后才变为Finished状态


Users who want to add user metadata to the ORC files can do so by calling `addUserMetadata(...)` inside the overriding
`vectorize(...)` method.
给 ORC 文件添加自定义元数据可以通过在覆盖的 `vectorize(...)` 方法中调用 `addUserMetadata(...)` 实现:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

”覆盖“ -》 重载

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该是 覆盖 不是 重载 哈?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

和下面一样改成 实现 了

@@ -454,8 +449,7 @@ input.sinkTo(sink)
</div>
</div>

OrcBulkWriterFactory can also take Hadoop `Configuration` and `Properties` so that a custom Hadoop configuration and ORC
writer properties can be provided.
用户还可以通过 Hadoop `Configuration` 和 `Properties` 来设置 OrcBulkWriterFactory 中涉及的 Hadoop 属性和 Writer 属性:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writer ---》 ORC Writer

@@ -404,7 +399,7 @@ class PersonVectorizer(schema: String) extends Vectorizer[Person](schema) {
</div>
</div>

To use the ORC bulk encoder in an application, users need to add the following dependency:
为了在应用使用 ORC 批量编码,用户需要添加如下依赖:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为了在应用


Like any other columnar format that encodes data in bulk fashion, Flink's `OrcBulkWriter` writes the input elements in batches. It uses
ORC's `VectorizedRowBatch` to achieve this.
和其它基于列式存储的批量编码格式类似,Flink中的 `OrcBulkWriter` 将数据按批写出,它通过 ORC 的 VectorizedRowBatch 来实现这一点。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

,---->。

class and override the `vectorize(T element, VectorizedRowBatch batch)` method. As you can see, the method provides an
instance of `VectorizedRowBatch` to be used directly by the users so users just have to write the logic to transform the
input `element` to `ColumnVectors` and set them in the provided `VectorizedRowBatch` instance.
由于输入数据必须先缓存为一个完整的 `VectorizedRowBatch` ,用户需要继承 `Vectorizer` 抽像类并且覆盖其中的 `vectorize(T element, VectorizedRowBatch batch)` 方法。方法参数中传入的 `VectorizedRowBatch` 使用户只需将输入 `element` 转化为 `ColumnVectors` 并将它存储到所提供的 `VectorizedRowBatch` 实例中。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

覆盖---》实现

@gaoyunhaii
Copy link
Contributor Author

Hi @gaoyunhaii , just to avoid doing redundant work, it may make sense to wait for the english docs to get finalised before merging this one. I hope this will be done soon #14061.

+1 for wait till the english doc get merged and then I would also modify the translation to reveal the changes.

@gaoyunhaii
Copy link
Contributor Author

Very thanks @guoweiM for the reviewing and I will update the PR~

@kl0u
Copy link
Contributor

kl0u commented Nov 16, 2020

@gaoyunhaii I merged the english version.

@gaoyunhaii gaoyunhaii changed the title [FLINK-20141][fs-connector] Translate FileSink document into Chinese [FLINK-20180][fs-connector][translation] Translate FileSink document into Chinese Nov 17, 2020
@gaoyunhaii
Copy link
Contributor Author

@guoweiM @kl0u Very thanks, I have updated the rebased the PR, and modified it according to the latest English version and comments.

Copy link
Contributor

@guoweiM guoweiM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gaoyunhaii for resolving the comments. All parts are LGTM expect one I comment following.


File Sink 会将数据写入到桶中。由于输入流可能是无界的,因此每个桶中的数据被划分为多个有限大小的文件。如何分桶是可以配置的,默认使用基于时间的分桶策略,这种策略每个小时创建一个新的桶,桶中包含的文件将记录所有该小时内从流中接收到的数据。

桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。对于行编码格式(参考 [File Formats](#file-formats) )默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。对于批量编码格式我们需要在每次 Checkpoint 时切割文件,但是用户也可以指定额外的基于文件大小和超时时间的条件
桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。对于行编码格式(参考 [File Formats](#file-formats) )默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。批量编码格式必须在每次 Checkpoint 时切割文件,但是用户也可以指定额外的基于文件大小和超时时间的策略
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we use the "切割文件" not "滚动文件", which is consistent with other part in this section.

Copy link
Contributor

@guoweiM guoweiM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for resolving comments. and Look good to me. +1 for merge.

@gaoyunhaii
Copy link
Contributor Author

Very thanks @guoweiM !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants