New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-20180][fs-connector][translation] Translate FileSink document into Chinese #14077
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 605e600 (Mon Nov 16 06:36:55 UTC 2020) ✅no warnings Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
docs/dev/connectors/file_sink.zh.md
Outdated
|
||
桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。 | ||
桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。对于行编码格式(参考 [File Formats](#file-formats) )默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。对于批量编码格式我们需要在每次 Checkpoint 时切割文件,但是用户也可以指定额外的基于文件大小和超时时间的条件。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could change the "我们" to more specific. For example ”批量编码格式的默认策略是每次在checkpoint时滚动文件。“
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little difference is that the english specify that "must roll on checkpoint", thus translated to "必须切割文件", the other part is changed.
docs/dev/connectors/file_sink.zh.md
Outdated
|
||
桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。 | ||
桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。对于行编码格式(参考 [File Formats](#file-formats) )默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。对于批量编码格式我们需要在每次 Checkpoint 时切割文件,但是用户也可以指定额外的基于文件大小和超时时间的条件。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
”时间的条件“
Maybe we could change the "条件" to "策略" for keeping the consistence.
Hi @gaoyunhaii , just to avoid doing redundant work, it may make sense to wait for the english docs to get finalised before merging this one. I hope this will be done soon #14061. |
docs/dev/connectors/file_sink.zh.md
Outdated
@@ -603,7 +604,7 @@ Flink 有两个内置的 BucketAssigners : | |||
|
|||
## 滚动策略 | |||
|
|||
滚动策略 [RollingPolicy]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html) 定义了指定的文件在何时关闭(closed)并将其变为 Pending 状态,随后变为 Finished 状态。处于 Pending 状态的文件会在下一次 Checkpoint 时变为 Finished 状态,通过设置 Checkpoint 间隔时间,可以控制部分文件(part file)对下游读取者可用的速度、大小和数量。 | |||
在流模式下,滚动策略 [RollingPolicy]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/streaming/api/functions/sink/filesystem/RollingPolicy.html) 定义了指定的文件在何时关闭(closed)并将其变为 Pending 状态,随后变为 Finished 状态。处于 Pending 状态的文件会在下一次 Checkpoint 时变为 Finished 状态,通过设置 Checkpoint 间隔时间,可以控制部分文件(part file)对下游读取者可用的速度、大小和数量。在批模式下,临时文件只会在作业处理完所有输入数据后提交,此时滚动策略可以用来控制每个文件的大小。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could not introduce a new concept "临时文件". So maybe we could change to following :
在批模式下,所有文件只会在作业处理完所有输入数据后才变为Finished状态
docs/dev/connectors/file_sink.zh.md
Outdated
|
||
Users who want to add user metadata to the ORC files can do so by calling `addUserMetadata(...)` inside the overriding | ||
`vectorize(...)` method. | ||
给 ORC 文件添加自定义元数据可以通过在覆盖的 `vectorize(...)` 方法中调用 `addUserMetadata(...)` 实现: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
”覆盖“ -》 重载
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里应该是 覆盖 不是 重载 哈?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
和下面一样改成 实现 了
docs/dev/connectors/file_sink.zh.md
Outdated
@@ -454,8 +449,7 @@ input.sinkTo(sink) | |||
</div> | |||
</div> | |||
|
|||
OrcBulkWriterFactory can also take Hadoop `Configuration` and `Properties` so that a custom Hadoop configuration and ORC | |||
writer properties can be provided. | |||
用户还可以通过 Hadoop `Configuration` 和 `Properties` 来设置 OrcBulkWriterFactory 中涉及的 Hadoop 属性和 Writer 属性: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writer ---》 ORC Writer
docs/dev/connectors/file_sink.zh.md
Outdated
@@ -404,7 +399,7 @@ class PersonVectorizer(schema: String) extends Vectorizer[Person](schema) { | |||
</div> | |||
</div> | |||
|
|||
To use the ORC bulk encoder in an application, users need to add the following dependency: | |||
为了在应用使用 ORC 批量编码,用户需要添加如下依赖: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为了在应用中
docs/dev/connectors/file_sink.zh.md
Outdated
|
||
Like any other columnar format that encodes data in bulk fashion, Flink's `OrcBulkWriter` writes the input elements in batches. It uses | ||
ORC's `VectorizedRowBatch` to achieve this. | ||
和其它基于列式存储的批量编码格式类似,Flink中的 `OrcBulkWriter` 将数据按批写出,它通过 ORC 的 VectorizedRowBatch 来实现这一点。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
,---->。
docs/dev/connectors/file_sink.zh.md
Outdated
class and override the `vectorize(T element, VectorizedRowBatch batch)` method. As you can see, the method provides an | ||
instance of `VectorizedRowBatch` to be used directly by the users so users just have to write the logic to transform the | ||
input `element` to `ColumnVectors` and set them in the provided `VectorizedRowBatch` instance. | ||
由于输入数据必须先缓存为一个完整的 `VectorizedRowBatch` ,用户需要继承 `Vectorizer` 抽像类并且覆盖其中的 `vectorize(T element, VectorizedRowBatch batch)` 方法。方法参数中传入的 `VectorizedRowBatch` 使用户只需将输入 `element` 转化为 `ColumnVectors` 并将它存储到所提供的 `VectorizedRowBatch` 实例中。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
覆盖---》实现
+1 for wait till the english doc get merged and then I would also modify the translation to reveal the changes. |
Very thanks @guoweiM for the reviewing and I will update the PR~ |
@gaoyunhaii I merged the english version. |
605e600
to
ca8f49b
Compare
ca8f49b
to
58a2d81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gaoyunhaii for resolving the comments. All parts are LGTM expect one I comment following.
docs/dev/connectors/file_sink.zh.md
Outdated
|
||
File Sink 会将数据写入到桶中。由于输入流可能是无界的,因此每个桶中的数据被划分为多个有限大小的文件。如何分桶是可以配置的,默认使用基于时间的分桶策略,这种策略每个小时创建一个新的桶,桶中包含的文件将记录所有该小时内从流中接收到的数据。 | ||
|
||
桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。对于行编码格式(参考 [File Formats](#file-formats) )默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。对于批量编码格式我们需要在每次 Checkpoint 时切割文件,但是用户也可以指定额外的基于文件大小和超时时间的条件。 | ||
桶目录中的实际输出数据会被划分为多个部分文件(part file),每一个接收桶数据的 Sink Subtask ,至少包含一个部分文件(part file)。额外的部分文件(part file)将根据滚动策略创建,滚动策略是可以配置的。对于行编码格式(参考 [File Formats](#file-formats) )默认的策略是根据文件大小和超时时间来滚动文件。超时时间指打开文件的最长持续时间,以及文件关闭前的最长非活动时间。批量编码格式必须在每次 Checkpoint 时切割文件,但是用户也可以指定额外的基于文件大小和超时时间的策略。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we use the "切割文件" not "滚动文件", which is consistent with other part in this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for resolving comments. and Look good to me. +1 for merge.
Very thanks @guoweiM ! |
What is the purpose of the change
Translating the
FileSink
document into Chinese.Verifying this change
This change is a trivial rework / code cleanup without any test coverage.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation