Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the parameter descriptions in the HugeGraph Loader documentation for both English and Chinese versions. The changes expand the parameter table to include many additional configuration options that were previously undocumented.
Changes:
- Removed the
-gsshort form for the--graphspaceparameter - Updated the
--helpparameter to show-helpas an alternative - Added 25 new parameters covering features like parallel processing, CDC integration, HBase configuration, and graph creation options
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| content/en/docs/quickstart/toolchain/hugegraph-loader.md | Expanded parameter documentation table with 25 new parameters and formatting adjustments |
| content/cn/docs/quickstart/toolchain/hugegraph-loader.md | Chinese translation of the same parameter documentation updates |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | `--print-progress` | true | | 是否在控制台实时打印导入条数 | | ||
| | `--dry-run` | false | | 打开该模式,只解析不导入,通常用于测试 | | ||
| | `--help` 或 `-help` | false | | 打印帮助信息 | | ||
| | `--parallel-count` 或 `--parallel-count` | max(2,CPUS) | | 并行读取数据文件最大线程数 | |
There was a problem hiding this comment.
The parameter name is duplicated. The entry shows --parallel-count 或 --parallel-count where both alternatives are identical. This should either show different parameter names (like --parallel-count or -pc) or just show a single parameter name without the 或 (or) connector.
| | `--parallel-count` 或 `--parallel-count` | max(2,CPUS) | | 并行读取数据文件最大线程数 | | |
| | `--parallel-count` | max(2,CPUS) | | 并行读取数据文件最大线程数 | |
| | `--backend` | hstore | | The backend store type when creating graph if not exists | | ||
| | `--serializer` | binary | | The serializer type when creating graph if not exists | | ||
| | `--scheduler-type` | distributed | | The task scheduler type when creating graph if not exists | | ||
| | `--batch-failure-fallback` | true | | Whether to fallback to single insert when batch insert fails |##### 3.4.2 Breakpoint Continuation Mode |
There was a problem hiding this comment.
Line 861 contains text that appears to be a duplicate of the following section header. The line ends with "Whether to fallback to single insert when batch insert fails" followed immediately by "##### 3.4.2 Breakpoint Continuation Mode" on the same line. The section header should be on line 862 by itself, and line 861 should end properly with just the parameter description.
| | `--batch-failure-fallback` | true | | Whether to fallback to single insert when batch insert fails |##### 3.4.2 Breakpoint Continuation Mode | |
| | `--batch-failure-fallback` | true | | Whether to fallback to single insert when batch insert fails | |
| |-----------------------------------------|---------------|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | `-f` or `--file` | | Y | Path to configure script | | ||
| | `-g` or `--graph` | | Y | Graph name | | ||
| | `--graphspace` | DEFAULT | | Graph space name | |
There was a problem hiding this comment.
The parameter --graphspace is listed without the -gs short form alternative, but this is inconsistent with the PR description which states this is an update to parameter descriptions. The original version had -gs or --graphspace. If -gs is being removed intentionally, this should be clearly documented. If it's an error, the short form should be restored.
| | `--graphspace` | DEFAULT | | Graph space name | | |
| | `-gs` or `--graphspace` | DEFAULT | | Graph space name | |
| |-----------------------------------------|-------------|------|-------------------------------------------------------------------| | ||
| | `-f` 或 `--file` | | Y | 配置脚本的路径 | | ||
| | `-g` 或 `--graph` | | Y | 图名称 | | ||
| | `--graphspace` | DEFAULT | | 图空间 | |
There was a problem hiding this comment.
The parameter --graphspace is listed without the -gs short form alternative, but the original version had -gs 或 --graphspace. This is inconsistent with other parameters that maintain their short forms. If -gs is being removed intentionally, this should be clearly documented. If it's an error, the short form should be restored to maintain consistency with the English version.
| | `--graphspace` | DEFAULT | | 图空间 | | |
| | `-gs` 或 `--graphspace` | DEFAULT | | 图空间 | |
| | `--check-vertex` | false | | Whether to check if the vertices connected by the edge exist when inserting the edge | | ||
| | `--print-progress` | true | | Whether to print the number of imported items in real time on the console | | ||
| | `--dry-run` | false | | Enable this mode to only parse data without importing; usually used for testing | | ||
| | `--help` or `-help` | false | | Print help information | |
There was a problem hiding this comment.
The --help parameter now shows --help or -help as alternatives. However, the standard convention for help flags is typically --help or -h. The short form -help is unusual and may not work as expected in many CLI frameworks. This should be verified against the actual implementation, and if -h is the correct short form, it should be used instead of -help.
| | `--help` or `-help` | false | | Print help information | | |
| | `--help` or `-h` | false | | Print help information | |
| | `--check-vertex` | false | | 插入边时是否检查边所连接的顶点是否存在 | | ||
| | `--print-progress` | true | | 是否在控制台实时打印导入条数 | | ||
| | `--dry-run` | false | | 打开该模式,只解析不导入,通常用于测试 | | ||
| | `--help` 或 `-help` | false | | 打印帮助信息 | |
There was a problem hiding this comment.
The --help parameter now shows --help 或 -help as alternatives. However, the standard convention for help flags is typically --help or -h. The short form -help is unusual and may not work as expected in many CLI frameworks. This should be verified against the actual implementation, and if -h is the correct short form, it should be used instead of -help.
| | `--help` 或 `-help` | false | | 打印帮助信息 | | |
| | `--help` 或 `-h` | false | | 打印帮助信息 | |
| | `--print-progress` | true | | Whether to print the number of imported items in real time on the console | | ||
| | `--dry-run` | false | | Enable this mode to only parse data without importing; usually used for testing | | ||
| | `--help` or `-help` | false | | Print help information | | ||
| | `--parallel-count` 或 `--parallel-count` | max(2,CPUS) | | Parallel read pipelines for data files | |
There was a problem hiding this comment.
The parameter name appears to be duplicated in both Chinese and English. The entry shows --parallel-count 或 --parallel-count (or in Chinese). This should likely be a different alternative parameter name or just a single parameter name without alternatives.
| | `--parallel-count` 或 `--parallel-count` | max(2,CPUS) | | Parallel read pipelines for data files | | |
| | `--parallel-count` | max(2,CPUS) | | Parallel read pipelines for data files | |
| | `--parallel-count` 或 `--parallel-count` | max(2,CPUS) | | 并行读取数据文件最大线程数 | | ||
| | `--start-file` | 0 | | 用于部分(分片)导入的起始文件索引 | | ||
| | `--end-file` | -1 | | 用于部分导入的截止文件索引 | | ||
| | `--scatter-sources` | false | | 分散(并行)读取多个数据源以优化 I/O 性能 | |
There was a problem hiding this comment.
--scatter-sources 的描述"分散(并行)读取多个数据源以优化 I/O 性能"对于新用户来说不够直观。
建议补充:
- 什么场景下应该启用此参数?
- 与
--parallel-count的区别和配合使用方式 - 对性能的具体影响(例如:"适用于多个大数据文件同时导入的场景")
| | `--use-prefilter` | false | | 是否预先过滤顶点 | | ||
| | `--short-id` | | | 将自定义 ID 映射为更短的 ID | | ||
| | `--vertex-edge-limit` | -1L | | 单个顶点的最大边数限制 | | ||
| | `--sink-type` | true | | 数据接收端(Sink)存储类型开关 | |
There was a problem hiding this comment.
--sink-type 的默认值为 true,但描述"数据接收端(Sink)存储类型开关"让用户不清楚:
true代表什么存储类型?false又代表什么?- 可选的存储类型有哪些?
建议改进描述,例如:"是否使用 [存储类型A],false 时使用 [存储类型B]"或列举所有可选存储类型。
| | `--short-id` | | | 将自定义 ID 映射为更短的 ID | | ||
| | `--vertex-edge-limit` | -1L | | 单个顶点的最大边数限制 | | ||
| | `--sink-type` | true | | 数据接收端(Sink)存储类型开关 | | ||
| | `--vertex-partitions` | 64 | | HBase 顶点表的预分区数量 | |
There was a problem hiding this comment.
新增的 HBase 相关参数(第 863-869 行):
--vertex-partitions--edge-partitions--vertex-table-name--edge-table-name--hbase-zk-quorum--hbase-zk-port--hbase-zk-parent
建议在参数表格前或"3.4.1 参数说明"章节开头补充:
- 这些参数适用于 HBase 后端存储场景
- 需要配合
--backend参数使用 - 或者增加一个独立小节专门说明不同后端存储的参数配置
| | `--dry-run` | false | | 打开该模式,只解析不导入,通常用于测试 | | ||
| | `--help` | false | | 打印帮助信息 | | ||
|
|
||
| | 参数 | 默认值 | 是否必传 | 描述信息 | |
There was a problem hiding this comment.
🧹 参数分组建议 - 提升易用性
当前参数表格包含 49 个参数,对新用户来说信息量过大。建议按功能分组,例如:
基础参数 (必填/常用):
-f,-g,-s,-h,-p等
性能调优参数:
--batch-insert-threads,--max-conn,--batch-size,--parallel-count等
高级特性参数:
--incremental-mode,--failure-mode,--dry-run等
HBase 后端参数:
--vertex-partitions,--hbase-zk-quorum等
CDC 相关参数:
--cdc-flush-interval,--cdc-sink-parallelism等
这样可以帮助用户快速找到需要的参数。
| | `--max-read-lines` | -1L | | 最大读取行数限制;一旦达到此行数,导入任务将停止 | | ||
| | `--test-mode` | false | | 是否开启测试模式 | | ||
| | `--use-prefilter` | false | | 是否预先过滤顶点 | | ||
| | `--short-id` | | | 将自定义 ID 映射为更短的 ID | |
There was a problem hiding this comment.
--short-id 参数的默认值为空,但描述是"将自定义 ID 映射为更短的 ID",这让用户不清楚:
- 该参数的数据类型是什么?(布尔值?字符串?)
- 如何使用此参数?(需要传入映射规则?还是只是开关?)
建议:
- 补充参数类型说明
- 提供使用示例,例如:
--short-id=true或--short-id=mapping-file.json
| | `--backend` | hstore | | 自动创建图(如果不存在)时的后端存储类型 | | ||
| | `--serializer` | binary | | 自动创建图(如果不存在)时的序列化器类型 | | ||
| | `--scheduler-type` | distributed | | 自动创建图(如果不存在)时的任务调度器类型 | | ||
| | `--batch-failure-fallback` | true | | 批量插入失败时是否回退至单条插入模式 | |
There was a problem hiding this comment.
新增的 24 个参数都没有实际使用示例。建议在文档的"3.4.4 执行命令"或"4 完整示例"章节补充:
- 高性能导入配置示例(包含
--parallel-count,--scatter-sources等) - HBase 后端导入示例
- CDC 流式导入示例
- 分片导入示例(
--start-file,--end-file)
这样可以显著提升文档的实用性。
PR 审查意见 - Loader 文档易用性你好!我已经审查了这个 PR 的 Loader 文档更新。总体来说,新增了 24 个参数的文档,信息量很大。从易用性角度,我发现了以下需要改进的问题:
|
https://github.com/apache/incubator-hugegraph-toolchain/pull/704
Purpose of the PR