Skip to content

doc: update loader doc#445

Merged
imbajin merged 3 commits intoapache:masterfrom
kenssa4eedfd:update-loader-doc
Jan 22, 2026
Merged

doc: update loader doc#445
imbajin merged 3 commits intoapache:masterfrom
kenssa4eedfd:update-loader-doc

Conversation

@kenssa4eedfd
Copy link
Contributor

@kenssa4eedfd kenssa4eedfd commented Jan 22, 2026

https://github.com/apache/incubator-hugegraph-toolchain/pull/704

Purpose of the PR

  • updated the parameter descriptions in the loader documentation.

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 22, 2026
@kenssa4eedfd
Copy link
Contributor Author

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the parameter descriptions in the HugeGraph Loader documentation for both English and Chinese versions. The changes expand the parameter table to include many additional configuration options that were previously undocumented.

Changes:

  • Removed the -gs short form for the --graphspace parameter
  • Updated the --help parameter to show -help as an alternative
  • Added 25 new parameters covering features like parallel processing, CDC integration, HBase configuration, and graph creation options

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
content/en/docs/quickstart/toolchain/hugegraph-loader.md Expanded parameter documentation table with 25 new parameters and formatting adjustments
content/cn/docs/quickstart/toolchain/hugegraph-loader.md Chinese translation of the same parameter documentation updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| `--print-progress` | true | | 是否在控制台实时打印导入条数 |
| `--dry-run` | false | | 打开该模式,只解析不导入,通常用于测试 |
| `--help` 或 `-help` | false | | 打印帮助信息 |
| `--parallel-count` 或 `--parallel-count` | max(2,CPUS) | | 并行读取数据文件最大线程数 |
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name is duplicated. The entry shows --parallel-count--parallel-count where both alternatives are identical. This should either show different parameter names (like --parallel-count or -pc) or just show a single parameter name without the 或 (or) connector.

Suggested change
| `--parallel-count` `--parallel-count` | max(2,CPUS) | | 并行读取数据文件最大线程数 |
| `--parallel-count` | max(2,CPUS) | | 并行读取数据文件最大线程数 |

Copilot uses AI. Check for mistakes.
| `--backend` | hstore | | The backend store type when creating graph if not exists |
| `--serializer` | binary | | The serializer type when creating graph if not exists |
| `--scheduler-type` | distributed | | The task scheduler type when creating graph if not exists |
| `--batch-failure-fallback` | true | | Whether to fallback to single insert when batch insert fails |##### 3.4.2 Breakpoint Continuation Mode
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 861 contains text that appears to be a duplicate of the following section header. The line ends with "Whether to fallback to single insert when batch insert fails" followed immediately by "##### 3.4.2 Breakpoint Continuation Mode" on the same line. The section header should be on line 862 by itself, and line 861 should end properly with just the parameter description.

Suggested change
| `--batch-failure-fallback` | true | | Whether to fallback to single insert when batch insert fails |##### 3.4.2 Breakpoint Continuation Mode
| `--batch-failure-fallback` | true | | Whether to fallback to single insert when batch insert fails |

Copilot uses AI. Check for mistakes.
|-----------------------------------------|---------------|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `-f` or `--file` | | Y | Path to configure script |
| `-g` or `--graph` | | Y | Graph name |
| `--graphspace` | DEFAULT | | Graph space name |
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter --graphspace is listed without the -gs short form alternative, but this is inconsistent with the PR description which states this is an update to parameter descriptions. The original version had -gs or --graphspace. If -gs is being removed intentionally, this should be clearly documented. If it's an error, the short form should be restored.

Suggested change
| `--graphspace` | DEFAULT | | Graph space name |
| `-gs` or `--graphspace` | DEFAULT | | Graph space name |

Copilot uses AI. Check for mistakes.
|-----------------------------------------|-------------|------|-------------------------------------------------------------------|
| `-f` 或 `--file` | | Y | 配置脚本的路径 |
| `-g` 或 `--graph` | | Y | 图名称 |
| `--graphspace` | DEFAULT | | 图空间 |
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter --graphspace is listed without the -gs short form alternative, but the original version had -gs--graphspace. This is inconsistent with other parameters that maintain their short forms. If -gs is being removed intentionally, this should be clearly documented. If it's an error, the short form should be restored to maintain consistency with the English version.

Suggested change
| `--graphspace` | DEFAULT | | 图空间 |
| `-gs``--graphspace` | DEFAULT | | 图空间 |

Copilot uses AI. Check for mistakes.
| `--check-vertex` | false | | Whether to check if the vertices connected by the edge exist when inserting the edge |
| `--print-progress` | true | | Whether to print the number of imported items in real time on the console |
| `--dry-run` | false | | Enable this mode to only parse data without importing; usually used for testing |
| `--help` or `-help` | false | | Print help information |
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --help parameter now shows --help or -help as alternatives. However, the standard convention for help flags is typically --help or -h. The short form -help is unusual and may not work as expected in many CLI frameworks. This should be verified against the actual implementation, and if -h is the correct short form, it should be used instead of -help.

Suggested change
| `--help` or `-help` | false | | Print help information |
| `--help` or `-h` | false | | Print help information |

Copilot uses AI. Check for mistakes.
| `--check-vertex` | false | | 插入边时是否检查边所连接的顶点是否存在 |
| `--print-progress` | true | | 是否在控制台实时打印导入条数 |
| `--dry-run` | false | | 打开该模式,只解析不导入,通常用于测试 |
| `--help` 或 `-help` | false | | 打印帮助信息 |
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --help parameter now shows --help-help as alternatives. However, the standard convention for help flags is typically --help or -h. The short form -help is unusual and may not work as expected in many CLI frameworks. This should be verified against the actual implementation, and if -h is the correct short form, it should be used instead of -help.

Suggested change
| `--help``-help` | false | | 打印帮助信息 |
| `--help``-h` | false | | 打印帮助信息 |

Copilot uses AI. Check for mistakes.
| `--print-progress` | true | | Whether to print the number of imported items in real time on the console |
| `--dry-run` | false | | Enable this mode to only parse data without importing; usually used for testing |
| `--help` or `-help` | false | | Print help information |
| `--parallel-count` 或 `--parallel-count` | max(2,CPUS) | | Parallel read pipelines for data files |
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name appears to be duplicated in both Chinese and English. The entry shows --parallel-count--parallel-count (or in Chinese). This should likely be a different alternative parameter name or just a single parameter name without alternatives.

Suggested change
| `--parallel-count` `--parallel-count` | max(2,CPUS) | | Parallel read pipelines for data files |
| `--parallel-count` | max(2,CPUS) | | Parallel read pipelines for data files |

Copilot uses AI. Check for mistakes.
| `--parallel-count` 或 `--parallel-count` | max(2,CPUS) | | 并行读取数据文件最大线程数 |
| `--start-file` | 0 | | 用于部分(分片)导入的起始文件索引 |
| `--end-file` | -1 | | 用于部分导入的截止文件索引 |
| `--scatter-sources` | false | | 分散(并行)读取多个数据源以优化 I/O 性能 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 参数描述不够清晰 - 易用性问题

--scatter-sources 的描述"分散(并行)读取多个数据源以优化 I/O 性能"对于新用户来说不够直观。

建议补充:

  1. 什么场景下应该启用此参数?
  2. --parallel-count 的区别和配合使用方式
  3. 对性能的具体影响(例如:"适用于多个大数据文件同时导入的场景")

| `--use-prefilter` | false | | 是否预先过滤顶点 |
| `--short-id` | | | 将自定义 ID 映射为更短的 ID |
| `--vertex-edge-limit` | -1L | | 单个顶点的最大边数限制 |
| `--sink-type` | true | | 数据接收端(Sink)存储类型开关 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 参数描述模糊 - 易用性问题

--sink-type 的默认值为 true,但描述"数据接收端(Sink)存储类型开关"让用户不清楚:

  1. true 代表什么存储类型?
  2. false 又代表什么?
  3. 可选的存储类型有哪些?

建议改进描述,例如:"是否使用 [存储类型A],false 时使用 [存储类型B]"或列举所有可选存储类型。

| `--short-id` | | | 将自定义 ID 映射为更短的 ID |
| `--vertex-edge-limit` | -1L | | 单个顶点的最大边数限制 |
| `--sink-type` | true | | 数据接收端(Sink)存储类型开关 |
| `--vertex-partitions` | 64 | | HBase 顶点表的预分区数量 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 缺少使用场景说明 - 易用性问题

新增的 HBase 相关参数(第 863-869 行):

  • --vertex-partitions
  • --edge-partitions
  • --vertex-table-name
  • --edge-table-name
  • --hbase-zk-quorum
  • --hbase-zk-port
  • --hbase-zk-parent

建议在参数表格前或"3.4.1 参数说明"章节开头补充:

  1. 这些参数适用于 HBase 后端存储场景
  2. 需要配合 --backend 参数使用
  3. 或者增加一个独立小节专门说明不同后端存储的参数配置

| `--dry-run` | false | | 打开该模式,只解析不导入,通常用于测试 |
| `--help` | false | | 打印帮助信息 |

| 参数 | 默认值 | 是否必传 | 描述信息 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 参数分组建议 - 提升易用性

当前参数表格包含 49 个参数,对新用户来说信息量过大。建议按功能分组,例如:

基础参数 (必填/常用):

  • -f, -g, -s, -h, -p

性能调优参数:

  • --batch-insert-threads, --max-conn, --batch-size, --parallel-count

高级特性参数:

  • --incremental-mode, --failure-mode, --dry-run

HBase 后端参数:

  • --vertex-partitions, --hbase-zk-quorum

CDC 相关参数:

  • --cdc-flush-interval, --cdc-sink-parallelism

这样可以帮助用户快速找到需要的参数。

| `--max-read-lines` | -1L | | 最大读取行数限制;一旦达到此行数,导入任务将停止 |
| `--test-mode` | false | | 是否开启测试模式 |
| `--use-prefilter` | false | | 是否预先过滤顶点 |
| `--short-id` | | | 将自定义 ID 映射为更短的 ID |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 参数默认值与类型不匹配

--short-id 参数的默认值为空,但描述是"将自定义 ID 映射为更短的 ID",这让用户不清楚:

  1. 该参数的数据类型是什么?(布尔值?字符串?)
  2. 如何使用此参数?(需要传入映射规则?还是只是开关?)

建议:

  1. 补充参数类型说明
  2. 提供使用示例,例如:--short-id=true--short-id=mapping-file.json

| `--backend` | hstore | | 自动创建图(如果不存在)时的后端存储类型 |
| `--serializer` | binary | | 自动创建图(如果不存在)时的序列化器类型 |
| `--scheduler-type` | distributed | | 自动创建图(如果不存在)时的任务调度器类型 |
| `--batch-failure-fallback` | true | | 批量插入失败时是否回退至单条插入模式 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 缺少参数使用示例 - 易用性问题

新增的 24 个参数都没有实际使用示例。建议在文档的"3.4.4 执行命令"或"4 完整示例"章节补充:

  1. 高性能导入配置示例(包含 --parallel-count, --scatter-sources 等)
  2. HBase 后端导入示例
  3. CDC 流式导入示例
  4. 分片导入示例(--start-file, --end-file)

这样可以显著提升文档的实用性。

@imbajin
Copy link
Member

imbajin commented Jan 22, 2026

PR 审查意见 - Loader 文档易用性

你好!我已经审查了这个 PR 的 Loader 文档更新。总体来说,新增了 24 个参数的文档,信息量很大。从易用性角度,我发现了以下需要改进的问题:

‼️ 严重问题 (Critical)

1. --graphspace 参数丢失短参数形式

位置: content/cn/docs/quickstart/toolchain/hugegraph-loader.md:54

原来是 -gs--graphspace,现在变成了只有 --graphspace。这会导致:

  • 与实际命令行参数不匹配
  • 用户无法使用短参数 -gs
  • 与其他参数格式不一致

建议: 恢复为 -gs--graphspace

2. --parallel-count 参数名重复

位置: content/cn/docs/quickstart/toolchain/hugegraph-loader.md:850

表格中写的是 --parallel-count--parallel-count,参数名重复了。需要确认:

  • 是否应该是 --parallel-count-pc?
  • 还是该参数本身就没有短参数形式?

3. 英文版包含中文字符

位置: content/en/docs/quickstart/toolchain/hugegraph-loader.md:207

英文文档中出现了中文 字符:--parallel-count--parallel-count

应修改为:--parallel-count or --parallel-count 或删除重复部分。

⚠️ 重要问题 (Important)

4. 新增参数缺少短参数形式说明

新增的 24 个参数都没有短参数形式(如 -f, -g 等),导致表格第一列格式不统一。

建议:

  • 如果确实没有短参数,在描述中说明
  • 或统一标注"(无短参数)"

5. --sink-type 参数描述模糊

位置: content/cn/docs/quickstart/toolchain/hugegraph-loader.md:862

默认值为 true,但描述"数据接收端(Sink)存储类型开关"让用户不清楚:

  • true 代表什么存储类型?
  • false 代表什么?

建议: 明确说明可选的存储类型,例如:"是否使用 HBase 存储,false 时使用默认后端"

6. HBase 相关参数缺少使用场景说明

新增了 7 个 HBase 相关参数(--vertex-partitions, --hbase-zk-quorum 等),但没有说明:

  • 这些参数仅适用于 HBase 后端
  • 需要配合 --backend=hbase 使用

建议: 在 3.4.1 章节开头或增加子章节说明不同后端的参数配置

7. --short-id 参数类型不明确

位置: content/cn/docs/quickstart/toolchain/hugegraph-loader.md:860

默认值为空,但不清楚:

  • 参数类型是什么?(布尔值/字符串/文件路径?)
  • 如何使用?

建议: 补充参数类型和使用示例

🧹 优化建议 (Minor)

8. 参数数量过多,建议分组

当前表格包含 49 个参数,信息量大,建议按功能分组:

  • 基础参数: -f, -g, -s
  • 性能调优: --batch-insert-threads, --parallel-count
  • 高级特性: --incremental-mode, --dry-run
  • HBase 后端: --vertex-partitions
  • CDC 相关: --cdc-flush-interval

9. 缺少实际使用示例

新增的 24 个参数都没有实际使用示例。

建议在 3.4.4 或第 4 章补充:

  1. 高性能导入示例(含 --parallel-count, --scatter-sources)
  2. HBase 后端导入示例
  3. CDC 流式导入示例
  4. 分片导入示例(--start-file, --end-file)

10. --scatter-sources 描述可以更详细

"分散(并行)读取多个数据源以优化 I/O 性能"对新用户来说不够直观。

建议补充:

  • 适用场景(如:"适用于同时导入多个大文件")
  • --parallel-count 的区别和配合方式

总体评价:这次更新补充了大量参数文档,非常有价值!但从易用性角度看,建议:

  1. 修复 3 个严重问题(--graphspace, --parallel-count 重复,英文版中文字符)
  2. 完善参数描述,特别是 HBase/CDC 相关参数的使用场景
  3. 增加实际使用示例,帮助用户快速上手

期待更新!👍

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 22, 2026
@imbajin imbajin merged commit 2832781 into apache:master Jan 22, 2026
1 check passed
github-actions bot pushed a commit that referenced this pull request Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants