Skip to content

Conversation

@mengnankkkk
Copy link
Contributor

@mengnankkkk mengnankkkk commented Sep 28, 2025

What's changed?

Incorrect UTF-8 validation logic: The validation in OnlineParser.java: lines 419-421 incorrectly identifies valid Chinese strings as invalid.
Byte stream processing issue: The original parser reads the InputStream byte by byte and cannot correctly handle UTF-8 multi-byte characters (Chinese characters occupy 3 bytes).
Close #3791

Change the byte-by-byte parsing based on InputStream to UTF-8 string parsing
add test for this change

Checklist

  • I have read the Contributing Guide
  • I have written the necessary doc or comment.
  • I have added the necessary unit tests and all cases have passed.

Add or update API

  • I have added the necessary e2e tests and all cases have passed.

@Duansg
Copy link
Member

Duansg commented Sep 28, 2025

Hi, @mengnankkkk thank you very much for your revisions to this issue. I believe the minimal change required should only address the tag value.

After reviewing the officialdocumentation, it appears that only label value currently exhibits this issue. I believe we can utilize UTF-8 parsing when necessary to avoid unnecessary encoding conversion overhead.

For example, the adopted solution: Fast ASCII validation + UTF-8 fallback mechanism

@mengnankkkk
Copy link
Contributor Author

您好,非常感谢您对此问题的修改。我认为所需的最小修改应该只涉及标签值。

查阅官方文档后,似乎只有 label value 目前存在此问题。我认为我们可以在必要时使用 UTF-8 解析,以避免不必要的编码转换开销。

例如,采用的解决方案:快速 ASCII 验证 + UTF-8 回退机制

Thank you for your suggestion, I will make changes according to your suggestion.

- Replace strict UTF-8 validation with performance-optimized approach
- ASCII characters (0-127) use fast path without UTF-8 conversion overhead
- Non-ASCII characters use UTF-8 fallback mechanism for proper validation
- Support Chinese and other Unicode characters in Prometheus label values
- Add comprehensive UTF-8 multi-byte character parsing in parseLabelValue
- Add test case for Chinese label values validation
- Maintain full backward compatibility with existing functionality

Resolves issue with Chinese characters in Prometheus metrics label values
Performance improvement: zero overhead for ASCII-only label values
@mengxin523
Copy link

Yes, I believe using UTF-8 can fully meet the requirement of supporting all languages worldwide for label values! Moreover, I suggest that this setting should not be limited only to label values; in this way, non-English content can be used in many other scenarios, achieving full universality and getting it right in one go.

@Duansg
Copy link
Member

Duansg commented Sep 29, 2025

Yes, I believe using UTF-8 can fully meet the requirement of supporting all languages worldwide for label values! Moreover, I suggest that this setting should not be limited only to label values; in this way, non-English content can be used in many other scenarios, achieving full universality and getting it right in one go.

The Prometheus specification explicitly stipulates that metric names and label names must only use [a-zA-Z0-9_:] and does not support Chinese characters. If we were to support Chinese metric names in Hertzbeat, it would cause exported data to become unrecognizable by ecosystem tools like Prometheus/Grafana, thereby breaking compatibility.

A more reasonable approach is to maintain standardization while implementing Chinese-friendly features through label value, metadata mapping, or the frontend presentation layer. This approach adheres to the Prometheus specification while also meeting users' need for Chinese readability.

@tomsun28 tomsun28 added the good first pull request Good for newcomers label Oct 5, 2025
@tomsun28 tomsun28 added the bugfix label Oct 5, 2025
@tomsun28 tomsun28 added this to the 1.8.0 milestone Oct 5, 2025
Copy link
Member

@tomsun28 tomsun28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@tomsun28 tomsun28 merged commit 32d784e into apache:master Oct 5, 2025
3 checks passed
@github-project-automation github-project-automation bot moved this from To do to Done in Apache HertzBeat Oct 5, 2025
@mengnankkkk mengnankkkk deleted the feat-mengnankkbug branch October 5, 2025 01:18
@tomsun28
Copy link
Member

tomsun28 commented Oct 8, 2025

Yes, I believe using UTF-8 can fully meet the requirement of supporting all languages worldwide for label values! Moreover, I suggest that this setting should not be limited only to label values; in this way, non-English content can be used in many other scenarios, achieving full universality and getting it right in one go.

The Prometheus specification explicitly stipulates that metric names and label names must only use [a-zA-Z0-9_:] and does not support Chinese characters. If we were to support Chinese metric names in Hertzbeat, it would cause exported data to become unrecognizable by ecosystem tools like Prometheus/Grafana, thereby breaking compatibility.

A more reasonable approach is to maintain standardization while implementing Chinese-friendly features through label value, metadata mapping, or the frontend presentation layer. This approach adheres to the Prometheus specification while also meeting users' need for Chinese readability.

Sorry, I missed this. Do we need to discuss this PR again?

@Duansg
Copy link
Member

Duansg commented Oct 9, 2025

Yes, I believe using UTF-8 can fully meet the requirement of supporting all languages worldwide for label values! Moreover, I suggest that this setting should not be limited only to label values; in this way, non-English content can be used in many other scenarios, achieving full universality and getting it right in one go.

The Prometheus specification explicitly stipulates that metric names and label names must only use [a-zA-Z0-9_:] and does not support Chinese characters. If we were to support Chinese metric names in Hertzbeat, it would cause exported data to become unrecognizable by ecosystem tools like Prometheus/Grafana, thereby breaking compatibility.
A more reasonable approach is to maintain standardization while implementing Chinese-friendly features through label value, metadata mapping, or the frontend presentation layer. This approach adheres to the Prometheus specification while also meeting users' need for Chinese readability.

Sorry, I missed this. Do we need to discuss this PR again?

Hi @tomsun28, yes, i believe there are still some issues in this PR that require priority attention, such as:

  1. Excessive object creation and unnecessary boxing/unboxing.
  2. Incorrect exception fallback strategy.
  3. Redundant inspections and existing performance issues.

I've addressed them in #3810. Please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Regarding Prometheus monitoring, it seems that it does not support Chinese characters as label values.

4 participants