Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't json decode Chinese character on arm platform #3142

Closed
scuzhanglei opened this issue Mar 1, 2021 · 14 comments
Closed

Can't json decode Chinese character on arm platform #3142

scuzhanglei opened this issue Mar 1, 2021 · 14 comments
Assignees

Comments

@scuzhanglei
Copy link
Contributor

Bug Report

Describe the bug

on arm platform, regex parser with Decode_Field_As can't work correctly.
this is fluent-bit error log:

[2021/03/01 19:48:34] [error] Crossing over string boundary
[2021/03/01 19:48:34] [error] Not at boundary but still NULL terminating : 9 - '�呀"}'

To Reproduce

  • Rubular link if applicable: https://rubular.com/r/p3ryl22EbVRU00

  • Steps to reproduce the problem:
    with above parser and the example log with Chinese character or other special character like '《》' will reproduce it.

Expected behavior

Screenshots

Your Environment

  • Version used:

from v1.2.0 to v1.7.1 which is build on my arm machine with default configuration: cmake3 ../ && make would suffer this issue

v1.1.x fluent-bit cant start at all, it exit directly

v1.0.x will work correctly

  • Configuration:

parser.conf

[PARSER]
    Name db_center
    Format regex
    Regex ^\[(?<time>.+): (?<levelname>.+)\/(?<processName>.+)\] \[(?<name>.+):(?<result>.+)\] \[(?<source>.+):(?<userrole>.+):(?<username>.+)\] (?<resources>{.*}) (?<data>{.+}) (?<messages>{.+})
    Time_Key time
    Decode_Field_As json resources
    Decode_Field_As json data
    Decode_Field_As json messages
  • Environment name and version (e.g. Kubernetes? What version?):
  • Server type and version: aarch64
  • Operating System and version: CentOS Linux release 7.8.2003 (AltArch)
  • Filters and plugins:

Additional context

and I tested it on x86 machine, it works ok, this issue only occurs on a arm machine.

@tlamr
Copy link

tlamr commented Mar 23, 2021

Hello!

we are facing same issue. This blocks our migration to arm64. It works ok with latest version and amd64 version thought.

Cheers,
Tomas

@PettitWesley
Copy link
Contributor

@edsiper Is this a known limitation of the arm64 build?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 26, 2021
@tlamr
Copy link

tlamr commented Apr 26, 2021

ping

@agup006
Copy link
Member

agup006 commented Apr 26, 2021

@nokute78 Would this be something you would be able to help with? I recall you helping with some localization issues before

@nokute78
Copy link
Collaborator

nokute78 commented Apr 28, 2021

I excerpted from flb_unescape.c and try to print japanese.
https://gist.github.com/nokute78/9fbf670d657277d8645ad11ad0413ce8

I can reproduce it on Ubuntu 20.04 (x86_64) and qemu which emulates aarch64.

$ sudo apt install g++-aarch64-linux-gnu qemu-user-binfmt 
$ aarch64-linux-gnu-gcc -o aaa --static arm_string.c
$ ./aaa
Crossing over string boundaryNot at boundary but still NULL terminating : 6 - '��'あぃ

ã

On x86_64, it outputs successfully.
Hmm.

$ gcc --static arm_string.c
$ ./a.out 
あぃ

あぃ

@nokute78
Copy link
Collaborator

nokute78 commented Apr 28, 2021

Casting result is different. (signed char -> unsigned uint32_t)
It affects

ch = (uint32_t) *in_buf;

example code

#include <stdio.h>
#include <stdint.h>

int main() {
  char a = -1;
  printf("a           = %0x\n", a);
  printf("a(uint32_t) = %0x\n", (uint32_t)a);
}

On x86_64

a           = ffffffff
a(uint32_t) = ffffffff

On aarch64

a           = ff
a(uint32_t) = ff

@nokute78
Copy link
Collaborator

Hmm,
printf("%x", (char)-1);
is different on x86_64 / aarch64.

@nokute78
Copy link
Collaborator

%x expects variable is unsigned int. So (char) -1 will be converted to unsigned int.
https://linux.die.net/man/3/printf

o, u, x, X
The unsigned int argument is converted to

Refer to Draft of ISO/IEC 9899:201x, (char) -1 will be converted to (uint32_t) UINT32_MAX in my understanding.

6.3.1.3 Signed and unsigned integers
(snip)
2 Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.60)

@YukkiKaze
Copy link

YukkiKaze commented May 19, 2021

I excerpted flb_unescape.c from v1.0.4 ,which fluent-bit works good both on arm and x86, but I didn't get expected output,

root in localhost in ~/test 
❯ gcc -o flb_104 flb_104.c 

root in localhost in ~/test 
❯ ./flb_104 
被拔

so I add some conditional statements taken from v1.7.4
https://gist.github.com/YukkiKaze/1f0210b207ab92146dcb830fbc3c6e7a
now I can get expected result on x86,

root in localhost in ~/test 
❯ gcc -o flb_104 flb_104.c 

root in localhost in ~/test 
❯ ./flb_104 
被拔

被拔

but still face the same problem on arm

[root@nestedcluster 18:54:36 test]$gcc -o flb_104 flb_104.c
[root@nestedcluster 18:54:44 test]$./flb_104
Not at boundary but still NULL terminating : 6 - '��'
被拔

被

emmm...
and I compare flb_unescape_string_utf8 between v1.0.4 and v1.7.4, overall logic seems same.
If casting issue is the culprit, why v1.0.4 didn't expose it?

@YukkiKaze
Copy link

@nokute78 @edsiper Hi~ I seem found a solution, ch = (uint32_t) (signed char) *in_buf; will solve the problem.
Because char type on arm is unsigned char by default, but signed char on x86. So when casting char to uint32_t, C will do Integer Promotion.
For example

char a = 0xe8
(uint32_t) a == 4294967272 // 0xffffffe8 on x86
(uint32_t) a == 232 // 0x000000e8 on arm
(uint32_t) (signed char) a == 4294967272 // 0xffffffe8 on both x86 and arm

@edsiper
Copy link
Member

edsiper commented May 20, 2021

@nokute78 ^

@nokute78
Copy link
Collaborator

@nokute78
Copy link
Collaborator

nokute78 commented May 21, 2021

The patch #3522 is merged.
Close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants