Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters seem to encode incorrectly? #88

Open
polyfractal opened this issue Dec 15, 2015 · 0 comments
Open

Unicode characters seem to encode incorrectly? #88

polyfractal opened this issue Dec 15, 2015 · 0 comments

Comments

@polyfractal
Copy link

From: https://discuss.elastic.co/t/smart-chinese-analysis-returns-unicodes-instead-of-chinese-tokens

Unicode is far from my expertise, so I may be very wrong about this. It seems that Sense is urlencoding unicode characters, which is preventing them from being properly decoded at Elasticsearch?

For example, if we setup a smartcn analyzer and analyze some chinese characters:

PUT /test_chinese
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "default": {
            "type": "smartcn"
          }
        }
      }
    }
  }
}

GET /test_chinese/_analyze?text='我说世界好!'

The tokens are incorrect:

{
  "tokens": [
    {
      "token": "\u0011",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "\u0016",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    },
    {
      "token": "l",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 4
    }
  ]
}

If we look at what get's sent over the wire:

http://localhost:5601/api/sense/proxy?uri=http%3A%2F%2Flocalhost%3A9200%2Ftest_chinese%2F_analyze%3Ftext%3D%27%E6%88%91%E8%AF%B4%E4%B8%96%E7%95%8C%E5%A5%BD!%27&_=1450180681591


Decoded:  http://localhost:5601/api/sense/proxy?uri=http://localhost:9200/test_chinese/_analyze?text='我说世界好!'&_=1450180681591

So that part looks ok (unlike previous versions of Sense), So I suspect the proxy portion is what's incorrectly encoding. I pulled out a packet sniffer and this is what the proxy is sending to ES:

GET /test_chinese/_analyze?text=%27���L}!%27 HTTP/1.1
connection: keep-alive
x-forwarded-proto: http
accept: text/plain, */*; q=0.01
referer: http://localhost:5601/app/sense
kbn-xsrf-token: 959b10246601e4bc85e7f57d254ea23c31800cd60b36ee50627d0b6ef84f52f7
accept-encoding: gzip, deflate, sdch
x-forwarded-for: 127.0.0.1
accept-language: en-US,en;q=0.8
x-forwarded-port: 59597
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36
Host: localhost:9200

Full Packet dump:

0000   02 00 00 00 45 00 02 53 b2 0a 40 00 40 06 00 00  ....E..S..@.@...
0010   7f 00 00 01 7f 00 00 01 e8 ce 23 f0 b9 86 e9 fd  ..........#.....
0020   ce 74 fb d2 80 18 31 d7 00 48 00 00 01 01 08 0a  .t....1..H......
0030   1b bd 77 d1 1b bd 77 d1 47 45 54 20 2f 74 65 73  ..w...w.GET /tes
0040   74 5f 63 68 69 6e 65 73 65 2f 5f 61 6e 61 6c 79  t_chinese/_analy
0050   7a 65 3f 74 65 78 74 3d 25 32 37 11 f4 16 4c 7d  ze?text=%27...L}
0060   21 25 32 37 20 48 54 54 50 2f 31 2e 31 0d 0a 63  !%27 HTTP/1.1..c
0070   6f 6e 6e 65 63 74 69 6f 6e 3a 20 6b 65 65 70 2d  onnection: keep-
0080   61 6c 69 76 65 0d 0a 78 2d 66 6f 72 77 61 72 64  alive..x-forward
0090   65 64 2d 70 72 6f 74 6f 3a 20 68 74 74 70 0d 0a  ed-proto: http..
00a0   61 63 63 65 70 74 3a 20 74 65 78 74 2f 70 6c 61  accept: text/pla
00b0   69 6e 2c 20 2a 2f 2a 3b 20 71 3d 30 2e 30 31 0d  in, */*; q=0.01.
00c0   0a 72 65 66 65 72 65 72 3a 20 68 74 74 70 3a 2f  .referer: http:/
00d0   2f 6c 6f 63 61 6c 68 6f 73 74 3a 35 36 30 31 2f  /localhost:5601/
00e0   61 70 70 2f 73 65 6e 73 65 0d 0a 6b 62 6e 2d 78  app/sense..kbn-x
00f0   73 72 66 2d 74 6f 6b 65 6e 3a 20 39 35 39 62 31  srf-token: 959b1
0100   30 32 34 36 36 30 31 65 34 62 63 38 35 65 37 66  0246601e4bc85e7f
0110   35 37 64 32 35 34 65 61 32 33 63 33 31 38 30 30  57d254ea23c31800
0120   63 64 36 30 62 33 36 65 65 35 30 36 32 37 64 30  cd60b36ee50627d0
0130   62 36 65 66 38 34 66 35 32 66 37 0d 0a 61 63 63  b6ef84f52f7..acc
0140   65 70 74 2d 65 6e 63 6f 64 69 6e 67 3a 20 67 7a  ept-encoding: gz
0150   69 70 2c 20 64 65 66 6c 61 74 65 2c 20 73 64 63  ip, deflate, sdc
0160   68 0d 0a 78 2d 66 6f 72 77 61 72 64 65 64 2d 66  h..x-forwarded-f
0170   6f 72 3a 20 31 32 37 2e 30 2e 30 2e 31 0d 0a 61  or: 127.0.0.1..a
0180   63 63 65 70 74 2d 6c 61 6e 67 75 61 67 65 3a 20  ccept-language: 
0190   65 6e 2d 55 53 2c 65 6e 3b 71 3d 30 2e 38 0d 0a  en-US,en;q=0.8..
01a0   78 2d 66 6f 72 77 61 72 64 65 64 2d 70 6f 72 74  x-forwarded-port
01b0   3a 20 35 39 35 39 37 0d 0a 75 73 65 72 2d 61 67  : 59597..user-ag
01c0   65 6e 74 3a 20 4d 6f 7a 69 6c 6c 61 2f 35 2e 30  ent: Mozilla/5.0
01d0   20 28 4d 61 63 69 6e 74 6f 73 68 3b 20 49 6e 74   (Macintosh; Int
01e0   65 6c 20 4d 61 63 20 4f 53 20 58 20 31 30 5f 31  el Mac OS X 10_1
01f0   31 5f 31 29 20 41 70 70 6c 65 57 65 62 4b 69 74  1_1) AppleWebKit
0200   2f 35 33 37 2e 33 36 20 28 4b 48 54 4d 4c 2c 20  /537.36 (KHTML, 
0210   6c 69 6b 65 20 47 65 63 6b 6f 29 20 43 68 72 6f  like Gecko) Chro
0220   6d 65 2f 34 37 2e 30 2e 32 35 32 36 2e 37 33 20  me/47.0.2526.73 
0230   53 61 66 61 72 69 2f 35 33 37 2e 33 36 0d 0a 48  Safari/537.36..H
0240   6f 73 74 3a 20 6c 6f 63 61 6c 68 6f 73 74 3a 39  ost: localhost:9
0250   32 30 30 0d 0a 0d 0a                             200....

For comparison, if you run the command via curl, you get the proper tokens back:

$ curl -XGET -v "http://127.0.0.1:9200/test_chinese/_analyze?text='我说世界好!'&pretty"

* Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 9200 (#0)
> GET /test_chinese/_analyze?text='我说世界好!'&pretty HTTP/1.1
> Host: 127.0.0.1:9200
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 1749
< 
{
  "tokens" : [ {
    "token" : "",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "",
    "start_offset" : 7,
    "end_offset" : 8,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "",
    "start_offset" : 9,
    "end_offset" : 10,
    "type" : "word",
    "position" : 9
  }, {
    "token" : "",
    "start_offset" : 10,
    "end_offset" : 11,
    "type" : "word",
    "position" : 10
  }, {
    "token" : "",
    "start_offset" : 11,
    "end_offset" : 12,
    "type" : "word",
    "position" : 11
  }, {
    "token" : "",
    "start_offset" : 12,
    "end_offset" : 13,
    "type" : "word",
    "position" : 12
  }, {
    "token" : "",
    "start_offset" : 13,
    "end_offset" : 14,
    "type" : "word",
    "position" : 13
  }, {
    "token" : "",
    "start_offset" : 14,
    "end_offset" : 15,
    "type" : "word",
    "position" : 14
  }, {
    "token" : "",
    "start_offset" : 15,
    "end_offset" : 16,
    "type" : "word",
    "position" : 15
  } ]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant