Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected whitespace causes errors #28

Closed
Trey314159 opened this issue Aug 4, 2017 · 1 comment
Closed

Unexpected whitespace causes errors #28

Trey314159 opened this issue Aug 4, 2017 · 1 comment

Comments

@Trey314159
Copy link

I've found three cases where the unexpected presence or absence of whitespace causes offset error or string index out of range error. I'm reporting all three together, since I'm guessing they are related.

  1. Two newlines in a row causes an error:
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "x\n\ny" }'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[K5DTwrD][127.0.0.1:9300][indices:admin/analyze[s]]"
      }
    ],
    "type" : "string_index_out_of_bounds_exception",
    "reason" : "String index out of range: -1"
  },
  "status" : 500
}
  1. Two spaces between elements that should tokenize together causes an error. In this case "không gian" is normally indexed as one token. But if it has two spaces between "không" and "gian" it causes an error:
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "không  gian"}'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[K5DTwrD][127.0.0.1:9300][indices:admin/analyze[s]]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=-1,endOffset=9"
  },
  "status" : 400
}
  1. No space between elements that should tokenize together causes an error. In this case, "năm 6" usually gets tokenized together, but if there's no space in there, I think it still gets split into two tokens, but the lack of space between causes an error:
curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "vi", "text" : "năm6"}'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[K5DTwrD][127.0.0.1:9300][indices:admin/analyze[s]]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=-1,endOffset=4"
  },
  "status" : 400
}
@duydo
Copy link
Owner

duydo commented Aug 16, 2017

I close this issue, it will be fixed in new tokenizer #37

@duydo duydo closed this as completed Aug 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants