Offset Bug #61

sylvae · 2019-08-01T10:09:14Z

Hi , I got a bug when I use this elasticsearch-analysis-vietnamese：
My index mappings:
PUT /test { "mappings": { "properties":{ "content":{ "type": "text", "analyzer": "vi_analyzer" } } } }
And when I index the data:
POST test/_doc/1 { "content":"Phụ tùng xe Mazda bán tải dưới 7 chỗ: ống dẫn gió tới két làm mát khí nạp- cao su lưu hóa, mới 100%, phục vụ BHBD. Ms:1D0013246A" }
ES return errors:
{ "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=95,endOffset=96,lastStartOffset=115 for field 'content'" } ], "type": "illegal_argument_exception", "reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=95,endOffset=96,lastStartOffset=115 for field 'content'" }, "status": 400 }
And I tried analyze test:
POST _analyze { "analyzer": "vi_analyzer", "text": "Phụ tùng xe Mazda bán tải dưới 7 chỗ: ống dẫn gió tới két làm mát khí nạp- cao su lưu hóa, mới 100%, phục vụ BHBD. Ms:1D0013246A" }
ES returns:
{ "tokens": [ { "token": "phụ tùng", "start_offset": 0, "end_offset": 8, "type": "<PHRASE>", "position": 0 }, { "token": "xe", "start_offset": 9, "end_offset": 11, "type": "<PHRASE>", "position": 1 }, { "token": "mazda", "start_offset": 12, "end_offset": 17, "type": "<PHRASE>", "position": 2 }, { "token": "bán", "start_offset": 18, "end_offset": 21, "type": "<PHRASE>", "position": 3 }, { "token": "tải", "start_offset": 22, "end_offset": 25, "type": "<PHRASE>", "position": 4 }, { "token": "7", "start_offset": 31, "end_offset": 32, "type": "<NUMBER>", "position": 6 }, { "token": "chỗ", "start_offset": 33, "end_offset": 36, "type": "<PHRASE>", "position": 7 }, { "token": "ống", "start_offset": 38, "end_offset": 41, "type": "<PHRASE>", "position": 8 }, { "token": "dẫn", "start_offset": 42, "end_offset": 45, "type": "<PHRASE>", "position": 9 }, { "token": "gió", "start_offset": 46, "end_offset": 49, "type": "<PHRASE>", "position": 10 }, { "token": "tới", "start_offset": 50, "end_offset": 53, "type": "<PHRASE>", "position": 11 }, { "token": "két", "start_offset": 54, "end_offset": 57, "type": "<PHRASE>", "position": 12 }, { "token": "làm", "start_offset": 58, "end_offset": 61, "type": "<PHRASE>", "position": 13 }, { "token": "mát", "start_offset": 62, "end_offset": 65, "type": "<PHRASE>", "position": 14 }, { "token": "khí", "start_offset": 66, "end_offset": 69, "type": "<PHRASE>", "position": 15 }, { "token": "nạp", "start_offset": 70, "end_offset": 73, "type": "<PHRASE>", "position": 16 }, { "token": "cao su", "start_offset": 75, "end_offset": 81, "type": "<PHRASE>", "position": 17 }, { "token": "lưu hóa", "start_offset": 82, "end_offset": 89, "type": "<PHRASE>", "position": 18 }, { "token": "mới", "start_offset": 91, "end_offset": 94, "type": "<PHRASE>", "position": 19 }, { "token": "100%", "start_offset": 95, "end_offset": 99, "type": "<PERCENTAGE>", "position": 20 }, { "token": "phục vụ", "start_offset": 101, "end_offset": 108, "type": "<PHRASE>", "position": 21 }, { "token": "bhbd", "start_offset": 109, "end_offset": 113, "type": "<PHRASE>", "position": 22 }, { "token": "ms", "start_offset": 115, "end_offset": 117, "type": "<PHRASE>", "position": 23 }, { "token": "1", "start_offset": 95, "end_offset": 96, "type": "<NUMBER>", "position": 24 }, { "token": "d0", "start_offset": 119, "end_offset": 121, "type": "<ALLCAPS>", "position": 25 }, { "token": "013246", "start_offset": 121, "end_offset": 127, "type": "<NUMBER>", "position": 26 }, { "token": "a", "start_offset": 127, "end_offset": 128, "type": "<PHRASE>", "position": 27 } ] }

Token "100%" and token "1" have the same startOffset，so may be it is a bug.

I build the Elasticsearch Vietnamese Analysis Plugin by myself for Elasticsearch-7.3.0.

Thank you for help!

The text was updated successfully, but these errors were encountered:

duydo · 2019-08-02T09:29:40Z

Thanks for your report @sylvae, I'll investigate.

tienthanh2509 · 2019-08-23T04:21:06Z

Hi a,

ES 7.0.0 still have a sample issue

mapping

    "mappings": {
            "properties": {
                "description": {
                    "type": "text",
                    "analyzer": "vi_analyzer"
                },
                "title": {
                    "type": "text",
                    "analyzer": "vi_analyzer"
                }
            }
        }

elasticsearch_1          | {"type": "server", "timestamp": "2019-08-23T04:15:34,198+0000", "level": "DEBUG", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "knl-cluster", "node.name": "es01", "cluster.uuid": "9y8OyVWsStGCx5RXbaxxTg", "node.id": "A0dRe1RQTaObA9cOgjc9HA",  "message": "[knl][0] failed to execute bulk item (index) index {[knl][_doc][HqTIpqcKLA], source[n/a, actual length: [3.6kb], max length: 2kb]}" , 
elasticsearch_1          | "stacktrace": ["java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=54,endOffset=56,lastStartOffset=60 for field 'description'",
elasticsearch_1          | "at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:842) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
elasticsearch_1          | "at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:441) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
elasticsearch_1          | "at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:405) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
elasticsearch_1          | "at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:251) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
elasticsearch_1          | "at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
elasticsearch_1          | "at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
elasticsearch_1          | "at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]",
elasticsearch_1          | "at org.elasticsearch.index.engine.InternalEngine.addDocs(InternalEngine.java:1092) ~[elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1037) ~[elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:864) ~[elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:789) ~[elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:762) ~[elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:719) ~[elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.bulk.TransportShardBulkAction.lambda$executeIndexRequestOnPrimary$3(TransportShardBulkAction.java:452) ~[elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.bulk.TransportShardBulkAction.executeOnPrimaryWhileHandlingMappingUpdates(TransportShardBulkAction.java:475) ~[elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:450) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:218) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:161) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:153) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:141) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:79) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1033) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1011) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:105) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:413) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:359) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:269) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:236) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2512) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:970) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:358) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:313) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:305) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) [x-pack-security-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:309) [x-pack-security-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:687) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.0.0.jar:7.0.0]",
elasticsearch_1          | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
elasticsearch_1          | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
elasticsearch_1          | "at java.lang.Thread.run(Thread.java:835) [?:?]"] }

The tokenizer is work fine

{
    "tokens": [
        {
            "token": "Người dân",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<PHRASE>",
            "position": 0
        },
        {
            "token": "ở",
            "start_offset": 10,
            "end_offset": 11,
            "type": "<PHRASE>",
            "position": 1
        },
        {
            "token": "quận",
            "start_offset": 12,
            "end_offset": 16,
            "type": "<PHRASE>",
            "position": 2
        },
        {
            "token": "4",
            "start_offset": 17,
            "end_offset": 18,
            "type": "<NUMBER>",
            "position": 3
        },
        {
            "token": "TP.HCM",
            "start_offset": 20,
            "end_offset": 26,
            "type": "<ALLCAPS>",
            "position": 4
        },
        {
            "token": "nói",
            "start_offset": 28,
            "end_offset": 31,
            "type": "<PHRASE>",
            "position": 5
        },
        {
            "token": "một",
            "start_offset": 32,
            "end_offset": 35,
            "type": "<PHRASE>",
            "position": 6
        },
        {
            "token": "bị cáo",
            "start_offset": 36,
            "end_offset": 42,
            "type": "<PHRASE>",
            "position": 7
        },
        {
            "token": "được",
            "start_offset": 43,
            "end_offset": 47,
            "type": "<PHRASE>",
            "position": 8
        },
        {
            "token": "đưa",
            "start_offset": 48,
            "end_offset": 51,
            "type": "<PHRASE>",
            "position": 9
        },
        {
            "token": "ra",
            "start_offset": 52,
            "end_offset": 54,
            "type": "<PHRASE>",
            "position": 10
        },
        {
            "token": "xét xử",
            "start_offset": 55,
            "end_offset": 61,
            "type": "<PHRASE>",
            "position": 11
        },
        {
            "token": "với",
            "start_offset": 62,
            "end_offset": 65,
            "type": "<PHRASE>",
            "position": 12
        },
        {
            "token": "tội danh",
            "start_offset": 66,
            "end_offset": 74,
            "type": "<PHRASE>",
            "position": 13
        },
        {
            "token": "dâm ô",
            "start_offset": 75,
            "end_offset": 80,
            "type": "<PHRASE>",
            "position": 14
        },
        {
            "token": "trẻ",
            "start_offset": 81,
            "end_offset": 84,
            "type": "<PHRASE>",
            "position": 15
        },
        {
            "token": "nhỏ",
            "start_offset": 85,
            "end_offset": 88,
            "type": "<PHRASE>",
            "position": 16
        },
        {
            "token": "như",
            "start_offset": 89,
            "end_offset": 92,
            "type": "<PHRASE>",
            "position": 17
        },
        {
            "token": "ông",
            "start_offset": 93,
            "end_offset": 96,
            "type": "<PHRASE>",
            "position": 18
        },
        {
            "token": "Linh",
            "start_offset": 97,
            "end_offset": 101,
            "type": "<PHRASE>",
            "position": 19
        },
        {
            "token": "mà",
            "start_offset": 102,
            "end_offset": 104,
            "type": "<PHRASE>",
            "position": 20
        },
        {
            "token": "điều động",
            "start_offset": 105,
            "end_offset": 114,
            "type": "<PHRASE>",
            "position": 21
        },
        {
            "token": "lực lượng",
            "start_offset": 115,
            "end_offset": 124,
            "type": "<PHRASE>",
            "position": 22
        },
        {
            "token": "an ninh",
            "start_offset": 125,
            "end_offset": 132,
            "type": "<PHRASE>",
            "position": 23
        },
        {
            "token": "hùng hậu",
            "start_offset": 133,
            "end_offset": 141,
            "type": "<PHRASE>",
            "position": 24
        },
        {
            "token": "như vậy",
            "start_offset": 142,
            "end_offset": 149,
            "type": "<PHRASE>",
            "position": 25
        },
        {
            "token": "là",
            "start_offset": 150,
            "end_offset": 152,
            "type": "<PHRASE>",
            "position": 26
        },
        {
            "token": "không",
            "start_offset": 153,
            "end_offset": 158,
            "type": "<PHRASE>",
            "position": 27
        },
        {
            "token": "đáng",
            "start_offset": 159,
            "end_offset": 163,
            "type": "<PHRASE>",
            "position": 28
        }
    ]
}

Thank you for help

duydo · 2019-08-26T14:50:23Z

@sylvae @tienthanh2509 this bug has been fixed in new release v7.3.1.

duydo added the bug label Aug 2, 2019

duydo closed this as completed Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offset Bug #61

Offset Bug #61

sylvae commented Aug 1, 2019

duydo commented Aug 2, 2019 •

edited

Loading

tienthanh2509 commented Aug 23, 2019

duydo commented Aug 26, 2019

Offset Bug #61

Offset Bug #61

Comments

sylvae commented Aug 1, 2019

duydo commented Aug 2, 2019 • edited Loading

tienthanh2509 commented Aug 23, 2019

duydo commented Aug 26, 2019

duydo commented Aug 2, 2019 •

edited

Loading