New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to index duplicate documents with same id and routing id. #31976

Closed
kylelyk opened this Issue Jul 11, 2018 · 12 comments

Comments

Projects
None yet
4 participants
@kylelyk
Copy link

kylelyk commented Jul 11, 2018

Elasticsearch version: 6.2.4

Plugins installed: []

JVM version: 1.8.0_172

OS version: MacOS (Darwin Kernel Version 15.6.0)

Description of the problem including expected versus actual behavior:
Over the past few months, we've been seeing completely identical documents pop up which have the same id, type and routing id. We're using custom routing to get parent-child joins working correctly and we make sure to delete the existing documents when re-indexing them to avoid two copies of the same document on the same shard. We use Bulk Index API calls to delete and index the documents. The indexTime field below is set by the service that indexes the document into ES and as you can see, the documents were indexed about 1 second apart from each other. This problem only seems to happen on our production server which has more traffic and 1 read replica, and it's only ever 2 documents that are duplicated on what I believe to be a single shard.

The problem can be fixed by deleting the existing documents with that id and re-indexing it again which is weird since that is what the indexing service is doing in the first place.

Queries:
GET /my-index/_search

{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
      "field": "id",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}
{
  "took": 2588,
  "timed_out": false,
  "_shards": {
    "total": 4,
    "successful": 4,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 15430904,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "duplicateCount": {
      "doc_count_error_upper_bound": 4,
      "sum_other_doc_count": 15430801,
      "buckets": [
        {
          "key": "746004ff8168bbe5672605fad34704a5",
          "doc_count": 2,
          "duplicateDocuments": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_index": "my-index",
                  "_type": "ce",
                  "_id": "746004ff8168bbe5672605fad34704a5",
                  "_score": 1,
                  "_routing": "746004ff8168bbe5672605fad34704a5",
                  "_source": {
                    "indexTime": 1531249623788
                  }
                },
                {
                  "_index": "my-index",
                  "_type": "ce",
                  "_id": "746004ff8168bbe5672605fad34704a5",
                  "_score": 1,
                  "_routing": "746004ff8168bbe5672605fad34704a5",
                  "_source": {
                    "indexTime": 1531249622605
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}
@kylelyk

This comment has been minimized.

Copy link

kylelyk commented Jul 11, 2018

The description of this problem seems similar to #10511, however I have double checked that all of the documents are of the type "ce".

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented Jul 11, 2018

@ywelsch

This comment has been minimized.

Copy link
Contributor

ywelsch commented Jul 12, 2018

@kylelyk Can you provide more info on the bulk indexing process? Are you setting the routing value on the bulk request? Are you using auto-generated IDs?
In order to check that these documents are indeed on the same shard, can you do the search again, this time using a preference (_shards:0, and then check with _shards:1 etc.), see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html
Are these duplicates only showing when you hit the primary or the replica shards? Can you try the search with preference _primary, and then again using preference _replica. Thanks.

@kylelyk

This comment has been minimized.

Copy link

kylelyk commented Jul 12, 2018

We are using routing values for each document indexed during a bulk request and we are using external GUIDs from a DB for the id.

Searching using the preferences you specified, I can see that there are two documents on shard 1 primary with same id, type, and routing id, and 1 document on shard 1 replica.

@dnhatn

This comment has been minimized.

Copy link
Contributor

dnhatn commented Jul 12, 2018

I can see that there are two documents on shard 1 primary with same id, type, and routing id, and 1 document on shard 1 replica.

Did you mean the duplicate occurs on the primary? Can you also provide the _version number of these documents (on both primary and replica)? Thank you!

@kylelyk

This comment has been minimized.

Copy link

kylelyk commented Jul 12, 2018

Yes, the duplicate occurs on the primary shard. When I try to search using _version as documented here, I get two documents with version 60 and 59. I get 1 document when I then specify the preference=shards:X where x is any number. Maybe _version doesn't play well with preferences?

@dnhatn

This comment has been minimized.

Copy link
Contributor

dnhatn commented Jul 12, 2018

@kylelyk Thanks a lot for the info. Which version type did you use for these documents?

@kylelyk

This comment has been minimized.

Copy link

kylelyk commented Jul 12, 2018

I am not using any kind of versioning when indexing so the default should be no version checking and automatic version incrementing.

@dnhatn

This comment has been minimized.

Copy link
Contributor

dnhatn commented Jul 13, 2018

We use Bulk Index API calls to delete and index the documents.

@kylelyk We don't have to delete before reindexing a document. However, can you confirm that you always use a bulk of delete and index when updating documents or just sometimes? Thank you!

@dnhatn

This comment has been minimized.

Copy link
Contributor

dnhatn commented Jul 13, 2018

@ywelsch found that this issue is related to and fixed by #29619.

Given the way we deleted/updated these documents and their versions, this issue can be explained as follows:

  1. Suppose we have a document with version 57

  2. A bulk of delete and reindex will remove the index-v57, increase the version to 58 (for the delete operation), then put a new doc with version 59. While the engine places the index-59 into the version map, the safe-access flag is flipped over (due to a concurrent fresh), the engine won't put that index entry into the version map, but also leave the delete-58 tombstone in the version map. The delete-58 tombstone is stale because the latest version of that document is index-59.

  3. Another bulk of delete and reindex will increase the version to 59 (for a delete) but won't remove docs from Lucene because of the existing (stale) delete-58 tombstone. The index operation will append document (version 60) to Lucene (instead of overwriting). At this point, we will have two documents with the same id.

Our formal model uncovered this problem and we already fixed this in 6.3.0 by #29619.

@kylelyk I really appreciate your helpfulness here.

@ywelsch

This comment has been minimized.

Copy link
Contributor

ywelsch commented Jul 13, 2018

@kylelyk can you update to the latest ES version (6.3.1 as of this reply) and check if this still happens?

@kylelyk

This comment has been minimized.

Copy link

kylelyk commented Jul 13, 2018

Unfortunately, we're using the AWS hosted version of Elasticsearch so it might take some time for Amazon to update it to 6.3.x. I'll close this issue and re-open it if the problem persists after the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment