Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to index duplicate documents with same id and routing id. #31976

Closed
kylelyk opened this issue Jul 11, 2018 · 14 comments
Closed

Possible to index duplicate documents with same id and routing id. #31976

kylelyk opened this issue Jul 11, 2018 · 14 comments

Comments

@kylelyk
Copy link

@kylelyk kylelyk commented Jul 11, 2018

Elasticsearch version: 6.2.4

Plugins installed: []

JVM version: 1.8.0_172

OS version: MacOS (Darwin Kernel Version 15.6.0)

Description of the problem including expected versus actual behavior:
Over the past few months, we've been seeing completely identical documents pop up which have the same id, type and routing id. We're using custom routing to get parent-child joins working correctly and we make sure to delete the existing documents when re-indexing them to avoid two copies of the same document on the same shard. We use Bulk Index API calls to delete and index the documents. The indexTime field below is set by the service that indexes the document into ES and as you can see, the documents were indexed about 1 second apart from each other. This problem only seems to happen on our production server which has more traffic and 1 read replica, and it's only ever 2 documents that are duplicated on what I believe to be a single shard.

The problem can be fixed by deleting the existing documents with that id and re-indexing it again which is weird since that is what the indexing service is doing in the first place.

Queries:
GET /my-index/_search

{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
      "field": "id",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}
{
  "took": 2588,
  "timed_out": false,
  "_shards": {
    "total": 4,
    "successful": 4,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 15430904,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "duplicateCount": {
      "doc_count_error_upper_bound": 4,
      "sum_other_doc_count": 15430801,
      "buckets": [
        {
          "key": "746004ff8168bbe5672605fad34704a5",
          "doc_count": 2,
          "duplicateDocuments": {
            "hits": {
              "total": 2,
              "max_score": 1,
              "hits": [
                {
                  "_index": "my-index",
                  "_type": "ce",
                  "_id": "746004ff8168bbe5672605fad34704a5",
                  "_score": 1,
                  "_routing": "746004ff8168bbe5672605fad34704a5",
                  "_source": {
                    "indexTime": 1531249623788
                  }
                },
                {
                  "_index": "my-index",
                  "_type": "ce",
                  "_id": "746004ff8168bbe5672605fad34704a5",
                  "_score": 1,
                  "_routing": "746004ff8168bbe5672605fad34704a5",
                  "_source": {
                    "indexTime": 1531249622605
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}
@kylelyk
Copy link
Author

@kylelyk kylelyk commented Jul 11, 2018

The description of this problem seems similar to #10511, however I have double checked that all of the documents are of the type "ce".

@elasticmachine
Copy link
Collaborator

@elasticmachine elasticmachine commented Jul 11, 2018

@ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Jul 12, 2018

@kylelyk Can you provide more info on the bulk indexing process? Are you setting the routing value on the bulk request? Are you using auto-generated IDs?
In order to check that these documents are indeed on the same shard, can you do the search again, this time using a preference (_shards:0, and then check with _shards:1 etc.), see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html
Are these duplicates only showing when you hit the primary or the replica shards? Can you try the search with preference _primary, and then again using preference _replica. Thanks.

@kylelyk
Copy link
Author

@kylelyk kylelyk commented Jul 12, 2018

We are using routing values for each document indexed during a bulk request and we are using external GUIDs from a DB for the id.

Searching using the preferences you specified, I can see that there are two documents on shard 1 primary with same id, type, and routing id, and 1 document on shard 1 replica.

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 12, 2018

I can see that there are two documents on shard 1 primary with same id, type, and routing id, and 1 document on shard 1 replica.

Did you mean the duplicate occurs on the primary? Can you also provide the _version number of these documents (on both primary and replica)? Thank you!

@kylelyk
Copy link
Author

@kylelyk kylelyk commented Jul 12, 2018

Yes, the duplicate occurs on the primary shard. When I try to search using _version as documented here, I get two documents with version 60 and 59. I get 1 document when I then specify the preference=shards:X where x is any number. Maybe _version doesn't play well with preferences?

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 12, 2018

@kylelyk Thanks a lot for the info. Which version type did you use for these documents?

@kylelyk
Copy link
Author

@kylelyk kylelyk commented Jul 12, 2018

I am not using any kind of versioning when indexing so the default should be no version checking and automatic version incrementing.

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 13, 2018

We use Bulk Index API calls to delete and index the documents.

@kylelyk We don't have to delete before reindexing a document. However, can you confirm that you always use a bulk of delete and index when updating documents or just sometimes? Thank you!

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 13, 2018

@ywelsch found that this issue is related to and fixed by #29619.

Given the way we deleted/updated these documents and their versions, this issue can be explained as follows:

  1. Suppose we have a document with version 57

  2. A bulk of delete and reindex will remove the index-v57, increase the version to 58 (for the delete operation), then put a new doc with version 59. While the engine places the index-59 into the version map, the safe-access flag is flipped over (due to a concurrent fresh), the engine won't put that index entry into the version map, but also leave the delete-58 tombstone in the version map. The delete-58 tombstone is stale because the latest version of that document is index-59.

  3. Another bulk of delete and reindex will increase the version to 59 (for a delete) but won't remove docs from Lucene because of the existing (stale) delete-58 tombstone. The index operation will append document (version 60) to Lucene (instead of overwriting). At this point, we will have two documents with the same id.

Our formal model uncovered this problem and we already fixed this in 6.3.0 by #29619.

@kylelyk I really appreciate your helpfulness here.

@ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Jul 13, 2018

@kylelyk can you update to the latest ES version (6.3.1 as of this reply) and check if this still happens?

@kylelyk
Copy link
Author

@kylelyk kylelyk commented Jul 13, 2018

Unfortunately, we're using the AWS hosted version of Elasticsearch so it might take some time for Amazon to update it to 6.3.x. I'll close this issue and re-open it if the problem persists after the update.

@HJK181
Copy link

@HJK181 HJK181 commented May 13, 2019

@ywelsch I'm having the same issue which I can reproduce with the following commands:

PUT test/_doc/q_38d9770d8c80f2f3b8e3bb0b93d30b5f372cf5bf?routing=c_76d2af16e8b01927f16e6cbd0d91caac3b87cbdd&op_type=create
{
    "query": "forestier",
    "joinType": {
      "name": "q",
      "parent": "c_76d2af16e8b01927f16e6cbd0d91caac3b87cbdd"
    }
}

Followed by:

PUT test/_doc/q_38d9770d8c80f2f3b8e3bb0b93d30b5f372cf5bf?routing=c_38d9770d8c80f2f3b8e3bb0b93d30b5f372cf5bff&op_type=create
{
    "query": "forestier",
    "joinType": {
      "name": "q",
      "parent": "c_38d9770d8c80f2f3b8e3bb0b93d30b5f372cf5bf"
    }
}

Which will result in:

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "q_38d9770d8c80f2f3b8e3bb0b93d30b5f372cf5bf",
        "_score": 1,
        "_routing": "c_76d2af16e8b01927f16e6cbd0d91caac3b87cbdd",
        "_source": {
          "query": "forestier",
          "joinType": {
            "name": "q",
            "parent": "c_76d2af16e8b01927f16e6cbd0d91caac3b87cbdd"
          }
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "q_38d9770d8c80f2f3b8e3bb0b93d30b5f372cf5bf",
        "_score": 1,
        "_routing": "c_38d9770d8c80f2f3b8e3bb0b93d30b5f372cf5bff",
        "_source": {
          "query": "forestier",
          "joinType": {
            "name": "q",
            "parent": "c_38d9770d8c80f2f3b8e3bb0b93d30b5f372cf5bf"
          }
        }
      }
    ]
  }
}

The same commands issued against an index without joinType does not produce duplicate documents. My template looks like:

"mappings": {
   ...
			"properties": {
				"joinType": {
					"type": "join",
					"relations": {
						"c": "q"
					}
				},
  ...
}

I'm on Elasticsearch 6.3.2.

@imotov
Copy link
Member

@imotov imotov commented May 13, 2019

@HJK181 you have different routing keys. Your documents most likely go to different shards. This is expected behaviour. You need to ensure that if you use routing values two documents with the same id cannot have different routing keys. If you have any further questions or need help with elasticsearch, please don't hesitate to ask on our discussion forum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants