Reindex performance degrading logarithmically

Hi!  We're re-indexing a 7GB index, and noticed that performance starts out fast then logarithmically degrades over time:

![screen shot 2016-04-26 at 9 57 50 am](https://cloud.githubusercontent.com/assets/1916150/14837124/99445104-0bca-11e6-85f5-b0e3df98eed8.png)

We're using `elasticsearch v1.7.3` and `elasticsearch-py v1.9.0`.  

We're following all recommendations for increasing indexing performance, eg:

```
index.refresh_interval: -1
index.store.throttle.type: none
index.translog.flush_threshold_size: 1g
index.number_of_replicas: 0
```

Our cluster is at AWS and is comprised of the following:
- (5) `m4.xlarge` data nodes
- (3) `m3.medium` master nodes
- (1) `m4.large` client node

This cluster should be plenty beefy for indexing a paltry 7GB of data.  The original indexing only took a couple of hours to complete, but this re-indexing has been going for nearly 24 hours and is only 70% done.  And it only seems to be getting slower as time goes on.  At this rate, the re-index will never finish.

We've tried various chunk sizes in the `reindex()` call, but it doesn't seem to affect performance so we're using the default of 500.

The python script is relatively chill, while ES is blasting away at the CPU:

![screen shot 2016-04-26 at 10 25 58 am](https://cloud.githubusercontent.com/assets/1916150/14837346/47db5c0c-0bcc-11e6-8fd9-51a716a0fdc4.png)

Any ideas on what would cause this behavior?  And how to get past it?  I'm suspecting that there's an issue with scan/scroll.  It's almost like the client needs to seek through all the previous chunks to get to the next chunk, so everything is getting slower the further it gets.  But that's just a wild guess.

Fixing this is essential for completing our upgrade to ES 2.3, especially since we have indices that are 10x the size of this 7GB index that we will need to be reindexing as well.  Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reindex performance degrading logarithmically #397

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reindex performance degrading logarithmically #397

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions