Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filter by size #1612

Merged
merged 5 commits into from
Aug 10, 2021
Merged

Add filter by size #1612

merged 5 commits into from
Aug 10, 2021

Conversation

IndraGunawan
Copy link
Contributor

Fixes #1151

Proposed Changes

  • Add size filter type to filter indices based on their their primary only size or total size.

@untergeek
Copy link
Member

Thank you for the pull request. I'm curious. This is an unusual feature, at least to me. What's the use case? Why filter by size? I can see some value in an edge case where I have a ton of tiny indices and I have re-indexed everything to a much larger index and want to purge the tiny indices. But outside of an edge case, I (me personally) can't see where this is useful. That doesn't mean I won't merge this PR, I'm just curious.

@IndraGunawan
Copy link
Contributor Author

IndraGunawan commented Jul 22, 2021

Here is my use case. I'm using daily based indices to store logs, i set number_of_shards on index template to some value. and there are some days where the logs are not too much, so We we can shrink H+1 days indicies whose size is below N GB to use only 1 shard, it will help us reduce number of shard that store small amount of data

ES7 has shard limit defaults to 1,000 per node (i know we can change this value)

TIP: Small shards result in small segments, which increases overhead. Aim to keep the average shard size between at least a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size. (https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster)

@untergeek
Copy link
Member

Clever! I get it now! That's a clever way to address your need.

So to follow up in my curiosity, why not use rollover indices and just let them grow to a size? Is there a compelling reason to keep absolute daily indices rather than just letting them fill shards until they're between 20GB and 40GB?

@IndraGunawan
Copy link
Contributor Author

IndraGunawan commented Jul 22, 2021

It's related to data retention. For example we want to keep last 1 month indices in the hot tiers, then after that 1 month + 1 day indices we do a daily snapshot lets say production-logs-20210622 and put it to a repository and we have snapshot retention that will remove 1 year old snapshot.

Someday we want to restore snapshot (or specific indices) on specific date lets say 2021/05/01 so we can restore production-logs-20210501 snapshot. By keeping the daily indices we can be sure that the data will be there because when we do snapshot we filter that specific date indices. i don't know how to do it if we do rollover on the indices, or i missed something?

By adding this capability we can reduce number of shards in hot tiers storage.

@IndraGunawan IndraGunawan changed the title Filter by size Add filter by size Jul 22, 2021
@IndraGunawan
Copy link
Contributor Author

@untergeek any updates on this?

@untergeek
Copy link
Member

Apologies. I had to be away from work for an unexpected funeral and an expected wedding. I should hopefully be able to address this shortly.

@untergeek untergeek merged commit c99abbf into elastic:master Aug 10, 2021
@IndraGunawan IndraGunawan deleted the filter_by_size branch August 10, 2021 18:02
TinLe pushed a commit to TinLe/curator that referenced this pull request Nov 16, 2021
* add size filtertype

* fix SyntaxWarning, revert untouched file

* fix wrong dictionary key

* add tests

* fix tests
untergeek added a commit that referenced this pull request Jan 31, 2023
7.x branch updates

  - This release is a simplified release for only `pip` and Docker. It only works
    with Elasticsearch 7.x and is functionally identical to 5.8.4

  - Curator is now version locked. Curator v7.x will only work with Elasticsearch v7.x
  - Going forward, Curator will only be released as a tarball via GitHub, as an `sdist` or
    `wheel` via `pip` on PyPI, and to Docker Hub. There will no longer be RPM, DEB, or Windows
    ZIP releases. I am sorry if this is inconvenient, but one of the reasons the development and
    release cycle was delayed so long is because of how painfully difficult it was to do releases.
  - Curator will only work with Python 3.8+, and will more tightly follow the Python version releases.

  - Python 3.11.1 is fully supported, and all versions of Python 3.8+ should be fully supported.
  - Use `hatch` and `hatchling` for package building & publishing
  - Because of `hatch` and `pyproject.toml`, the release version still only needs to be tracked
    in `curator/_version.py`.
  - Maintain the barest `setup.py` for building a binary version of Curator for Docker using
    `cx_Freeze`.
  - Remove `setup.cfg`, `requirements.txt`, `MANIFEST.in`, and other files as functionality
    is now handled by `pyproject.toml` and doing `pip install .` to grab dependencies and
    install them. YAY! Only one place to track dependencies now!!!
  - Preliminarily updated the docs.
  - Migrate towards `pytest` and away from `nose` tests.
  - Scripts provided now that aid in producing and destroying Docker containers for testing. See
    `docker_test/scripts/create.sh`. To spin up a numbered version release of Elasticsearch, run
    `docker_test/scripts/create.sh 7.17.8`. It will download any necessary images, launch them,
    and tell you when it's ready, as well as provide `REMOTE_ES_SERVER` environment variables for
    testing the `reindex` action, e.g.
    `REMOTE_ES_SERVER="172.16.0.1:9201" pytest --cov=curator`. These tests are skipped
    if this value is not provided. To clean up afterwards, run `docker_test/scripts/destroy.sh`
  - Add filter by size feature. #1612 (IndraGunawan)
  - Update Elasticsearch client to 7.17.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add per index_space based filtertype
2 participants