Skip to content

Conversation

@pswaao88
Copy link

Description

This PR adds tier_preference and node.role columns to the _cat/shards API to facilitate troubleshooting of ILM allocation issues, as requested in #136895.

Implementation Details

  • tier_preference (tp): Retrieves the index.routing.allocation.include._tier_preference setting from IndexMetadata.
    • Used Metadata#findIndex(Index) instead of the deprecated index(String) or getProject() methods to safely handle index lookup in the current architecture.
  • node.role (r): Retrieves the node role abbreviation using DiscoveryNode#getRoleAbbreviationString(), ensuring consistency with the _cat/nodes API.
  • Safe Access: Implemented using getOrNull to prevent NullPointerException when shards are unassigned or metadata is missing.

Related Issues

Closes #136895


Note to Reviewers

This is my first contribution to Elasticsearch! 🚀
As a non-native English speaker and a first-time contributor, I apologize in advance if I missed any conventions or used awkward phrasing. If there is anything I overlooked or need to improve, please let me know, and I will address it immediately. Thank you for the opportunity to contribute.

@cla-checker-service
Copy link

cla-checker-service bot commented Nov 21, 2025

💚 CLA has been signed

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Nov 21, 2025
@pswaao88 pswaao88 force-pushed the feature/136895-cat-shards-columns branch from d559f89 to 2bf3173 Compare November 21, 2025 09:56
@github-actions
Copy link
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@pswaao88 pswaao88 force-pushed the feature/136895-cat-shards-columns branch from 2bf3173 to 49dcfbe Compare November 21, 2025 11:17
@szybia szybia added :Data Management/CAT APIs Text APIs behind /_cat and removed needs:triage Requires assignment of a team area label labels Nov 21, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Nov 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@pswaao88 pswaao88 force-pushed the feature/136895-cat-shards-columns branch from 49dcfbe to fc1aa0c Compare November 21, 2025 11:22
@szybia
Copy link
Contributor

szybia commented Nov 21, 2025

hi @pswaao88, thank you for your interest in elasticsearch!

few preliminary things before reviewing:

  1. if you could refrain from git force-pushing/rebasing and use merging instead, if needed. makes it more difficult to review and reason about things
  2. regarding testing: i've ran the CI for you, if there's any failures, if you could delve into these and investigate them. but whether there are or aren't failures, i'd suggest we should be adding some tests here to test the changes you're making

@pswaao88
Copy link
Author

Hi @szybia,

Thank you very much for running the CI and for the clear feedback. As this is my first contribution to Elasticsearch, I wasn't fully aware of these processes. Thank you for the guidance!

Regarding the preliminary items:

Git History:
I sincerely apologize for the force-pushes; they were unfortunately necessary to correct the failed CLA signature and the Changelog filename. I fully understand the policy and will use standard merging going forward to ensure a clean history.

Testing:
I am currently reviewing the CI results, and based on the feedback, I will add the required unit and integration tests. Thank you again for pointing this out.

Thank you again for your time!

Copy link
Contributor

@szybia szybia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll run the CI for you again, but you'll still probably get a bunch of failures due to not adjusting the existing tests that assert the body of the response (have a scan through all the different failures in buildkite/CI)

helpful suggestion: ctrl-f for reproduce with once you expand a job that failed, and that will highlight all the tests that have failed within that job, run the gradle command locally, and then you can start figuring it why it failed and how to fix it

table.addCell(getOrNull(commonStats, CommonStats::getSparseVectorStats, SparseVectorStats::getValueCount));

table.addCell(
Optional.ofNullable(getOrNull(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without having a deeper look, it surprises me that we need a null check here when the other cells/fields above seem to be fine with null

mind helping me understand why this is needed? 🙏

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @szybia,

Thank you for the guidance. As a university student who has recently started studying this field (and is new to contributing to complex systems), I view each CI failure as a valuable learning opportunity.

I've reviewed the failure logs and have made the following attempt to fix the issue:

  1. Issue Identification: I found that the widespread CI failures were related to a NullPointerException (NPE) occurring during Backward Compatibility (BWC) tests.
  2. Hypothesis: I suspect the issue is that the newly added String fields receive a raw null value from older nodes, causing the system to crash later.
  3. Attempted Solution: To fix this, I applied the Optional.ofNullable().orElse("") pattern to ensure a safe String is returned instead of null.

I would be very grateful if you could confirm my understanding.

Could you please confirm if my diagnosis of the root cause and the need for a null check is fundamentally correct? Also, if my approach is missing the preferred project convention, could you kindly provide a small hint or gentle guidance on the correct direction? I want to ensure I'm adopting the best practices for future contributions. 🙏

Additionally, is it okay for me to manually trigger the tests by commenting 'buildkite test this' if they don't start automatically?

Thank you for your patience and guidance!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pswaao88 and thanks for your contribution here.

You should be running these tests locally as part of your development process - see these docs. To be fair ./gradlew check takes a while and mostly will be running tests that are not germane to your change, but the ones that you've seen fail in CI are the important ones and you should be re-running them yourself before asking for another CI run and code review. Looking at the recent failures that means you need to make sure that at least the following command completes successfully on your local machine first:

./gradlew :server:test :rest-api-spec:yamlRestTest :qa:smoke-test-multinode:yamlRestTest

Additionally, is it okay for me to manually trigger the tests by commenting 'buildkite test this' if they don't start automatically?

Unfortunately no, for security reasons we can't allow external contributors to trigger their own test runs in CI. We have to check there's at least nothing obviously malicious in the changes we're about to test before running anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be very grateful if you could confirm my understanding.

This (and more) will come out in the code review, once the tests are all passing and you've added some more tests to support your own change. Please bear in mind that our capacity for reviewing contributions like this is bounded, so please try not to exhaust your share of this capacity on relatively minor questions like this. I'd love it if we could welcome contributions of all levels but unfortunately we do not have the infinite time that this would require.

Please also carefully read this section of the contributing guide noting particularly (emphasis mine):

We sometimes reject contributions due to the low quality of the submission since low-quality submissions tend to take unreasonable effort to review properly. Quality is rather subjective so it is hard to describe exactly how to avoid this, but there are some basic steps you can take to reduce the chances of rejection. Follow the guidelines listed above when preparing your changes. You should add tests that correspond with your changes, and your PR should pass affected test suites too. It makes it much easier to review if your code is formatted correctly and does not include unnecessary extra changes.

@szybia
Copy link
Contributor

szybia commented Nov 21, 2025

buildkite test this

@pswaao88
Copy link
Author

Hi @DaveCTurner and @szybia,

Thank you both for the guidance and patience. As a student new to contributing, I really appreciate your help in getting the process right.

Following @DaveCTurner's instructions, I have successfully run and passed the local tests (:server:test, :rest-api-spec:yamlRestTest, :qa:smoke-test-multinode:yamlRestTest) on my machine.

I have pushed the commits that resolve the BWC NPE (using Optional for safety) and adjusted the existing tests to account for the new columns. I would appreciate it if you could take a look.

Thanks!

@szybia szybia self-assigned this Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/CAT APIs Text APIs behind /_cat external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add columns CAT Shards for _tier_preference and node.roles

4 participants