Skip to content

Conversation

@snazy
Copy link
Member

@snazy snazy commented Sep 30, 2025

This PR provides a mechanism to assign a Polaris-cluster-wide unique node-ID to each Polaris instance, which is then used when generating Polaris-cluster-wide unique Snowflake-IDs.

The change is fundamental for the NoSQL work, but also demanded for the existing relational JDBC persistence.

Does not include any persistence specific implementation.

dimas-b
dimas-b previously approved these changes Oct 2, 2025
@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Oct 2, 2025
Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it! Left some comments. Given this is a big change(23 new files and 3 new modules), is it worth to have a dev list discussion? So that people are aware of the changes and contribute their ideas.

Some ID generation mechanisms,
like [Snowflake-IDs](https://medium.com/@jitenderkmr/demystifying-snowflake-ids-a-unique-identifier-in-distributed-computing-72796a827c9d),
require unique integer IDs for each running node. This framework provides a mechanism to assign each running node a
unique integer ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If snowflake id generator requires such complex node id generator, maybe we should consider other options. Would it possible to use other id generators? Since we are in the persistence module already, why cannot we use something like ObjectID in mongoDB, or Java UUID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snowflake id generator is already used by the NoSQL persistence impl., of which this PR is just a sub-component.

Comment on lines +31 to +34
* `polaris-nodes-api` provides the necessary Java interfaces and immutable types.
* `polaris-nodes-impl` provides the storage agnostic implementation.
* `polaris-nodes-spi` provides the necessary interfaces to provide a storage specific implementation.
* `polaris-nodes-store-nosql` provides the storage implementation based on `polaris-persistence-nosql-api`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the module?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it's in the end-to-end NoSQL PR: #1189 ... to be made available for review later (to allow for smaller, easier-to-review PRs, as discussed)

* specific language governing permissions and limitations
* under the License.
*/
package org.apache.polaris.nodes.api;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anywhere else in Polaris needs this. Can we rename it to org.apache.polaris.nosql.nodes.api or org.apache.polaris.nosql.snowflakeid.nodes.api?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that this PR adds code that in meant to support Snowflake ID generators.

Proposal: Align with existing ID Gen code in main.

  • package org.apache.polaris.ids.nodes.*
  • Location: persistence/nosql/idgen/nodes/...

@snazy @flyrain @dennishuo WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the main API interface mentions API to lease node IDs..., the use cases are not limited to that sole use case. Other IMHO interesting use cases are to get an overview of the active processes (= nodes) of a single Polaris cluster. Adding too specific use cases or even specific call-sites to the package name(s) feels like restricting the use cases.

I'd prefer to keep the current packages names. I'm okay to rename the packages to org.apache.polaris.nodeleases.* or org.apache.polaris.nodeids.* though. But the whole effort isn't user-facing at all, so later renames are possible w/o the risk of breaking anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to nodeids

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to get an overview of the active processes (= nodes) of a single Polaris cluster.

Hi @snazy, could you elaborate the use cases of node id generator beyond the snowflake id generator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure I understand your cite of the get an overview of the active processes use case and question about another use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @flyrain : the concrete use case ATM is feeding nodes IDs into the Snowflake ID generator. That is required for the NoSQL persistence to work end-to-end (#1189).

As a side benefit of maintaining a list of active node IDs, one can use that information to report the status of Polaris JVMs that allocate those node IDs. However, this is completely at the discretion of downstream projects that include Polaris libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure I understand your cite of the get an overview of the active processes use case and question about another use case.

The citation probably doesn't quite matter. I was trying to understand the node id generator use cases beyond snowflake id.

As a side benefit of maintaining a list of active node IDs, one can use that information to report the status of Polaris JVMs that allocate those node IDs. However, this is completely at the discretion of downstream projects that include Polaris libraries.

Thanks, Dmitri! This feels more like a K8s-level concern rather than something at the application level (referring to the Polaris service). Could you shed some light on how downstream projects make use of these node IDs?

Copy link
Contributor

@dimas-b dimas-b Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only use case that I know of is the Snowflake IDs. I mentioned downstream together with "can", I did not mean to imply that such downstream projects already exist ATM :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I agree with @flyrain about starting with more constrained package names where possible, not because we're necessarily implying that the concepts within the package can't be useful in other use cases, but because it's best to be more "deliberate" when adopting the libraries into those other use cases, where we'll be able to better assess the suitability of which aspects constitute a stable SPI, whether there are pitfalls to document better, etc.

I do think the nodeids package name is at least an improvement over the more general nodes at this stage though, so maybe that's enough for now.

Comment on lines +31 to +33
* `polaris-nodes-api` provides the necessary Java interfaces and immutable types.
* `polaris-nodes-impl` provides the storage agnostic implementation.
* `polaris-nodes-spi` provides the necessary interfaces to provide a storage specific implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These modules are used by snowflake id generator only, can we merge it into the modules holding snowflake id generators? So that the snowflake id generator is more consistent and self-contained.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a valid point 👍 I made a specific renaming proposal in the thread about the package name (above).

snazy added 4 commits October 20, 2025 10:50
This PR provides a mechanism to assign a Polaris-cluster-wide unique node-ID to each Polaris instance, which is then used when generating Polaris-cluster-wide unique Snowflake-IDs.

The change is fundamental for the NoSQL work, but also demanded for the existing relational JDBC persistence.

Does not include any persistence specific implementation.
Also move the expensive part to a `@PostConstruct` to not block CDI entirely from initializing.
Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree nodeids is an improvement over nodes as a package name, and I'm okay with moving forward with this PR as-is for now to unblock further work, though my top preference would've still been to constrain to a nosql package name initially, then if there are non-nosql use cases we can always move into a more general package name along with discussion about deeper documentation preferences as it comes.

We can maybe better come up with standard guidance within our related "SPI" discussion - to me, package names constitute some degree of "prescriptive" scoping of shared code, in contrast to the separation of compilation modules being more "descriptive" in nature. So it's more about what we're communicating to (especially, new) developers trying to find their way around the codebase than any pure technical consideration.

And in that vein it's always easier to start more constrained and make it more open as needed rather than the other way around.

The two sides of the coin for commitment to SPIs are that we can provide better stability and broad generalization of usage of core SPI packages by being selective in avoiding premature generalization.

@dimas-b dimas-b merged commit 3dd46b9 into apache:main Oct 28, 2025
20 of 23 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Oct 28, 2025
dimas-b added a commit to dimas-b/polaris that referenced this pull request Oct 28, 2025
Following up on apache#2728 this change moves "nodeids" code to the
`org.apache.polaris.nosql.nodeids` package.
@dimas-b
Copy link
Contributor

dimas-b commented Oct 28, 2025

@dennishuo @flyrain : Follow package rename PR: #2931

@snazy snazy deleted the nosql-nodes-1 branch October 29, 2025 08:18
snazy pushed a commit to snazy/polaris that referenced this pull request Oct 29, 2025
Following up on apache#2728 this change moves "nodeids" code to the
`org.apache.polaris.nosql.nodeids` package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants