Skip to content

Latest commit

 

History

History
137 lines (102 loc) · 16.6 KB

0002-kafka-tenant-collection-topics.md

File metadata and controls

137 lines (102 loc) · 16.6 KB

Tenant Collection Kafka Topics

Summary

With this RFC, either Kafka topics are created for a collection of tenants or Kafka topics for each tenant can be configured for each deployed module instance. Using a collection of tenants will reduce the number of Kafka topics and subsequently the number of partitions.

Motivation

Before this RFC, FOLIO creates Kafka topics and partitions on a per tenant basis for each application concern. There are scaling and cost implications as the number of tenants increases in a FOLIO cluster.

  • Performance: Similar to a database not having a theoretical limit on the number of rows/tables, Kafka does not have a limit on partition count. A major difference is that a partition is "heavier" than a row/table. Partitions have to be rebalanced, replicated & maintain corresponding open server files for each partition. Partitions have an ongoing administrative cost while online. Creating numerous partitions like they are database rows/tables will spell disaster for an under-provisioned Kafka installation. This is why cloud providers ensure to have limits on partition counts. Provisioning a Kafka installation to support such a high partition count will be wasteful for FOLIO hosting providers.

    On the client side, buffers are created for each partition in memory. So a producer that interacts with very many thousands of partitions will consume more memory. To subscribe to a number of partitions with a regex subscription pattern, Kafka will send the total list of topics to each consumer. A consumer would then filter the list to determine the topics that match the subscription pattern. Sending a total list of topics would require some consensus from all brokers, therefore overhead. Consumers would ask for the full list at regular interval to catch newly added topics.

  • Cost: Kafka's use in FOLIO is still limited, more FOLIO modules will continue to employ event driven techniques that will cause an explosion of partitions. For hosting providers that self host Kafka, more brokers or broker capacity will be needed to service load generated by the high number of partitions. For hosting providers who utilize cloud managed Kafka, cloud providers typically charge by partition or set limits on partition count for a particular broker size. For example, Confluent Cloud charges $13/partition/year with a maximum of 4096 partitions for one cluster. AWS charges $3679/year for a broker instance with 4 vCPU and 16GB Memory and a soft limit of 1000 partitions per broker. This does not include extra brokers in other availability zones to support high availability.

Out of Scope

  • Enabling tenant separation within a module. Tenant separation meaning "How do we prevent a tenant's data from being processed in place of another?", "How can we prevent a module (specific to a tenant) from seeing data from other tenants?"
  • Moving forward with the Temporary Kafka Security Solution. More details are included in the Related Concerns section of this document.
  • Countering module multi-versioning i.e. more than one version of a module is installed in FOLIO with tenants being able to target specific versions of a module. Module multi-versioning is supported by OKAPI & mod-pubsub but not for modules that interact directly via Kafka. This is an existing deficiency prior to the changes documented by this RFC. More details are included in the Related Concerns section of this document.

Detailed Explanation/Design

For a module instance the KAFKA_PRODUCER_TENANT_COLLECTION environment variable

  • is set to configure Kafka topics for a tenant collection, or
  • is unset to configure Kafka topics for each tenant. The naming scheme for Kafka topics is like so:

[environment].[namespace].[tenant].[eventtype]

[tenant] is either a tenant collection like ALL, or a tenant id like diku.

An example: folio.Default.ALL.DI_COMPLETED or folio.Default.diku.DI_COMPLETED.

Module Configuration

The tenant collection option involves creating another set of topics similar to any other tenant. They will accept messages from a list of tenants in the FOLIO cluster and will be named in a similar scheme as regular tenant topics are named. The name of this tenant collection must match [A-Z][A-Z0-9]{0,30}.

A FOLIO module will check for the existence of an environment variable called KAFKA_PRODUCER_TENANT_COLLECTION. With this variable unset, the module will produce messages to topics for each tenant. If the variable is set to a value like ALL, the module will produce messages to one topic belonging to the "ALL tenant collection" e.g. folio.Default.ALL.DI_COMPLETED.

In the simplest configuration, all modules would use the ALL tenant collection and it contains all tenants.

Advanced configurations run multiple instances of a module, enable each tenant on one of them, and use a different tenant collection like COLLECTIONA and COLLECTIONB for each module instance.

Consuming

In folio-kafka-wrapper, a sample of a subscription pattern is defined below

folio.Default.\w{1,}.DI_COMPLETED

where folio is the cluster name, Default is the namespace then a wildcard to capture tenant ids followed by the event type(DI_COMPLETED) that the topic is supposed to contain. Subscriptions are created using this interface:

public static SubscriptionDefinition createSubscriptionDefinition(String env, String nameSpace, String eventType)

With the existing subscription pattern, consumers will be able to consume from single tenant topics and tenant collection topics alike. It will also allow steady migration to tenant collection topics for each module without worrying if a consuming module is listening to a tenant collection topic as well. Safeguards can be added to ensure that a message's ownership can be determined. These include:

  • Enriching messages produced by folio-kafka-wrapper with a tenant id before the message is sent.
  • Highlighting messages consumed that don't have a tenant id. Orphan messages can be logged partially or with the full contents. They can also be placed in a dead queue/topic for further review.
  • Some modules can be configured to listen to only some Kafka topics if the KAFKA_EVENTS_CONSUMER_PATTERN environment variable is set. This works with tenant id and tenant collection based topics.

It is imperative that every message has a owner.

Producing

In folio-kafka-wrapper, a topic name is generated by this interface

public static String formatTopicName(String env, String nameSpace, String tenant, String eventType)

This will generate a sample of a topic like so:

folio.Default.tenant00001.DI_COMPLETED

When KAFKA_PRODUCER_TENANT_COLLECTION environment variable is set, every variation to the tenant argument in the interface will always route to one tenant collection. e.g.

folio.Default.ALL.DI_COMPLETED

A producer must ensure the message has a tenant id as one of its headers. It is imperative that every message has a owner.

Implementation

  • Changes will need to be made in folio-kafka-wrapper to implement safeguards and triggers for the new environment variable to exhibit different behaviour.
  • FOLIO modules will reference the latest version of folio-kafka-wrapper.

Migration To Tenant Collection Topics

  • FOLIO modules with latest version of folio-kafka-wrapper and KAFKA_PRODUCER_TENANT_COLLECTION set - are instantiated, new topics are created by module business logic.
  • Existing tenant topics can be dropped.

Migration to New Module Version

When multiple module versions run that are incompatible regarding Kafka messages they must be separated. One way is setting the ENV environment variable that populates the [environment] part of the Kafka topic, for example ENV=nolana and ENV=orchid for different FOLIO flower releases.

Risks and Drawbacks

  • A Burst of messages from one tenant will inhibit processing of messages from other tenants that are in the same tenant collection.
  • Lack of support in Kafka interactions for FOLIO's multi-versioning scheme can cause functional issues. Rate of issue occurrence does not increase with the changes detailed in this RFC. Functional issue occurrence is dependent on the breaking changes implemented in the varying versions of a module. Multiple versions of a module will be consuming from the same topic and there are no guarantees about which module version will consume a message sourced from an incompatible version.
  • Insights like "how many messages have yet to be processed for a tenant" will be harder to derive.
  • Tenant specific topic settings will not be possible.
  • Temporary Kafka Security Solution as described will no longer be possible. FOLIO is not able to enforce tenant security and isolation within its boundaries: the other two parts are not defined within FOLIO code, it is up to the hosting provider to implement. It is unlikely that this temporary solution has been fully implemented by a hosting provider because FOLIO modules are not able to accept and manage the multiple identities needed to consume/produce into topics locked by ACLs.
    • Before this proposal, a Kafka administrator could set ACLs to Kafka topics so that a module can listen to only those tenants the module is enabled for in Okapi. When using topics with tenant collections, the ACLs can no longer be used on a tenant basis. They can be used on a tenant collection basis, though.

Rationale and Alternatives

Here are other alternatives that were considered:

  • Merging All Topics Into A Single Topic: This would be a maintenance nightmare. Messages could be consumed in error because so many consumer groups will be present and would rely on application code to filter out irrelevant messages for any particular module.
  • Have A Single Topic For Each Module Area: Module areas being Inventory, Acquisitions, Circulation etc. This shares the same pain as the previous point but to a lesser extent.
  • Reduce Partition Count For Each Topic: Partition counts should be tailored by a Kafka admin to suite the Kafka cluster involved; adjusting with load patterns and special cases that running applications might require. FOLIO has no guarantee that its configuration will be intact after its topics are initialized.

    Reducing partition counts for each topic will produce decent impact on the problem illustrated by this document, but there is still a linear relationship between the number of tenants in a FOLIO cluster and the number of partitions in a Kafka cluster. Additional functionality, which require the use of more topics, will mean that partition counts in the Kafka cluster will grow with the addition of new tenants and release upgrades. Partition counts will be revisited sooner rather than later if only reducing the partition count of a topic rather than a topic's existence. The scalability of FOLIO is still bound by Kafka to lesser extent.

    Partitions of a topic can only be increased and not decreased, so the topic has to be dropped and recreated which should require some application disruption event.

    This alternative is not included in this RFC because administration of a Kafka installation owned by a hosting provider is out of FOLIO's control. Additionally, setting alternate default partition counts in folio-kafka-wrapper does not require the rigor of an RFC.

Frequently Asked Questions

  • Should we ensure that FOLIO's multi-versioning scheme is applied in Kafka as well?
    • Multi-versioning is not supported with direct interaction with Kafka currently. This is out of scope for this RFC to resolve.

Related Concerns

Why Finishing The Temporary Kafka Security Solution Will Have Cascading Effects

Attempting to implement the other two parts of the Temporary Kafka Security Solution would probably force a decision to have dedicated modules instances to retain distinct credentials needed to access Kafka thereby raising FOLIO's cost-to-host. Having modules per tenant is cost prohibitive, causes inefficient use of some resources and overwhelm other resources e.g. increasing database connections for each module instance.

At the time of writing, FOLIO's own release infrastructure does not have modules per tenant or Kafka security configurations as described by the Temporary Kafka Security Solution. Another consequence would be the need for a process to create Kafka users when a tenant is installed in OKAPI.

A representation of Kafka ACLs would also need to be created for FOLIO developers to modify definitions for which modules can access any particular topic. These items are not defined by the temporary solution and is left for a hosting provider to figure out.

Why Supporting Module Multi-Versioning Via Topic Versioning Is Not Desirable

FOLIO's multi-versioning scheme is still a gap not covered by this change. An approach could be to create topics for each module version. This will allow specific messages to be delivered to specific versions without causing a partition count explosion when a new tenant is installed. Topics belonging to older modules versions can be removed safely.

Example:

if V1 of Module A exists and V1 and V2 of Module B exists. Each will have their own versioned topics; Module A:V1, Module B:V1, Module B:V2.

There are edge cases with this approach that will have to be fleshed out. After these edge cases are covered, I believe a complex solution would have been built.

  • How will V1 of module A know all the versions of other modules that exists and which topic is applicable for tenant message?
  • How will V1 of module A know that V2 of module B has been recently installed and should use V2 topic to send a message for a specific tenant?
  • Are we saying there is a one-to-one relationship between a versioned module and a versioned topic? Is the version of the topic independent from the version of the module? modules don't "own" topics for every case. Some modules are sole producers of a topic while other topics can have multiple producers.
  • Module dependencies currently described in module descriptors inform the system of other versions that are compatible when communicating via a REST interface. Are we going to define module dependencies when communicating via events? Does that undercut the loose coupling that Kafka provides?

Some hosting providers do not employ FOLIO's multi-versioning too often. FOLIO clusters are usually within the scope of a flower release with an upgrade event to shift to another flower release. Nonetheless, multi-versioning is available and there are no guarantees that it will not be used. All of work will be applied to a feature that may not be used often.

Why Supporting Module Multi-Versioning Via Message Versioning Is Not Desirable

Another approach to counter FOLIO's multi-versioning is to version messages produced into topics. Messages would have a special header that will denote the version of the message. Logic could be applied in a module to discard messages that are not compatible with the module. Modules have different versions would have to be in their own consumer group so that each version can receive a copy of a versioned message. There are edge cases with this approach to think about:

  • FOLIO modules are not tenant-aware i.e. a module does not know which tenants it can processed, OR which tenant are installed in okApi for its version. This would help in discarding messages easily - tenant id exists as a header in the Kafka message.
  • Modules need a mechanism to allow compatibility for compatible future versions of a message without a code change. For example, V1 of module A emits V1 message that is consumed by V1 of module B. V2 of module A is introduced which emits a V2 message. How will V1 of module B consume the new version (which is functionally correct to consume, imagine if an extra property is added to the message) if it is strictly coded to accept V1. Will semantic versioning be introduced?
  • It would be prudent to maintain an event type registry that will maintain event type version and their schemas. It is not a rule that there must be only one producer into a topic. For example, V1 of module A and V2 of module B can produce V1 of a message into a topic. Since modules are meant to have independent development cycles, it will be important to ensure that the next version of module A and B do not declare the same version of a message but with different schemas.