Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose a lightweight model for the message consumer service #19

Closed
houshengbo opened this issue Oct 19, 2016 · 4 comments
Closed

Propose a lightweight model for the message consumer service #19

houshengbo opened this issue Oct 19, 2016 · 4 comments

Comments

@houshengbo
Copy link

houshengbo commented Oct 19, 2016

OpenWhisk manages different components in different docker containers, each of whom is dedicated to host one service taking care of the workload for all the users. The current in-progress package messaging(kafka) service, designed in the same model, launches one docker container to manage all the consumers and triggers created by all the users, etc. The major pitfall of this design is that it restrains all the resources in one container and directly leads to single point of failure and lack of scalability.

Digging into the code a little bit, we can see a internal hashmap or dict is created to save all the consumers and mapping relation with the triggers, and each new consumer/new trigger has to create a new record in the database for consistency and recovery if there is a demand for the service restart.

The following issues are what we can predict with the current design:

  • The container becomes a single point of failure. If this service container crashes, all the services are shut down. No service HA can be insured.
  • If we have to restart the service container, we have to load everything available saved in the database, to make the service back online, e.g. each time the service provider or the package messaging(kafka) service, all the triggers for every user have to be reloaded and reinitialized. What if the database information is lost? We lose all the information of the consumers and triggers. No service HA can be insured.
  • No matter how much resource we can allocate to this container, it is limited for sure, so that the container will be drained if too many triggers have been created. We have already run into this issue badly, since openwhisk developers have been told to check the maximum number of triggers that can be created with certain resources. We are running a service with limit, which is not a model able to run in cloud.

If the triggers exceeds this number, what can we do to this situation without changing the implementation? Some are possible, but no ideal solution at this moment.

  • We can add a new consumer service to listen to the messagehub. The drawback is that we only have a limited topic to listen to for the kafka message. Even if we add one more consumer, we cannot make sure that the triggers over 1200 created in the second consumer pick up the message, because both the two consumers listen to the the same topic. Also, each consumer service maintains a separate and different internal map, and they are still single points of failure. If we lose one of them, we lose all the information in the internal map. Third, how can we control the message is consumed by the consumer service, which has the internal map we need? If the message needs to be consumed by trigger A, how do we know it can go to the consumer service with the mapping information for trigger A? Very difficult. We can send the message to all the consumer services, but in that case it is a waste of resources.
  • We can add new messagehub service. In this case, we will end up managing multiple messaging services. It will be very complex to manage which trigger/consumer maps to which message hub.

We can see that the internal map of the registry is really an issue, which limits the number triggers we are able to handle, and make the service rather difficult to scale.

In order to resolve the above issues, we have to reconsider the design model for the message consumer service. I would like to propose a more lightweight model, in which, the consumer service only takes care of listening to the messaging service, picking up the message from the specific topic and consuming it. We are able to launch more consumer services in HA mode, since they have the same responsibilities.

Where do we get the information of the trigger to be fired?
From openwhisk's perspective, we are already able to create the actions and triggers, and associate a rule between them, so the only thing that the consumer service is to get the information for the trigger. Instead of maintaining the relationship between the consumer and the trigger in the internal map/dict and the database. We can offload the information of the trigger to provider service. When the provider discovers a change from the event source, it sends a message with the trigger information, e.g. the trigger ID, trigger uri, etc, and credentials and the payload of the final action into the messaging service. When the consumer service receives it, simply fire the trigger by the uri with the payload.

In this model, the consumer service has been tailored into a lightweight module, which listens to the topic, receives the message and fires the trigger with the payload and credentials, since all the information is prepared by the provider service and available in the message. There is no database involved and no specific information to be saved. The service is able to scale, since we can launch the consumer service as many as we need. As long as there is one consumer service running, our service can guarantee the availability.

The only thing we need to define is the message format to be consumed by the consumer service. I will describe the message format in details for the design of the provider service: #20.

@houshengbo houshengbo changed the title Propose a lightweight model for the message consumer Propose a lightweight model for the message consumer service Oct 19, 2016
@mrutkows
Copy link
Contributor

Do we really want to propagate the Trigger ID and responsibility outside the domain of OpenWhisk? After all, the major goal of an event-driven paradigm is "fire and forget" not "fire and keep track of triggers". This seems like offloading the (retention, storage, recovery) problem, plus it exposes a necessity to expose the Trigger as a top-level interaction whereas we really only want the ecosystem to focus on event generation and Actions.

@mrutkows
Copy link
Contributor

mrutkows commented Oct 19, 2016

To me, a trigger is essentially (from a functional perspective) a name that maps to a URI (effectively a DNS for Serverless); its job is to maintain that relationship so others do not have to. Saying we can "We can offload the information of the trigger to provider service" to me says we expect provider services to provide pub/sub/registration which is the ideal future, but not true in the majority case today. I am looking forward to seeing your design in issue #20...

@jberstler
Copy link

I agree @mrutkows. Requiring the data providers to do OpenWhisk bookkeeping for us will preclude a lot of integration scenarios where the OpenWhisk user simply wants to fire triggers based on data that is already being produced from a variety of sources. This is really the bread and butter scenario for Kafka where any number of disparate entities produce messages without knowing or caring who will consume them.

@houshengbo
Copy link
Author

I have updated the design of the provider service #20. The provider service will provide different kind of templates, which listen to the changes from different event sources. Each user can choose which service is his event source and launch an instance of the provider with the specific trigger information. One instance of the provider service serves only for one user. We do not need to register all the trigger mappings on the provider side as well.

I am sort of concerned with current consumer and trigger registry within the memory, which is also a common model followed by the openwhisk service providers, like the providers of cloudant service and alarm service. We end up with issues of hitting the cap of the memory by creating too many triggers or consumers, losing the availability of the service if one single service runs into outage, scaling vertically only with no alternative to add other horizontal compute resources, like containers or VMs, breaking down the in-memory map or the database record then breaking everything, etc. Anyone of the above issues can cause trouble as service deployed in cloud, not to mention all of them existing at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants