Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add events integration to Incremental Entity Providers #15154

Merged

Conversation

dekoding
Copy link
Contributor

@dekoding dekoding commented Dec 10, 2022

Hey, I just made a Pull Request!

This pull request adds a new feature to incremental entity providers: The ability to receive events and (optionally) trigger a delta update based on an event payload. A new, optional eventHandler property exposes methods for handling events and identifying event topics that should be handled by the provider:

export interface IncrementalEntityProvider<TCursor, TContext> {
  ...
  /**
   * If set, the IncrementalEntityProvider will receive and respond to
   * events.
   *
   * This system acts as a wrapper for the Backstage events bus, and
   * requires the events backend to function. It does not provide its
   * own events backend. See {@link https://github.com/backstage/backstage/tree/master/plugins/events-backend}.
   */
  eventHandler?: {
    /**
     * This method accepts an incoming event for the provider, and
     * optionally maps the payload to an object containing a delta
     * mutation.
     *
     * If a valid delta is returned by this method, it will be ingested
     * automatically by the provider.
     */
    onEvent: (params: EventParams) =>
      | undefined
      | {
          added: DeferredEntity[];
          removed: { entityRef: string }[];
        };

    /**
     * This method returns an array of topics for the IncrementalEntityProvider
     * to respond to.
     */
    supportsEventTopics: () => string[];
  };
}

This feature is optional and should not break any existing providers.

✔️ Checklist

  • A changeset describing the change and affected packages. (more info)
  • Added or updated documentation
  • Tests for new functionality and regression tests for bug fixes
  • Screenshots attached (for UI changes)
  • All your commits have a Signed-off-by line in the message. (more info)

Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
@dekoding dekoding requested review from a team as code owners December 10, 2022 01:02
@github-actions github-actions bot added the area:catalog Related to the Catalog Project Area label Dec 10, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Dec 10, 2022

Changed Packages

Package Name Package Path Changeset Bump Current Version
@backstage/plugin-catalog-backend-module-incremental-ingestion plugins/catalog-backend-module-incremental-ingestion minor v0.1.0-next.3

@dekoding dekoding changed the title Dekoding/deltas in incremental providers Deltas in incremental entity providers Dec 10, 2022
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
@freben
Copy link
Member

freben commented Dec 12, 2022

I'm a bit confused by this one. It seems like this makes the engine do two completely distinct things - pull data from external sources and pushing them inward, and also receiving events and purely translating them to additions/removals via a callback. What drove this change, and are the two more related than they seem at surface value?

If the added functionality is needed, I'd have thought it might be in a provider all of its own, and perhaps in a more targeted or system specific way, rather than a generic translation by callback to the raw entity provider interface 🤔

@dekoding
Copy link
Contributor Author

I'm a bit confused by this one. It seems like this makes the engine do two completely distinct things - pull data from external sources and pushing them inward, and also receiving events and purely translating them to additions/removals via a callback. What drove this change, and are the two more related than they seem at surface value?

If the added functionality is needed, I'd have thought it might be in a provider all of its own, and perhaps in a more targeted or system specific way, rather than a generic translation by callback to the raw entity provider interface thinking

Good question. The logic behind giving incremental providers the ability to perform on-the-fly delta updates like this is to allow for situations where new assets are added to a data source that is capable of performing notifications (i.e., webhooks) outside the ingestion window.

For example, let's say you're using an incremental provider to ingest data for fifty thousand repositories from Github Enterprise. Since most won't change over the course of a week, you only have it scheduled to perform incremental ingestion once a week. However, you are aware that a repo might be added or removed during the rest period for the incremental ingestion, and GHE offers webhooks you can subscribe to in order to get those atomic updates. If the provider also supports deltas, you can get those on-the-fly updates ingested practically instantly. And the next time regular incremental ingestion runs, it will pick up those changes and incorporate them into the ingestion_mark_entities table.

As for reasons why not to use a separate entity provider, the main one is that if the entity is created by a separate EntityProvider, it will be managed by that provider, rather than the one at the core of the incremental provider. If the incremental provider subsequently attempts to ingest the same entity with the same entityref, you'll get collisions. Unless there's a way to "hand off" an entity from one provider to another that I'm not aware of?

@freben
Copy link
Member

freben commented Dec 12, 2022

Regarding overwrites: providers can actually overwrite each other, as long as they do so using the exact same locationKey. Of course, it's important that they don't do so with completely different entity shapes each time because that would be very confusing and lead to churn :) But let's see if we keep these as one anyway. 👇

Going to ping @pjungermann here for the interesting event use case.

Regarding the pull + push duality:

Alright, thanks for the context! This is indeed something that will be important, but we've thought about it from the opposite angle so far. We'd have expected the world to gradually shift towards being primarily event based and centred around that, and then pull mechanisms will still linger around but at very slow rates and only basically for fault recovery (making sure that stuff that for some reason falls between the chairs eventually "heals" anyway). That makes them "secondary" in a sense.

This makes me think that maybe a provider that has this ability, and being fed primarily out of webhooks, might want to be an event handler first, and then maybe, but as a side concern, it might be using the incremental method of ingestion in the recovery mode. :) This makes me look at your original stab with incremental ingestion more as a "facility" than a provider in a different light.

Maybe it's a matter of just tweaking the interfaces here a little, to clarify this. The name deltaMapper is the main offender perhaps. It would be good if one could read the IncrementalEntityProvider interface from top to bottom and it were clear that there was an entity bus part that subscribes to a special topic, and then there was a scheduled static part.

The interface could have optional supportsEventTopic and onEvent methods or something. Maybe if there's an onEvent but no supportsEventTopic, it'll fall back to a topic with the same name as the provider? 🤷 maybe that's more confusing than helpful.

And then be prepared that maybe people will more commonly come from the angle of having written event based ingestion first elsewhere, and only later, when they feel that they need to perhaps have an incremental fallback, do they look at this system and try to rewrite their provider as an incremental one.

@dekoding
Copy link
Contributor Author

dekoding commented Dec 12, 2022

Regarding overwrites: providers can actually overwrite each other, as long as they do so using the exact same locationKey. Of course, it's important that they don't do so with completely different entity shapes each time because that would be very confusing and lead to churn :) But let's see if we keep these as one anyway. point_down

Ah, I guess that would work as an alternative, but I'm (currently) still inclined to try to keep things in the same provider, because we can keep it simpler for adopters (a mapper function vs. an entirely new entity provider). It's also easier to implement the delta logic I mentioned above if I don't have to duplicate access to the incremental ingestion database manager across two entity providers. Anyway... onward!

Going to ping @pjungermann here for the interesting event use case.

I'm a big fan of event emitters as ways to communicate between otherwise unconnected layers of an application. Looking forward to what @pjungermann has to say on this one.

Regarding the pull + push duality:

Alright, thanks for the context! This is indeed something that will be important, but we've thought about it from the opposite angle so far. We'd have expected the world to gradually shift towards being primarily event based and centred around that, and then pull mechanisms will still linger around but at very slow rates and only basically for fault recovery (making sure that stuff that for some reason falls between the chairs eventually "heals" anyway). That makes them "secondary" in a sense.

We want to get there, too. We regard incremental ingestion as kind of a drop-in replacement for full mutations for those cases where full is impossible due to the size of the entity set, but we still want to reach the point where we're mostly doing deltas.

This makes me think that maybe a provider that has this ability, and being fed primarily out of webhooks, might want to be an event handler first, and then maybe, but as a side concern, it might be using the incremental method of ingestion in the recovery mode. :) This makes me look at your original stab with incremental ingestion more as a "facility" than a provider in a different light.

Yeah, that's where we want to be. It's a large amount of infrastructure and fairly big feature, but it's all in service to what we hope will just ultimately be used intermittently. And of course, in large, old enterprises, there are often big data sources that don't offer webhooks or other means of notifying about changes, and for those, incremental ingestion would need to be the focus.

Maybe it's a matter of just tweaking the interfaces here a little, to clarify this. The name deltaMapper is the main offender perhaps. It would be good if one could read the IncrementalEntityProvider interface from top to bottom and it were clear that there was an entity bus part that subscribes to a special topic, and then there was a scheduled static part.

I'm open to suggestions on the name. :)

The interface could have optional supportsEventTopic and onEvent methods or something. Maybe if there's an onEvent but no supportsEventTopic, it'll fall back to a topic with the same name as the provider? shrug maybe that's more confusing than helpful.

I think I get it, but it would defeat the purpose of making deltas simple. The supportsEventTopic and onEvent methods are incorporated into the engine in this PR, and exposing them seems like it would just result in repetitious code. You'd still need a mapper to convert payloads to deferred entities, but requiring those be part of the incremental provider would mean the methods would need to be explicitly added to each one.

And then be prepared that maybe people will more commonly come from the angle of having written event based ingestion first elsewhere, and only later, when they feel that they need to perhaps have an incremental fallback, do they look at this system and try to rewrite their provider as an incremental one.

I think (hope) they will appreciate the simplicity. The system I've created just requires you to write two things: A type/interface for your payload - which you'd need anyway - and the mapper function. The engine handles all the rest.

Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
@pjungermann
Copy link
Contributor

I'm already on my Christmas / end of year vacation and hence, less active right now.

Disclaimer: I know more or less what the incremental entity provider is, however I have no knowledge about how it works in detail.

And I didn't look at the code changes yet. I will have a closer look later. Just did a super quick scroll through it.

We'd have expected the world to gradually shift towards being primarily event based and centered around that, and then pull mechanisms will still linger around but at very slow rates and only basically for fault recovery (making sure that stuff that for some reason falls between the chairs eventually "heals" anyway). That makes them "secondary" in a sense.

I fully agree to this statement.

We want to get there, too. We regard incremental ingestion as kind of a drop-in replacement for full mutations for those cases where full is impossible due to the size of the entity set [...]

Overall, I would see the incremental ingestion as a replacement for the full mutation, too.

"drop-in replacement" means for me that there is nearly no additional work, however I assume that it is not that simple. A level of encapsulation and abstraction which allows you to use the capability without detailed knowledge is likely what you have prepared already though.

In general, I would see this as a feature for full mutations which entity providers can utilize -- or not. Maybe it even gets the default at some point with small datasets being ingested as one batch.


scenario:

E.g., let's assume there is a FooEntityProvider to ingest entities based on the external system Foo.

Foo provides APIs which we can use to scan and discover information which we can convert to entities.

Additionally, Foo provides events for changes to this data and we can subscribe to these events to refresh affected entities. In order to do so, we need to be able to identify these "affected entities" from the information within the event as well as the information we have at the provider-owned entities (name, annotations, ...) -- including expected new state and previous state (add/update/remove).

Full mutations will be scheduled in very long intervals to recover from any potentially lost event, etc.

The identification of "affected entities" (old vs new state) requires a certain level of alignment with the full mutation and its outcome. ("A" resulted in entities "a", "b", "c"; event for changes at "A"; "A" will result in the new state "a", "c", "d"; outcome: remove "a", add "d").
(maybe "refresh keys", annotations, ...)

As the dataset and the resulting ingested set of entities is huge (e.g., filters for the API will not be sufficient), we need a more efficient full mutation/ingestion compared to scanning the whole dataset at once. Hence, we add incremental ingestion support to the entity provider.

The incremental ingestion facilitates going through the entire dataset in batches/increments. In order to do so, it needs to keep track of the dataset.

This might require being aware of the delta mutations issued as result of event-based updates.

You can replace Foo with Github, BitbucketCloud, etc. and consider their (repo) push events.


this.providerEventTopic = `${options.provider.getProviderName()}-delta`;

Not sure if events like this make sense though. I would rather expect to have something like github.push or bitbucketCloud.repo:push as consumed events.

Maybe the incremental ingestion engine just needs a way to hook into the delta mutations itself. (E.g. by wrapping the connection at connect(connection: EntityProviderConnection): Promise<void> and add pre and/or post actions to applyMutation in case of mutation.type delta).

As written above: take these preliminary thoughts with a grain of salt. :-)

@backstage-goalie
Copy link
Contributor

backstage-goalie bot commented Dec 21, 2022

Changed Packages

Package Name Package Path Changeset Bump Current Version
@backstage/plugin-catalog-backend-module-incremental-ingestion plugins/catalog-backend-module-incremental-ingestion minor v0.1.1

@github-actions
Copy link
Contributor

github-actions bot commented Dec 21, 2022

Uffizzi Preview deployment-9512 was deleted.

@github-actions
Copy link
Contributor

This PR has been automatically marked as stale because it has not had recent activity from the author. It will be closed if no further activity occurs. If the PR was closed and you want it re-opened, let us know and we'll re-open the PR so that you can continue the contribution!

@github-actions github-actions bot added the stale label Dec 28, 2022
@github-actions github-actions bot removed the stale label Dec 30, 2022
@dekoding
Copy link
Contributor Author

dekoding commented Jan 2, 2023

I'm already on my Christmas / end of year vacation and hence, less active right now.

Disclaimer: I know more or less what the incremental entity provider is, however I have no knowledge about how it works in detail.

And I didn't look at the code changes yet. I will have a closer look later. Just did a super quick scroll through it.

We'd have expected the world to gradually shift towards being primarily event based and centered around that, and then pull mechanisms will still linger around but at very slow rates and only basically for fault recovery (making sure that stuff that for some reason falls between the chairs eventually "heals" anyway). That makes them "secondary" in a sense.

I fully agree to this statement.

We want to get there, too. We regard incremental ingestion as kind of a drop-in replacement for full mutations for those cases where full is impossible due to the size of the entity set [...]

Overall, I would see the incremental ingestion as a replacement for the full mutation, too.

"drop-in replacement" means for me that there is nearly no additional work, however I assume that it is not that simple. A level of encapsulation and abstraction which allows you to use the capability without detailed knowledge is likely what you have prepared already though.

In general, I would see this as a feature for full mutations which entity providers can utilize -- or not. Maybe it even gets the default at some point with small datasets being ingested as one batch.

scenario:

E.g., let's assume there is a FooEntityProvider to ingest entities based on the external system Foo.

Foo provides APIs which we can use to scan and discover information which we can convert to entities.

Additionally, Foo provides events for changes to this data and we can subscribe to these events to refresh affected entities. In order to do so, we need to be able to identify these "affected entities" from the information within the event as well as the information we have at the provider-owned entities (name, annotations, ...) -- including expected new state and previous state (add/update/remove).

Full mutations will be scheduled in very long intervals to recover from any potentially lost event, etc.

The identification of "affected entities" (old vs new state) requires a certain level of alignment with the full mutation and its outcome. ("A" resulted in entities "a", "b", "c"; event for changes at "A"; "A" will result in the new state "a", "c", "d"; outcome: remove "a", add "d"). (maybe "refresh keys", annotations, ...)

As the dataset and the resulting ingested set of entities is huge (e.g., filters for the API will not be sufficient), we need a more efficient full mutation/ingestion compared to scanning the whole dataset at once. Hence, we add incremental ingestion support to the entity provider.

The incremental ingestion facilitates going through the entire dataset in batches/increments. In order to do so, it needs to keep track of the dataset.

This might require being aware of the delta mutations issued as result of event-based updates.

You can replace Foo with Github, BitbucketCloud, etc. and consider their (repo) push events.

this.providerEventTopic = `${options.provider.getProviderName()}-delta`;

Not sure if events like this make sense though. I would rather expect to have something like github.push or bitbucketCloud.repo:push as consumed events.

Maybe the incremental ingestion engine just needs a way to hook into the delta mutations itself. (E.g. by wrapping the connection at connect(connection: EntityProviderConnection): Promise<void> and add pre and/or post actions to applyMutation in case of mutation.type delta).

As written above: take these preliminary thoughts with a grain of salt. :-)

Hey, no worries. I think if you look at the existing incremental provider, you'll find a lot of your thoughts are already addressed. The purpose of this particular PR is just to expose a mechanism for doing deltas with the same provider, which was previously not possible.

The mechanism I created for it takes advantage of event emitters, and it's that particular functionality I was hoping you could review if you have time.

(One thing I also want to address is that the reason it doesn't use github.push or anything like it is that it's a generic entity provider, not designed for any specific source, like Github.)

addIncrementalEntityProvider<TCursor, TContext>(
provider: IncrementalEntityProvider<TCursor, TContext>,
addIncrementalEntityProvider<TCursor, TContext, TInput>(
provider: IncrementalEntityProvider<TCursor, TContext, TInput>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting two things here.

First: We should really just skip the named types here. We shouldn't know or care what they are; the argument should be provider: IncrementalEntityProvider<unknown, unknown, unknown>. First thing we do in code (in the builder at least!) is to throw away the types anyway. Just like the IncrementalIngestionEngine already takes an unknown-typed provider. Would make sense to make this change as part of the PR.

Second: It's probably unfortunate that this is a single type argument as well. I suspect you'll quickly run into cases where you need to subscribe to several events in a single provider - not just repo pushes, but also repo creation and deletion and renaming and ... etc etc. See below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see how complicated it will be to switch to unknown for all of them. That would certainly make some things simpler.

I do really like the idea of making the entity provider able to handle creation/deletion/renaming. Is it appropriate for the entity provider to do that, though? It seems like it might be a little out of scope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking a github discovery provider that you just point at an entire org. Whenever people make new repos in there, those should also be part of the discovery. And when repos are deleted, the corresponding entities should vanish.

Copy link
Contributor Author

@dekoding dekoding Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be handled by the delta mechanism (theoretically, at least, via webhooks). Deltas can add, remove, and change entities. I thought you were suggesting that the provider actually do the work of add/remove/change, instead of processing deltas from the data source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fully addressed by the engine, actually. It accepts a payload of both emitted entities and of deleted entities.

@taras
Copy link
Member

taras commented Jan 11, 2023

@dekoding what do you think about renaming the issue to "Add events integration to Incremental Entity Providers"?

@dekoding dekoding changed the title Deltas in incremental entity providers Add events integration to Incremental Entity Providers Jan 11, 2023
@dekoding
Copy link
Contributor Author

@dekoding what do you think about renaming the issue to "Add events integration to Incremental Entity Providers"?

That is an excellent idea. Done!

Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
This reverts commit bcb0ff4.

Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Copy link
Member

@taras taras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

* If a delta is present, the incremental entity provider will apply
* it automatically.
*/
onEvent?: (payload: unknown) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I really think that this should be params: EventParams instead of the payload. The other fields in the event params are there for good reason :) You'll quickly run into needing those headers or other things (that are there today or get added in the future).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It took me some iterations to end up at EventParams and originally, I only had the payload, too.
I think it is better to support EventParams and stay more aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. It will require some changes to the engine as well, since the engine currently handles the actual event, but those should be minimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it occurs to me that passing in the entire event, not just the payload, means the onEvent method can do things besides just producing a delta, making it more flexible. So returning a delta should itself be optional. Maybe there's a reason to respond to an event without generating a delta from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -37,6 +41,7 @@ export class IncrementalIngestionEngine implements IterationEngine {
{ minutes: 30 },
{ hours: 3 },
];
this.providerEventTopic = `${options.provider.getProviderName()}-push`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we arrive at having this one being just a single one and static? Will people use event routers to fold the "actual" event streams into this topic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. I wonder a bit about this topic, too.

Could you describe a bit the flow?

E.g., let's assume I've received a webhook event via HTTP ingress from Bitbucket.org for the event type repo:push on the topic bitbucketCloud which then gets routed to the topic bitbucketCloud.repo:push.

Or from GitHub to topic github and then routed to github.push.

What would happen next? How would it get to this subscriber?
Would you use yet another event router to route it to this topic name?

Or couldn't we not just configure the topic name to be used, e.g. as part of IterationEngineOptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a route for each incremental entity provider for accepting payloads via HTTP ingress. For example, if you have an incremental provider for BitBucket repositories called BitBucketEntityProvider, the route would look like https://[somehost]/api/incremental/providers/BitBucketEntityProvider/event. It grabs the body of the request and emits it as a payload with a topic of BitBucketEntityProvider-push.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this route is gone now. Instead, the plugin will be reliant on the presence of the events backend.

Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Copy link
Member

@freben freben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go :)

@freben freben merged commit 27732dc into backstage:master Jan 20, 2023
@github-actions
Copy link
Contributor

Thank you for contributing to Backstage! The changes in this pull request will be part of the 1.11.0 release, scheduled for Tue, 14 Feb 2023.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:catalog Related to the Catalog Project Area
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants