-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add events integration to Incremental Entity Providers #15154
Add events integration to Incremental Entity Providers #15154
Conversation
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Changed Packages
|
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
I'm a bit confused by this one. It seems like this makes the engine do two completely distinct things - pull data from external sources and pushing them inward, and also receiving events and purely translating them to additions/removals via a callback. What drove this change, and are the two more related than they seem at surface value? If the added functionality is needed, I'd have thought it might be in a provider all of its own, and perhaps in a more targeted or system specific way, rather than a generic translation by callback to the raw entity provider interface 🤔 |
Good question. The logic behind giving incremental providers the ability to perform on-the-fly delta updates like this is to allow for situations where new assets are added to a data source that is capable of performing notifications (i.e., webhooks) outside the ingestion window. For example, let's say you're using an incremental provider to ingest data for fifty thousand repositories from Github Enterprise. Since most won't change over the course of a week, you only have it scheduled to perform incremental ingestion once a week. However, you are aware that a repo might be added or removed during the rest period for the incremental ingestion, and GHE offers webhooks you can subscribe to in order to get those atomic updates. If the provider also supports deltas, you can get those on-the-fly updates ingested practically instantly. And the next time regular incremental ingestion runs, it will pick up those changes and incorporate them into the As for reasons why not to use a separate entity provider, the main one is that if the entity is created by a separate |
plugins/catalog-backend-module-incremental-ingestion/src/engine/IncrementalIngestionEngine.ts
Show resolved
Hide resolved
Regarding overwrites: providers can actually overwrite each other, as long as they do so using the exact same Going to ping @pjungermann here for the interesting event use case. Regarding the pull + push duality: Alright, thanks for the context! This is indeed something that will be important, but we've thought about it from the opposite angle so far. We'd have expected the world to gradually shift towards being primarily event based and centred around that, and then pull mechanisms will still linger around but at very slow rates and only basically for fault recovery (making sure that stuff that for some reason falls between the chairs eventually "heals" anyway). That makes them "secondary" in a sense. This makes me think that maybe a provider that has this ability, and being fed primarily out of webhooks, might want to be an event handler first, and then maybe, but as a side concern, it might be using the incremental method of ingestion in the recovery mode. :) This makes me look at your original stab with incremental ingestion more as a "facility" than a provider in a different light. Maybe it's a matter of just tweaking the interfaces here a little, to clarify this. The name The interface could have optional And then be prepared that maybe people will more commonly come from the angle of having written event based ingestion first elsewhere, and only later, when they feel that they need to perhaps have an incremental fallback, do they look at this system and try to rewrite their provider as an incremental one. |
Ah, I guess that would work as an alternative, but I'm (currently) still inclined to try to keep things in the same provider, because we can keep it simpler for adopters (a mapper function vs. an entirely new entity provider). It's also easier to implement the delta logic I mentioned above if I don't have to duplicate access to the incremental ingestion database manager across two entity providers. Anyway... onward!
I'm a big fan of event emitters as ways to communicate between otherwise unconnected layers of an application. Looking forward to what @pjungermann has to say on this one.
We want to get there, too. We regard incremental ingestion as kind of a drop-in replacement for
Yeah, that's where we want to be. It's a large amount of infrastructure and fairly big feature, but it's all in service to what we hope will just ultimately be used intermittently. And of course, in large, old enterprises, there are often big data sources that don't offer webhooks or other means of notifying about changes, and for those, incremental ingestion would need to be the focus.
I'm open to suggestions on the name. :)
I think I get it, but it would defeat the purpose of making deltas simple. The
I think (hope) they will appreciate the simplicity. The system I've created just requires you to write two things: A type/interface for your payload - which you'd need anyway - and the mapper function. The engine handles all the rest. |
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
I'm already on my Christmas / end of year vacation and hence, less active right now. Disclaimer: I know more or less what the incremental entity provider is, however I have no knowledge about how it works in detail. And I didn't look at the code changes yet. I will have a closer look later. Just did a super quick scroll through it.
I fully agree to this statement.
Overall, I would see the incremental ingestion as a replacement for the full mutation, too. "drop-in replacement" means for me that there is nearly no additional work, however I assume that it is not that simple. A level of encapsulation and abstraction which allows you to use the capability without detailed knowledge is likely what you have prepared already though. In general, I would see this as a feature for full mutations which entity providers can utilize -- or not. Maybe it even gets the default at some point with small datasets being ingested as one batch. scenario: E.g., let's assume there is a
Additionally, Full mutations will be scheduled in very long intervals to recover from any potentially lost event, etc. The identification of "affected entities" (old vs new state) requires a certain level of alignment with the full mutation and its outcome. ("A" resulted in entities "a", "b", "c"; event for changes at "A"; "A" will result in the new state "a", "c", "d"; outcome: remove "a", add "d"). As the dataset and the resulting ingested set of entities is huge (e.g., filters for the API will not be sufficient), we need a more efficient full mutation/ingestion compared to scanning the whole dataset at once. Hence, we add incremental ingestion support to the entity provider. The incremental ingestion facilitates going through the entire dataset in batches/increments. In order to do so, it needs to keep track of the dataset. This might require being aware of the delta mutations issued as result of event-based updates. You can replace this.providerEventTopic = `${options.provider.getProviderName()}-delta`; Not sure if events like this make sense though. I would rather expect to have something like Maybe the incremental ingestion engine just needs a way to hook into the delta mutations itself. (E.g. by wrapping the connection at As written above: take these preliminary thoughts with a grain of salt. :-) |
…/deltas-in-incremental-providers
Changed Packages
|
Uffizzi Preview |
This PR has been automatically marked as stale because it has not had recent activity from the author. It will be closed if no further activity occurs. If the PR was closed and you want it re-opened, let us know and we'll re-open the PR so that you can continue the contribution! |
…/deltas-in-incremental-providers
…/deltas-in-incremental-providers
Hey, no worries. I think if you look at the existing incremental provider, you'll find a lot of your thoughts are already addressed. The purpose of this particular PR is just to expose a mechanism for doing deltas with the same provider, which was previously not possible. The mechanism I created for it takes advantage of event emitters, and it's that particular functionality I was hoping you could review if you have time. (One thing I also want to address is that the reason it doesn't use |
…/deltas-in-incremental-providers
addIncrementalEntityProvider<TCursor, TContext>( | ||
provider: IncrementalEntityProvider<TCursor, TContext>, | ||
addIncrementalEntityProvider<TCursor, TContext, TInput>( | ||
provider: IncrementalEntityProvider<TCursor, TContext, TInput>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting two things here.
First: We should really just skip the named types here. We shouldn't know or care what they are; the argument should be provider: IncrementalEntityProvider<unknown, unknown, unknown>
. First thing we do in code (in the builder at least!) is to throw away the types anyway. Just like the IncrementalIngestionEngine
already takes an unknown
-typed provider. Would make sense to make this change as part of the PR.
Second: It's probably unfortunate that this is a single type argument as well. I suspect you'll quickly run into cases where you need to subscribe to several events in a single provider - not just repo pushes, but also repo creation and deletion and renaming and ... etc etc. See below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see how complicated it will be to switch to unknown
for all of them. That would certainly make some things simpler.
I do really like the idea of making the entity provider able to handle creation/deletion/renaming. Is it appropriate for the entity provider to do that, though? It seems like it might be a little out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking a github discovery provider that you just point at an entire org. Whenever people make new repos in there, those should also be part of the discovery. And when repos are deleted, the corresponding entities should vanish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be handled by the delta mechanism (theoretically, at least, via webhooks). Deltas can add, remove, and change entities. I thought you were suggesting that the provider actually do the work of add/remove/change, instead of processing deltas from the data source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fully addressed by the engine, actually. It accepts a payload of both emitted entities and of deleted entities.
plugins/catalog-backend-module-incremental-ingestion/src/types.ts
Outdated
Show resolved
Hide resolved
plugins/catalog-backend-module-incremental-ingestion/src/types.ts
Outdated
Show resolved
Hide resolved
…/deltas-in-incremental-providers
@dekoding what do you think about renaming the issue to "Add events integration to Incremental Entity Providers"? |
That is an excellent idea. Done! |
…/deltas-in-incremental-providers
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
This reverts commit bcb0ff4. Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great
* If a delta is present, the incremental entity provider will apply | ||
* it automatically. | ||
*/ | ||
onEvent?: (payload: unknown) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I really think that this should be params: EventParams
instead of the payload. The other fields in the event params are there for good reason :) You'll quickly run into needing those headers or other things (that are there today or get added in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It took me some iterations to end up at EventParams and originally, I only had the payload, too.
I think it is better to support EventParams and stay more aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. It will require some changes to the engine as well, since the engine currently handles the actual event, but those should be minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it occurs to me that passing in the entire event, not just the payload, means the onEvent
method can do things besides just producing a delta, making it more flexible. So returning a delta should itself be optional. Maybe there's a reason to respond to an event without generating a delta from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -37,6 +41,7 @@ export class IncrementalIngestionEngine implements IterationEngine { | |||
{ minutes: 30 }, | |||
{ hours: 3 }, | |||
]; | |||
this.providerEventTopic = `${options.provider.getProviderName()}-push`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we arrive at having this one being just a single one and static? Will people use event routers to fold the "actual" event streams into this topic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. I wonder a bit about this topic, too.
Could you describe a bit the flow?
E.g., let's assume I've received a webhook event via HTTP ingress from Bitbucket.org for the event type repo:push
on the topic bitbucketCloud
which then gets routed to the topic bitbucketCloud.repo:push
.
Or from GitHub to topic github
and then routed to github.push
.
What would happen next? How would it get to this subscriber?
Would you use yet another event router to route it to this topic name?
Or couldn't we not just configure the topic name to be used, e.g. as part of IterationEngineOptions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a route for each incremental entity provider for accepting payloads via HTTP ingress. For example, if you have an incremental provider for BitBucket repositories called BitBucketEntityProvider
, the route would look like https://[somehost]/api/incremental/providers/BitBucketEntityProvider/event
. It grabs the body of the request and emits it as a payload with a topic of BitBucketEntityProvider-push
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this route is gone now. Instead, the plugin will be reliant on the presence of the events backend.
plugins/catalog-backend-module-incremental-ingestion/src/engine/IncrementalIngestionEngine.ts
Outdated
Show resolved
Hide resolved
plugins/catalog-backend-module-incremental-ingestion/src/engine/IncrementalIngestionEngine.ts
Outdated
Show resolved
Hide resolved
plugins/catalog-backend-module-incremental-ingestion/src/module/WrapperProviders.ts
Outdated
Show resolved
Hide resolved
plugins/catalog-backend-module-incremental-ingestion/src/router/routes.ts
Outdated
Show resolved
Hide resolved
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
…/deltas-in-incremental-providers
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go :)
Thank you for contributing to Backstage! The changes in this pull request will be part of the |
Hey, I just made a Pull Request!
This pull request adds a new feature to incremental entity providers: The ability to receive events and (optionally) trigger a delta update based on an event payload. A new, optional
eventHandler
property exposes methods for handling events and identifying event topics that should be handled by the provider:This feature is optional and should not break any existing providers.
✔️ Checklist
Signed-off-by
line in the message. (more info)