Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the incremental entity provider backend #14356

Merged
merged 54 commits into from
Nov 24, 2022

Conversation

dekoding
Copy link
Contributor

Signed-off-by: Damon Kaswell damon.kaswell1@hp.com

Hey, I just made a Pull Request!

This PR introduces the concept of incremental entity providers to Backstage. An incremental entity provider is a form of entity provider that can ingest from very large data sources where a full mutation to achieve initial ingestion may be impossible.

Incremental providers split a large ingestion task into smaller chunks which can be streamed into the catalog (using deltas under the hood). The size and length of these chunks are configurable, as are the lengths of the pauses between them and the amount of time to rest after ingestion is complete before starting a fresh run.

The incremental entity provider tracks a cursor to indicate its current location in the data stream. This allows the provider's work to pick up where it leaves off after a burst of activity, on any replica running the provider's task.

In order to accomplish this, the incremental provider introduces a new schema called ingestion to the Postgres database, with three tables:

  • ingestions - Contains the status record for an incremental entity provider. Over time, there will always be two records per provider: One for the previous run, and one for the current run. Older ingestion records are purged automatically.
  • ingestion_marks - Contains the cursor records used by the incremental entity provider. They have a many-to-one relationship with the ingestions table records. Ingestion marks for old ingestion records are purged automatically as well.
  • ingestion_mark_entities - Contains an entry for each discovered entity, with a many-to-one relationship with records from the ingestion_marks table. Again, older records are purged automatically. This table is used for comparisons to the final_entities table, to allow entities for assets that have been removed to be deleted.

It's important to note that due to the fact that any replica of Backstage could pick up the task of ingestion, stateful protocols such as LDAP will not work, and should be ingested either via a proxy or with a standard entity provider.

✔️ Checklist

  • A changeset describing the change and affected packages. (more info)
  • Added or updated documentation
  • Tests for new functionality and regression tests for bug fixes
  • Screenshots attached (for UI changes)
  • All your commits have a Signed-off-by line in the message. (more info)

@github-actions
Copy link
Contributor

github-actions bot commented Oct 26, 2022

Changed Packages

Package Name Package Path Changeset Bump Current Version
@backstage/plugin-catalog-backend-module-incremental-ingestion plugins/catalog-backend-module-incremental-ingestion minor v0.0.0

Copy link
Member

@taras taras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work. I added a few comments for the readme.

@freben
Copy link
Member

freben commented Oct 27, 2022

Whoa! Nice :) Review to be had later on. Regarding the DI error, note that all our in-repo dependencies (to other packages/plugins packages) are on the form "workspace:^" now, instead of actual versions. Way more convenient, I'm sure you'll find. I think that should let the compilation progress more.

@dekoding
Copy link
Contributor Author

Whoa! Nice :) Review to be had later on. Regarding the DI error, note that all our in-repo dependencies (to other packages/plugins packages) are on the form "workspace:^" now, instead of actual versions. Way more convenient, I'm sure you'll find. I think that should let the compilation progress more.

Thanks for the heads-up! I also need to trim some other dependencies that are unneeded in this version.

Copy link
Member

@freben freben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright! Sorry it took some time, we were away and I also spent some real focus time on this :) I hope the amount of comments doesn't feel overwhelming. Let me know if you prefer some form of arrangement with an eager merge, with the most important things addressed and deferring the rest, or so

packages/backend/package.json Outdated Show resolved Hide resolved
plugins/incremental-ingestion-backend/package.json Outdated Show resolved Hide resolved
plugins/incremental-ingestion-backend/package.json Outdated Show resolved Hide resolved
plugins/incremental-ingestion-backend/README.md Outdated Show resolved Hide resolved
} from '../types';

/** @public */
export class IncrementalIngestionDatabaseManager {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests for this class (might be good for at least specifically this one since it might be the riskier one in future refactors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into writing one. Is this a showstopper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah. Would be good to get them in eventually of course.


const ingestionsDeleted = await tx('ingestion.ingestions')
.delete()
.where('provider_name', provider);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah a lot of the cleanup would go away if you had on cascade delete foreign relations as suggested in the migration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a way to do this without losing the information about how many of each type of record was deleted. I'm still trying to decide if I actually care about that data, though.

}

/**
* Performs a lookup of all providers that have duplicate active ingestion records.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a UNIQUE constraint on the table instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is left over from before we added the unique constraint that is present in the current migration. I'll have to think about whether this is still a concern.

@freben freben added the area:catalog Related to the Catalog Project Area label Nov 10, 2022
@github-actions
Copy link
Contributor

This PR has been automatically marked as stale because it has not had recent activity from the author. It will be closed if no further activity occurs. If the PR was closed and you want it re-opened, let us know and we'll re-open the PR so that you can continue the contribution!

@github-actions github-actions bot added the stale label Nov 17, 2022
@freben freben removed the stale label Nov 17, 2022
@taras taras mentioned this pull request Nov 17, 2022
31 tasks
@github-actions github-actions bot added area:catalog Related to the Catalog Project Area and removed area:catalog Related to the Catalog Project Area labels Nov 17, 2022
@freben freben force-pushed the dekoding/incremental-entity-provider branch 3 times, most recently from d74979c to 227e882 Compare November 22, 2022 11:13
@github-actions github-actions bot removed the area:catalog Related to the Catalog Project Area label Nov 22, 2022
@freben freben closed this Nov 22, 2022
@freben freben reopened this Nov 22, 2022
@freben
Copy link
Member

freben commented Nov 22, 2022

Just closing and opening to try to nudge the builds

Copy link
Member

@freben freben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, some followup. Do note that I pushed a PR to strip out all of the database related code from the API report; you want to pull this branch before you continue work on it. I hope that's not a problem. 🙏

// before incremental builder migrations are executed
const { incrementalAdminRouter } = await incrementalBuilder.build();

router.use('/incremental', incrementalAdminRouter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let's leave it as is then for now. We will want to write a module for the new backend system for this eventually, and at that point I think we'll discover that some similar deferral mechanism as suggested above might come into play under the hood too. So we can worry about that then.

} from '../types';

/** @public */
export class IncrementalIngestionDatabaseManager {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah. Would be good to get them in eventually of course.

plugins/incremental-ingestion-backend/api-report.md Outdated Show resolved Hide resolved
plugins/incremental-ingestion-backend/api-report.md Outdated Show resolved Hide resolved
plugins/incremental-ingestion-backend/src/router/paths.ts Outdated Show resolved Hide resolved
dekoding and others added 18 commits November 23, 2022 15:05
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Fredrik Adelöw <freben@gmail.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Damon Kaswell <damon.kaswell1@hp.com>
Signed-off-by: Fredrik Adelöw <freben@gmail.com>
@freben freben force-pushed the dekoding/incremental-entity-provider branch from de21a79 to 3f5d620 Compare November 23, 2022 14:09
@github-actions github-actions bot added the area:catalog Related to the Catalog Project Area label Nov 23, 2022
Signed-off-by: Fredrik Adelöw <freben@gmail.com>
@freben freben force-pushed the dekoding/incremental-entity-provider branch from b198472 to 72b1847 Compare November 24, 2022 14:10
Signed-off-by: Fredrik Adelöw <freben@gmail.com>
@freben freben force-pushed the dekoding/incremental-entity-provider branch from 72b1847 to d7209f1 Compare November 24, 2022 14:16
Signed-off-by: Fredrik Adelöw <freben@gmail.com>
Copy link
Member

@freben freben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think this is in a good shape now to merge and iterate on!

One large remaining note that I'd like to have addressed in a follow-up, is that it directly accesses catalog tables. That should ideally not be necessary. Since we already know from our own tables what the previous ingestion round returned, we should consult with that to compute the deleted items. The catalog, especially since it goes via the search and final_entities tables, is not a good source of truth since there may be problems and delays related to processing and stitching meaning that the final state of the db isn't what you think it is, next time that the burst starts over again.

@freben freben merged commit 6ecb455 into backstage:master Nov 24, 2022
@jhaals jhaals mentioned this pull request Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:catalog Related to the Catalog Project Area
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants