Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds urnBasedPagination option to datahub-upgrade RestoreIndices #9232

Merged

Conversation

nmbryant
Copy link
Contributor

@nmbryant nmbryant commented Nov 13, 2023

We found that the RestoreIndices job in datahub-upgrade performs poorly with large amounts of data in the SQL database. What we were seeing before this change was that as RestoreIndices ran, each batch would take longer and longer, with the bottleneck being SQL. The reason for this is that using OFFSET causes queries to slowdown as the OFFSET value gets higher.

This solution uses Keyset Pagination by saving the most recent urn and aspect that has been processed. The former method is still part of RestoreIndices, with the flag urnBasedPagination being set to true to use the new keyset pagination method. From our internal testing, we saw full restoration of ~5M records go from 4-5 hours to 40 minutes with linear execution instead of getting slower over time. The downside to this implementation is that currently I haven't added support for multiple threads. However, in most cases this implementation should be must faster than before, since in our example comparison, that was with 32 threads against the new implementation's single thread.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment labels Nov 13, 2023
@nmbryant nmbryant marked this pull request as ready for review November 15, 2023 18:32
@yoonhyejin yoonhyejin requested review from RyanHolstien and removed request for RyanHolstien November 16, 2023 09:14
Copy link
Collaborator

@RyanHolstien RyanHolstien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! This is great, a couple of minor comments

@RyanHolstien RyanHolstien self-assigned this Nov 16, 2023
@nmbryant
Copy link
Contributor Author

nmbryant commented Nov 20, 2023

@RyanHolstien Thanks for the review, I addressed your comments in the latest commit. Right now I have urnBasedPagination default to false, should I change the default value to true?

@maggiehays maggiehays added the community-contribution PR or Issue raised by member(s) of DataHub Community label Nov 29, 2023
@nmbryant
Copy link
Contributor Author

nmbryant commented Dec 7, 2023

@RyanHolstien Updated the query a bit based on our results. With this change, we were able to complete RestoreIndices on ~66 million aspects in ~5 hours.

@@ -163,7 +153,8 @@ public Function<UpgradeContext, UpgradeStepResult> executable() {
context.report().addLine(String.format("Rows processed this loop %d", rowsProcessed));
start += args.batchSize;
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
context.report().addLine(String.format("Exception received while processing batch, exiting: %s", e.getMessage()));
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused on the usage of break here, was thinking this should be a failure condition where it does:

return new DefaultUpgradeStepResult(id(), UpgradeStepResult.Result.FAILED);

what was the idea around the break intended for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wasn't aware that's how failure conditions should be handled, I'll get that updated

validateConnection();

List<EbeanAspectV2.PrimaryKey> keys =
urnAspects.entrySet().stream()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of reformatting changes that got put into this PR that make it hard to see what has actually been changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all from this PR: #9373
Do you know what was run to do the formatting for that PR so that I can run it on my changes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's through Spotless which seems to be run through ./gradlew spotlessApply, might need to be run at a per module basis I haven't used it yet personally so haven't worked through the kinks myself yet. It might also work itself out when updating the branch which we can try.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah merge did not fix it, it's showing that the lines got specifically changed from the new format to old format in this PR 😕

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build output gives the command: ./gradlew :metadata-service:services:spotlessApply should fix it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that fixed it. I think everything should be good now

@RyanHolstien RyanHolstien added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Dec 8, 2023
@sgomezvillamor
Copy link
Contributor

I have suffered long upgrades because of RestoreIndices job in the past and this PR looks promising!
Is there any chance that this gets included in the next release @RyanHolstien?

@RyanHolstien RyanHolstien merged commit a29fce9 into datahub-project:master Dec 19, 2023
39 checks passed
@sixmen
Copy link

sixmen commented Jan 17, 2024

I found that it is much better if I added an index. (useful before the release)

create index version_urn_aspect on metadata_aspect_v2 (version, urn, aspect)

(a query form restoring is select urn, aspect, ... where version=0 order by urn, aspect limit n offset m)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community devops PR or Issue related to DataHub backend & deployment merge-pending-ci A PR that has passed review and should be merged once CI is green. product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants