Adds urnBasedPagination option to datahub-upgrade RestoreIndices #9232

nmbryant · 2023-11-13T20:41:46Z

We found that the RestoreIndices job in datahub-upgrade performs poorly with large amounts of data in the SQL database. What we were seeing before this change was that as RestoreIndices ran, each batch would take longer and longer, with the bottleneck being SQL. The reason for this is that using OFFSET causes queries to slowdown as the OFFSET value gets higher.

This solution uses Keyset Pagination by saving the most recent urn and aspect that has been processed. The former method is still part of RestoreIndices, with the flag urnBasedPagination being set to true to use the new keyset pagination method. From our internal testing, we saw full restoration of ~5M records go from 4-5 hours to 40 minutes with linear execution instead of getting slower over time. The downside to this implementation is that currently I haven't added support for multiple threads. However, in most cases this implementation should be must faster than before, since in our example comparison, that was with 32 threads against the new implementation's single thread.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…ade RestoreIndices

…t on args

RyanHolstien

Thanks for the contribution! This is great, a couple of minor comments

datahub-upgrade/src/main/java/com/linkedin/datahub/upgrade/restoreindices/SendMAEStep.java

…Step when urnBasedPagination is true

nmbryant · 2023-11-20T15:20:27Z

@RyanHolstien Thanks for the review, I addressed your comments in the latest commit. Right now I have urnBasedPagination default to false, should I change the default value to true?

…rove performance

nmbryant · 2023-12-07T16:40:44Z

@RyanHolstien Updated the query a bit based on our results. With this change, we were able to complete RestoreIndices on ~66 million aspects in ~5 hours.

RyanHolstien · 2023-12-07T17:43:53Z

datahub-upgrade/src/main/java/com/linkedin/datahub/upgrade/restoreindices/SendMAEStep.java

@@ -163,7 +153,8 @@ public Function<UpgradeContext, UpgradeStepResult> executable() {
            context.report().addLine(String.format("Rows processed this loop %d", rowsProcessed));
            start += args.batchSize;
          } catch (InterruptedException | ExecutionException e) {
-            e.printStackTrace();
+            context.report().addLine(String.format("Exception received while processing batch, exiting: %s", e.getMessage()));
+            break;


Confused on the usage of break here, was thinking this should be a failure condition where it does:

return new DefaultUpgradeStepResult(id(), UpgradeStepResult.Result.FAILED);

what was the idea around the break intended for?

Just wasn't aware that's how failure conditions should be handled, I'll get that updated

RyanHolstien · 2023-12-07T17:45:43Z

metadata-io/src/main/java/com/linkedin/metadata/entity/ebean/EbeanAspectDao.java

    validateConnection();

-    List<EbeanAspectV2.PrimaryKey> keys =
-        urnAspects.entrySet().stream()


There are a lot of reformatting changes that got put into this PR that make it hard to see what has actually been changed

That's all from this PR: #9373
Do you know what was run to do the formatting for that PR so that I can run it on my changes?

I believe it's through Spotless which seems to be run through ./gradlew spotlessApply, might need to be run at a per module basis I haven't used it yet personally so haven't worked through the kinks myself yet. It might also work itself out when updating the branch which we can try.

Yeah merge did not fix it, it's showing that the lines got specifically changed from the new format to old format in this PR 😕

The build output gives the command: ./gradlew :metadata-service:services:spotlessApply should fix it

Thanks, that fixed it. I think everything should be good now

…yant/datahub into restore-indices-improvement

sgomezvillamor · 2023-12-19T08:52:30Z

I have suffered long upgrades because of RestoreIndices job in the past and this PR looks promising!
Is there any chance that this gets included in the next release @RyanHolstien?

sixmen · 2024-01-17T05:29:36Z

I found that it is much better if I added an index. (useful before the release)

create index version_urn_aspect on metadata_aspect_v2 (version, urn, aspect)

(a query form restoring is select urn, aspect, ... where version=0 order by urn, aspect limit n offset m)

github-actions bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment labels Nov 13, 2023

perf(datahub-upgrade): Adds urnBasedPagination option to datahub-upgr…

d1d0ea5

…ade RestoreIndices

nmbryant force-pushed the restore-indices-improvement branch from 0619c34 to d1d0ea5 Compare November 13, 2023 21:00

perf(datahub-upgrade): Fixes issue where lastAspect wasn't getting se…

1edcb42

…t on args

nmbryant force-pushed the restore-indices-improvement branch from 713101d to 1edcb42 Compare November 13, 2023 21:50

vercel bot deployed to Preview November 13, 2023 22:30 View deployment

perf(datahub-upgrade): updates datahub-upgrade docker docs

327c4ba

vercel bot deployed to Preview November 14, 2023 15:14 View deployment

nmbryant marked this pull request as ready for review November 15, 2023 18:32

yoonhyejin requested review from RyanHolstien and removed request for RyanHolstien November 16, 2023 09:14

RyanHolstien reviewed Nov 16, 2023

View reviewed changes

datahub-upgrade/src/main/java/com/linkedin/datahub/upgrade/restoreindices/SendMAEStep.java Outdated Show resolved Hide resolved

datahub-upgrade/src/main/java/com/linkedin/datahub/upgrade/restoreindices/SendMAEStep.java Outdated Show resolved Hide resolved

RyanHolstien self-assigned this Nov 16, 2023

perf(datahub-upgrade): print message and exit on exception in SendMAE…

2be9021

…Step when urnBasedPagination is true

vercel bot deployed to Preview November 20, 2023 15:38 View deployment

maggiehays added the community-contribution PR or Issue raised by member(s) of DataHub Community label Nov 29, 2023

nmbryant added 2 commits December 7, 2023 11:25

perf(datahub-upgrade): Removes string concatenation from query to imp…

b2eae3d

…rove performance

perf(datahub-upgrade): Resolve merge conflicts from main

3cddb4f

RyanHolstien reviewed Dec 7, 2023

View reviewed changes

vercel bot deployed to Preview December 7, 2023 17:52 View deployment

Merge branch 'master' into restore-indices-improvement

51106fd

vercel bot deployed to Preview December 7, 2023 19:11 View deployment

nmbryant added 2 commits December 8, 2023 11:18

perf(datahub-upgrade): Updates error scenario, runs spotless apply

202440e

Merge branch 'restore-indices-improvement' of https://github.com/nmbr…

5c1377c

…yant/datahub into restore-indices-improvement

vercel bot deployed to Preview December 8, 2023 16:38 View deployment

perf(datahub-upgrade): runs spotless apply on metadata-io

9cc60ef

vercel bot deployed to Preview December 8, 2023 17:01 View deployment

RyanHolstien approved these changes Dec 8, 2023

View reviewed changes

Merge branch 'master' into restore-indices-improvement

66d2641

RyanHolstien added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Dec 8, 2023

vercel bot deployed to Preview December 8, 2023 20:21 View deployment

Merge branch 'master' into restore-indices-improvement

3ed6102

vercel bot deployed to Preview December 19, 2023 17:47 View deployment

RyanHolstien merged commit a29fce9 into datahub-project:master Dec 19, 2023
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds urnBasedPagination option to datahub-upgrade RestoreIndices #9232

Adds urnBasedPagination option to datahub-upgrade RestoreIndices #9232

nmbryant commented Nov 13, 2023 •

edited

Loading

RyanHolstien left a comment

nmbryant commented Nov 20, 2023 •

edited

Loading

nmbryant commented Dec 7, 2023

RyanHolstien Dec 7, 2023

nmbryant Dec 7, 2023

RyanHolstien Dec 7, 2023

nmbryant Dec 7, 2023

RyanHolstien Dec 7, 2023

RyanHolstien Dec 7, 2023

RyanHolstien Dec 7, 2023

nmbryant Dec 8, 2023

sgomezvillamor commented Dec 19, 2023

sixmen commented Jan 17, 2024

Adds urnBasedPagination option to datahub-upgrade RestoreIndices #9232

Adds urnBasedPagination option to datahub-upgrade RestoreIndices #9232

Conversation

nmbryant commented Nov 13, 2023 • edited Loading

Checklist

RyanHolstien left a comment

Choose a reason for hiding this comment

nmbryant commented Nov 20, 2023 • edited Loading

nmbryant commented Dec 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgomezvillamor commented Dec 19, 2023

sixmen commented Jan 17, 2024

nmbryant commented Nov 13, 2023 •

edited

Loading

nmbryant commented Nov 20, 2023 •

edited

Loading