-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
De-duplicate embedded specialist edition change history #940
Conversation
3d0cc6f
to
b4e0a44
Compare
b4e0a44
to
bfeb586
Compare
bfeb586
to
e5c9aee
Compare
FROM change_notes | ||
WHERE edition_id IS NULL | ||
GROUP BY document_id, note | ||
HAVING (count(*) > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this query and the loop - maybe I'm totally missing something. I'm assuming we're pulling out one ID each time, if so why not all in one, and/or why the max one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I struggled to find a way to give me back all the change note ids where there was duplication (defined by grouping document and note) as the grouping clause won't allow the query to return all change note ids. Using max(id) gives me back one dupe only, by looping over this query we delete the dupes by reduction/repetition.
I attempted this by trying the grouping in a sub-query but the outer select could potentially return non-duplicates, at least the various ways I attempted did.
If there's a handy way in SQL of determining the dupes then giving back the ids for all but one in the series it would improve this massively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I understand it better now thanks. So it returns just the first id from each collision - I thought it returned just one row at a time.
You can do this to get them all back: ChangeNote.where(edition_id: nil).group(:document_id, :note).having("COUNT(*) > 1").pluck("array_agg(id)")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's much better - thanks will update
f5ed8cc
to
8448c09
Compare
This migration identifies editions with change history embedded in the details hash which contain duplicate notes. Some editions contain duplicate change history but are not associated with any change notes, so we repopulate the change notes from the change history, deduplicating based on document and note text. Some of the duplicate change history may be legitimate entries so we clear the entry in the details hash and represent the edition downstream. The presenter will regenerate this history correctly based on the ChangeNote records. These have been de-duplicated in an earlier migration.
8448c09
to
df07517
Compare
@kevindew updated, I had to make one small change to this. I'm happy it works now. |
https://trello.com/c/NbATcan2/956-de-duplicate-specialist-publisher-embedded-change-history.-(2)
This migration identifies editions with change history embedded in the details hash
which contain duplicate notes. Some of these may be legitimate entries so we clear
the entry in the details hash and represent the edition downstream.
The presenter will regenerate this history correctly based on the ChangeNote records.
Migration took ~58 seconds on a dev VM