Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate embedded specialist edition change history #940

Merged
merged 1 commit into from
Jun 21, 2017

Conversation

steventux
Copy link
Contributor

@steventux steventux commented Jun 1, 2017

https://trello.com/c/NbATcan2/956-de-duplicate-specialist-publisher-embedded-change-history.-(2)

This migration identifies editions with change history embedded in the details hash
which contain duplicate notes. Some of these may be legitimate entries so we clear
the entry in the details hash and represent the edition downstream.
The presenter will regenerate this history correctly based on the ChangeNote records.
Migration took ~58 seconds on a dev VM

@steventux steventux force-pushed the dedupe-specialist-change-history branch 3 times, most recently from 3d0cc6f to b4e0a44 Compare June 6, 2017 08:24
@steventux steventux changed the title De-duplicate embedded specialist edition change history [Do not merge] De-duplicate embedded specialist edition change history Jun 8, 2017
@steventux steventux force-pushed the dedupe-specialist-change-history branch from b4e0a44 to bfeb586 Compare June 13, 2017 15:03
@steventux steventux changed the title [Do not merge] De-duplicate embedded specialist edition change history De-duplicate embedded specialist edition change history Jun 13, 2017
@steventux steventux force-pushed the dedupe-specialist-change-history branch from bfeb586 to e5c9aee Compare June 13, 2017 15:06
FROM change_notes
WHERE edition_id IS NULL
GROUP BY document_id, note
HAVING (count(*) > 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this query and the loop - maybe I'm totally missing something. I'm assuming we're pulling out one ID each time, if so why not all in one, and/or why the max one?

Copy link
Contributor Author

@steventux steventux Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled to find a way to give me back all the change note ids where there was duplication (defined by grouping document and note) as the grouping clause won't allow the query to return all change note ids. Using max(id) gives me back one dupe only, by looping over this query we delete the dupes by reduction/repetition.
I attempted this by trying the grouping in a sub-query but the outer select could potentially return non-duplicates, at least the various ways I attempted did.
If there's a handy way in SQL of determining the dupes then giving back the ids for all but one in the series it would improve this massively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I understand it better now thanks. So it returns just the first id from each collision - I thought it returned just one row at a time.

You can do this to get them all back: ChangeNote.where(edition_id: nil).group(:document_id, :note).having("COUNT(*) > 1").pluck("array_agg(id)")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much better - thanks will update

@steventux steventux force-pushed the dedupe-specialist-change-history branch 2 times, most recently from f5ed8cc to 8448c09 Compare June 16, 2017 12:54
kevindew
kevindew previously approved these changes Jun 19, 2017
This migration identifies editions with change history embedded in the details hash
which contain duplicate notes.
Some editions contain duplicate change history but are not associated with any change
notes, so we repopulate the change notes from the change history, deduplicating
based on document and note text.
Some of the duplicate change history may be legitimate entries so we clear
the entry in the details hash and represent the edition downstream.
The presenter will regenerate this history correctly based on the ChangeNote records.
These have been de-duplicated in an earlier migration.
@steventux
Copy link
Contributor Author

@kevindew updated, I had to make one small change to this. I'm happy it works now.

@steventux steventux merged commit ce67524 into master Jun 21, 2017
@steventux steventux deleted the dedupe-specialist-change-history branch June 21, 2017 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants