De-duplicate embedded specialist edition change history #940

steventux · 2017-06-01T11:20:55Z

https://trello.com/c/NbATcan2/956-de-duplicate-specialist-publisher-embedded-change-history.-(2)

This migration identifies editions with change history embedded in the details hash
which contain duplicate notes. Some of these may be legitimate entries so we clear
the entry in the details hash and represent the edition downstream.
The presenter will regenerate this history correctly based on the ChangeNote records.
Migration took ~58 seconds on a dev VM

kevindew · 2017-06-15T10:15:58Z

db/migrate/20170612115845_dedupe_embedded_specialist_change_history.rb

+      FROM change_notes
+      WHERE edition_id IS NULL
+      GROUP BY document_id, note
+      HAVING (count(*) > 1)


I'm a bit confused by this query and the loop - maybe I'm totally missing something. I'm assuming we're pulling out one ID each time, if so why not all in one, and/or why the max one?

I struggled to find a way to give me back all the change note ids where there was duplication (defined by grouping document and note) as the grouping clause won't allow the query to return all change note ids. Using max(id) gives me back one dupe only, by looping over this query we delete the dupes by reduction/repetition.
I attempted this by trying the grouping in a sub-query but the outer select could potentially return non-duplicates, at least the various ways I attempted did.
If there's a handy way in SQL of determining the dupes then giving back the ids for all but one in the series it would improve this massively.

Ah! I understand it better now thanks. So it returns just the first id from each collision - I thought it returned just one row at a time.

You can do this to get them all back: ChangeNote.where(edition_id: nil).group(:document_id, :note).having("COUNT(*) > 1").pluck("array_agg(id)")

That's much better - thanks will update

This migration identifies editions with change history embedded in the details hash which contain duplicate notes. Some editions contain duplicate change history but are not associated with any change notes, so we repopulate the change notes from the change history, deduplicating based on document and note text. Some of the duplicate change history may be legitimate entries so we clear the entry in the details hash and represent the edition downstream. The presenter will regenerate this history correctly based on the ChangeNote records. These have been de-duplicated in an earlier migration.

steventux · 2017-06-20T13:03:07Z

@kevindew updated, I had to make one small change to this. I'm happy it works now.

steventux force-pushed the dedupe-specialist-change-history branch 3 times, most recently from 3d0cc6f to b4e0a44 Compare June 6, 2017 08:24

steventux changed the title ~~De-duplicate embedded specialist edition change history~~ [Do not merge] De-duplicate embedded specialist edition change history Jun 8, 2017

steventux force-pushed the dedupe-specialist-change-history branch from b4e0a44 to bfeb586 Compare June 13, 2017 15:03

steventux changed the title ~~[Do not merge] De-duplicate embedded specialist edition change history~~ De-duplicate embedded specialist edition change history Jun 13, 2017

steventux force-pushed the dedupe-specialist-change-history branch from bfeb586 to e5c9aee Compare June 13, 2017 15:06

kevindew reviewed Jun 15, 2017

View reviewed changes

steventux force-pushed the dedupe-specialist-change-history branch 2 times, most recently from f5ed8cc to 8448c09 Compare June 16, 2017 12:54

kevindew previously approved these changes Jun 19, 2017

View reviewed changes

steventux dismissed kevindew’s stale review via df07517 June 20, 2017 13:01

steventux force-pushed the dedupe-specialist-change-history branch from 8448c09 to df07517 Compare June 20, 2017 13:01

kevindew approved these changes Jun 21, 2017

View reviewed changes

steventux merged commit ce67524 into master Jun 21, 2017

steventux deleted the dedupe-specialist-change-history branch June 21, 2017 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

De-duplicate embedded specialist edition change history #940

De-duplicate embedded specialist edition change history #940

steventux commented Jun 1, 2017 •

edited

Loading

kevindew Jun 15, 2017

steventux Jun 15, 2017 •

edited

Loading

kevindew Jun 15, 2017

steventux Jun 15, 2017

steventux commented Jun 20, 2017

De-duplicate embedded specialist edition change history #940

De-duplicate embedded specialist edition change history #940

Conversation

steventux commented Jun 1, 2017 • edited Loading

kevindew Jun 15, 2017

Choose a reason for hiding this comment

steventux Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

kevindew Jun 15, 2017

Choose a reason for hiding this comment

steventux Jun 15, 2017

Choose a reason for hiding this comment

steventux commented Jun 20, 2017

steventux commented Jun 1, 2017 •

edited

Loading

steventux Jun 15, 2017 •

edited

Loading