Update requeue task to requeue by document type #1009

MatMoore · 2017-08-21T12:45:13Z

This task was added when Finding Things team migrated all the links to the
publishing api, to sync everything back to Rummager.

We now need something similar to synchronise all existing content (not just links), so that
Rummager can be a direct downstream consumer of publishing api.

This PR includes a few implementation changes:

Use non persistent messages, because we want requeues to be as fast and low
overhead as possible. Requeued messages are less important than normal
publishing messages and should be abandoned if RabbitMQ goes down.
Use a separate routing key - *.bulk.reindex - to indicate the intent (see
removed comment). This allows us to define a separate non-durable queue
for these messages in Rummager; currently the rummager_govuk_index
receives everything and is set up as durable to survive server restarts.
Filter on content_store instead of state, because we care about
unpublished (especially withdrawn) editions as well as published.
Filter by document type for now, as we don't need the whole lot yet. We may
add a 'requeue everything' task back in later.

Part of https://trello.com/c/5d4Qkt31/244-bulk-requeue-content-by-format / https://trello.com/c/VQsLP52d/246-create-rake-task-to-bulk-republish-by-format

thomasleese

Seems fine to me, just had one comment.

Have you tried the task on integration?

thomasleese · 2017-08-21T12:51:45Z

lib/tasks/queue.rake

+  desc "Add published editions to the message queue by document type"
+  task :requeue_document_type, [:document_type] => :environment do |_, args|
+    document_type = args[:document_type]
+    raise ValueError("expecting document_type") if document_type.nil?


I would do unless document_type.present? because it catches empty strings too.

MatMoore · 2017-08-21T13:17:30Z

I've only tried it on dev so far, but will test the branch on integration before merging as I also want to add in a new queue on the consumer side.

thomasleese · 2017-08-21T14:31:56Z

lib/requeue_content.rb

-    queue_payload = Presenters::EditionPresenter.new(
-      edition, draft: false,
-    ).for_message_queue(version)
+  attr_reader :scope


This can be one line, attr_reader :scope, :version.

MatMoore · 2017-08-21T17:02:34Z

@thomasleese I think there might be another problem, which is that when binding queues to the exchange, we can't easily distinguish between a routing key of document_type.requeue (these messages) and document_type.major/minor/links/unpublish - at least without setting up multiple bindings. And our puppet module doesn't seem to work with multiple bindings per queue (yet).

I'd like to route these messages to a separate transient queue, rather than having our existing queue handle them, and this would get in the way of that.

My suggestion is to change the routing key to 3 words instead of 2:
document_type.requeue.bulk. Then we could bind the main rummager queue to *.*. Do you see any problem with having a routing key like this?

thomasleese · 2017-08-22T06:19:40Z

@MatMoore I would say there's nothing wrong with a three word routing key for the moment.

In the future we're planning to do some clean up on the naming of the routing keys so they're more consistent (https://trello.com/c/OqyPl7pc/534-stop-messagequeueeventtype-from-using-update-type), do you think there is anything we should be aware of now in preparation for that? Perhaps when we get round to that, the puppet module will support multiple bindings per queue.

MatMoore · 2017-08-22T09:38:24Z

@thomasleese Cool, I'll proceed with the 3 word key then for now.

Looking at the proposal in the card though - should I be distinguishing between bulk_republish and bulk_reindex? I've gone with *.requeue.bulk here but it could also be *.reindex.bulk or *.bulk.reindex.

A couple of thoughts on the proposed renames:

Rummager now needs to subscribe to the set of {major_update, minor_update, links_update, unpublish} - basically all publishing actions - but I'm assuming it shouldn't subscribe to {bulk_reindex, bulk_republish}. I think if we were to ever hook up the content store to the message queue it would also subscribe to the same things as well, so I think if we're renaming, it would help a lot to have something common in the routing key for all of those actions.

So if we have guide.action.major_update, guide.action.links_update, vs guide.bulk.reindex we'd bind to *.action.* and *.bulk.reindex to distinguish between user-initiated publishing actions and developer-initiated bulk republishing.

Multiple bindings would also do the job, if supported, but wildcard matching is simpler and we can definitely do it.

thomasleese · 2017-08-22T09:51:29Z

@MatMoore I would go for reindex actually, since that's what we called it in the proposal. I don't think it matters at the moment whether it's reindex.bulk or bulk.index.

I'll copy your notes into the card on renaming the routing keys as that's useful information to have, thanks!

This task was added when Finding Things team migrated all the links to the publishing api, to sync everything back to Rummager. We now need something similar to synchronise all existing content, so that Rummager can be a direct downstream consumer of publishing api. This includes a few implementation changes: 1. Use non persistent messages, because we want requeues to be as fast and low overhead as possible. Requeued messages are less important than normal publishing messages and should be abandoned if RabbitMQ goes down. 2. Use a separate routing key - `*.bulk.reindex` - to indicate the intent (see removed comment). This allows us to define a separate non-durable queue for these messages in Rummager; currently the `rummager_govuk_index` receives everything and is set up as `durable` to survive server restarts. 3. Filter on `content_store` instead of `state`, because we care about unpublished (especially withdrawn) editions as well as published. 4. Filter by document type for now, as we don't need the whole lot yet. We may add a 'requeue everything' task back in later.

MatMoore · 2017-08-22T10:02:13Z

@thomasleese Cool, I've made that change.

thomasleese · 2017-08-22T10:09:29Z

lib/requeue_content.rb

+    # because we don't want to send additional email alerts to users.
+    service.send_message(
+      queue_payload,
+      routing_key: "#{edition.schema_name}.bulk.reindex",


Do you know if it's possible to do event_type: "bulk.reindex" here? AFAIK it's just string interpolation.

I guess so but I thought this was less confusing as you can see what the key looks like all in one place.

Yeah, I see what you mean. It's annoying having to include the duplication of edition.schema_name though. It's up to you, I'll approve the PR since it's not major.

In RabbitMQ binding keys the wildcard `#` matches any number of dot separated words, whereas `*.*` ensures it's two words. This allows us to keep matching `schema.major`, `schema.minor`, `schema.links`, `schema.unpublish`, while not matching reindexing messages (alphagov/publishing-api#1009) I'm going to set up a separate queue to handle these messages. Note that the puppet module for rabbitmq is not smart enough to update bindings for an existing queue when the binding key changes, so this will need to be manually changed after deployment.

thomasleese reviewed Aug 21, 2017

View reviewed changes

MatMoore force-pushed the requeue-by-type branch from 0c84230 to 5a7d087 Compare August 21, 2017 13:18

thomasleese reviewed Aug 21, 2017

View reviewed changes

MatMoore force-pushed the requeue-by-type branch from 5a7d087 to f10635e Compare August 21, 2017 17:16

MatMoore force-pushed the requeue-by-type branch from f10635e to acb9f16 Compare August 22, 2017 10:01

thomasleese reviewed Aug 22, 2017

View reviewed changes

MatMoore mentioned this pull request Aug 22, 2017

Restrict RabbitMQ queue binding for rummager alphagov/govuk-puppet#6333

Merged

thomasleese approved these changes Aug 22, 2017

View reviewed changes

MatMoore merged commit ac0f5f2 into master Aug 22, 2017

MatMoore deleted the requeue-by-type branch August 22, 2017 12:48

MatMoore mentioned this pull request Aug 22, 2017

Use find_each instead of each #1011

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update requeue task to requeue by document type #1009

Update requeue task to requeue by document type #1009

MatMoore commented Aug 21, 2017 •

edited

thomasleese left a comment

thomasleese Aug 21, 2017

MatMoore commented Aug 21, 2017

thomasleese Aug 21, 2017

MatMoore commented Aug 21, 2017 •

edited

thomasleese commented Aug 22, 2017

MatMoore commented Aug 22, 2017 •

edited

thomasleese commented Aug 22, 2017

MatMoore commented Aug 22, 2017

thomasleese Aug 22, 2017

MatMoore Aug 22, 2017

thomasleese Aug 22, 2017

Update requeue task to requeue by document type #1009

Update requeue task to requeue by document type #1009

Conversation

MatMoore commented Aug 21, 2017 • edited

thomasleese left a comment

Choose a reason for hiding this comment

thomasleese Aug 21, 2017

Choose a reason for hiding this comment

MatMoore commented Aug 21, 2017

thomasleese Aug 21, 2017

Choose a reason for hiding this comment

MatMoore commented Aug 21, 2017 • edited

thomasleese commented Aug 22, 2017

MatMoore commented Aug 22, 2017 • edited

thomasleese commented Aug 22, 2017

MatMoore commented Aug 22, 2017

thomasleese Aug 22, 2017

Choose a reason for hiding this comment

MatMoore Aug 22, 2017

Choose a reason for hiding this comment

thomasleese Aug 22, 2017

Choose a reason for hiding this comment

MatMoore commented Aug 21, 2017 •

edited

MatMoore commented Aug 21, 2017 •

edited

MatMoore commented Aug 22, 2017 •

edited