Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update requeue task to requeue by document type #1009

Merged
merged 1 commit into from Aug 22, 2017
Merged

Conversation

MatMoore
Copy link
Contributor

@MatMoore MatMoore commented Aug 21, 2017

This task was added when Finding Things team migrated all the links to the
publishing api, to sync everything back to Rummager.

We now need something similar to synchronise all existing content (not just links), so that
Rummager can be a direct downstream consumer of publishing api.

This PR includes a few implementation changes:

  1. Use non persistent messages, because we want requeues to be as fast and low
    overhead as possible. Requeued messages are less important than normal
    publishing messages and should be abandoned if RabbitMQ goes down.

  2. Use a separate routing key - *.bulk.reindex - to indicate the intent (see
    removed comment). This allows us to define a separate non-durable queue
    for these messages in Rummager; currently the rummager_govuk_index
    receives everything and is set up as durable to survive server restarts.

  3. Filter on content_store instead of state, because we care about
    unpublished (especially withdrawn) editions as well as published.

  4. Filter by document type for now, as we don't need the whole lot yet. We may
    add a 'requeue everything' task back in later.

Part of https://trello.com/c/5d4Qkt31/244-bulk-requeue-content-by-format / https://trello.com/c/VQsLP52d/246-create-rake-task-to-bulk-republish-by-format

Copy link
Contributor

@thomasleese thomasleese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me, just had one comment.

Have you tried the task on integration?

desc "Add published editions to the message queue by document type"
task :requeue_document_type, [:document_type] => :environment do |_, args|
document_type = args[:document_type]
raise ValueError("expecting document_type") if document_type.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do unless document_type.present? because it catches empty strings too.

@MatMoore
Copy link
Contributor Author

I've only tried it on dev so far, but will test the branch on integration before merging as I also want to add in a new queue on the consumer side.

queue_payload = Presenters::EditionPresenter.new(
edition, draft: false,
).for_message_queue(version)
attr_reader :scope
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be one line, attr_reader :scope, :version.

@MatMoore
Copy link
Contributor Author

MatMoore commented Aug 21, 2017

@thomasleese I think there might be another problem, which is that when binding queues to the exchange, we can't easily distinguish between a routing key of document_type.requeue (these messages) and document_type.major/minor/links/unpublish - at least without setting up multiple bindings. And our puppet module doesn't seem to work with multiple bindings per queue (yet).

I'd like to route these messages to a separate transient queue, rather than having our existing queue handle them, and this would get in the way of that.

My suggestion is to change the routing key to 3 words instead of 2:
document_type.requeue.bulk. Then we could bind the main rummager queue to *.*. Do you see any problem with having a routing key like this?

@thomasleese
Copy link
Contributor

@MatMoore I would say there's nothing wrong with a three word routing key for the moment.

In the future we're planning to do some clean up on the naming of the routing keys so they're more consistent (https://trello.com/c/OqyPl7pc/534-stop-messagequeueeventtype-from-using-update-type), do you think there is anything we should be aware of now in preparation for that? Perhaps when we get round to that, the puppet module will support multiple bindings per queue.

@MatMoore
Copy link
Contributor Author

MatMoore commented Aug 22, 2017

@thomasleese Cool, I'll proceed with the 3 word key then for now.

Looking at the proposal in the card though - should I be distinguishing between bulk_republish and bulk_reindex? I've gone with *.requeue.bulk here but it could also be *.reindex.bulk or *.bulk.reindex.

A couple of thoughts on the proposed renames:

Rummager now needs to subscribe to the set of {major_update, minor_update, links_update, unpublish} - basically all publishing actions - but I'm assuming it shouldn't subscribe to {bulk_reindex, bulk_republish}. I think if we were to ever hook up the content store to the message queue it would also subscribe to the same things as well, so I think if we're renaming, it would help a lot to have something common in the routing key for all of those actions.

So if we have guide.action.major_update, guide.action.links_update, vs guide.bulk.reindex we'd bind to *.action.* and *.bulk.reindex to distinguish between user-initiated publishing actions and developer-initiated bulk republishing.

Multiple bindings would also do the job, if supported, but wildcard matching is simpler and we can definitely do it.

@thomasleese
Copy link
Contributor

@MatMoore I would go for reindex actually, since that's what we called it in the proposal. I don't think it matters at the moment whether it's reindex.bulk or bulk.index.

I'll copy your notes into the card on renaming the routing keys as that's useful information to have, thanks!

This task was added when Finding Things team migrated all the links to the
publishing api, to sync everything back to Rummager.

We now need something similar to synchronise all existing content, so that
Rummager can be a direct downstream consumer of publishing api.

This includes a few implementation changes:

1. Use non persistent messages, because we want requeues to be as fast and low
   overhead as possible. Requeued messages are less important than normal
   publishing messages and should be abandoned if RabbitMQ goes down.

2. Use a separate routing key - `*.bulk.reindex` - to indicate the intent (see
   removed comment). This allows us to define a separate non-durable queue
   for these messages in Rummager; currently the `rummager_govuk_index`
   receives everything and is set up as `durable` to survive server restarts.

3. Filter on `content_store` instead of `state`, because we care about
   unpublished (especially withdrawn) editions as well as published.

4. Filter by document type for now, as we don't need the whole lot yet. We may
   add a 'requeue everything' task back in later.
@MatMoore
Copy link
Contributor Author

@thomasleese Cool, I've made that change.

# because we don't want to send additional email alerts to users.
service.send_message(
queue_payload,
routing_key: "#{edition.schema_name}.bulk.reindex",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if it's possible to do event_type: "bulk.reindex" here? AFAIK it's just string interpolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so but I thought this was less confusing as you can see what the key looks like all in one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see what you mean. It's annoying having to include the duplication of edition.schema_name though. It's up to you, I'll approve the PR since it's not major.

MatMoore pushed a commit to alphagov/govuk-puppet that referenced this pull request Aug 22, 2017
In RabbitMQ binding keys the wildcard `#` matches any number of dot separated
words, whereas `*.*` ensures it's two words.

This allows us to keep matching `schema.major`, `schema.minor`, `schema.links`,
`schema.unpublish`, while not matching reindexing messages
(alphagov/publishing-api#1009)

I'm going to set up a separate queue to handle these messages.

Note that the puppet module for rabbitmq is not smart enough to update bindings
for an existing queue when the binding key changes, so this will need to be
manually changed after deployment.
@MatMoore MatMoore merged commit ac0f5f2 into master Aug 22, 2017
@MatMoore MatMoore deleted the requeue-by-type branch August 22, 2017 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants