Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk processing for workers #82

Closed
wants to merge 2 commits into from

Conversation

mauricio
Copy link
Collaborator

@mauricio mauricio commented Feb 7, 2014

The rationale behind this is that sometimes you have workers that publish work in units but would be better served by taking the work in batches.

The best example of this is solr/elasticsearch, when integrated, most of the time you will want to work with them publishing every change, but then it's much more efficient to process them in batches given they both offer batch/bulk update interfaces. So, instead of incurring the cost of publishing every document at solr/elasticsearch, you only pay the price once for every 10 documents.

Other than that, there's the fact that SQS itself has a much higher latency and taking in more messages instead of just one is much better.

This is mostly a proof of concept, but I have hacked a solution to do this before and it would be nice to have the queue supporting this out of the box.

mauricio added a commit to mauricio/qu that referenced this pull request Feb 7, 2014
This is already part of bkeepers#82 but given it still needs a lot of work, it would be nice to have this in before. If `true` is not returned here it will never see that the services are actually running and won't make all specs run on `unattended_spec`.
mauricio added a commit to mauricio/qu that referenced this pull request Feb 7, 2014
This is already part of bkeepers#82 but given it still needs a lot of work, it would be nice to have this in before. If `true` is not returned here it will never see that the services are actually running and won't make all specs run on `unattended_spec`.
@jnunemaker
Copy link
Collaborator

This one is going to take me a bit to review.

@mauricio
Copy link
Collaborator Author

mauricio commented Feb 8, 2014

The build will stay broken for a bit until this goes in iain/fake_sqs#9

@mauricio
Copy link
Collaborator Author

@jnunemaker ping

@jnunemaker jnunemaker self-assigned this Feb 16, 2014

module Qu
module Backend
# Internal: Backend that wraps all backends with instrumentation.
class Instrumented < Base
extend Forwardable
include Wrapper
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this fine.

@jnunemaker
Copy link
Collaborator

Overall, I love the idea of making batching possible. I also love all the hard work you put it. It is far easier for me to critique than to fully work on the solution.

The thing this pull is really missing is good abstraction of batch. A few spots are directly related to functionality in SQS rather than batch in general.

As I think about it more, I almost wonder if the right abstraction is just a BatchPayload and a new backend for any queue that can handle batch operations. For example, here is some psuedo ruby for sqs and batching:

class SQSBatch
  class BatchPayload
    def initialize(payloads)
      @payloads = payloads
    end

    def perform
      payloads.each(&:perform)
    end
  end

  def pop(queue_name = 'default')
    payloads = sqs.receive_messages(limit: 10).map { |message| 
      create_payload(message) 
    }

    BatchPayload.new(payloads)
  end
end

I don't doubt that this will require some thought on how to ensure that batch payloads have the same semantics as payload, but I think that is probably the route I would want to take. It avoids qu itself having any batch knowledge (though it may need a slim down of the stock payload interface) which I like because it means that the abstraction of qu is correct in that it is flexible enough to work with anything that can pop and push.

Any of this make sense or am I just blabbering?

@jnunemaker
Copy link
Collaborator

There are several places in qu where qu assumes payload to have certain methods, such as queue, klass, etc. Those are what we would need to abstract a bit to make qu work with single or batch payloads.

@mauricio
Copy link
Collaborator Author

WOW.

@jnunemaker ok, gonna work on this feedback and thanks for taking a look at this!

It's important to have another set of eyes on top of it, since it creates a, somewhat, different way of processing stuff.

@jnunemaker
Copy link
Collaborator

WOW.

Hope that is a good wow!

@mauricio
Copy link
Collaborator Author

@jnunemaker that's a much comments, very feedback wow, hahaha.

@jnunemaker
Copy link
Collaborator

Lol.

@mauricio
Copy link
Collaborator Author

@jnunemaker the main problem I see with this:

    def perform
      payloads.each(&:perform)
    end

Is that it wouldn't allow me to send the many payloads as a single entity (in the case of batch indexing, for instance).

The job would have to receive the many items so it would be able to process them all at once and not one by one.

@jnunemaker
Copy link
Collaborator

Is that it wouldn't allow me to send the many payloads as a single entity (in the case of batch indexing, for instance).

Well more what I mean is that payload could be the abstraction. We could even allow people swaping out their own payload mechanism. Then you could have a batch payload that "does the right thing".

@mauricio
Copy link
Collaborator Author

Makes sense, gonna dig more into this and build something to look at how it would be used.

@mauricio
Copy link
Collaborator Author

This is for another PR, but the current worker implementation could possibly take more than one job at a time if the runner handles it. I need to write more tests for the forking runner to have this in action and flesh out a thread-pool based runner as well, both would allow the main worker to run more than one job at a time.

@jnunemaker
Copy link
Collaborator

👍 to small iterations to the worker and payload rather than a big PR. Big PR's are a lot harder to digest.

@jnunemaker
Copy link
Collaborator

Closing this due to staleness. Feel free to pick it back up someday.

@jnunemaker jnunemaker closed this Jun 14, 2014
@mauricio
Copy link
Collaborator Author

Sure, still haven't found a nice way to implement this :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants