-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: A scalable asynchronous Analytics platform #104
Comments
Until you get the NSQ plugin into Telegraf, why not put the InfluxDB publishing code into the logger? |
Separation of concerns mainly. Didn't want to worry about adding the influx
|
Good point, but it can be done relatively cleanly if the Also, another thing to consider... I'm not too familiar with NSQ, but can it implement topics as well? If you would have both the logger and the "metrics consumer" pulling messages from there, you need a a topic; not a queue. |
Yeah it has topics. I'll update the proposal to show that.
|
Updated with repos that have the code for implementing this. |
@jchauncey if it makes things simpler, you might consider running a container alongside fluentd (in the daemonset) that fluentd can send, via syslog, on the loopback interface. This way, you don't have to change any fluentd plugins and you can control all the enqueue and dequeue code yourself. I don't believe the extra container or its functionality is technically necessary, just that it could add flexibility. |
Well the plugin is great because we really only care about a very small portion of the overall data that fluentd is collecting. So the edit: plus i got to write some ruby =p |
So I have a working telegraf plugin for fetching data directly from nsq. That means we could eliminate |
Oh someone asked why the dip in the graph. THat was me restarting my test with a change that @gerred helped me with. |
We are moving forward with this propsal so I am closing it. |
Current Architecture
Problem 1: Point to point connections
Right now we have point to point connections from
N
number of fluentd daemons tologger
andstdout-metrics
. For every value that is received by fluentd we immediately send that value to both of those components (even if it is not for that component).Problem 2: Communication happens synchronously
Data is written over UDP 1 packet at a time to both
logger
andstdout-metrics
.Problem 3: Duplicate UDP packets
kubernetes/kubernetes#25793
Problem 4: write speed of fluentd -> stdout-metrics
Right now we see a bottleneck of how fast we can send data to stdout-metrics and we cap out at like 80 requests per second on the cluster.
Proposed Solution
I would like to propose moving to an asynchronous system for delivery both log messages and metric data in the cluster. The architecture would look something like this:
By pushing data from
fluentd
directly to NSQ we allow the consumers of data to pull off the queue as fast or as slow as they desire. NSQ is written to be a fault tolerant high throughput queue that can scale as you need it. In this architecture, however, we only have 1 NSQ instance for simplicity.Metric data published to NSQ is read via the telegraf nsq-consumer plugin.
All data pushed onto NSQ is written to a topic (logs/metrics) in this case.
As write speed is concerned here is a picture of me making ~800 requests per second from my laptop into the cluster.
Eventually we could scale the single nsq instance and make it more fault tolerant and use persistent data but that isn't a big requirement right now. But having this messaging platform will allow us to expand async communication to other components which will allow us to scale out a cluster without fear of creating bottlenecks.
Example future use case: Someone wants to scale an app to 500 pods, we fire off a message to a worker that does that for us and puts a message back on the queue when its complete. The controller reads that message and can update the user (through websockets or ui or whatever)
To see the code for this implementation visit the follow repos:
The text was updated successfully, but these errors were encountered: