Proposal: A scalable asynchronous Analytics platform #104

jchauncey · 2016-06-07T23:19:19Z

Current Architecture

                          ┌────────┐                            
                          │ Router │                            
                          └────────┘                            
                               │                    ┌──────┐    
                           Log File         ┌──────▶│Logger│    
                               │            │       └──────┘    
                               ▼            │                   
┌────────┐                ┌─────────┐       │                   
│App Logs│───Log file────▶│ fluentd │──UDP/Syslog               
└────────┘                └─────────┘       │       ┌──────────┐
                                            │       │ stdout   │
                                            └──────▶│  metrics │
┌─────────────┐                                     └──────────┘
│ HOST        │          ┌───────────┐          Wire      │     
│  Telegraf   │────┬────▶│ InfluxDB  │◀───────Protocol────┘     
└─────────────┘    │     └───────────┘                          
                   │           │                                
┌─────────────┐    │           │                                
│ HOST        │    │           ▼                                
│  Telegraf   │────┤     ┌──────────┐                           
└─────────────┘    │     │ Grafana  │                           
                   │     └──────────┘                           
┌─────────────┐    │                                            
│ HOST        │    │                                            
│  Telegraf   │────┘                                            
└─────────────┘

Problem 1: Point to point connections

Right now we have point to point connections from N number of fluentd daemons to logger and stdout-metrics. For every value that is received by fluentd we immediately send that value to both of those components (even if it is not for that component).

Problem 2: Communication happens synchronously

Data is written over UDP 1 packet at a time to both logger and stdout-metrics.

Problem 3: Duplicate UDP packets

kubernetes/kubernetes#25793

Problem 4: write speed of fluentd -> stdout-metrics

Right now we see a bottleneck of how fast we can send data to stdout-metrics and we cap out at like 80 requests per second on the cluster.

Proposed Solution

I would like to propose moving to an asynchronous system for delivery both log messages and metric data in the cluster. The architecture would look something like this:

                        ┌────────┐                            
                        │ Router │                  ┌────────┐
                        └────────┘                  │ Logger │
                            │                       └────────┘
                        Log file                        │    
                            │                           │    
                            ▼                           ▼    
┌────────┐             ┌─────────┐    logs/metrics   ┌─────┐ 
│App Logs│──Log File──▶│ fluentd │───────topics─────▶│ NSQ │ 
└────────┘             └─────────┘                   └─────┘ 
                                                        │    
                                                        │    
┌─────────────┐                                         │    
│ HOST        │                                         ▼    
│  Telegraf   │───┐                                 ┌────────┐
└─────────────┘   │                                 │Telegraf│
                  │                                 └────────┘
┌─────────────┐   │                                     │    
│ HOST        │   │    ┌───────────┐                    │    
│  Telegraf   │───┼───▶│ InfluxDB  │◀────Wire ──────────┘    
└─────────────┘   │    └───────────┘   Protocol       
                  │          ▲                        
┌─────────────┐   │          │                        
│ HOST        │   │          ▼                        
│  Telegraf   │───┘    ┌──────────┐                   
└─────────────┘        │ Grafana  │                   
                       └──────────┘

By pushing data from fluentd directly to NSQ we allow the consumers of data to pull off the queue as fast or as slow as they desire. NSQ is written to be a fault tolerant high throughput queue that can scale as you need it. In this architecture, however, we only have 1 NSQ instance for simplicity.

Metric data published to NSQ is read via the telegraf nsq-consumer plugin.

All data pushed onto NSQ is written to a topic (logs/metrics) in this case.

As write speed is concerned here is a picture of me making ~800 requests per second from my laptop into the cluster.

Eventually we could scale the single nsq instance and make it more fault tolerant and use persistent data but that isn't a big requirement right now. But having this messaging platform will allow us to expand async communication to other components which will allow us to scale out a cluster without fear of creating bottlenecks.

Example future use case: Someone wants to scale an app to 500 pods, we fire off a message to a worker that does that for us and puts a message back on the queue when its complete. The controller reads that message and can update the user (through websockets or ui or whatever)

To see the code for this implementation visit the follow repos:

monitor - Mostly just graph changes
logger - Updates how we receive log data to use nsq instead of syslog.
metrics-consumer - Pulls data from the metrics topic and sends it to influx. Batches in 1000 metrics or 5 seconds whatever comes first (configurable)
fluentd - Builds a new deis-output plugin that is responsible for sending data directly to nsq. It filters out all data that it does not care about.
nsq - This is a simple docker image for running nsq on kubernetes

The text was updated successfully, but these errors were encountered:

arschles · 2016-06-07T23:23:34Z

Until you get the NSQ plugin into Telegraf, why not put the InfluxDB publishing code into the logger?

jchauncey · 2016-06-07T23:25:49Z

Separation of concerns mainly. Didn't want to worry about adding the influx
code into logger.
On Jun 7, 2016 5:23 PM, "Aaron Schlesinger" notifications@github.com
wrote:

Until you get the NSQ plugin into Telegraf, why not put the InfluxDB
publishing code into the logger?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#104 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAaRGBg8KGrL_8KkhV-ltxYi8AJQ6O4Nks5qJf13gaJpZM4IwdAP
.

krancour · 2016-06-07T23:28:33Z

Separation of concerns mainly. Didn't want to worry about adding the influx
code into logger.

Good point, but it can be done relatively cleanly if the Drain interface that used to be in there were resurrected.

Also, another thing to consider... I'm not too familiar with NSQ, but can it implement topics as well? If you would have both the logger and the "metrics consumer" pulling messages from there, you need a a topic; not a queue.

jchauncey · 2016-06-07T23:33:49Z

Yeah it has topics. I'll update the proposal to show that.
On Jun 7, 2016 5:28 PM, "Kent Rancourt" notifications@github.com wrote:

Separation of concerns mainly. Didn't want to worry about adding the influx
code into logger.

Good point, but it can be done relatively cleanly if the Drain
interface that used to be in there were resurrected.

Also, another thing to consider... I'm not too familiar with NSQ, but can
it implement topics as well? If you would have both the logger and
the "metrics consumer" pulling messages from there, you need a a topic; not
a queue.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#104 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAaRGNh96dzydlDkTw8GyF3oMPH3nQUbks5qJf6igaJpZM4IwdAP
.

jchauncey · 2016-06-08T15:39:32Z

Updated with repos that have the code for implementing this.

arschles · 2016-06-08T16:47:45Z

@jchauncey if it makes things simpler, you might consider running a container alongside fluentd (in the daemonset) that fluentd can send, via syslog, on the loopback interface. This way, you don't have to change any fluentd plugins and you can control all the enqueue and dequeue code yourself. I don't believe the extra container or its functionality is technically necessary, just that it could add flexibility.

jchauncey · 2016-06-08T16:52:53Z

Well the plugin is great because we really only care about a very small portion of the overall data that fluentd is collecting. So the deis-output plugin allows us to filter out only the data we care about and also provides a nice interface for sending data to both nsq topics. In either case I will need that fork and it just made more sense to have it in fluentd rather than another app that we have to build and manage.

edit: plus i got to write some ruby =p

jchauncey · 2016-06-09T20:54:57Z

So I have a working telegraf plugin for fetching data directly from nsq. That means we could eliminate metrics-consumer from the diagram

jchauncey · 2016-06-09T20:56:21Z

Oh someone asked why the dip in the graph. THat was me restarting my test with a change that @gerred helped me with.

jchauncey · 2016-06-24T17:49:10Z

We are moving forward with this propsal so I am closing it.

jchauncey added the proposal label Jun 7, 2016

jchauncey self-assigned this Jun 7, 2016

krancour mentioned this issue Jun 15, 2016

Proposal: Use Redis for storing logs deis/logger#87

Closed

jchauncey closed this as completed Jun 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: A scalable asynchronous Analytics platform #104

Proposal: A scalable asynchronous Analytics platform #104

jchauncey commented Jun 7, 2016 •

edited

Loading

arschles commented Jun 7, 2016 •

edited

Loading

jchauncey commented Jun 7, 2016

krancour commented Jun 7, 2016

jchauncey commented Jun 7, 2016

jchauncey commented Jun 8, 2016

arschles commented Jun 8, 2016 •

edited

Loading

jchauncey commented Jun 8, 2016 •

edited

Loading

jchauncey commented Jun 9, 2016

jchauncey commented Jun 9, 2016

jchauncey commented Jun 24, 2016

Proposal: A scalable asynchronous Analytics platform #104

Proposal: A scalable asynchronous Analytics platform #104

Comments

jchauncey commented Jun 7, 2016 • edited Loading

Current Architecture

Problem 1: Point to point connections

Problem 2: Communication happens synchronously

Problem 3: Duplicate UDP packets

Problem 4: write speed of fluentd -> stdout-metrics

Proposed Solution

arschles commented Jun 7, 2016 • edited Loading

jchauncey commented Jun 7, 2016

krancour commented Jun 7, 2016

jchauncey commented Jun 7, 2016

jchauncey commented Jun 8, 2016

arschles commented Jun 8, 2016 • edited Loading

jchauncey commented Jun 8, 2016 • edited Loading

jchauncey commented Jun 9, 2016

jchauncey commented Jun 9, 2016

jchauncey commented Jun 24, 2016

jchauncey commented Jun 7, 2016 •

edited

Loading

arschles commented Jun 7, 2016 •

edited

Loading

arschles commented Jun 8, 2016 •

edited

Loading

jchauncey commented Jun 8, 2016 •

edited

Loading