Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: A scalable asynchronous Analytics platform #104

Closed
jchauncey opened this issue Jun 7, 2016 · 10 comments
Closed

Proposal: A scalable asynchronous Analytics platform #104

jchauncey opened this issue Jun 7, 2016 · 10 comments
Assignees
Labels

Comments

@jchauncey
Copy link
Member

jchauncey commented Jun 7, 2016

Current Architecture

                          ┌────────┐                            
                          │ Router │                            
                          └────────┘                            
                               │                    ┌──────┐    
                           Log File         ┌──────▶│Logger│    
                               │            │       └──────┘    
                               ▼            │                   
┌────────┐                ┌─────────┐       │                   
│App Logs│───Log file────▶│ fluentd │──UDP/Syslog               
└────────┘                └─────────┘       │       ┌──────────┐
                                            │       │ stdout   │
                                            └──────▶│  metrics │
┌─────────────┐                                     └──────────┘
│ HOST        │          ┌───────────┐          Wire      │     
│  Telegraf   │────┬────▶│ InfluxDB  │◀───────Protocol────┘     
└─────────────┘    │     └───────────┘                          
                   │           │                                
┌─────────────┐    │           │                                
│ HOST        │    │           ▼                                
│  Telegraf   │────┤     ┌──────────┐                           
└─────────────┘    │     │ Grafana  │                           
                   │     └──────────┘                           
┌─────────────┐    │                                            
│ HOST        │    │                                            
│  Telegraf   │────┘                                            
└─────────────┘                                                 

Problem 1: Point to point connections

Right now we have point to point connections from N number of fluentd daemons to logger and stdout-metrics. For every value that is received by fluentd we immediately send that value to both of those components (even if it is not for that component).

Problem 2: Communication happens synchronously

Data is written over UDP 1 packet at a time to both logger and stdout-metrics.

Problem 3: Duplicate UDP packets

kubernetes/kubernetes#25793

Problem 4: write speed of fluentd -> stdout-metrics

Right now we see a bottleneck of how fast we can send data to stdout-metrics and we cap out at like 80 requests per second on the cluster.

Proposed Solution

I would like to propose moving to an asynchronous system for delivery both log messages and metric data in the cluster. The architecture would look something like this:

                        ┌────────┐                            
                        │ Router │                  ┌────────┐
                        └────────┘                  │ Logger │
                            │                       └────────┘
                        Log file                        │    
                            │                           │    
                            ▼                           ▼    
┌────────┐             ┌─────────┐    logs/metrics   ┌─────┐ 
│App Logs│──Log File──▶│ fluentd │───────topics─────▶│ NSQ │ 
└────────┘             └─────────┘                   └─────┘ 
                                                        │    
                                                        │    
┌─────────────┐                                         │    
│ HOST        │                                         ▼    
│  Telegraf   │───┐                                 ┌────────┐
└─────────────┘   │                                 │Telegraf│
                  │                                 └────────┘
┌─────────────┐   │                                     │    
│ HOST        │   │    ┌───────────┐                    │    
│  Telegraf   │───┼───▶│ InfluxDB  │◀────Wire ──────────┘    
└─────────────┘   │    └───────────┘   Protocol       
                  │          ▲                        
┌─────────────┐   │          │                        
│ HOST        │   │          ▼                        
│  Telegraf   │───┘    ┌──────────┐                   
└─────────────┘        │ Grafana  │                   
                       └──────────┘                              

By pushing data from fluentd directly to NSQ we allow the consumers of data to pull off the queue as fast or as slow as they desire. NSQ is written to be a fault tolerant high throughput queue that can scale as you need it. In this architecture, however, we only have 1 NSQ instance for simplicity.

Metric data published to NSQ is read via the telegraf nsq-consumer plugin.

All data pushed onto NSQ is written to a topic (logs/metrics) in this case.

As write speed is concerned here is a picture of me making ~800 requests per second from my laptop into the cluster.

screen shot 2016-06-07 at 4 16 41 pm

Eventually we could scale the single nsq instance and make it more fault tolerant and use persistent data but that isn't a big requirement right now. But having this messaging platform will allow us to expand async communication to other components which will allow us to scale out a cluster without fear of creating bottlenecks.

Example future use case: Someone wants to scale an app to 500 pods, we fire off a message to a worker that does that for us and puts a message back on the queue when its complete. The controller reads that message and can update the user (through websockets or ui or whatever)

To see the code for this implementation visit the follow repos:

  • monitor - Mostly just graph changes
  • logger - Updates how we receive log data to use nsq instead of syslog.
  • metrics-consumer - Pulls data from the metrics topic and sends it to influx. Batches in 1000 metrics or 5 seconds whatever comes first (configurable)
  • fluentd - Builds a new deis-output plugin that is responsible for sending data directly to nsq. It filters out all data that it does not care about.
  • nsq - This is a simple docker image for running nsq on kubernetes
@jchauncey jchauncey self-assigned this Jun 7, 2016
@arschles
Copy link
Member

arschles commented Jun 7, 2016

Until you get the NSQ plugin into Telegraf, why not put the InfluxDB publishing code into the logger?

@jchauncey
Copy link
Member Author

Separation of concerns mainly. Didn't want to worry about adding the influx
code into logger.
On Jun 7, 2016 5:23 PM, "Aaron Schlesinger" notifications@github.com
wrote:

Until you get the NSQ plugin into Telegraf, why not put the InfluxDB
publishing code into the logger?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#104 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAaRGBg8KGrL_8KkhV-ltxYi8AJQ6O4Nks5qJf13gaJpZM4IwdAP
.

@krancour
Copy link
Contributor

krancour commented Jun 7, 2016

Separation of concerns mainly. Didn't want to worry about adding the influx
code into logger.

Good point, but it can be done relatively cleanly if the Drain interface that used to be in there were resurrected.

Also, another thing to consider... I'm not too familiar with NSQ, but can it implement topics as well? If you would have both the logger and the "metrics consumer" pulling messages from there, you need a a topic; not a queue.

@jchauncey
Copy link
Member Author

Yeah it has topics. I'll update the proposal to show that.
On Jun 7, 2016 5:28 PM, "Kent Rancourt" notifications@github.com wrote:

Separation of concerns mainly. Didn't want to worry about adding the influx
code into logger.

Good point, but it can be done relatively cleanly if the Drain
interface that used to be in there were resurrected.

Also, another thing to consider... I'm not too familiar with NSQ, but can
it implement topics as well? If you would have both the logger and
the "metrics consumer" pulling messages from there, you need a a topic; not
a queue.


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#104 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAaRGNh96dzydlDkTw8GyF3oMPH3nQUbks5qJf6igaJpZM4IwdAP
.

@jchauncey
Copy link
Member Author

Updated with repos that have the code for implementing this.

@arschles
Copy link
Member

arschles commented Jun 8, 2016

@jchauncey if it makes things simpler, you might consider running a container alongside fluentd (in the daemonset) that fluentd can send, via syslog, on the loopback interface. This way, you don't have to change any fluentd plugins and you can control all the enqueue and dequeue code yourself. I don't believe the extra container or its functionality is technically necessary, just that it could add flexibility.

@jchauncey
Copy link
Member Author

jchauncey commented Jun 8, 2016

Well the plugin is great because we really only care about a very small portion of the overall data that fluentd is collecting. So the deis-output plugin allows us to filter out only the data we care about and also provides a nice interface for sending data to both nsq topics. In either case I will need that fork and it just made more sense to have it in fluentd rather than another app that we have to build and manage.

edit: plus i got to write some ruby =p

@jchauncey
Copy link
Member Author

So I have a working telegraf plugin for fetching data directly from nsq. That means we could eliminate metrics-consumer from the diagram

@jchauncey
Copy link
Member Author

Oh someone asked why the dip in the graph. THat was me restarting my test with a change that @gerred helped me with.

@jchauncey
Copy link
Member Author

We are moving forward with this propsal so I am closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants