Suricate - Open Data Science Service/Platform
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 3 commits behind tmetsch:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
suricate
tests
.gitignore
LICENSE
MANIFEST.in
README.md
TODO.md
run_tests.sh
setup.py

README.md

Open Data Science Service

Suricate is a very simple Python based Analytics Service originally designed to analyze data streams coming from DTrace to learn application/service behaviour in a data center/cloud. Based on that models could be created. Finally actions could be triggered (for optimization, fault detection, ...) based on the models and new incoming data via AMQP.

With this first release this has become more a general purpose tool. The DTrace part is stripped out for now.

Usage

The following steps guide to the usage of the service.

Step 1 - get the data

Create a simple file json file:

{
  "server1": [10, 20, 30, 10, 20, 30, 50, 10],
  "server2": [5, 6, 3, 2, 1, 3, 10, 20],
  "server3": [80, 80, 80, 80, 90, 85, 80, 80]
}

Open the browser navigate to http://localhost:8080 and click 'Data'. Select the file an upload it.

Other options are to stream data in using AMQP or to just connect to a Database as a Service.

Step 2 - analyse it

Navigate to the 'Analytics' part. Create a new project. Navigate to the project and open the automatically generated file analytics.py. From here on it is just some Python coding.

To list the objects and then retrieve the just created one add:

list_objects()
tmp = retrieve_object('<id>')

Now we can plot it:

pyplot.bar(range(0,8), tmp['server1'])
show()

Now we will do sth very simple! You can add scikit-learn or pandas to the sdk to directly use those.

tmp2 = np.asarray(tmp['server1'])
mean = tmp2.mean()

Now we will store that value:

create_object({'meanserver': mean})

Note the edit and remove capabilities of the notebooks as well.

Step 3 - do sth with it

Just like the analytics part the processing is done in Python. Create a script called processing.py. Lets load the model we just learned:

mean = retrieve_object('<id_of_mean_obj>')['meanserver']

Let's use the streaming data sources to get latest usage percentage from server1:

list_streams()
new_val = retrieve_from_stream('52225c4d17b1684044f86353')[0]['body']

Compare them and run an action when needed:

if new_val > mean:
    run_ssh_command(server1, 'shutdown -k now')

We can now update the object from step one too. And therefore learn a new mean afterwards when we trigger the analytics notebook again. So we get a continuously updating process.

The scripts for the analytics and or processing part can be triggered externally via an API (by a cron job?). The clean split of learning (analytics) and acting (processing) makes the idea of when to trigger what.

API

Currently the following features are available when coding notebooks (through preload_internal.py - a mini SDK if you want):

  • show() - show matplotlib output
  • show_d3() - show matplotlib output interactively using D3
  • list_streams() - list all streams
  • retrieve_from_stream(id, interval=60) - retrieve messages from a stream
  • list_objects() - list all data objects
  • create_object() - create a new data object
  • retrieve_object(id) - retrieve a data object
  • update_object(id) - update a data object

Those features can easily extended/altered by editing the preload scripts. Whatever is preloaded is automatically also available in the notebooks.

REST API

TBD.

Running it

Currently a MongoDB is needed. Add a admin user:

db = db.getSiblingDB('admin')
db.createUser( { user: "admin", pwd: "secret",  
                 roles: [ "clusterAdmin", 
                          "userAdminAnyDatabase", 
                          "readWriteAnyDatabase", 
                          "dbAdminAnyDatabase" ] } )

Then run mongod with authentication enabled:

mongod --dbpath <path> --auth

Also make sure RabbitMQ is running and configured.

For Development & local

For local environments got to the bin directory and just run:

$ ./run_me.py

Using Docker & MicroService

Have a look here for an example PoC development deployment.

In Production

Suricate is a simple WSGI app which can be run with multiple servers. It can also be easily deployed on OpenShift or similar.

By adding Oauth2 support through a WSGI middleware such as [wsgi-oauth2] (http://styleshare.github.io/wsgi-oauth2/) users can be authenticated. Just make sure that at some point in your middleware a username and a access token needs to be set. Suricate expects this attribute to be called 'X_UID' and 'X_TOKEN'. So if you deploy Suricate using the same authentication method as your Object Storage you should be save. So for example:

environ['HTTP_X_UID'] = '123'
environ['HTTP_X_TOKEN'] = 'secret'

To get the app do the following:

from web import wsgi_app
app = wsgi_app.AnalyticsApp(<Mongo URI>).get_wsgi_app()

Please note that the usage of TLS is highly recommend! It is also recommend to run each component as a MicroService in a Container on e.g. CoreOS or similar.

Authentication/Authorization can be done in the WSGI Middlware.

OpenShift

Installing numpy and matplotlib dependencies for the preload scripts:

$ rhc ssh suricate
$ cd $OPENSHIFT_DATA_DIR
$ source ~/python/bin/activate_virtenv
$ pip install numpy

Configuration

The configuration file supports some simple configuration settings:

  • Mongo
    • The uri for the MongoDB Server.
    • The admin for the MongoDB Server.
    • The pwd for the MongoDB Server.
  • Rabbit
    • The uri for the AMQP broker.
  • Suricate
    • The python_sdk script which will be preloaded and therefore be available to each notebook.

Architecture

The following ASCII art shows the rough architecture of Suricate:

        -----Web----                        -------------
        | -------- |                     -------------  |
        | |  UI  | | ----AMQP msgs.----> | Execution |  |
User -> | -------- |                     |   nodes   |---
        |          |                     -------------
Data -> | -------- |                           |
        | | REST | |                           |
        | | API  | |                     -------------
        | -------- | ------Mongo-------> |    DB     |
        ------------                     -------------

Some notes on the components:

  • UI renders a UI which can be displayed in a Web Browser.
  • REST is a RESTful interface to the service
  • Data can be streamed or bulk uploaded into the service. It will directly be put in the MongoDB.
  • Execution nodes are run per tenant and isolate the users and guarantee scalability. The Interfaces talk to the nodes using AMQP messages. For maximum security run a Execution node in a container (LXC, cgroups, Solaris zone, ...). Enforce capping rules on the Execution nodes wherever possible. Execution nodes talk to the MongoDB directly so if you might want to schedule them close to the data for maximum performance. You can also execute Execution nodes with different environment (versions of packages etc.). Execution nodes make Suricate a distributed system itself.

Security considerations

Run at own risk & please be aware that the users of the service get a full Python interpreter at their hands.

The Execution nodes act as isolation container. All messages in the system carry a uid and token. So even if a user in an execution nodes figures out how to communicate with the AMQP broker he will still need the token of the other users to be successful.

Note: Use encryption wherever possible & change default passwords! (AMQP, Mongo, ...)