Skip to content
Making linear algebra great (and persistent) again.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
clients
proto
service
storage
wrapper
.gitattributes
.gitignore
.travis.yml
Dockerfile misc: small fix or general refactoring i did not bother commenting Apr 21, 2018
Gopkg.lock
Gopkg.toml
LICENSE
Makefile
README.md misc: small fix or general refactoring i did not bother commenting May 14, 2018
main.go
release.sh
sumd.service release and install rules, systemd service file May 14, 2018

README.md

SUM

Build Go Report Card Coverage License GoDoc Release

If you work with machine learning you probably find yourself having around a bunch of huge CSV files that maybe you keep using to train your models, or you run PCA on them, or you perform any sort of analysis. If this is the case, you know the struggle of:

  • parsing and loading the file with numpy, tensorflow or whatever.
  • crossing your fingers that your laptop can actually store those records in memory.
  • running your algorithm
  • ... waiting ...

This project is an attempt to make these tedious tasks (and many others) simpler if not completely automated. Sum is a database and gRPC high performance service offering three main things:

  1. Persistace for your vectors.
  2. A simple CRUD system to create, read, update and delete them.
  3. Oracles.

An oracle is a piece of javascript logic you want to run on your data, this code is sent to the Sum server by a client, compiled and stored. It'll then be available for every client to use in order to "query" the data.

For instance, this is the findSimilar oracle definition:

// Given the vector with id=`id`, return a list of
// other vectors which cosine similarity to the reference
// one is greater or equal than the threshold.
// Results are given as a dictionary of :
//      `vector_id => similarity`
function findSimilar(id, threshold) {
    var v = records.Find(id);
    if( v.IsNull() == true ) {
        return ctx.Error("Vector " + id + " not found.");
    }

    var results = {};
    records.AllBut(v).forEach(function(record){
        var similarity = v.Cosine(record);
        if( similarity >= threshold ) {
           results[record.ID] = similarity
        }
    });

    return results;
}

Once defined on the Sum server, any client will be able to execute calls like findSimilar("some-vector-id-here", 0.9), such calls will be evaluated on data in memory in order to be as fast as possible, while the same data will be persisted on disk as binary protobuf encoded files.

To have a better idea of how this works, take a look at the example python client code that will create a few vectors on the server, define an oracle, call it for every vector and print the similarities the server returned.

Example Usecase

Clustering Android malware samples by behavioural similarities:

You can’t perform that action at this time.