# Opening the model, and preparing the target
We prepare a target, and open the test set. This should be hosted somewhere else, but 

In [1]:
import ember
import numpy as np
import json, msgpack, requests, zlib
import matplotlib.pyplot as plt
from lightgbm import Booster

datadir = "/Users/sven/localdata/ember2018/"

x_test, y_test  = ember.read_vectorized_features(datadir,"test")

model = Booster(model_file=datadir + "model.txt")

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


# Interacting with the server

To start the server go to `serve_goko` and run `cargo run --example ember_server`
## Status: /
Here we get the status of the tree:

In [2]:
r = requests.get('http://localhost:3030/')
print(json.dumps(json.loads(r.text), indent=2))

{
  "scale_base": 1.5,
  "leaf_cutoff": 100,
  "min_res_index": -20,
  "use_singletons": true,
  "partition_type": "Nearest",
  "verbosity": 2,
  "rng_seed": null
}


## KNN: /knn?k=N
Here we can get the true KNN 

In [3]:
sample_bytes = zlib.compress(msgpack.packb([float(f) for f in x_test[0]]))
r = requests.get('http://localhost:3030/knn', params = {"k": 5}, headers = {"Content-Type": "gzip"}, data=sample_bytes)
print(json.dumps(json.loads(r.text), indent=2))

{
  "knn": [
    {
      "name": "0",
      "distance": 0.0
    },
    {
      "name": "173337",
      "distance": 1.0
    },
    {
      "name": "58831",
      "distance": 1.0000011
    },
    {
      "name": "24412",
      "distance": 5.0
    },
    {
      "name": "42622",
      "distance": 5.0
    }
  ]
}


## Routing KNN: /routing_knn?k=N

Here's an approximate KNN that is likely to be wrong, unless you set up a very slow tree. This ignores most of the points. 

In [4]:
sample_bytes = zlib.compress(msgpack.packb([float(f) for f in x_test[0]]))

r = requests.get('http://localhost:3030/routing_knn', params = {"k": 5}, headers = {"Content-Type": "gzip"}, data=sample_bytes)

print(r)
print(json.dumps(json.loads(r.text), indent=2))

<Response [200]>
{
  "routing_knn": [
    {
      "name": "81300",
      "distance": 13.152946
    },
    {
      "name": "139576",
      "distance": 17.464249
    },
    {
      "name": "48288",
      "distance": 24.596748
    },
    {
      "name": "26837",
      "distance": 30.149628
    },
    {
      "name": "118135",
      "distance": 30.282007
    }
  ]
}


## Path query: /path

The path of the point you're interested in. We determine which node the point belongs to (treating the tree like a filesystem), then the path from the root to that node.
It also includes the label summary for each element of the path. We can use this to determine many many things.  

In [5]:
sample_bytes = zlib.compress(msgpack.packb([float(f) for f in x_test[0]]))

r = requests.get('http://localhost:3030/path', headers = {"Content-Type": "gzip"}, data=sample_bytes)

print(r)
print(json.dumps(json.loads(r.text), indent=2))

<Response [200]>
{
  "path": [
    {
      "name": "199999",
      "layer": 58,
      "distance": 76001580.0,
      "label_summary": {
        "summary": {
          "items": [
            [
              1,
              100000
            ],
            [
              0,
              100000
            ]
          ]
        },
        "nones": 0,
        "errors": 0
      }
    },
    {
      "name": "199999",
      "layer": 57,
      "distance": 76001580.0,
      "label_summary": {
        "summary": {
          "items": [
            [
              1,
              99991
            ],
            [
              0,
              99997
            ]
          ]
        },
        "nones": 0,
        "errors": 0
      }
    },
    {
      "name": "199999",
      "layer": 56,
      "distance": 76001580.0,
      "label_summary": {
        "summary": {
          "items": [
            [
              1,
              99989
            ],
            [
              0,
              

## Configuring the tracker, /track/add?window_size=N&tracker_name=NAME

We set up trackers, which use the paths (from the previous query). In this case we have a window_size of 100, so we'll be tracking the last 100 queries. We omit the tracker name to add this window to the default tracker.

In [6]:
# Should get {'Unknown': [None, 100]}
r = requests.post('http://localhost:3030/track/add?window_size=100')

print(r)
print(json.dumps(json.loads(r.text), indent=2))

<Response [200]>
{
  "AddTracker": {
    "success": false
  }
}


## Tracking a point: /track/point?tracker_name=NAME

This adds the point to the default trackers. We only have the one, the one with window_size `100`

In [7]:
# Should 200, {'TrackPath': {'success': True}}
sample_bytes = zlib.compress(msgpack.packb([float(f) for f in x_test[0]]))
r = requests.post('http://localhost:3030/track/point', headers = {"Content-Type": "gzip"}, data=sample_bytes)
print(json.dumps(json.loads(r.text), indent=2))

{
  "TrackPath": {
    "success": true
  }
}


# Gets the stats back out: /track/stats?window_size=N&tracker_name=NAME

This grabs the stats for the tracker 

In [8]:
# Should get a current stats object with a very small kl_div.
r = requests.get('http://localhost:3030/track/stats?window_size=100')
print(json.dumps(json.loads(r.text), indent=2))

{
  "CurrentStats": {
    "kl_div": 0.07742628264899665,
    "max": 0.0688584204759417,
    "min": 8.691216635270393e-10,
    "nz_count": 34,
    "moment1_nz": 0.07947733725128625,
    "moment2_nz": 0.004766578833762081,
    "sequence_len": 2
  }
}


## Normal Queries!

In [9]:
r = requests.post('http://localhost:3030/track/add?window_size=100&tracker_name=normal')

for i in range(100):
    sample_bytes = zlib.compress(msgpack.packb([float(f) for f in x_test[i]]))
    r = requests.post('http://localhost:3030/track/point?tracker_name=normal', headers = {"Content-Type": "gzip"}, data=sample_bytes)
    assert json.loads(r.text)["TrackPath"]["success"] == True
    
r = requests.get('http://localhost:3030/track/stats?window_size=100&tracker_name=normal')
print(json.dumps(json.loads(r.text), indent=2))

{
  "CurrentStats": {
    "kl_div": 1.6717167818221697,
    "max": 0.2658954744674986,
    "min": 4.765811478790738e-08,
    "nz_count": 625,
    "moment1_nz": 1.7472514390241187,
    "moment2_nz": 0.12248904008724106,
    "sequence_len": 100
  }
}


# Basic Attack Simulation

The Test set attack is what 99% of malware authors do. Try things until you get a bypass, then do that until it stops working. This is a perfect simulation of that attack.

All the blackbox attacks, and all the Malware attacks form hotspots. The same location gets queried over and over again. This is either immediate (in the case of ToucanStrike) or after a very short exploration phase (in the case of CounterFit). Poisoing a location in the dataset also works similarly, but doesn't form quite as high a maximum nodal KL-divergence.

In [10]:
r = requests.post('http://localhost:3030/track/add?window_size=100&tracker_name=attack')

sample_bytes = zlib.compress(msgpack.packb([float(f) for f in x_test[0]]))
for i in range(100):
    r = requests.post('http://localhost:3030/track/point?tracker_name=attack', headers = {"Content-Type": "gzip"}, data=sample_bytes)
    assert json.loads(r.text)["TrackPath"]["success"] == True
    
r = requests.get('http://localhost:3030/track/stats?window_size=100&tracker_name=attack')
print(json.dumps(json.loads(r.text), indent=2))

{
  "CurrentStats": {
    "kl_div": 60.14726813266543,
    "max": 79.49557177004016,
    "min": 1.0008556898810639e-06,
    "nz_count": 35,
    "moment1_nz": 102.36068891179305,
    "moment2_nz": 6429.589883111649,
    "sequence_len": 100
  }
}
