# Armis Data Hack Challenge - Solution Example

In this example we will detect anomalies on each network using Elliptic Envelope Algorithm. As this is only an example, our feature set will consist of only five features: unique hosts count, unique port count and total packets transferred. <br>
Let's start :)

# Imports and Consts

In [1]:
import pandas as pd
import datetime

In [2]:
DEVICES_PATH = "all_devices.csv"
SESSIONS_PATH = "all_sessions.csv"

## Read the data

In [3]:
devices = pd.read_csv(DEVICES_PATH)
devices.head()

Unnamed: 0.1,Unnamed: 0,network_id,device_id,type,model,manufacturer,operating_system,operating_system_version
0,113,0,1028623,MOBILE_PHONE,Galaxy S8,Samsung,Android,9
1,587,0,48047,MOBILE_PHONE,Galaxy Note 8,Samsung,Android,9
2,668,0,123568,MOBILE_PHONE,H918,LG Electronics,Android,8.0.0
3,830,0,95366,MOBILE_PHONE,iPhone 6,"Apple, Inc.",iOS,
4,886,0,1755023,TABLET,iPad,Apple,iOS,


In [4]:
sessions = pd.read_csv(SESSIONS_PATH)
sessions.head()

Unnamed: 0.1,Unnamed: 0,network_id,device_id,timestamp,host,host_ip,port_dst,transport_protocol,service_device_id,packets_count,...,outbound_packet_size_max,outbound_packet_size_min,outbound_packet_size_mean,outbound_packet_size_median,outbound_packet_size_stddev,inbound_packet_size_max,inbound_packet_size_min,inbound_packet_size_mean,inbound_packet_size_median,inbound_packet_size_stddev
0,0,0,35,1565074800,ecbb92cd941972b779d18451b6f96275587941e4cf07a1...,ecbb92cd941972b779d18451b6f96275587941e4cf07a1...,49152,TCP,790889.0,260,...,93.0,93.0,93.0,93.0,0.0,312.6,312.6,312.6,312.6,0.0
1,1,0,35,1565053200,90cf529b11c8f26efbb3936c7d10a5bf57c1a930603af0...,90cf529b11c8f26efbb3936c7d10a5bf57c1a930603af0...,49153,TCP,1604622.0,178,...,106.75,93.4,94.883333,93.0,4.317134,318.4,318.4,318.4,318.0,0.0
2,2,0,57,1565082000,e16257c983f2c35d41f39d425651972fa1905e46e968d7...,d43dad76e6cef2231d2efc743e498996b40f8b13fc120b...,443,TCP,,67,...,64.018182,41.0,44.288312,41.0,8.700055,,,,,
3,3,0,57,1565082000,1a4f860269acca6c264f00d84c4b63aad00b8f93a77250...,945e37dab8aee93dd4e650f8d17d76a3adfbc6aa70ebba...,443,TCP,,45,...,226.733333,133.571429,194.968236,209.0,41.777266,,,,,
4,4,0,57,1565082000,df106cbe1ba4a700c00ec8883490f40a8afdb75c15a9ea...,4b43e85e630c2d18a0afaa2a6366367c4fc52d32b4ba5b...,443,TCP,,9,...,185.666667,185.666667,185.666667,185.666667,,,,,,


# Generate Feature Set

As we mentioned before we are going to select only 5 features: unique hosts count, unique port count,  total packets transferred, total inbound bytes and total outbound bytes. Lets create the feature set for each network

In [5]:
def extract_features(df):
    return df.groupby(["device_id"]).aggregate({"host": "nunique", "port_dst": "nunique", 
                                                "packets_count": "sum", "inbound_bytes_count": "sum", 
                                                "outbound_bytes_count": "sum"})


In [6]:
networks_fs = []

In [7]:
for i in range(4):
    networks_fs.append(extract_features(sessions[sessions.network_id == i]))

# Modeling Fun

As we've mentioned we used the Elliptic Envelope model for our anomaly detection. We use the decision_function in order to receive the confidence and we normalize the values to values between 0 and 1.

In [8]:
from sklearn.covariance import EllipticEnvelope

In [9]:
# We use the simple min-max normalization in order to normalize the confidence values to 0-1 range.
# Higher score means that this device is probably more anomalous.
def calc_normalized_decision(decision_function_result):
    decision_function_result = -1 * decision_function_result
    minimum = decision_function_result.min()
    maximum = decision_function_result.max()
    return (decision_function_result - minimum) / (maximum - minimum)

In [10]:
def detect_anomalies(feature_set):
    ee = EllipticEnvelope(contamination=0.05).fit(feature_set.values)
    decision_function_result = ee.decision_function(feature_set.values)
    feature_set["confidence"] = calc_normalized_decision(decision_function_result)

In [11]:
for i in range(4):
    detect_anomalies(networks_fs[i])

In [12]:
# Remove the device_id from index and Add network id to each data set
for i in range(4):
    networks_fs[i].reset_index(level=0, inplace=True)
    networks_fs[i]["network_id"] = 0

In [13]:
df_to_submit = pd.concat(networks_fs)[["network_id", "device_id", "confidence"]] # The column order is important!
df_to_submit.head()

Unnamed: 0,network_id,device_id,confidence
0,0,33,2.149737e-11
1,0,35,3.652858e-05
2,0,40,0.01006346
3,0,41,2.479184e-10
4,0,53,3.912391e-11


# Submissions

In order to update the Leader Board you need to send a POST request to following url: "https://leaderboard.datahack.org.il/armis/api/". Our Leader Board receives your anomalies results in the json form of [["network_id", "device_id", "confidence"]] - The order is important!

In [14]:
arr_to_submit = df_to_submit.to_json(orient='values')

In [15]:
from urllib import request
import json

leaderboard_name = "armis"
host = "leaderboard.datahack.org.il"

# Name of the user
submitter = "Armis-test"

predictions = json.loads(arr_to_submit)

jsonStr = json.dumps({'submitter': submitter, 'predictions': predictions})
data = jsonStr.encode('utf-8')
req = request.Request(f"https://{host}/{leaderboard_name}/api/",
                      headers={'Content-Type': 'application/json'},
                      data=data)
resp = request.urlopen(req)
print(json.load(resp))

{'member': 'Armis-test', 'rank': 2, 'score': 0.5648161775508931}
