# Assignment 4 - Identifying IoT Devices using Traffic Meta-data

This assignment includes some prewritten code for you to work with. This code a re-implementation of the classification methods described in the 2018 paper "Classifying IoT Devices in Smart Environments Using Network Traffic Characteristics."

Download and unzip the materials & IoT dataset (for Google Colab only) using the following code-block...

In [None]:
!gdown 199U9EjMlDsqfOTxaDRMZkALzMhJ1Hmsd
!unzip A4_Materials.zip
!unzip A4_Materials/iot_data.zip

You should now have two new directories in your file lists.

`/content/iot_data/*` contains the feature data for the samples in the IoT dataset.

`/content/A4_Materials/*` contains additional resources...
*   *08440758.pdf* is the research paper that describes the following classification pipeline and dataset in detail.
*   *list_of_devices.txt* contains device name, MAC address, and connection type information for all devices in the dataset.
*    *classify.py* and *requirements.txt* contain the Notebook code and necessary libs to execute the script locally.
*    *iot_data.zip* is the compressed feature files for the dataset.


In [None]:
# All primary imports
import json
import argparse
import os
import numpy as np
import tqdm

# Supress sklearn warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

Below are configurable hard-coded variables that used throughout the code...

You are welcome to adjust these values however you see fit.

In [None]:
# Seed Value
# (ensures consistent dataset splitting between runs)
SEED = 0

# Default path for IoT data after unzipping the dataset (in Colab)
ROOT = '/content/iot_data'

# Percentage of samples to use for testing (feel free to change)
SPLIT = 0.3

# Port count threshold (e.g., discard port feature values that appear less than N times)
PORT_COUNT = 10

# Maximum and minimum number of samples to allow when loading each IoT device
MAX_SAMPLES_PER_CLASS = 1000
MIN_SAMPLES_PER_CLASS = 20

This starter code can be organized into three or four main sections: **(1)** *data loading & processing*, **(2)** *stage-0 classification of bags-of-words*, **(3)** *stage-1 classification using random forests*, and **(4)** *the main function* that glues everything together.

The code will currently run as-is and correctly perform multi-class IoT device identification using the classification pipeline described in the assignment document. Additionally, each function's purpose and behavior is briefly described in its accompanying docstring. I recommend reading through all blocks of code to develop a general understanding of the way feature processing and classification is done.

For this assignment, you will only *need* to modify code section **(3)**, but you also are free to adjust, modify, re-write any of the remaining code as you see fit to complete your analysis or to integrate better with your random forest implementation.

In [None]:
def load_data(root, min_samples, max_samples):
    """
    Load json feature files produced from feature extraction.

    The device label (MAC) is identified from the directory in which the feature file was found.
    Returns x and y as separate multidimensional arrays.
    The instances in x contain only the first 6 features.
    The ports, domain, and cipher features are stored in separate arrays for easier process in stage 0.

    Parameters
    ----------
    root : str
           Path to the directory containing samples.
    min_samples : int
                  The number of samples each class must have at minimum (else it is pruned).
    max_samples : int
                  Stop loading samples for a class when this number is reached.

    Returns
    -------
    features_misc : numpy array
                    Traffic statistical features
    features_ports : numpy array
                     Vectorized word-bags (e.g., counts) for ports
    features_domains : numpy array
                       Vectorized word-bags (e.g., counts) for domains
    features_ciphers : numpy array
                       Vectorized word-bags (e.g., counts) for ciphers
    labels : numpy array
             (numerical) Labels for all samples in the dataset
    """
    x = []
    x_p = []
    x_d = []
    x_c = []
    y = []

    port_dict = dict()
    domain_set = set()
    cipher_set = set()

    # Create paths and do instance count filtering
    f_paths = []
    f_counts = dict()
    for rt, dirs, files in os.walk(root):
        for f_name in files:
            path = os.path.join(rt, f_name)
            label = os.path.basename(os.path.dirname(path))
            name = os.path.basename(path)
            if name.startswith("features") and name.endswith(".json"):
                f_paths.append((path, label, name))
                f_counts[label] = 1 + f_counts.get(label, 0)

    # Load Samples
    processed_counts = {label: 0 for label in f_counts.keys()}
    for fpath in tqdm.tqdm(f_paths):  # Enumerate all sample files
        path = fpath[0]
        label = fpath[1]
        if f_counts[label] < min_samples:
            continue
        if processed_counts[label] >= max_samples:
            continue  # Limit
        processed_counts[label] += 1
        with open(path, "r") as fp:
            features = json.load(fp)
            instance = [features["flow_volume"],
                        features["flow_duration"],
                        features["flow_rate"],
                        features["sleep_time"],
                        features["dns_interval"],
                        features["ntp_interval"]]
            x.append(instance)
            x_p.append(list(features["ports"]))
            x_d.append(list(features["domains"]))
            x_c.append(list(features["ciphers"]))
            y.append(label)
            domain_set.update(list(features["domains"]))
            cipher_set.update(list(features["ciphers"]))
            for port in set(features["ports"]):
                port_dict[port] = 1 + port_dict.get(port, 0)

    # Prune rarely seen ports
    port_set = set()
    for port in port_dict.keys():
        if port_dict[port] > PORT_COUNT:  # Filter out ports that are rarely seen to reduce feature dimensionality
            port_set.add(port)

    # Map to word-bag
    print("Generating word-bags ... ")
    for i in tqdm.tqdm(range(len(y))):
        x_p[i] = list(map(lambda x: x_p[i].count(x), port_set))
        x_d[i] = list(map(lambda x: x_d[i].count(x), domain_set))
        x_c[i] = list(map(lambda x: x_c[i].count(x), cipher_set))

    return np.array(x).astype(float), np.array(x_p), np.array(x_d), np.array(x_c), np.array(y)

In [None]:
def classify_bayes(x_tr, y_tr, x_ts, y_ts):
    """
    Use a multinomial naive bayes classifier to analyze the 'bag of words' seen in the ports/domain/ciphers features.
    Returns the prediction results for the training and testing datasets as an array of tuples in which each row
    represents a data instance and each tuple is composed as the predicted class and the confidence of prediction.

    Parameters
    ----------
    x_tr : numpy array
           Array containing training samples.
    y_tr : numpy array
           Array containing training labels.
    x_ts : numpy array
           Array containing testing samples.
    y_ts : numpy array
           Array containing testing labels

    Returns
    -------
    c_tr : numpy array
           Prediction results for training samples.
    c_ts : numpy array
           Prediction results for testing samples.
    """
    classifier = MultinomialNB()
    classifier.fit(x_tr, y_tr)

    # Produce class and confidence for training samples
    c_tr = classifier.predict_proba(x_tr)
    c_tr = [(np.argmax(instance), max(instance)) for instance in c_tr]

    # Produce class and confidence for testing samples
    c_ts = classifier.predict_proba(x_ts)
    c_ts = [(np.argmax(instance), max(instance)) for instance in c_ts]

    return c_tr, c_ts


def do_stage_0(xp_tr, xp_ts, xd_tr, xd_ts, xc_tr, xc_ts, y_tr, y_ts):
    """
    Perform stage 0 of the classification procedure:
        process each multinomial feature using naive bayes
        return the class prediction and confidence score for each instance feature

    Parameters
    ----------
    xp_tr : numpy array
           Array containing training (port) samples.
    xp_ts : numpy array
           Array containing testing (port) samples.
    xd_tr : numpy array
           Array containing training (port) samples.
    xd_ts : numpy array
           Array containing testing (port) samples.
    xc_tr : numpy array
           Array containing training (port) samples.
    xc_ts : numpy array
           Array containing testing (port) samples.
    y_tr : numpy array
           Array containing training labels.
    y_ts : numpy array
           Array containing testing labels

    Returns
    -------
    res_p_tr : numpy array
               Prediction results for training (port) samples.
    res_p_ts : numpy array
               Prediction results for testing (port) samples.
    res_d_tr : numpy array
               Prediction results for training (domains) samples.
    res_d_ts : numpy array
               Prediction results for testing (domains) samples.
    res_c_tr : numpy array
               Prediction results for training (cipher suites) samples.
    res_c_ts : numpy array
               Prediction results for testing (cipher suites) samples.
    """
    # Perform multinomial classification on bag of ports
    res_p_tr, res_p_ts = classify_bayes(xp_tr, y_tr, xp_ts, y_ts)

    # Perform multinomial classification on domain names
    res_d_tr, res_d_ts = classify_bayes(xd_tr, y_tr, xd_ts, y_ts)

    # Perform multinomial classification on cipher suites
    res_c_tr, res_c_ts = classify_bayes(xc_tr, y_tr, xc_ts, y_ts)

    return res_p_tr, res_p_ts, res_d_tr, res_d_ts, res_c_tr, res_c_ts

Your primary goal for this assignment is to implement your own versions of the decision tree and random forest algorithms to replace the scikit-learn implementation currently in use for stage-1 classification. 



In [None]:
def gini(labels1, labels2, classnum):
  prop1 = np.count_nonzero(labels1 == classnum)/len(labels1)
  invprop1 = 1 - prop1
  prop2 = np.count_nonzero(labels2 == classnum)/len(labels2)
  invprop2 = 1 - prop2
  totallen = len(labels1) + len(labels2)
  gini1 = (1 - (prop1*prop1 + invprop1*invprop1)) * (len(labels1)/totallen)
  gini2 = (1 - (prop2*prop2 + invprop2*invprop2)) * (len(labels2)/totallen)
  return gini1 + gini2

In [None]:
def splitfinder(labels):
  ginis = {}
  classes = np.unique(labels)
  #For starters, split will be from 0.1 to 0.9
  i = 0.1
  while i < 1:
    splitnum = int(len(labels)*i)
    set1 = labels[0:splitnum]
    set2 = labels[splitnum:]
    ginisum = 0
    for c in classes:
      ginisum = ginisum + gini(set1, set2, c)
    ginis[i] = ginisum
    i+=0.1
  return min(ginis, key=ginis.get)

Working Code:

In [None]:
def gini_impurity(groups, classes):
  n = sum(len(group) for group in groups)
  impurity = 0.0
  for group in groups:
    if len(group) == 0:
      pass
    else:
      g_score = 0.0
      for c in classes:
        p = [row[-1] for row in group].count(c) / len(group)
        g_score += p * p
      impurity = impurity + (1-g_score)*(len(group)/n)
  return impurity

In [None]:
def split(value, feature, data):
  yes, no = [],[]
  for d in data:
    if d[feature] < value:
      no.append(d)
    else:
      yes.append(d)
  return no,yes

In [None]:
def splitter(data, classes):
  index, value, score, groups = 0, 0, 2.0, None
  i = 0
  checked_feats = []
  while i < len(data[0])-1:
    for d in data:
      if d[i] not in checked_feats:
        checked_feats.append(d[i])
        groups = split(d[i], i, data)
        gini = gini_impurity(groups, classes)
        if gini < score:
          index, value, score, groups = i, d[i], gini, groups
    i+=1
    checked_feats = []
  return (index, value, score, groups)

In [None]:
class Node:
  def __init__(self, data):
    self.left = None
    self.right = None
    self.data = data
    self.spliton = None
    self.splitthreshold = None
    self.purity = 2.0
  
  def setchildren(self, feature, value, leftnode, rightnode):
    self.left = leftnode
    self.right = rightnode
    self.spliton = feature
    self.splitthreshold = value
  
  def getleft(self):
    return self.left

  def getright(self):
    return self.right

  def getpurity(self):
    return self.purity

  def printtree(self, layer=0):
    if self.data is not None:
      print("Layer #" + str(layer) + ": Feature " + str(self.spliton) + " < " + str(self.splitthreshold))
      layer = layer + 1
      if self.left is not None:
        print("left side of " + str(self.spliton) + " < " + str(self.splitthreshold))
        self.left.printtree(layer)
      if self.right is not None:
        print("right side of " + str(self.spliton) + " < " + str(self.splitthreshold))
        self.right.printtree(layer)

  def predict(self, sample):
    if self.left is not None and self.right is not None:
      #print(self.spliton)
      #print(self.splitthreshold)
      sampleval = sample[self.spliton]
      if sampleval < self.splitthreshold:
        below = self.left.predict(sample)
        return below
      else:
        below = self.right.predict(sample)
        return below
    else:
      classcount = {}
      for d in self.data:
        if d[-1] not in classcount.keys():
          classcount[d[-1]] = 1
        else:
          classcount[d[-1]] = classcount[d[-1]] + 1
      return max(classcount, key=classcount.get)

In [None]:
def dec_tree(max_depth, min_node, data, classes, num_features, currentnode=None, featuresused=None):
  if max_depth == 0:
    return currentnode
  if currentnode is None:
    currentnode = Node(data)
  if featuresused is None:
    featuresused = []
  elif len(featuresused) >= num_features:
    return currentnode
  bestsplit = splitter(data, classes)
  if bestsplit[2] > currentnode.getpurity():
    return currentnode
  elif len(bestsplit[3][0]) == 0 or len(bestsplit[3][1]) == 0:
    return currentnode
  elif len(bestsplit[3][0]) < min_node or len(bestsplit[3][1]) < min_node:
    return currentnode
  leftnode = dec_tree(max_depth-1, min_node, bestsplit[3][0], classes, num_features, currentnode=currentnode.getleft(), featuresused=featuresused)
  rightnode = dec_tree(max_depth-1, min_node, bestsplit[3][1], classes, num_features, currentnode=currentnode.getright(), featuresused=featuresused)
  currentnode.setchildren(bestsplit[0], bestsplit[1], leftnode, rightnode)
  if bestsplit[0] not in featuresused:
    featuresused.append(bestsplit[0])
  return currentnode

In [None]:
def random_forest(n_trees, data_frac, feature_subcount, x_tr, y_tr):
  reformed_x = np.zeros(shape=(len(x_tr),len(x_tr[0])+1))
  i = 0
  classes_y = []
  while i < len(x_tr):
    c = y_tr[i]
    reformed_x[i] = np.append(x_tr[i], [c], 0)
    if c not in classes_y:
      classes_y.append(c)
    i+=1
  forest = []
  slicesize = int(len(reformed_x)*data_frac)
  indices = np.arange(len(reformed_x))
  for n in range(n_trees):
    xsample = []
    sliceindex = np.random.choice(indices, size=slicesize, replace=False)
    for si in sliceindex:
      xsample.append(reformed_x[si])
    bigtree = dec_tree(3, 10, xsample, classes_y, feature_subcount)
    bigtree.printtree()
    forest.append(bigtree)
  return forest

In [None]:
def test(x_tr, y_tr, x_ts, y_ts):
  numcorrect = 0 
  trees = random_forest(2, .70, 1, x_tr, y_tr)
  print("len: " + str(len(trees)))
  for x in x_ts:
    treepred = {}
    for t in trees:
      #print(t)
      xclass = t.predict(x)
      #print(treepred.keys())
      if xclass not in treepred.keys():
        treepred[xclass] = 1
        #print("if")
      else:
        #print("else")
        treepred[xclass] = treepred[xclass] + 1
    #print(treepred)
    predclass = max(treepred, key=treepred.get)
    xdex = np.where(x_ts == x)
    actualclass = y_ts[xdex[0]][0]
    #print(actualclass)
    if predclass == actualclass:
      numcorrect = numcorrect + 1
  return (numcorrect/len(x_ts))

In [None]:
  def do_stage_1(X_tr, X_ts, Y_tr, Y_ts):
    """
    Perform stage 1 of the classification procedure:
        train a random forest classifier using the NB prediction probabilities

    Parameters
    ----------
    X_tr : numpy array
           Array containing training samples.
    Y_tr : numpy array
           Array containing training labels.
    X_ts : numpy array
           Array containing testing samples.
    Y_ts : numpy array
           Array containing testing labels

    Returns
    -------
    pred : numpy array
           Final predictions on testing dataset.
    """

    accuracy = test(X_tr, Y_tr, X_ts, Y_ts)
    return accuracy


In [None]:
def main():
    """
    Load data, encode labels to numeric, and perform classification stages 
    """
    # load dataset
    print("Loading dataset ... ")
    X, X_p, X_d, X_c, Y = load_data(ROOT, min_samples=MIN_SAMPLES_PER_CLASS, 
                                          max_samples=MAX_SAMPLES_PER_CLASS)

    # encode labels into numerical values
    print("Encoding labels ... ")
    le = LabelEncoder()
    le.fit(Y)
    Y = le.transform(Y)

    print("Dataset statistics:")
    print("\t Classes: {}".format(len(le.classes_)))
    print("\t Samples: {}".format(len(Y)))
    print("\t Dimensions: ", X.shape, X_p.shape, X_d.shape, X_c.shape)

    # shuffle
    print("Shuffling dataset using seed {} ... ".format(SEED))
    s = np.arange(Y.shape[0])
    np.random.seed(SEED)
    np.random.shuffle(s)
    X, X_p, X_d, X_c, Y = X[s], X_p[s], X_d[s], X_c[s], Y[s]

    # split
    print("Splitting dataset using train:test ratio of {}:{} ... ".format(1-int(SPLIT*10), int((SPLIT)*10)))
    cut = int(len(Y) * SPLIT)
    X_tr, Xp_tr, Xd_tr, Xc_tr, Y_tr = X[cut:], X_p[cut:], X_d[cut:], X_c[cut:], Y[cut:]
    X_ts, Xp_ts, Xd_ts, Xc_ts, Y_ts = X[:cut], X_p[:cut], X_d[:cut], X_c[:cut], Y[:cut]

    # perform stage 0
    print("Performing Stage 0 classification ... ")
    p_tr, p_ts, d_tr, d_ts, c_tr, c_ts = \
        do_stage_0(Xp_tr, Xp_ts, Xd_tr, Xd_ts, Xc_tr, Xc_ts, Y_tr, Y_ts)

    # build stage 1 dataset using stage 0 results
    # NB predictions are concatenated to the statistical attributes processed from the flows
    X_tr_full = np.hstack((X_tr, p_tr, d_tr, c_tr))
    X_ts_full = np.hstack((X_ts, p_ts, d_ts, c_ts))

    # perform final classification
    print("Performing Stage 1 classification ... ")
    pred = do_stage_1(X_tr_full, X_ts_full, Y_tr, Y_ts)

    # print classification report
    print("Prediction: "+ str(pred))
    #print(classification_report(Y_ts, pred, target_names=le.classes_))


main()