# T4 - Identifying IoT Devices using Traffic Meta-data

Semester 2221, CSEC 520/620, Team 4\
Assignment 4 - IoT Classification\
Due by November 15, 2022 11:59 PM EST.\
Accounts for 12% of total grade.

This assignment includes some prewritten code for you to work with. This code is a re-implementation of the classification methods described in the 2018 paper ["Classifying IoT Devices in Smart Environments Using Network Traffic Characteristics."](https://doi.org/10.1109/TMC.2018.2866249)

## Requirements

- Python 3.7+
- Download and unzip the materials & IoT dataset (for Google Colab only) using the following code-block...

In [None]:
!gdown 199U9EjMlDsqfOTxaDRMZkALzMhJ1Hmsd
!unzip A4_Materials.zip
!unzip A4_Materials/iot_data.zip

You should now have two new directories in your file lists.

`/content/iot_data/*` contains the feature data for the samples in the IoT dataset.

`/content/A4_Materials/*` contains additional resources...
*   *08440758.pdf* is the research paper that describes the following classification pipeline and dataset in detail.
*   *list_of_devices.txt* contains device name, MAC address, and connection type information for all devices in the dataset.
*    *classify.py* and *requirements.txt* contain the Notebook code and necessary libs to execute the script locally.
*    *iot_data.zip* is the compressed feature files for the dataset.


## 0. Preliminaries

Before we can begin we have to define and import our various modules and libraries that we will depend on during execution.

In [None]:
# All primary imports
import json
import argparse
import os
import numpy as np
import tqdm

# Supress sklearn warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

Below are configurable hard-coded variables that used throughout the code...

You are welcome to adjust these values however you see fit.

In [None]:
# Seed Value
# (ensures consistent dataset splitting between runs)
SEED = 0

# Default path for IoT data after unzipping the dataset (in Colab)
ROOT = '/content/iot_data'

# Percentage of samples to use for testing (feel free to change)
SPLIT = 0.3

# Port count threshold (e.g., discard port feature values that appear less than N times)
PORT_COUNT = 10

# Maximum and minimum number of samples to allow when loading each IoT device
MAX_SAMPLES_PER_CLASS = 1000
MIN_SAMPLES_PER_CLASS = 20

## 1. Data Loading & Processing

This starter code can be organized into three or four main sections: **(1)** *data loading & processing*, **(2)** *stage-0 classification of bags-of-words*, **(3)** *stage-1 classification using random forests*, and **(4)** *the main function* that glues everything together.

The code will currently run as-is and correctly perform multi-class IoT device identification using the classification pipeline described in the assignment document. Additionally, each function's purpose and behavior is briefly described in its accompanying docstring. I recommend reading through all blocks of code to develop a general understanding of the way feature processing and classification is done.

For this assignment, you will only *need* to modify code section **(3)**, but you also are free to adjust, modify, re-write any of the remaining code as you see fit to complete your analysis or to integrate better with your random forest implementation.

In [None]:
def load_data(root, min_samples, max_samples):
    """
    Load json feature files produced from feature extraction.

    The device label (MAC) is identified from the directory in which the feature file was found.
    Returns x and y as separate multidimensional arrays.
    The instances in x contain only the first 6 features.
    The ports, domain, and cipher features are stored in separate arrays for easier process in stage 0.

    Parameters
    ----------
    root : str
           Path to the directory containing samples.
    min_samples : int
                  The number of samples each class must have at minimum (else it is pruned).
    max_samples : int
                  Stop loading samples for a class when this number is reached.

    Returns
    -------
    features_misc : numpy array
                    Traffic statistical features
    features_ports : numpy array
                     Vectorized word-bags (e.g., counts) for ports
    features_domains : numpy array
                       Vectorized word-bags (e.g., counts) for domains
    features_ciphers : numpy array
                       Vectorized word-bags (e.g., counts) for ciphers
    labels : numpy array
             (numerical) Labels for all samples in the dataset
    """
    x = []
    x_p = []
    x_d = []
    x_c = []
    y = []

    port_dict = dict()
    domain_set = set()
    cipher_set = set()

    # Create paths and do instance count filtering
    f_paths = []
    f_counts = dict()
    for rt, dirs, files in os.walk(root):
        for f_name in files:
            path = os.path.join(rt, f_name)
            label = os.path.basename(os.path.dirname(path))
            name = os.path.basename(path)
            if name.startswith("features") and name.endswith(".json"):
                f_paths.append((path, label, name))
                f_counts[label] = 1 + f_counts.get(label, 0)

    # Load Samples
    processed_counts = {label: 0 for label in f_counts.keys()}
    for fpath in tqdm.tqdm(f_paths):  # Enumerate all sample files
        path = fpath[0]
        label = fpath[1]
        if f_counts[label] < min_samples:
            continue
        if processed_counts[label] >= max_samples:
            continue  # Limit
        processed_counts[label] += 1
        with open(path, "r") as fp:
            features = json.load(fp)
            instance = [features["flow_volume"],
                        features["flow_duration"],
                        features["flow_rate"],
                        features["sleep_time"],
                        features["dns_interval"],
                        features["ntp_interval"]]
            x.append(instance)
            x_p.append(list(features["ports"]))
            x_d.append(list(features["domains"]))
            x_c.append(list(features["ciphers"]))
            y.append(label)
            domain_set.update(list(features["domains"]))
            cipher_set.update(list(features["ciphers"]))
            for port in set(features["ports"]):
                port_dict[port] = 1 + port_dict.get(port, 0)

    # Prune rarely seen ports
    port_set = set()
    for port in port_dict.keys():
        if port_dict[port] > PORT_COUNT:  # Filter out ports that are rarely seen to reduce feature dimensionality
            port_set.add(port)

    # Map to word-bag
    print("Generating word-bags ... ")
    for i in tqdm.tqdm(range(len(y))):
        x_p[i] = list(map(lambda x: x_p[i].count(x), port_set))
        x_d[i] = list(map(lambda x: x_d[i].count(x), domain_set))
        x_c[i] = list(map(lambda x: x_c[i].count(x), cipher_set))

    return np.array(x).astype(float), np.array(x_p), np.array(x_d), np.array(x_c), np.array(y)

## 2. Stage-0 Classification of Bags-of-Words

In [None]:
def classify_bayes(x_tr, y_tr, x_ts, y_ts):
    """
    Use a multinomial naive bayes classifier to analyze the 'bag of words' seen in the ports/domain/ciphers features.
    Returns the prediction results for the training and testing datasets as an array of tuples in which each row
    represents a data instance and each tuple is composed as the predicted class and the confidence of prediction.

    Parameters
    ----------
    x_tr : numpy array
           Array containing training samples.
    y_tr : numpy array
           Array containing training labels.
    x_ts : numpy array
           Array containing testing samples.
    y_ts : numpy array
           Array containing testing labels

    Returns
    -------
    c_tr : numpy array
           Prediction results for training samples.
    c_ts : numpy array
           Prediction results for testing samples.
    """
    classifier = MultinomialNB()
    classifier.fit(x_tr, y_tr)

    # Produce class and confidence for training samples
    c_tr = classifier.predict_proba(x_tr)
    c_tr = [(np.argmax(instance), max(instance)) for instance in c_tr]

    # Produce class and confidence for testing samples
    c_ts = classifier.predict_proba(x_ts)
    c_ts = [(np.argmax(instance), max(instance)) for instance in c_ts]

    return c_tr, c_ts


def do_stage_0(xp_tr, xp_ts, xd_tr, xd_ts, xc_tr, xc_ts, y_tr, y_ts):
    """
    Perform stage 0 of the classification procedure:
        process each multinomial feature using naive bayes
        return the class prediction and confidence score for each instance feature

    Parameters
    ----------
    xp_tr : numpy array
           Array containing training (port) samples.
    xp_ts : numpy array
           Array containing testing (port) samples.
    xd_tr : numpy array
           Array containing training (port) samples.
    xd_ts : numpy array
           Array containing testing (port) samples.
    xc_tr : numpy array
           Array containing training (port) samples.
    xc_ts : numpy array
           Array containing testing (port) samples.
    y_tr : numpy array
           Array containing training labels.
    y_ts : numpy array
           Array containing testing labels

    Returns
    -------
    res_p_tr : numpy array
               Prediction results for training (port) samples.
    res_p_ts : numpy array
               Prediction results for testing (port) samples.
    res_d_tr : numpy array
               Prediction results for training (domains) samples.
    res_d_ts : numpy array
               Prediction results for testing (domains) samples.
    res_c_tr : numpy array
               Prediction results for training (cipher suites) samples.
    res_c_ts : numpy array
               Prediction results for testing (cipher suites) samples.
    """
    # Perform multinomial classification on bag of ports
    res_p_tr, res_p_ts = classify_bayes(xp_tr, y_tr, xp_ts, y_ts)

    # Perform multinomial classification on domain names
    res_d_tr, res_d_ts = classify_bayes(xd_tr, y_tr, xd_ts, y_ts)

    # Perform multinomial classification on cipher suites
    res_c_tr, res_c_ts = classify_bayes(xc_tr, y_tr, xc_ts, y_ts)

    return res_p_tr, res_p_ts, res_d_tr, res_d_ts, res_c_tr, res_c_ts

## 3. Stage-1 Classification using Random Forests

Your primary goal for this assignment is to implement your own versions of the decision tree and random forest algorithms to replace the scikit-learn implementation currently in use for stage-1 classification.

In [None]:
def gini_impurity(groups, classes):
    n = sum(len(group) for group in groups)
    impurity = 0.0
    for group in groups:
        if len(group) == 0:
            pass
        else:
            g_score = 0.0
            for c in classes:
                p = [row[-1] for row in group].count(c) / len(group)
                g_score += p * p
            impurity = impurity + (1 - g_score) * (len(group) / n)
    return impurity

In [None]:
def split(value, feature, data):
    yes, no = [], []
    for d in data:
        if d[feature] < value:
            no.append(d)
        else:
            yes.append(d)
    return no, yes

In [None]:
def splitter(data, classes):
    index, value, score, groups = 0, 0, 2.0, None
    i = 0
    checked_feats = []
    while i < len(data[0]) - 1:
        for d in data:
            if d[i] not in checked_feats:
                checked_feats.append(d[i])
                groups = split(d[i], i, data)
                gini = gini_impurity(groups, classes)
                if gini < score:
                    index, value, score, groups = i, d[i], gini, groups
        i += 1
        checked_feats = []
    return index, value, score, groups

In [None]:
class Node:
    def __init__(self, data):
        self.left = None
        self.right = None
        self.data = data
        self.split_on = None
        self.split_threshold = None
        self.purity = 2.0

    def set_children(self, feature, value, left_node, right_node):
        self.left = left_node
        self.right = right_node
        self.split_on = feature
        self.split_threshold = value

    def get_left(self):
        return self.left

    def get_right(self):
        return self.right

    def get_purity(self):
        return self.purity

    def print_tree(self, layer=0):
        if self.data is not None:
            print(f"Layer #{layer}: Feature {self.split_on} < {self.split_threshold}")
            layer = layer + 1
            if self.left is not None:
                print(f"Left side of {self.split_on} < {self.split_threshold}")
                self.left.print_tree(layer)
            if self.right is not None:
                print(f"Right side of {self.split_on} < {self.split_threshold}")
                self.right.print_tree(layer)

    def predict(self, sample):
        if self.left is not None and self.right is not None:
            sample_val = sample[self.split_on]
            if sample_val < self.split_threshold:
                below = self.left.predict(sample)
                return below
            else:
                below = self.right.predict(sample)
                return below
        else:
            class_count = {}
            for d in self.data:
                if d[-1] not in class_count.keys():
                    class_count[d[-1]] = 1
                else:
                    class_count[d[-1]] = class_count[d[-1]] + 1
            return max(class_count, key=class_count.get)

In [None]:
def decision_tree(max_depth, min_node, data, classes, num_features, current_node=None, features_used=None):
    if max_depth == 0:
        return current_node
    if current_node is None:
        current_node = Node(data)
    if features_used is None:
        features_used = []
    elif len(features_used) >= num_features:
        return current_node
    best_split = splitter(data, classes)
    if best_split[2] > current_node.get_purity():
        return current_node
    elif len(best_split[3][0]) == 0 or len(best_split[3][1]) == 0:
        return current_node
    elif len(best_split[3][0]) < min_node or len(best_split[3][1]) < min_node:
        return current_node
    left_node = decision_tree(max_depth - 1, min_node, best_split[3][0], classes, num_features,
                              current_node=current_node.get_left(), features_used=features_used)
    right_node = decision_tree(max_depth - 1, min_node, best_split[3][1], classes, num_features,
                               current_node=current_node.get_right(), features_used=features_used)
    current_node.set_children(best_split[0], best_split[1], left_node, right_node)
    if best_split[0] not in features_used:
        features_used.append(best_split[0])
    return current_node

In [None]:
def random_forest(n_trees, data_frac, feature_sub_count, x_tr, y_tr):
    reformed_x = np.zeros(shape=(len(x_tr), len(x_tr[0]) + 1))
    i = 0
    classes_y = []
    while i < len(x_tr):
        c = y_tr[i]
        reformed_x[i] = np.append(x_tr[i], [c], 0)
        if c not in classes_y:
            classes_y.append(c)
        i += 1
    forest = []
    slice_size = int(len(reformed_x) * data_frac)
    indices = np.arange(len(reformed_x))
    for n in range(n_trees):
        x_sample = []
        slice_index = np.random.choice(indices, size=slice_size, replace=False)
        for si in slice_index:
            x_sample.append(reformed_x[si])
        bigtree = decision_tree(5, 10, x_sample, classes_y, feature_sub_count)
        bigtree.print_tree()
        forest.append(bigtree)
    return forest

In [None]:
def rf_test(x_tr, y_tr, x_ts, y_ts):
    """
    Performs testing on the random forest.

    Parameters
    ----------
    x_tr : numpy array
           Array containing training samples.
    y_tr : numpy array
           Array containing training labels.
    x_ts : numpy array
           Array containing testing samples.
    y_ts : numpy array
           Array containing testing labels.

    Returns
    -------
    accuracy : float
           Accuracy percentage from test run.
    """
    # Train Random Forest
    correct_number = 0
    trees = random_forest(1, .7, 1, x_tr, y_tr)
    print(f'Length: {len(trees)}')

    # Generate predictions for each sample
    for sample in x_ts:
        tree_predictions = {}

        for tree in trees:
            x_class = tree.predict(sample)
            if x_class not in tree_predictions.keys():
                tree_predictions[x_class] = 1
            else:
                tree_predictions[x_class] += 1

        # Check if class prediction was correct
        predicted_class = max(tree_predictions, key=tree_predictions.get)
        x_dex = np.where(x_ts == sample)
        actual_class = y_ts[x_dex[0]][0]

        if predicted_class == actual_class:
            correct_number += 1

    return correct_number / len(x_ts)

In [None]:
def do_stage_1(x_tr, x_ts, y_tr, y_ts):
    """
    Perform stage 1 of the classification procedure:
        train a random forest classifier using the NB prediction probabilities

    Parameters
    ----------
    x_tr : numpy array
           Array containing training samples.
    y_tr : numpy array
           Array containing training labels.
    x_ts : numpy array
           Array containing testing samples.
    y_ts : numpy array
           Array containing testing labels.

    Returns
    -------
    pred : numpy array
           Final predictions on testing dataset.
    """

    accuracy = rf_test(x_tr, y_tr, x_ts, y_ts)
    return accuracy

## 4. Main Function

In [None]:
def main():
    """
    Load data, encode labels to numeric, and perform classification stages.
    """
    # load dataset
    print("Loading dataset ...")
    x, x_p, x_d, x_c, y = load_data(ROOT, min_samples=MIN_SAMPLES_PER_CLASS,
                                    max_samples=MAX_SAMPLES_PER_CLASS)

    # Encode labels into numerical values
    print("\nEncoding labels ...")
    le = LabelEncoder()
    le.fit(y)
    y = le.transform(y)

    print("Dataset statistics:")
    print(f"\t Classes: {len(le.classes_)}")
    print(f"\t Samples: {len(y)}")
    print("\t Dimensions: ", x.shape, x_p.shape, x_d.shape, x_c.shape)

    # Shuffle
    print(f"\nShuffling dataset using seed {SEED} ...")
    s = np.arange(y.shape[0])
    np.random.seed(SEED)
    np.random.shuffle(s)
    x, x_p, x_d, x_c, y = x[s], x_p[s], x_d[s], x_c[s], y[s]

    # Split
    print(f"Splitting dataset using train:test ratio of {1 - int(SPLIT * 10)}:{int(SPLIT * 10)} ...")
    cut = int(len(y) * SPLIT)
    x_tr, xp_tr, xd_tr, xc_tr, y_tr = x[cut:], x_p[cut:], x_d[cut:], x_c[cut:], y[cut:]
    x_ts, xp_ts, xd_ts, xc_ts, y_ts = x[:cut], x_p[:cut], x_d[:cut], x_c[:cut], y[:cut]

    # Perform stage 0
    print("\nPerforming Stage 0 classification ...")
    p_tr, p_ts, d_tr, d_ts, c_tr, c_ts = do_stage_0(xp_tr, xp_ts, xd_tr, xd_ts, xc_tr, xc_ts, y_tr, y_ts)

    # Build stage 1 dataset using stage 0 results
    # NB predictions are concatenated to the statistical attributes processed from the flows
    x_tr_full = np.hstack((x_tr, p_tr, d_tr, c_tr))
    x_ts_full = np.hstack((x_ts, p_ts, d_ts, c_ts))

    # Perform final classification
    print("Performing Stage 1 classification ...")
    pred = do_stage_1(x_tr_full, x_ts_full, y_tr, y_ts)

    # Print classification report
    print(f"\nPrediction: {pred}")
    #print(classification_report(Y_ts, pred, target_names=le.classes_))


main()