# Convolutional Neural Nets to Identify Botnet Traffic: Final Project
## Author: Bryce C Turner

## Important Links:
- Short Video (5 Min): tbd
- Long Long (15 Min): tbd
- Github Project :
- Link to Full Data: https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/botnet-and-ransomware-detection-datasets/



## Introduction

### Botnets
Today's highly networked world comes with many securty vulnerabilities. One such vulnerability is a botnet attack. In a botnet attack, a malicious actor takes over one or many computers and uses them for their own purposes. Some common uses for botnets is that they conduct Distributed Denial of Service (DDoS) attacks, spread spam or misinformation, or conduct phishing attacks.

In some instances, botnets communicate with a central server which gives commands to the individual machines. This communication often happens via Domain Name System (DNS) traffic, because a DNS request may get bounced back and forth between multiple DNS servers.


Source:
- https://www.techtarget.com/searchsecurity/definition/botnet

### Packet Capture

Packet capture for DNS involves capturing and analyzing network traffic specifically related to the Domain Name System (DNS). It entails capturing individual data packets exchanged during DNS transactions to examine communication patterns, identify queries, and analyze responses between DNS clients and servers. For the purposes of this project, I explore whether we can detect any distinction between the DNS packets sent from different botnets and other normal DNS traffic.


## The Dataset

To explore the relationship between botnet and non-malicious DNS packets, I utilize a publicly available dataset from the University of Victoria. This is a publicly available dataset in which the researchers constructed virtual machines and deployed 9 different botnets and captured all internet traffic inside their network.  They also deployed a number of non-malicious software, such as Skype, Windows Updater, Dropbox, and others.  Helpfully for this work, they also identified the Command and Control servers of for each of the botnets. This dataset contains all internet traffic observed in the virtual environemnt, however I will focus just on the DNS traffic for this analysis.


### Link to Dataset
https://onlineacademiccommunity.uvic.ca/isot/datasets/

## Overview of Solution

To explore the relationship between botnet DNS traffic and different applications, I construct multiple Convolutional Neural Networks (CNNs) and assess them on their accuracy on a testing set. Given the fact that an individual DNS connection consists of a number of packets, each containing strings of binary data, this a CNN is a good model choice. CNNs create and apply many filters over an input tensor.  Given that binary data quickly becomes uninterpretable to humans, it is better to let the CNN do the feature engineering on its own.

I explore various depths and kernel sizes. I do not have a strong prior about what the appropriate values for these should be in this CNN so I begin at a very low number and work my way up, incrementally adding another Convolutional Layer and increasing the kernel size.

## Preparing My Environment
### PYSHARK

This analysis notebook was largely built and run using my Google Colab premium account. However, the prelimiary python scripts I used to prepare the data are raw python files (.py) and must be run outside of a standard notebook. This is because the pyshark module that I used to manipulate the packet-level data runs in an event loop. Google Colab Notebooks run in an event loop and python will not allow you run nested event loops, therefore this preliminary section must be run outside of a notebook. Futher, the pyshark library is a wrapper for the "tshark" application, which must be installed on the machine as well. That is why you see the command to install it below.


### GOOGLE DRIVE
The raw data and the intermediate files I save were stored in my Google Drive. I have a premium Google Drive subscription, with 150 Gb of storage available. This is important to note because the raw data as well as the tenors that I save are too large to fit within the standard Google Drive limits.


In [4]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
!sudo apt install tshark

In [None]:
%pip install pyshark

## Data Preparation

Getting the data into a usable state required a significant amount of prep work. The first process is not seen here and required the use of a third-party tool called "WireShark". WireShark is a very popular tool for exploring and manipulating PCAP data. I use WireShark to apply a filter to the raw dataset so that I only need to work with the DNS data.  

Next, I loop over all the files and extract the raw data from each packet. For this analysis, each observation is an individual packet.  This portion kept running into ram limits on Colab, so I wrote out a number of intermediate tensor files which I ultimately concatenate together to make the final datasets.

THIS SECTION NEED ONLY BE RUN ONCE!
Google Colab has a time-out feature where if you do not type or click in Colab for a while, it will automatically cancel the command that you run. This is separate from the resource timeout that occurs if you do not use any resources for more than two hours. The non-jupyter command, "prepare_all_tensors.py", executes in about 2 hours. Therefore, I had to baby-sit this cell to ensure that it did not cancel. Therefore, to avoid burning down your Colab credits and wasting a ton of time (like I did), you need only run this section once and then you explore the following sections over and over again.

In [None]:
!python /content/drive/MyDrive/DeepLearningForPcap/src/prepare_all_tensors.py

In [3]:
import os
tensor_dir = "/content/drive/MyDrive/DeepLearningForPcap/data/isot_app_and_botnet_dataset/file_tensors/"
all_files = os.listdir(tensor_dir)
all_files.sort()
pcaps = all_files[0::2]
traffic_data = all_files[1::2]

In [None]:
full_pcap = []
full_traffic = []

for pcap, traffic in zip(pcaps, traffic_data):

  with open(tensor_dir + pcap, 'rb') as f:
    full_pcap.append(np.load(f))

  with open(tensor_dir + traffic, 'rb') as f:
    full_traffic.append(np.load(f))


full_pcap = np.concatenate(full_pcap)
with open("/content/drive/MyDrive/DeepLearningForPcap/data/tensors/full_pcap.npy"  , 'wb') as f:
    np.save(f, full_pcap)


full_traffic = np.concatenate(full_traffic)
with open("/content/drive/MyDrive/DeepLearningForPcap/data/tensors/full_traffic.npy"  , 'wb') as f:
    np.save(f, full_traffic)

## Final Preparation Before Model

Before I am able to train the model, I must conduct a few more final data processing steps around the "labels" or Y data.

The raw PCAP data provides only Source IP and Destination IP for a given connection, but I need to provide the contextual knowledge necessary to provide the correct output.

In [13]:
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
import os
import json


import tensorflow as tf
import keras
from keras import layers


from numba import cuda

In [14]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [15]:
ip_metadata = {
    '192.168.50.14': {
        'name': 'zyklon',
        'is_bot': 'yes'},
    '192.168.50.15': {
        'name': 'blue',
        'is_bot': 'yes'},
    '192.168.50.16': {
        'name': 'liphyra',
        'is_bot': 'yes'},
    '192.168.50.17': {
        'name': 'gaudox',
        'is_bot': 'yes'},
    '192.168.50.18': {
        'name': 'blackout',
        'is_bot': 'yes'},
    '192.168.50.30': {
        'name': 'citadel',
        'is_bot': 'yes'},
    '192.168.50.31': {
        'name': 'citadel',
        'is_bot': 'yes'},
    '192.168.50.32': {
        'name': 'black-energy',
        'is_bot': 'yes'},
    '192.168.50.34': {
        'name': 'zeus',
        'is_bot': 'yes'},


    '192.168.50.19': {
        'name': 'dropbox',
        'is_bot': 'no'},
    '192.168.50.50': {
        'name':'avast',
        'is_bot': 'no'},
    '192.168.50.51': {
        'name': 'adobe-reader',
        'is_bot': 'no'},
    '192.168.50.52': {
        'name': 'adobe-suite',
        'is_bot': 'no'},
    '192.168.50.54': {
        'name': 'chrome',
        'is_bot': 'no'},
    '192.168.50.55': {
        'name': 'firefox',
        'is_bot': 'no'},
    '192.168.50.56': {
        'name': 'malwarebyte',
        'is_bot': 'no'},
    '192.168.50.57': {
        'name': 'wps-office',
        'is_bot': 'no'},
    '192.168.50.58': {
        'name': 'win-update',
        'is_bot': 'no'},
    '192.168.50.59': {
        'name': 'utorrent',
        'is_bot': 'no'},
    '192.168.50.60': {
        'name': 'fosshub',
        'is_bot': 'no'},
    '192.168.50.61': {
        'name': 'bytefence',
        'is_bot': 'no'},
    '192.168.50.63': {
        'name': 'thunderbird-mozila',
        'is_bot': 'no'},
    '192.168.50.64': {
        'name': 'avast',
        'is_bot': 'no'},
    '192.168.50.65': {
        'name': 'skype',
        'is_bot': 'no'},
    '192.168.50.66': {
        'name': 'facebook-messenger',
        'is_bot': 'no'},
    '192.168.50.67': {
        'name': 'ccleaner',
        'is_bot': 'no'},
    '192.168.50.68': {
        'name': 'win-update',
        'is_bot': 'no'},
    '192.168.50.69': {
        'name': 'hitmanpro',
        'is_bot': 'no'},


    '192.168.50.88': {
        'name': 'local-dns-server',
        'is_bot': 'no'},
    '8.8.4.4': {
        'name': 'google-public-dns',
        'is_bot': 'no'},
    '8.8.8.8': {
        'name': 'google-public-dns',
        'is_bot': 'no'}
}


In [16]:
with open("/content/drive/MyDrive/DeepLearningForPcap/data/tensors/full_pcap.npy", 'rb') as f:
    pcap_tensor = np.load(f)

with open("/content/drive/MyDrive/DeepLearningForPcap/data/tensors/full_traffic.npy", 'rb') as f:
    traffic_tensor = np.load(f)


In [17]:
traffic_pd = pd.DataFrame(traffic_tensor, columns = ["traffic_data"])
traffic_pd[['src', 'dst']] = traffic_pd.traffic_data.str.split(" -> ", expand=True)

meta_df = pd.DataFrame.from_dict(ip_metadata, orient="index")

traffic_final = (
    pd.merge(traffic_pd,
            meta_df.add_prefix('src_'),
            left_on='src',
            right_index=True, how='left')
      .fillna({'src_name': 'random_machine'})
      )

traffic_final = (
    pd.merge(traffic_final,
             meta_df.add_prefix('dst_'),
             left_on='dst',
             right_index=True,
             how='left')
    .fillna({'dst_name': 'random_machine'})
  )

traffic_final['series_of_interest'] = traffic_final.src_name + " -> " + traffic_final.dst_name

traffic_final.head()

Unnamed: 0,traffic_data,src,dst,src_name,src_is_bot,dst_name,dst_is_bot,series_of_interest
0,192.168.50.88 -> 192.168.50.51,192.168.50.88,192.168.50.51,local-dns-server,no,adobe-reader,no,local-dns-server -> adobe-reader
1,192.168.50.19 -> 192.168.50.88,192.168.50.19,192.168.50.88,dropbox,no,local-dns-server,no,dropbox -> local-dns-server
2,192.168.50.88 -> 8.8.4.4,192.168.50.88,8.8.4.4,local-dns-server,no,google-public-dns,no,local-dns-server -> google-public-dns
3,192.168.50.51 -> 192.168.50.88,192.168.50.51,192.168.50.88,adobe-reader,no,local-dns-server,no,adobe-reader -> local-dns-server
4,192.168.50.51 -> 192.168.50.88,192.168.50.51,192.168.50.88,adobe-reader,no,local-dns-server,no,adobe-reader -> local-dns-server


In [18]:
# traffic_final.groupby(['series_of_interest']).count()

In [19]:
def one_hot_from_list(the_list):
    """One Hot Encodes elements in a list. returns the one_hot_encoded and the indicies"""
    all_options = set(the_list)
    num_options = len(all_options)

    # Create a mapping from string to numerical indices
    string_to_index = {string: index for index, string in enumerate(all_options)}

    # Convert the strings to numerical indices
    indices = [string_to_index[string] for string in the_list]

    # Perform one-hot encoding using tf.one_hot
    one_hot_encoded = tf.one_hot(indices, depth=num_options).numpy()

    return one_hot_encoded, indices

In [20]:
traffic_one_hot, traffic_indices = one_hot_from_list(traffic_final['series_of_interest'].tolist())

In [21]:
X_train, X_test, Y_train, Y_test = train_test_split(pcap_tensor, traffic_one_hot, test_size = 0.3, random_state=12)

del pcap_tensor
del traffic_tensor

## Define A Modular CNN Model
As described in the overview of my Solution, I will be training a few CNN models on my data. This is modular and can be modified to add as many Convolutional Layers as wished.



In [22]:
input_shape = X_train.shape[1]
output_shape = Y_train.shape[1]

n_training_epochs = 5
batch_size = 64

outdir = "/content/drive/MyDrive/DeepLearningForPcap/data/results/"
result_files = os.listdir(outdir)

In [23]:
def base_classifier_model(input_shape, output_shape, conv_kernel_size, n_conv_layers):
    """ Defines the baseline classifier model with n_conv_layers + 1 Convulutional Layers and
     the paired max pool layers"""

    model = keras.models.Sequential()

    for _ in range(n_conv_layers):
        model.add(layers.Conv1D(32, (conv_kernel_size), activation='relu', input_shape=(input_shape, 1)))
        model.add(layers.MaxPooling1D(2))

    model.add(layers.Conv1D(64, conv_kernel_size, activation='relu'))

    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(64, activation='relu'))
    model.add(keras.layers.Dense(output_shape, activation='softmax'))

    return model

In [24]:
param_list = [
    {'kernel_size': 3, 'n_conv_layers': 2},
    {'kernel_size': 3, 'n_conv_layers': 3},
    {'kernel_size': 3, 'n_conv_layers': 4},
    {'kernel_size': 3, 'n_conv_layers': 5},

    {'kernel_size': 4, 'n_conv_layers': 2},
    {'kernel_size': 4, 'n_conv_layers': 3},
    {'kernel_size': 4, 'n_conv_layers': 4},
    {'kernel_size': 4, 'n_conv_layers': 5}
]


outdir = "/content/drive/MyDrive/DeepLearningForPcap/data/results/"
result_files = os.listdir(outdir)

results = []

for using_param in param_list:
    print(using_param)


    file_name = f"k{using_param['kernel_size']}_covs{using_param['n_conv_layers']}"
    outfile = outdir + file_name

    if file_name in result_files:
        continue

    # Define the model using the parameters supplied for this round
    model = base_classifier_model(input_shape,
                                  output_shape,
                                  using_param['kernel_size'],
                                  using_param['n_conv_layers'])

    model.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])


    model.fit(X_train, Y_train,
              epochs=n_training_epochs,
              batch_size=batch_size)

    # Evaluate the Model using the Test Set defined above
    test_loss, test_acc = model.evaluate(X_test, Y_test)

    using_param['test_lost'] = test_loss
    using_param['test_acc'] = test_acc

    with open(outfile, "w") as f:
        f.write(json.dumps(using_param, indent=4))

    print(test_loss); print(test_acc)


{'kernel_size': 3, 'n_conv_layers': 2}
{'kernel_size': 3, 'n_conv_layers': 3}
{'kernel_size': 3, 'n_conv_layers': 4}
{'kernel_size': 3, 'n_conv_layers': 5}
{'kernel_size': 4, 'n_conv_layers': 2}
{'kernel_size': 4, 'n_conv_layers': 3}
{'kernel_size': 4, 'n_conv_layers': 4}
{'kernel_size': 4, 'n_conv_layers': 5}
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
0.002026134170591831
0.9998342990875244


## Results

The results of my model training are very confusing to me. For some reason, I am getting very high accuracy for all model specifications.

On the surface, this may look promising but I suspect this means that there is something strange going on under the hood but I have not yet been able to figure out what that is.  It could be that the accuracy of the model is being calculated incorrectly in the presence of so many outcome choices.

In [77]:
result_files = os.listdir(outdir)
result_files = [outdir + f for f in result_files]


outtable = pd.DataFrame()

for i in result_files:
  with open(i, 'r') as f:
    i = json.load(f)
z
  outtable = pd.concat([outtable, pd.DataFrame(i, index=[0])])

outtable

Unnamed: 0,kernel_size,n_conv_layers,test_lost,test_acc
0,3,2,0.000349,0.999959
0,3,3,0.000367,0.999957
0,3,4,0.001577,0.999875
0,3,5,0.001615,0.999874
0,4,2,0.000937,0.999934
0,4,3,0.000767,0.999915
0,4,4,0.001471,0.999879
0,4,5,0.002026,0.999834


## Conclusion

Although the results of the model training were improbably good, I still learned so much as a result of this project. I spent a large portion of my time constrcting and manipulating the data into a usable shape, which will make future endevors in this field easier.

In terms of model preformance, I believe that there is more work to be done but that this is an important first step. The DNS traffic clearly has some patterns to it, but given the very high accuracy observed I suspect that this model would not perform well if it were to be deployed in the field.

I also want to thank my Professors and TAs for their time this semester. We covered a lot of different topics and the homeworks were in depth and interesting. I know that they all put a lot of time and effort into constructing the course, and it does not go unnoticed and unappreciated. I look forward to seeing you all again soon.