# Extracting features from raw network data

Welecome to this notebook! We will first explore how to extract features that we can use in our machine
learning models from raw network data. Raw network data is usually stored in packet capture (PCAP) files.
PCACP files store every byte of the packets which allows us to replay the network traffic at a later
time. Unfortunately, machine learning models are unable to process raw bytes. Instead, machine learning
models expect a numerical representation. After all, machine learning models are defined by a set of
mathematical operations. 

Not all the information in a PCAP will be relevant for our machine learning. A PCAP file can have
millions of packets, many of which may contain little or no relevant information, such as
acknowledgements, packets to keep a connection alive, etc. Therefore, we focus on "packet flows" (i.e.,
TCP connections) instead of individual packets. 

We know that each DNS-over-HTTPS request will be sent over a TLS connection contained within a TCP connection. By extracting features about the TCP connection instead of individual packets we can significantly reduce the amount of data that we need to process with our machine learning models. 

The set of example features that we choose for this notebook are as follows:

|Feature number | Feature |
|:---:|:---:|
|1|Number of sent bytes|
|2|Number of received bytes|
|3|Number of sent packets|
|4|Number of received packets|
|5|Ratio of received to sent bytes|
|6|Ratio of received to sent packets|
|7|Length of connection in seconds|
|8|Average received packet size|
|9|Median received packet size|
|10|Variance of received packet size|
|11|Average sent packet size|
|12|Median sent packet size|
|13|Variance of sent packet size|
|14|Minimum delay between received packets|
|15|Average delay between received packets|
|16|Maximum delay between received packets|

We could also include IP addresses, port numbers, etc. However, we choose not to include them because it
is fairly easy to filter DoH traffic from the rest of the network traffic based on the IP address of the
server and the port number. That is, any TCP connection to port 443 on a known public DoH server can be
reasonably assumed to be a DoH connection. 

## The PCAP file

There are multiple ways of obtaining a PCAP file. For illustration purposes, we will be using a PCAP provided by [MontazerShatoori et al.](https://www.unb.ca/cic/datasets/dohbrw-2020.html)

This PCAP contains DoH traffic generated using a Chrome Browser using the public AdGuard DoH public server. An automated script used Chrome to open a list of websites and recorded the network traffic into a PCAP file.  The PCAP also includes non-DoH traffic to servers using port 443. 

In practice, the PCAP file can be obtained from firewalls. It is also possible to configure your firewall to calculate the features directly (beyond the scope of this notebook). 

## Finding the IP address of the server

Open the PCAP file in Wireshark and find the IP address of the Adguard server. We will need it to filter out the non-DoH traffic. 

The file is stored in  ```pcaps/dump-small.pcap```.

### How should we track down the Adguard IP address?
### Can you find additional DoH servers?


## Filtering out non-DoH traffic
Our objective is to find malicious DoH traffic. Therefore, we can ignore the rest of the non-DoH traffic. To this end, we can create a new PCAP file with only DoH traffic. We know that there is only one DoH server being used. We also know its IP address. 

Use the filter ```ip.addr=[DoH Server IP address]``` to only show the DoH traffic from that server in Wireshark. 

Save the DoH traffic to a new file called ```pcaps/DoH-traffic.pcap```

To save the file we can use ```File>Export Specified Packets...```. Make sure you select ```Displayed```.

## Extracting TCP Connection Features from a PCAP file

We use the NFStream library in Python to identify the TCP connections and calculate the features. We save the features into a CSV file.  

In [None]:
import csv
import pandas as pd
import numpy as np

from nfstream import NFStreamer

# Initialize the NFStreamer object
my_streamer = NFStreamer(
            source='/home/dsu/doh_workshop/pcaps/Chrome_Adguard_Traffic/DoH-traffic.pcap', # the name of your pcap file
            statistical_analysis=True, # NFStream will generate the TCP connection stats
            active_timeout=25200, #The maximum time allowed for the TCP connection. Longer connectinos are ignored
        )

# Save the results to a CSV file
features_df = my_streamer.to_pandas(columns_to_anonymize=[])
features_df

## Lets filter out the uneeded columns
You can see that NFStreamer actually calculates many more statisitics than we need. It also includes additional columns that do not help us detect the malicious DoH traffic. For example, the destination port is 443 for all traffic. 

In [None]:
# Show the columns. Select ans copy the ones you want to keep. 
features_df.columns

In [None]:
# Add the columns you selected to the cols list
cols = [] # paste your columns between the square brackets

features_df = features_df[cols]
features_df

## Don't forget to save the dataframe to a CSV file. 

In [None]:
features_df.to_csv('sample-features.csv', index=False)

## Final note

PCAP files are often very large. Filtering the PCAP and generating the statistics is better done directly through code. 

In the rest of this learning objective, we will use CSV files that have already been processed from more than 100GBs of network traffic. 

The datasets in the workshop have been stripped from IP addresses. However, in practice, you will need to keep them so that you can track down which computers are generating the malicious traffic. 