# Feature extraction and preprocessing 

In this laboratory, you are asked to use Pyshark to extract a set of header field from network packets and convert them into a suitable format using one of the techniques discussed during the class (representations of traffic flows). 
You will also assign a label to each packet. In this case, the task is to use the one-hot-encoding technique to convert the text labels pre-assigned to the various categories of traffic into a one-hot encoding representation.

In [1]:
# Author: Roberto Doriguzzi-Corin
# Project: Course on Network Intrusion and Anomaly Detection with Machine Learning
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import pyshark
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder

# We need the following to get around “RuntimeError: This event loop is already running” when using Pyshark within Jupyter notebooks.
# Not needed in stand-alone Python projects
import nest_asyncio
nest_asyncio.apply()  

LABELS = ['BENIGN','SYN']
# Definition of malicious flows 
DOS2019_SYN_FLOWS = [('172.16.0.5','192.168.50.1'), ('172.16.0.5','192.168.50.4')]

# Bag of words
bow = ['arp','data','dhcp','dns','eth','ftp','http','icmp','ip','ssdp','ssl','tcp','telnet','tls','udp']
# Create a CountVectorizer instance
vectorizer = CountVectorizer()
# Fit the vectorizer on the corpus and transform the documents into BoW vectors 
vectorizer.fit_transform(bow).todense()

# One-hot encoder
# Create an instance of OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
labels_reshaped = np.array(LABELS).reshape(-1, 1)
one_hot_encoded = encoder.fit_transform(labels_reshaped)


# Path to the capture file
capture_file = './PCAPs/benign-syn.pcap'
cap = pyshark.FileCapture(capture_file)

# Packet feature extraction and labelling
In this step, complete the code in the following cell as follows:
- make sure to keep only TCP packets
- extract the 5-tuple packet identifier composed by source and destination IP addresses, source and destination transport ports and transport protocol
- extract the list of protocols (string), IP flags (integer), the TCP length (integer) and TCP flags (integer)
- assign the label to the packet

In [None]:
packet_list = []
for packet in cap:
    packet_features = {}
    ### ADD YOUR CODE HERE ###
    if '' in packet and '' in packet:  # Check if the packet has IP and TCP layers

        
        src_ip =   # Source IP address
        dst_ip =   # Destination IP address
        src_port =   # Source port
        dst_port =   # Destination port
        protocol =   # Transport layer protocol (TCP or UDP)

        # Fill the packet features dictionary
        packet_features['ID'] = # the 5-tuple, a "set" with the 5 features extracted above
        packet_features['Protocols'] = str(packet.frame_info.protocols)
        packet_features['IP_FLAGS'] = int( , 16) # IP FLAGS need to be traslated from exadecimal value to decimal format
        packet_features['TCP_LENGTH'] = int()
        packet_features['TCP_FLAGS'] = int( , 16) # TCP FLAGS need to be traslated from exadecimal value to decimal format
    ##########################

        ### ADD YOUR CODE HERE ###
        if (src_ip,dst_ip) in DOS2019_SYN_FLOWS or  in DOS2019_SYN_FLOWS: # check whether the packet belongs to a malicious communication (bi-directional)
        ##########################
            packet_features['LABEL'] = 'SYN'
        else:
            packet_features['LABEL'] = 'BENIGN'

        packet_list.append(packet_features)
        print(packet_features)

# Feature preprocessing and label encoding
The next step consists in converting the string features and labels into numerical vector that can be used with ML algorithm. More precisely:
- convert the ```protocols``` string into a numerical vector using the bag-of-words technique
- convert the text labels into numerical vectors usign the one-hot-encoding technique

In [None]:
# Iterate through the list of packets
for packet in packet_list:
    ### Processing the list of protocols with the BoW technique

    ### ADD YOUR CODE HERE ###
    protocols_string =  # take the "protocols" feature from the packet's dictionary
    ##########################

    protocols_vector = vectorizer.transform([protocols_string])
    packet['Protocols'] = protocols_vector.toarray()[0]

    ### One-hot-encoded labels a text label into a one-hot encoded label

    ### ADD YOUR CODE HERE ###
    label = # take the label from the packet's dictionary
    ##########################
    
    label_reshaped = np.array(label).reshape(-1, 1)
    one_hot_encoded_new_label = encoder.transform(label_reshaped)
    packet['LABEL'] = one_hot_encoded_new_label

    print(packet)
    

# Build the network traffic flows
Finally, we use the 5-tuple (and its transpose) to group the packets into flows.

In [None]:
# Iterate through the list of packets
flows = {}
for packet in packet_list:

    ### ADD YOUR CODE HERE ###
    packet_id = # take the 5-tuple
    packet_id_tr = () # fill this Python tuple with the transposed 5-tuple
    ##########################

    if flows.get(packet_id) == None and flows.get(packet_id_tr) == None:
        flows[packet_id] = [packet]
    elif flows.get(packet_id) != None:
        flows[packet_id].append(packet)
    else:
        flows[packet_id_tr].append(packet)

for id, packets in flows.items():
    print ("Flow ID:",id)
    for packet in packets[:10]:
        print (packet)