## IMPORTANT!!

### From the root NetworkML directory, make sure you first run:
```
pip3 install .
pip3 install nest_asyncio
```

NOTE: nest_asyncio is a needed hack to run this in jupyterlab/notebook. Additionally, you will also need to download Wireshark/tshark in order to use this notebook and then ensure that tshark is in the file path.

In [2]:
# some initial setup
import sys

# let's set a path to a pcap (we'll use one included in the tests)
path = '../tests/trace_ab12_2001-01-01_02_03-client-ip-1-2-3-4.pcap'

# let's change the output so it's easy to find
output = 'trace_ab12_2001-01-01_02_03-client-ip-1-2-3-4.pcap.csv.gz'

# set arguments for arg parse
sys.argv = ['pcap_to_csv.py', f'-o{output}', path]

In [3]:
# hack for jupyterlab since pyshark tries to take the main run loop
import nest_asyncio
nest_asyncio.apply()

# import the class for converting PCAPs to CSVs
from networkml.parsers.pcap_to_csv import PCAPToCSV
instance = PCAPToCSV()

# this will parse the args we specified above in sys.argv and run using them
instance.main()

# the will take a minute or so to run

INFO:networkml.parsers.pcap_to_csv:Including the following layers in CSV (if they exist): ['<IP Layer>', '<ETH Layer>', '<TCP Layer>', '<UDP Layer>', '<ICMP Layer>', '<ICMPv6 Layer>', '<DNS Layer>', '<DHCP Layer>', '<DHCPv6 Layer>', '<ARP Layer>', '<IP6 Layer>', '<TLS Layer>']
INFO:networkml.parsers.pcap_to_csv:GZipped CSV file(s) written out to: ['trace_ab12_2001-01-01_02_03-client-ip-1-2-3-4.pcap.csv.gz']


In [4]:
# the output is a gzipped csv, so let's quickly pop that open and see what we have
# we're going to use DictReader from the CSV lib, so we'll get back a list of dictionaries, where the keys in the dicts are the fieldnames
# each dictionary in the list is a record (packet)

from networkml.featurizers.csv_to_features import CSVToFeatures
instance = CSVToFeatures()
rows = instance.get_rows('trace_ab12_2001-01-01_02_03-client-ip-1-2-3-4.pcap.csv.gz', 'both')
print(rows[0])



{'udp.time_delta': '', 'udp.dstport_raw': '', 'frame.protocols': 'eth:ethertype:ip:tcp:http', 'tls.handshake': '', 'dhcp.flags.bc': '', 'tcp.flags.res_raw': "['0', 46, 1, 3584, 2]", 'tcp.options.sack_perm_tree': '', 'tcp.options': '', 'udp.length': '', 'arp.dst.proto_ipv4_raw': '', 'ip.ttl_tree': '', 'arp.dst.proto_ipv4': '', 'tls.alert_message_raw': '', 'tcp.checksum': '0x00005df8', 'eth.addr.oui': '4219270', 'tcp.window_size_scalefactor': '-1', 'eth.padding_raw': '', 'tcp.options.mss': '', 'dhcp.hw.type': '', 'http.host': '', 'dhcp.secs': '', 'tcp.options.sack_tree': '', 'tcp.payload': '47:45:54:20:2f:63:6f:6d:70:6c:65:74:65:2f:73:65:61:72:63:68:3f:63:6c:69:65:6e:74:3d:63:68:72:6f:6d:65:26:68:6c:3d:65:6e:2d:55:53:26:71:3d:63:72:20:48:54:54:50:2f:31:2e:31:0d:0a:48:6f:73:74:3a:20:63:6c:69:65:6e:74:73:31:2e:67:6f:6f:67:6c:65:2e:63:61:0d:0a:43:6f:6e:6e:65:63:74:69:6f:6e:3a:20:6b:65:65:70:2d:61:6c:69:76:65:0d:0a:55:73:65:72:2d:41:67:65:6e:74:3a:20:4d:6f:7a:69:6c:6c:61:2f:35:2e:30:20:28:57

In [5]:
# we have a gzipped csv with all of the fields we could extract at the 'packet' level
# (we could have supplied an arg above to do a different level, like 'flow')
# we can now take that file and reduce or change or add which fields should be included using the featurizer

# let's set a path to a gzipped csv (we'll use one we just made)
path = 'trace_ab12_2001-01-01_02_03-client-ip-1-2-3-4.pcap.csv.gz'

# let's change the output so it's easy to find
output = 'trace_ab12_2001-01-01_02_03-client-ip-1-2-3-4.features.csv.gz'

# we need to specify where the featurizer functions are
features_path = '../networkml/featurizers/funcs'

# set arguments for arg parse
sys.argv = ['csv_to_features.py', f'-o{output}', f'-p{features_path}', path]

from networkml.featurizers.csv_to_features import CSVToFeatures
instance = CSVToFeatures()
instance.main()


Importing class: Host
Importing class: Generic
Importing class: Flow
Importing class: Packet
Running method: Flow/default_tcp_5tuple
Running method: Flow/default_udp_5tuple


INFO:networkml.featurizers.csv_to_features:GZipped CSV file(s) written out to: ['trace_ab12_2001-01-01_02_03-client-ip-1-2-3-4.features.csv.gz']


In [6]:
# the output is a gzipped csv again, so let's quickly pop that open and see what we have
# we're going to use DictReader from the CSV lib, so we'll get back a list of dictionaries, where the keys in the dicts are the fieldnames
# each dictionary in the list is a record (packet)

from networkml.featurizers.csv_to_features import CSVToFeatures
instance = CSVToFeatures()
rows = instance.get_rows('trace_ab12_2001-01-01_02_03-client-ip-1-2-3-4.features.csv.gz', 'both')
print(rows[0])

{'tcp.srcport': '57011', 'tcp.dstport': '80', 'ip.src_host': '192.168.3.131', 'frame.protocols': 'eth:ethertype:ip:tcp:http', 'udp.srcport': '', 'udp.dstport': '', 'ip.dst_host': '72.14.213.138'}


### Writing New Featuzier Functions
now that we can see how to turn PCAPs into CSVs and then use the featurizer to change the columns and make a new CSVlet's now look at how to write new featurizer functions and use them here's an example of an already included class with functions.

In [7]:
with open('../networkml/featurizers/funcs/flow.py', 'r') as f:
    for line in f:
        print(line)

from networkml.featurizers.features import Features



class Flow(Features):



    def default_tcp_5tuple(self, rows):

        fields = ['ip.src_host', 'ip.dst_host', 'tcp.dstport', 'tcp.srcport', 'frame.protocols']

        return self.get_columns(fields, rows)



    def default_udp_5tuple(self, rows):

        fields = ['ip.src_host', 'ip.dst_host', 'udp.dstport', 'udp.srcport', 'frame.protocols']

        return self.get_columns(fields, rows)



In [8]:
# So simply use an existing python file and class already in the `funcs` directory
# or create a new one, and make sure the class subclasses `Features`
# and the function signatures take in rows and return rows
# the above example as a helper function `get_columns` that lets you provide a list of fields and the rows
# and returns the rows with only those fields

# You don't need to use the helper function, but the returns rows should be a list of dictionaries
# just like the input of rows is (the same thing you get back from a CSV DictReader)

# Here's a sample to test:
from networkml.featurizers.csv_to_features import CSVToFeatures
instance = CSVToFeatures()
rows = instance.get_rows('trace_ab12_2001-01-01_02_03-client-ip-1-2-3-4.pcap.csv.gz', 'both')
print(rows[0])

{'udp.time_delta': '', 'udp.dstport_raw': '', 'frame.protocols': 'eth:ethertype:ip:tcp:http', 'tls.handshake': '', 'dhcp.flags.bc': '', 'tcp.flags.res_raw': "['0', 46, 1, 3584, 2]", 'tcp.options.sack_perm_tree': '', 'tcp.options': '', 'udp.length': '', 'arp.dst.proto_ipv4_raw': '', 'ip.ttl_tree': '', 'arp.dst.proto_ipv4': '', 'tls.alert_message_raw': '', 'tcp.checksum': '0x00005df8', 'eth.addr.oui': '4219270', 'tcp.window_size_scalefactor': '-1', 'eth.padding_raw': '', 'tcp.options.mss': '', 'dhcp.hw.type': '', 'http.host': '', 'dhcp.secs': '', 'tcp.options.sack_tree': '', 'tcp.payload': '47:45:54:20:2f:63:6f:6d:70:6c:65:74:65:2f:73:65:61:72:63:68:3f:63:6c:69:65:6e:74:3d:63:68:72:6f:6d:65:26:68:6c:3d:65:6e:2d:55:53:26:71:3d:63:72:20:48:54:54:50:2f:31:2e:31:0d:0a:48:6f:73:74:3a:20:63:6c:69:65:6e:74:73:31:2e:67:6f:6f:67:6c:65:2e:63:61:0d:0a:43:6f:6e:6e:65:63:74:69:6f:6e:3a:20:6b:65:65:70:2d:61:6c:69:76:65:0d:0a:55:73:65:72:2d:41:67:65:6e:74:3a:20:4d:6f:7a:69:6c:6c:61:2f:35:2e:30:20:28:57

In [9]:
from networkml.featurizers.features import Features

class Flow(Features):

    def example_simple(self, rows):
        fields = ['layers', 'eth.src.oui_resolved', 'eth.dst.oui_resolved']
        return self.get_columns(fields, rows)

flow = Flow()
new_rows = flow.example_simple(rows)
print(new_rows[0])

{'layers': '[<ETH Layer>, <IP Layer>, <TCP Layer>, <HTTP Layer>, <FRAME_RAW Layer>, <ETH_RAW Layer>, <IP_RAW Layer>, <TCP_RAW Layer>, <HTTP_RAW Layer>]', 'eth.src.oui_resolved': "Micro-Star Int'L Co.,Ltd", 'eth.dst.oui_resolved': 'Sophos Ltd'}


In [10]:
# great, but how did we know the fields?
rows[0].keys()

dict_keys(['udp.time_delta', 'udp.dstport_raw', 'frame.protocols', 'tls.handshake', 'dhcp.flags.bc', 'tcp.flags.res_raw', 'tcp.options.sack_perm_tree', 'tcp.options', 'udp.length', 'arp.dst.proto_ipv4_raw', 'ip.ttl_tree', 'arp.dst.proto_ipv4', 'tls.alert_message_raw', 'tcp.checksum', 'eth.addr.oui', 'tcp.window_size_scalefactor', 'eth.padding_raw', 'tcp.options.mss', 'dhcp.hw.type', 'http.host', 'dhcp.secs', 'tcp.options.sack_tree', 'tcp.payload', 'eth.dst.oui_raw', 'ip.ttl_raw', 'dns.time', 'eth.addr.oui_resolved_raw', 'dns.count.add_rr_raw', 'frame.encap_type', 'tls.handshake.ciphersuites', 'eth.src.oui_resolved_raw', 'tls.record.version_raw', 'tcp.window_size_raw', 'tcp.hdr_len', 'ip.dst', 'tcp.urgent_pointer_raw', 'tcp.options.mss_tree', 'data_raw', 'frame.offset_shift', 'tcp.dstport', 'ip.flags.mf', 'tcp.flags.urg_raw', 'frame.marked', 'dhcp.option.type_tree', 'eth.src_raw', 'ip.src_host_raw', 'tcp.stream', 'tcp.time_delta', 'dns.count.answers', 'ip.id', 'tcp.flags.push_raw', 'icm

In [11]:
# what if we want to create a new field/column or reduce the number of records/rows?
from networkml.featurizers.features import Features

class NewColumn(Features):

    def example_modify(self, rows):
        # reduce rows first as needed
        fields = ['layers', 'eth.src.oui_resolved', 'eth.dst.oui_resolved', 'ip.src', 'ip.dst']
        rows = self.get_columns(fields, rows)
        
        # create new columns using existing column info
        last_layer = 'Last Layer'
        combined_ips = 'Combined IPs'
        # each row is a dict
        for row in rows:
            # not all rows are guaranteed to have 'layers'
            if 'layers' in row:  
                # get the last element in the stringified list and clean it up
                row[last_layer] = row['layers'].split('<')[-1][:-2].split()[0]
            
            # not all rows are guaranteed to have 'ip.src' and ip.dst
            if 'ip.src' in row and 'ip.dst' in row:
                # combine two fields with a colon, making a new field
                row[combined_ips] = row['ip.src']+':'+row['ip.dst']
                # remove ip.src and ip.dst now that we have them
                del row['ip.src']
                del row['ip.dst']
        
        # remove duplicate rows
        rows = [dict(t) for t in {tuple(d.items()) for d in rows}]
        return rows

print(f'Number of original records: {len(rows)}\n')
ncol = NewColumn()
new_rows = ncol.example_modify(rows)
print(f'Fields/Values in first record (row): {new_rows[0]}\n')
print(f'Number of records after deduplication: {len(new_rows)}\n')

Number of original records: 14261

Fields/Values in first record (row): {'layers': '[<ETH Layer>, <IP Layer>, <TCP Layer>, <MSNMS Layer>, <FRAME_RAW Layer>, <ETH_RAW Layer>, <IP_RAW Layer>, <TCP_RAW Layer>, <MSNMS_RAW Layer>]', 'eth.src.oui_resolved': 'PCS Computer Systems GmbH', 'eth.dst.oui_resolved': 'Realtek (UpTech? also reported)', 'Last Layer': 'MSNMS_RAW', 'Combined IPs': '10.0.2.15:64.4.9.254'}

Number of records after deduplication: 708

