# Automation Workshop: Analysing network captures

## I - Initializing your environment

### A) Setting up a virtual environment
(Optional but recommended)

```bash
virtualenv -p python3 venv
source venv/bin/activate
```
(Use `deactivate` to exit from `source` once you are done)

Alternatively you can also prefix all your `python` and `pip` commands with `./venv/bin/` (e.g: `./venv/bin/pip3 install -U pip`)



### B) Setting up Jupyter

In order to follow along on your computer:

```bash
pip3 install notebook
jupyter-notebook
```

### C) Installation of PyMISP

#### 1. Make sure the submodules are up-to-date and cloned

```bash
git submodule update --init --recursive PyMISP/
```

#### 2. Install PyMISP with the developer options

```bash
cd PyMISP
pip3 install -e .
```

#### 3. To be able to use the additional PyMISP helpers

```bash
# Make sure the package required for pydeep is installed
sudo apt-get install -y libfuzzy-dev

pip3 install python-magic lief git+https://github.com/kbandla/pydeep.git
```

## II - Automate the collection of data from network captures

Network captures provide invaluable insights into network activity, enabling analysts to detect intrusions, malware communications, and other security threats. However, manually analyzing PCAP files can be extremely time-consuming, requiring the inspection of thousands—or even millions—of packets to extract relevant indicators of compromise (IOCs).

Automation is key to streamlining this process. By leveraging the appropriate tools to parse network captures, extract meaningful threat intelligence and directly ingest it into MISP, analysts can significantly reduce the time spent on manual review. This approach not only accelerates incident response but also ensures that threat data is consistently structured and shared efficiently within the community. In this exercise, we will illustrate how automation can transform network capture analysis from a tedious task into an efficient, repeatable workflow that enhances security operations.

### A) Introduction - Using the right tools

#### 1. Analysis Tools

With this exercise, we will focus on the analysis of network captures rather than the capture process itself. If you are interested in discovering more about packet capture, you can have a look at the documentation of tools such as `tcpdump` or `wireshark`/`tshark`.

For our analysis, we will be working with **PCAP files**.  
A wide range of command-line tools are available for analyzing network captures, including:
- capinfos (Wireshark) – Provides metadata about PCAP files (packet count, duration, etc.).
- mergecap (Wireshark) – Merges multiple PCAP files into one.
- editcap (Wireshark) – Edits and filters packets within a PCAP file.
- tcpdump – Displays and filters packet data from a capture file.
- ipsumdump – Summarizes network traffic for analysis.
- tshark – A powerful packet analyzer with extensive filtering and parsing capabilities.
- tcpflow – Reconstructs TCP flows from a capture (two versions exist with different capabilities).
- ngrep – A grep-like tool for searching packet data.
- yaf – Parses and processes network flows.

#### 2. PyMISP

Our ultimate goal is to **structure and share** the information we extract from network packets in MISP. But manual encoding of the extracted data into MISP would be tedious and error-prone, which is why automation is essential.

PyMISP,  the official Python library for MISP, provides a powerful way to interact with the platform programmatically. It allows us to create, enrich, and query events, ensuring a seamless flow of extracted intelligence from our analysis tools into MISP. By leveraging PyMISP in a Python script, we can automate the entire encoding process, transforming raw network data into actionable threat intelligence with minimal manual effort.

In this exercise, we will explore how to use PyMISP to automate this workflow efficiently.

### B) Exercise description

We will use **`tshark`**, the command-line tool for network traffic analysis, which has the same filtering capacity as its UI equivalent version - Wireshark - and automate the packets parsing with some Python code.

#### 0. Preliminary step - Gather our dataset and declare some variables

Let's download PCAP files that are publicly available.

With your favourite browser, visit the latest *malware-traffic-analysis.net* blog posts from [2025](https://malware-traffic-analysis.net/2025/index.html) and download some of the latest example of PCAP file, like:
- [2025-01-09-CVE-2017-0199-XLS-to-DBatLoader-or-GuLoader-for-AgentTesla-variant.pcap.zip](https://malware-traffic-analysis.net/2025/01/31/2025-01-09-CVE-2017-0199-XLS-to-DBatLoader-or-GuLoader-for-AgentTesla-variant.pcap.zip)
- [2025-01-13-KongTuke-leads-to-infection-abusing-BOINC.pcap.zip](https://malware-traffic-analysis.net/2025/02/10/2025-01-13-KongTuke-leads-to-infection-abusing-BOINC.pcap.zip)

Those zip files are protected with a password: *infected_YYYYMMDD* - depending on the date mentioned in the file name.

**Alternatively**, you can execute the following python script which will gather some of the zip files from the website and extract the PCAPs for you:

```bash
# bash
python download_samples.py
```

We now have our PCAP files, we can start our analysis and see the relevant information we can extract from the network packets and encode as MISP objects.

In order to store those objects, we start with the creation of a MISP Event which will be their container.

In [None]:
import os
from pathlib import Path
from pymisp import MISPEvent, MISPObject

data_path = Path(os.getcwd()).parent / 'exercises' / 'data'
pcap_file = data_path / '2025-01-09-CVE-2017-0199-XLS-to-DBatLoader-or-GuLoader-for-AgentTesla-variant.pcap'

misp_event = MISPEvent()
misp_event.info = 'AgentTesla variant with CVE-2017-0199'

#### 1. Extract information on the PCAP file

As a first step, we will describe the PCAP file to keep a reference on our source of information.

More specifically, we want to describe the file itself, using a `file` MISP object, as well as details on the PCAP metadata, with the `pcap-metadata` object. Both object templates are part in the [list of available object templates](https://www.github.com/MISP/misp-objects) on Github, where you can find their description.

Starting with the file object, we could create a MISPObject and the related Attributes by ourselves, but PyMISP has a pretty convenient helper for this: `FileObject`. Let's see how to use it in order to have a file object describing our PCAP file added to our MISP Event.

In [None]:
from pymisp.tools import FileObject

file_object = FileObject(filepath=pcap_file, standalone=False)
for attribute in file_object.attributes:
    print(attribute.object_relation, attribute.value)
misp_event.add_object(file_object)

Now let's see the information given by `capinfos`, the command-line tool to describe PCAP metadata:

In [None]:
import subprocess

proc = subprocess.Popen(f'capinfos {pcap_file}', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

for line in proc.stdout.readlines():
    print(line)

Now based on the `pcap-metadata` object template, we can extract some information and generate our MISP object to add it to our Event.

In [None]:
import re

PCAP_METADATA_OBJECT_MAPPING = {
    'Capture length': 'capture-length',
    'File encapsulation': 'protocol',
    'First packet time': 'first-packet-seen',
    'Last packet time': 'last-packet-seen'
}

def parse_pcap_info_line(line: str) -> tuple:
    if ' = ' in line:
        return line.split(' = ')
    return re.split(r': +', line)

pcap_object = misp_event.add_object(name='pcap-metadata')
proc = subprocess.Popen(
    f'capinfos {pcap_file}', shell=True,
    stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
for line in proc.stdout.readlines():
    decoded = line.decode().strip().strip('\n')
    try:
        key, value = parse_pcap_info_line(decoded)
    except ValueError:
        continue
    if key not in PCAP_METADATA_OBJECT_MAPPING:
        continue
    relation = PCAP_METADATA_OBJECT_MAPPING[key]
    pcap_object.add_attribute(
        relation,
        value.upper() if relation == 'protocol' else value
    )
pcap_object.add_reference(file_object.uuid, 'describes')


#### 2. Parsing packets from a network capture

After a few preliminary easy steps, it is now time to remind the rudiments of network packets parsing and declare a few helpers to build for us the command to use to parse different types of information from the packets we have in our capture file

In [None]:
# Generic method used later to easily generate a tshark command
def define_command(input_file: Path, fields: tuple,
                   display_filter: str = '!(arp || dhcp)') -> str:
    param = '-o tcp.relative_sequence_numbers:FALSE -E separator="|"'
    fields_cmd = ' -e '.join(fields)
    tshark = f'tshark -T fields {param} -e {fields_cmd} -Y "{display_filter}"'
    return f'{tshark} -r {input_file}'


With `define_command`, we set a few parameters for our `tshark` command, including:
- `-o tcp.relative_sequence_numbers:FALSE` to visualise absolute sequence numbers rather than relative
- `-E separator="|"` in case we extract some text with `,` and want to avoid issues with the python code separating our parsing results in a wrong way
- `-Y "!(arp || dhcp)"` to excluse ARP & DHCP packets from the results
- `-T fields` to determine fields to filter, in association with `-e` to specify each of those fields
- `-r` followed by the PCAP file name

#### 3. Extract DNS records

An interesting kind of information we want to share in MISP from our network capture are the DNS records.

A Domain Name System (DNS) record is a set of instructions used to connect domain names with internet protocol (IP) addresses within DNS servers. DNS makes it possible for users to browse the internet with customizable domain names and URLs rather than complex numerical IP addresses.

MISP has a `dns-record` object template that could be used to describe the DNS information we extract from packets.

The following list gives you the fields we want to have a look at in ordre to describe a DNS record:
- `dns`: Used to check whether the packet is a DNS request or response
- `dns.a`: A record - The record that holds the IPv4 address of a domain
- `dns.aaaa`: AAAA record - The record that contains the IPv6 address for a domain
- `dns.cname`: CNAME record - Forwards one domain or subdomain to another domain, does NOT provide an IP address
- `dns.mx.mail_exchange`: MX record - Directs mail to an email server
- `dns.ns`: NS record - Stores the name server for a DNS entry
- `dns.ptr.domain_name`: PTR record - Provides a domain name in reverse-lookups
- `dns.qry.name`: queried domain
- `dns.soa.rname`: SOA record - Stores admin information about a domain
- `dns.spf`: SPF record - Used to identify the mail servers that can send emails through your domain
- `dns.srv.name`: SRV record - Specifies a port for specific services

In [None]:
# Tshark filters
dns_fields = (
    'dns', 'dns.qry.name', 'dns.a', 'dns.aaaa', 'dns.cname',
    'dns.mx.mail_exchange', 'dns.ns', 'dns.ptr.domain_name',
    'dns.soa.rname', 'dns.spf', 'dns.srv.name'
)

dns_cmd = define_command(pcap_file, dns_fields, display_filter='dns')
print(dns_cmd)

We can then use this command with the `subprocess.Popen` method, which replicates a command-line process as if we were executing the command directly from our terminal.

The idea is then to read from the standard output with `proc.stdout`, and return a list where each element is a line in the standard output with `readlines`:

In [None]:
proc = subprocess.Popen(dns_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# We read the results of the tshark command from the standard output
for line in proc.stdout.readlines():
    print(line)

As you can see, the resulting lines returned from the terminal process are byte strings.

In order to use the information we have here, we still need to extract the relevant information with:
- `decode`, to decode the byte string into a regular string
- `strip('\n')`, to remove the special character at the end of the line
- `split('|')`, to decompose our lines, based on the separator we chose previously (`|`)

In [None]:
DNS_RECORDS_OBJECT_RELATIONS = (
    'queried-domain', 'a-record', 'aaaa-record', 'cname-record', 'mx-record',
    'ns-record', 'ptr-record', 'soa-record', 'spf-record', 'srv-record'
)

proc = subprocess.Popen(dns_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

for line in proc.stdout.readlines():
    # We decode each line and split it based on the separator
    dns_type, *fields = line.decode().strip().split('|')

    # We skip query packets to focus on responses
    if 'query' in dns_type:
        continue

    # creation of a new `dns-record` object
    dns_record = misp_event.add_object(name='dns-record')

    # We iterate over both the object relations and field values to add them as attributes
    for relation, values in zip(DNS_RECORDS_OBJECT_RELATIONS, fields):
        if values:
            if ',' in values:
                for value in values.split(','):
                    dns_record.add_attribute(relation, value)
                continue
            dns_record.add_attribute(relation, values)

    dns_record.add_reference(file_object.uuid, 'included-in')

    print(dns_record)
    for attribute in dns_record.attributes:
        print(f' - {attribute.object_relation}: {attribute.value}')

#### 4. Extract HTTP requests

We can then try to fetch some HTTP requests information and generate some `http-request` MISP objects.

Based on the attributes defining the object template, we can have a look at the dedicated fields in tshark, to describe HTTP requests:
- `http.content_type`: MIME type of the body of the request
- `http.cookie`: HTTP cookie
- `http.host`: Domain name of the server
- `http.referer`: address of the previous web page from which a link to the currently requested page was followed
- `http.request.method`: HTTP Method invoked (one of GET, POST, PUT, HEAD, DELETE, OPTIONS, CONNECT)
- `http.request.full_uri`: request URL
- `http.request.uri`: request URI
- `http.user_agent`: characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent

In [None]:
http_fields = (
    'http.request.method', 'http.host', 'http.content_type', 'http.cookie',
    'http.referer', 'http.request.full_uri', 'http.request.uri', 'http.user_agent'
)

http_cmd = define_command(pcap_file, ('ip.src', 'ip.dst') + http_fields, display_filter='http')
print(http_cmd)

proc = subprocess.Popen(http_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

for line in proc.stdout.readlines():
    print(line)

As seen already before, we will replicate the process of mapping those values with the `http-request` object template

In [None]:
HTTP_REQUEST_OBJECT_RELATIONS = (
    'method', 'host', 'content-type', 'cookie', 'referer', 'url', 'uri',
    'user-agent'
)

def parse_http_request(ip_src: str, ip_dst: str, *fields: tuple[str]) -> MISPObject:
    http_request = MISPObject('http-request')
    http_request.add_attribute('ip-src', ip_src)
    http_request.add_attribute('ip-dst', ip_dst)
    for relation, value in zip(HTTP_REQUEST_OBJECT_RELATIONS, fields):
        if value:
            http_request.add_attribute(relation, value)
    return http_request

proc = subprocess.Popen(http_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

for line in proc.stdout.readlines():
    ip_src, ip_dst, *fields = line.decode().strip().split('|')
    http_request = parse_http_request(ip_src, ip_dst, *fields)
    print(http_request)
    for attribute in http_request.attributes:
        print(f' - {attribute.object_relation}: {attribute.value}')

#### 5. Extract payloads from the HTTP packets

To extend the HTTP request extraction method we just saw, we can pretty easily update the code to include payloads.

The field we are looking for here is `http.file_data`. Tshark will give a hex-encoded string value for this fields, which means we will have to use the `unhexlify` method from the `binascii` built-in python library to get the raw content of the file.

PyMISP, on the other side comes again with a little helper allowing us to skip the complete encoding procedure: `make_binary_object`. As it takes a file name or the payload itself as bytes, we will simply have to encode the raw content of the file in a `BytesIO` object.

In [None]:
import binascii
from io import BytesIO
from pymisp.tools import make_binary_objects

http_fields = (
    'http.request.method', 'http.host', 'http.content_type', 'http.cookie',
    'http.referer', 'http.request.full_uri', 'http.request.uri',
    'http.user_agent', 'http.file_data', 'frame.number'
)
# We use the `frame.number` field to generate a file name base on the uri or the frame number
def set_payload_name(uri: str, frame_number: str) -> str:
    filename = uri.split('/')[-2 if uri.endswith('/') else -1]
    if filename:
        return filename
    return f'payload_from_packet_{frame_number}'

http_cmd = define_command(pcap_file, ('ip.src', 'ip.dst') + http_fields, display_filter='http')

proc = subprocess.Popen(http_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

for line in proc.stdout.readlines():
    ip_src, ip_dst, *fields, file_data, frame_number = line.decode().strip().split('|')
    # As seen before, we create a new `http-request` object
    # We add this object to our event and we will be able to reuse the variable to add a reference to the file object
    http_request = misp_event.add_object(parse_http_request(ip_src, ip_dst, *fields))

    if file_data:
        response_uri = fields[-2]
        payload, executable, sections = make_binary_objects(
            pseudofile=BytesIO(binascii.unhexlify(file_data)),
            filename=set_payload_name(response_uri, frame_number),
            standalone=False
        )
        misp_event.add_object(payload)
        # We add a reference to the payload object
        http_request.add_reference(payload.uuid, 'drops')
        print(payload)
        for attribute in payload.attributes:
            print(f' - {attribute.object_relation}: {attribute.value}')

        # In case of a Windows Portable Executable file (PE), a more detailed description of the executable is also created
        if executable is not None:
            misp_event.add_object(executable)
            if sections:
                for section in sections:
                    misp_event.add_object(section)


#### 6. Extract network connection information

As a first step, we will store the information for every connection between a source and a destination. For this kind of parsing, the fields which could be interesting are for instance:
- `frame.time_epoch` - timestamp of the packet
- `ip.src` - source IP address
- `ip.dst` - destination IP address
- `tcp.srcport` / `udp.srcport` - source port
- `tcp.dstport` / `udp.dstport` - destination port
- `frame.protocols` - list of protocols through different layers related to the packet

In [None]:
# Tshark filters
network_fields = (
    'frame.time_epoch', 'ip.src', 'ip.dst', 'tcp.srcport', 'tcp.dstport',
    'udp.srcport', 'udp.dstport', 'frame.protocols'
)

tshark_cmd = define_command(pcap_file, network_fields)
print(tshark_cmd)
proc = subprocess.Popen(tshark_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# We read the results of the tshark command from the standard output
for line in proc.stdout.readlines():
    print(line)

We want to group connections in order to avoid duplicates.

So instead of generating a new MISP object for each packet, we will rather store the connection information for later, and increment a packets count as we loop through all the packets. We also want to keep a trace of the first seen and last seen timestamp values.

A few additional details to be careful with:
- a connection should describe a potentially bi-directional communication, so we need to handle the source and destionation permutation of IP addresses and ports in the packets we parse.
- a communication is established with the exchange of packets containing transport layer protocol information, and only after the connection is confirmed between 2 machines, we can observe data describing an application protocol, but the packets involved in this process are part of the same connection.

So we need to think deeper than a naive parsing of packets taking only the above mentioned fields like `ip.src`, `ip.dst`, `srcport`, `dstport` and `protocols` into consideration.

In [None]:
import json

network_fields = (
    'frame.time_epoch', 'ip.src', 'ip.dst', 'tcp.srcport', 'tcp.dstport',
    'udp.srcport', 'udp.dstport', 'frame.protocols', 'communityid'
)

layer3_protocols = {'ip': 'IP', 'ipv6': 'IP'}
layer4_protocols = {'tcp': 'TCP', 'udp': 'UDP'}
layer7_protocols = {
    'dhcp': 'DHCP', 'dns': 'DNS', 'ftp': 'FTP', 'http': 'HTTP',
    'ntp': 'NTP', 'smtp': 'SMTP', 'snmp': 'SNMP', 'ssdp': 'SSDP',
    'ssl': 'HTTPS', 'tftp': 'TFTP', 'tls': 'HTTPS'
}

def handle_protocols(
        frame_protocols: str, connection: dict = None) -> dict | None:
    if connection is None:
        connection = {'layer7-protocol': set()}
        for protocol in frame_protocols.split(':'):
            if layer3_protocols.get(protocol) is not None:
                connection['layer3-protocol'] = layer3_protocols[protocol]
                continue
            if layer4_protocols.get(protocol) is not None:
                connection['layer4-protocol'] = layer4_protocols[protocol]
                continue
            if layer7_protocols.get(protocol) is not None:
                connection['layer7-protocol'].add(layer7_protocols[protocol])
        return connection
    # If we already have a connection, we see if we have a new layer 7 protocol
    for protocol in frame_protocols.split(':'):
        if layer7_protocols.get(protocol) is not None:
            connection['layer7-protocol'].add(layer7_protocols[protocol])

def store_connections(cmd):
    proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    connections = {}
    for line in proc.stdout.readlines():
        # Decompose the line into variables
        (timestamp, ip_src, ip_dst, ts_port, td_port, us_port, ud_port,
         protocols, community_id) = line.decode().strip('\n').split('|')

        timestamp = float(timestamp)

        # We use the community ID as key and we store the other field values
        if community_id not in connections:
            connections[community_id] = {
                'community-id': community_id,
                'ip-src': ip_src, 'ip-dst': ip_dst,
                'src-port': ts_port if ts_port else us_port,
                'dst-port': td_port if td_port else ud_port,
                'first-packet-seen': timestamp,
                'last-packet-seen': timestamp,
                'dst-packets-count': 1, 'src-packets-count': 0,
                **handle_protocols(protocols)
            }
            continue

        # When the connection is already stored, we update the timestamps and packets count
        connection = connections[community_id]
        if timestamp < connection['first-packet-seen']:
            connection['first-packet-seen'] = timestamp
        if timestamp > connection['last-packet-seen']:
            connection['last-packet-seen'] = timestamp
        if ip_src == connection['ip-src'] and ip_dst == connection['ip-dst']:
            connection['dst-packets-count'] += 1
        else:
            connection['src-packets-count'] += 1
        handle_protocols(protocols, connection)

    return connections

network_cmd = define_command(pcap_file, network_fields)
connections = store_connections(network_cmd)
for connection, values in connections.items():
    print(f'{connection}:')
    for key, value in values.items():
        print(f' - {key}: {value}')

So we extracted the connections information with source and destination IP addresses, source and destination ports, protocols, timestamps and packets counters. It is then very straight forward to simply take all the values and generate `network-connection` MISP objects as the keys we're storing are similar to the object relations describing the object template.


In [None]:
CONNECTION_OBJECT_RELATIONS = (
    'community-id', 'ip-src', 'ip-dst', 'src-port', 'dst-port',
    'first-packet-seen', 'last-packet-seen', 'dst-packets-count',
    'src-packets-count', 'layer3-protocol', 'layer4-protocol'
)

for connection in connections.values():
    misp_object = misp_event.add_object(name='network-connection')
    for relation in CONNECTION_OBJECT_RELATIONS:
        misp_object.add_attribute(relation, connection[relation])
    for protocol in connection['layer7-protocol']:
        misp_object.add_attribute('layer7-protocol', protocol)

    print(misp_object)
    for attribute in misp_object.attributes:
        print(f' - {attribute.object_relation}: {attribute.value}')