# Automation Workshop: Analysing network captures

## I - Initializing your environment

### A) Setting up a virtual environment
(Optional but recommended)

```bash
virtualenv -p python3 venv
source venv/bin/activate
```
(Use `deactivate` to exit from `source` once you are done)

Alternatively you can also prefix all your `python` and `pip` commands with `./venv/bin/` (e.g: `./venv/bin/pip3 install -U pip`)



### B) Setting up Jupyter

In order to follow along on your computer:

```bash
pip3 install notebook
jupyter-notebook
```

### C) Installation of PyMISP

#### 1. Make sure the submodules are up-to-date and cloned

```bash
git submodule update --init --recursive PyMISP/
```

#### 2. Install PyMISP with the developer options

```bash
cd PyMISP
pip3 install -e .
```

#### 3. To be able to use the additional PyMISP helpers

```bash
# Make sure the package required for pydeep is installed
sudo apt-get install -y libfuzzy-dev

pip3 install python-magic lief git+https://github.com/kbandla/pydeep.git
```

## II - Automate the collection of data from network captures

Network captures provide invaluable insights into network activity, enabling analysts to detect intrusions, malware communications, and other security threats. However, manually analyzing PCAP files can be extremely time-consuming, requiring the inspection of thousands—or even millions—of packets to extract relevant indicators of compromise (IOCs).

Automation is key to streamlining this process. By leveraging the appropriate tools to parse network captures, extract meaningful threat intelligence and directly ingest it into MISP, analysts can significantly reduce the time spent on manual review. This approach not only accelerates incident response but also ensures that threat data is consistently structured and shared efficiently within the community. In this exercise, we will illustrate how automation can transform network capture analysis from a tedious task into an efficient, repeatable workflow that enhances security operations.

### A) Introduction - Using the right tools

#### 1. Analysis Tools

With this exercise, we will focus on the analysis of network captures rather than the capture process itself. If you are interested in discovering more about packet capture, you can have a look at the documentation of tools such as `tcpdump` or `wireshark`/`tshark`.

For our analysis, we will be working with **PCAP files**.  
A wide range of command-line tools are available for analyzing network captures, including:
- capinfos (Wireshark) – Provides metadata about PCAP files (packet count, duration, etc.).
- mergecap (Wireshark) – Merges multiple PCAP files into one.
- editcap (Wireshark) – Edits and filters packets within a PCAP file.
- tcpdump – Displays and filters packet data from a capture file.
- ipsumdump – Summarizes network traffic for analysis.
- tshark – A powerful packet analyzer with extensive filtering and parsing capabilities.
- tcpflow – Reconstructs TCP flows from a capture (two versions exist with different capabilities).
- ngrep – A grep-like tool for searching packet data.
- yaf – Parses and processes network flows.

#### 2. PyMISP

Our ultimate goal is to **structure and share** the information we extract from network packets in MISP. But manual encoding of the extracted data into MISP would be tedious and error-prone, which is why automation is essential.

PyMISP,  the official Python library for MISP, provides a powerful way to interact with the platform programmatically. It allows us to create, enrich, and query events, ensuring a seamless flow of extracted intelligence from our analysis tools into MISP. By leveraging PyMISP in a Python script, we can automate the entire encoding process, transforming raw network data into actionable threat intelligence with minimal manual effort.

In this exercise, we will explore how to use PyMISP to automate this workflow efficiently.

### B) Exercise description

We will use **`tshark`**, the command-line tool for network traffic analysis, which has the same filtering capacity as its UI equivalent version - Wireshark - and automate the packets parsing with some Python code.

#### 0. Preliminary step - Gather our dataset and declare some variables

Let's download PCAP files that are publicly available.

With your favourite browser, visit the latest *malware-traffic-analysis.net* blog posts from [2025](https://malware-traffic-analysis.net/2025/index.html) and download some of the latest example of PCAP file, like:
- [2025-01-31-VIP-Recovery-data-exfil-over-SMTP.pcap.zip](https://malware-traffic-analysis.net/2025/01/31/2025-01-31-VIP-Recovery-data-exfil-over-SMTP.pcap.zip)
- [2025-02-10-StrelaStealer-infection-traffic.pcap.zip](https://malware-traffic-analysis.net/2025/02/10/2025-02-10-StrelaStealer-infection-traffic.pcap.zip)

Those zip files are protected with a password: *infected_YYYYMMDD* - depending on the date mentioned in the file name.

**Alternatively**, you can execute the following python script which will gather some of the zip files from the website and extract the PCAPs for you:

```bash
# bash
python download_samples.py
```

We now have our PCAP files, we can start our analysis and see the relevant information we can extract from the network packets.

In [19]:
import os
from pathlib import Path
from pymisp import MISPEvent, MISPObject

data_path = Path(os.getcwd()).parent / 'exercises' / 'data'
pcap_file = data_path / '2025-02-10-StrelaStealer-infection-traffic.pcap'

# Generic method used later to easily generate a tshark command
def define_command(input_file: Path, filters: tuple) -> str:
    param = '-o tcp.relative_sequence_numbers:FALSE -E separator="|"'
    filters_cmd = ' -e '.join(filters)
    tshark = f'tshark -T fields {param} -e {filters_cmd} -Y "!(arp || dhcp)"'
    return f'{tshark} -r {input_file}'

misp_event = MISPEvent()
misp_event.info = 'StrelaStealer infection traffic'

With `define_command`, we set a few parameters for our `tshark` command, including:
- `-o tcp.relative_sequence_numbers:FALSE` to visualise absolute sequence numbers rather than relative
- `-E separator="|"` in case we extract some text with `,` and want to avoid issues with the python code separating our parsing results in a wrong way
- `-Y "!(arp || dhcp)"` to excluse ARP & DHCP packets from the results
- `-T fields` to determine fields to filter, in association with `-e` to specify each of those fields
- `-r` followed by the PCAP file name

#### 1. Extract network connection information

As a first step, we will store the information for every connection between a source and a destination. For this kind of parsing, the fields which could be interesting are for instance:
- `frame.time_epoch` - timestamp of the packet
- `ip.src` - source IP address
- `ip.dst` - destination IP address
- `tcp.srcport` / `udp.srcport` - source port
- `tcp.dstport` / `udp.dstport` - destination port
- `frame.protocols` - list of protocols through different layers related to the packet

In [9]:
# Tshark filters
standard_filters = (
    'frame.time_epoch', 'ip.src', 'ip.dst', 'tcp.srcport', 'tcp.dstport',
    'udp.srcport', 'udp.dstport', 'frame.protocols'
)

tshark_cmd = define_command(pcap_file, standard_filters)
print(tshark_cmd)

tshark -T fields -o tcp.relative_sequence_numbers:FALSE -E separator="|" -e frame.time_epoch -e ip.src -e ip.dst -e tcp.srcport -e tcp.dstport -e udp.srcport -e udp.dstport -e frame.protocols -Y "!(arp || dhcp)" -r /Users/chrisr3d/git/MISP/automation4MISP/exercises/data/2025-02-10-StrelaStealer-infection-traffic.pcap


We can then use this command with the `subprocess.Popen` method, which replicates a command-line process as if we were executing the command directly from our terminal.

The idea is then to read from the standard output with `proc.stdout`, and return a list where each element is a line in the standard output with `readlines`:
 

In [14]:
import subprocess

proc = subprocess.Popen(tshark_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# We read the results of the tshark command from the standard output
for line in proc.stdout.readlines():
    print(line)

b'1739231378.249859000|10.2.10.101|193.143.1.205|50087|80|||eth:ethertype:ip:tcp\n'
b'1739231378.457569000|193.143.1.205|10.2.10.101|80|50087|||eth:ethertype:ip:tcp\n'
b'1739231378.457863000|10.2.10.101|193.143.1.205|50087|80|||eth:ethertype:ip:tcp\n'
b'1739231378.460886000|10.2.10.101|193.143.1.205|50087|80|||eth:ethertype:ip:tcp:http\n'
b'1739231378.460982000|193.143.1.205|10.2.10.101|80|50087|||eth:ethertype:ip:tcp\n'
b'1739231378.784770000|193.143.1.205|10.2.10.101|80|50087|||eth:ethertype:ip:tcp\n'
b'1739231378.784788000|193.143.1.205|10.2.10.101|80|50087|||eth:ethertype:ip:tcp\n'
b'1739231378.784816000|193.143.1.205|10.2.10.101|80|50087|||eth:ethertype:ip:tcp\n'
b'1739231378.784996000|10.2.10.101|193.143.1.205|50087|80|||eth:ethertype:ip:tcp\n'
b'1739231378.786239000|193.143.1.205|10.2.10.101|80|50087|||eth:ethertype:ip:tcp\n'
b'1739231378.786252000|193.143.1.205|10.2.10.101|80|50087|||eth:ethertype:ip:tcp\n'
b'1739231378.786262000|193.143.1.205|10.2.10.101|80|50087|||eth:etherty

As you can see, the resulting lines returned from the terminal process are byte strings.

In order to use the information we have here, we still need to extract the relevant information with:
- `decode`, to decode the byte string into a regular string
- `strip('\n')`, to remove the special character at the end of the line
- `split('|')`, to decompose our lines, based on the separator we chose previously (`|`)

In [None]:
layer3_protocols = (
    'arp', 'icmp', 'icmpv6', 'ip', 'ipv6'
)
layer4_protocols = (
    'tcp', 'udp'
)
layer7_protocols = (
    'dhcp', 'dns', 'ftp', 'http', 'ntp', 'smtp', 'snmp', 'ssdp', 'tftp'
)

# We define a function that will extract the protocols
def handle_protocols(frame_protocols: str) -> list:
    protocols = set(frame_protocols.split(':'))
    protocol_key = []
    for layer in (3, 4, 7):
        for protocol in globals()[f'layer{layer}_protocols']:
            if protocol in protocols:
                protocol_key.append(protocol)
                break
    return protocol_key

def store_connections(cmd):
    proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    connections = {}
    for line in proc.stdout.readlines():
        # Decompose the line into variables
        timestamp, ip_src, ip_dst, ts_port, td_port, us_port, ud_port, protocols = line.decode().strip('\n').split('|')

        # We store the connection information in a tuple
        key = (
            ip_src, ip_dst,
            ts_port if ts_port else us_port,
            td_port if td_port else ud_port,
            *handle_protocols(protocols)
        )

        if key not in connections:
            connections[key] = {
                'first_seen': float('inf'),
                'counter': 0
            }
        timestamp = float(timestamp)
        if timestamp < connections[key]['first_seen']:
            connections[key]['first_seen'] = timestamp
        connections[key]['counter'] += 1

    return connections

connections = store_connections(tshark_cmd)
print(connections)


{('10.2.10.101', '193.143.1.205', '50087', '80', 'ip', 'tcp'): {'first_seen': 1739231378.249859, 'counter': 17}, ('193.143.1.205', '10.2.10.101', '80', '50087', 'ip', 'tcp'): {'first_seen': 1739231378.457569, 'counter': 91}, ('10.2.10.101', '193.143.1.205', '50087', '80', 'ip', 'tcp', 'http'): {'first_seen': 1739231378.460886, 'counter': 1}, ('193.143.1.205', '10.2.10.101', '80', '50087', 'ip', 'tcp', 'http'): {'first_seen': 1739231379.584436, 'counter': 1}, ('10.2.10.101', '193.143.1.205', '50088', '8888', 'ip', 'tcp'): {'first_seen': 1739231379.824056, 'counter': 4}, ('193.143.1.205', '10.2.10.101', '8888', '50088', 'ip', 'tcp'): {'first_seen': 1739231380.032305, 'counter': 2}, ('10.2.10.101', '193.143.1.205', '50088', '8888', 'ip', 'tcp', 'http'): {'first_seen': 1739231380.032618, 'counter': 1}, ('193.143.1.205', '10.2.10.101', '8888', '50088', 'ip', 'tcp', 'http'): {'first_seen': 1739231380.34757, 'counter': 1}, ('10.2.10.101', '193.143.1.205', '50089', '8888', 'ip', 'tcp'): {'firs

In the previous code snippet, we extracted some connection information like the source and destination IP addresses, source and destination ports, and protocols, which we all combined in a tuple used as a key to sort of "fingerprint" a connection.

Every packet describing the same connection will then update the `first_seen` value and increment a counter to keep the information on the number of packets exchanged through each connection.

Now for each connection, we want to create a `network-connection` MISP object, and add the different values we stored as Attributes

In [20]:
CONNECTION_OBJECT_RELATIONS = ('ip-src', 'ip-dst', 'src-port', 'dst-port')

for connection, values in connections.items():
    misp_object = MISPObject('network-connection')
    for value, relation in zip(connection[:4], CONNECTION_OBJECT_RELATIONS):
        if value:
            misp_object.add_attribute(relation, value)
    for protocol in connection[4:]:
        layer = 3 if protocol in layer3_protocols else 4 if protocol in layer4_protocols else 7
        misp_object.add_attribute(f'layer{layer}-protocol', protocol.upper())
    misp_object.add_attribute('first-packet-seen', values['first_seen'])
    misp_object.add_attribute('count', values['counter'])

    print(misp_object)
    for attribute in misp_object.attributes:
        print(f' - {attribute.object_relation}: {attribute.value}')

<MISPObject(name=network-connection)
 - ip-src: 10.2.10.101
 - ip-dst: 193.143.1.205
 - src-port: 50087
 - dst-port: 80
 - layer3-protocol: IP
 - layer4-protocol: TCP
 - first-packet-seen: 1739231378.249859
 - count: 17
<MISPObject(name=network-connection)
 - ip-src: 193.143.1.205
 - ip-dst: 10.2.10.101
 - src-port: 80
 - dst-port: 50087
 - layer3-protocol: IP
 - layer4-protocol: TCP
 - first-packet-seen: 1739231378.457569
 - count: 91
<MISPObject(name=network-connection)
 - ip-src: 10.2.10.101
 - ip-dst: 193.143.1.205
 - src-port: 50087
 - dst-port: 80
 - layer3-protocol: IP
 - layer4-protocol: TCP
 - layer7-protocol: HTTP
 - first-packet-seen: 1739231378.460886
 - count: 1
<MISPObject(name=network-connection)
 - ip-src: 193.143.1.205
 - ip-dst: 10.2.10.101
 - src-port: 80
 - dst-port: 50087
 - layer3-protocol: IP
 - layer4-protocol: TCP
 - layer7-protocol: HTTP
 - first-packet-seen: 1739231379.584436
 - count: 1
<MISPObject(name=network-connection)
 - ip-src: 10.2.10.101
 - ip-dst: 1