# Detect the payload size for LSB steganography

In this notebook, different methods to detect the payload size are presented. 
The payload size is the size of the data that is being transmitted inside the stego image
that was hidden using LSB steganography.

Three methods will be presented that can be used depending on the information that is available:
- Known message attack: If the original message is known, it is possible to detect the payload size by identifying the message in the extracted payload. This is the most reliable method but may take a while if the used LSBs are high.
- Known stego image attack: If the stego image is known, it is possible to detect the payload size by checking how the file size changes and find the correlation between the payload size and the file size.
- Statistical attack: Using RS analysis, it is possible to detect the payload size by analyzing the distribution of the pixel values.

## Initialization

First we need to import the extraction functions of [Extract LSBs](./extract-lsbs.ipynb).

In [1]:
from tqdm.notebook import tqdm
from pathlib import Path

In [2]:
%run extract-lsbs.ipynb

Collecting pillow
  Downloading pillow-10.2.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (9.7 kB)
Collecting ipywidgets
  Downloading ipywidgets-8.1.2-py3-none-any.whl.metadata (2.4 kB)
Collecting asyncstdlib
  Downloading asyncstdlib-3.12.1-py3-none-any.whl.metadata (3.4 kB)
Collecting widgetsnbextension~=4.0.10 (from ipywidgets)
  Downloading widgetsnbextension-4.0.10-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.10 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.10-py3-none-any.whl.metadata (4.1 kB)
Downloading pillow-10.2.0-cp311-cp311-macosx_11_0_arm64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading ipywidgets-8.1.2-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.4/139.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading asyncstdlib-3.12.1-py3-none-any.whl (38 kB)


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
async def for_each_image(func):
    """Helper function to iterate over all stego images and apply a function to them."""
    for method, stego_images in stego_images_by_method.items():
        for bits in [1, 2, 4]:
            for direction in ['msb', 'lsb']:
                yield await func(stego_images, (method, bits, direction))

## Known message attack

In [5]:
EMBEDDED_MESSAGES_DIR = Path('./data/embedded_messages')


async def save_embedded_messages(base_dir=EMBEDDED_MESSAGES_DIR):
    async def handler(stego_images, path_parts):
        method, bits, direction = path_parts
        sub_dir = base_dir / method / f'ls{bits}b' / direction
        if sub_dir.exists():
            return

        sub_dir.mkdir(parents=True, exist_ok=True)
        async for stego_img, msg in extract_messages(stego_images, bits, direction, method):
            img_name = stego_img.stem
            msg_file = sub_dir / f'{img_name}.txt'
            if msg_file.exists():
                continue

            msg_file.write_bytes(msg.tobytes())

    _ = [_ async for _ in for_each_image(handler)]


await save_embedded_messages()

In [6]:
MESSAGE_DIR = Path('../datasets/StegoAppDB_stegos_20240309-030352/message_dictionary')


def find_nth_substring(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start + len(needle))
        n -= 1
    return start


async def get_original_message(stego_img):
    img_row = info_file[info_file['image_filename'] == stego_img.name]
    msg_name = img_row['message_dictionary'].values[0]
    starting_line_index = img_row['message_starting_index'].values[0]
    msg_len = img_row['message_length'].values[0]
    full_msg = (MESSAGE_DIR / msg_name).read_text()
    start_index = find_nth_substring(full_msg, '\n', starting_line_index - 1) + 1
    return full_msg[start_index:start_index + msg_len].encode('utf-8')


async def get_embedded_message(stego_img, method, bits, direction):
    return (EMBEDDED_MESSAGES_DIR / method / f'ls{bits}b' / direction / f'{stego_img.stem}.txt').read_bytes()


async def detect_used_method_and_bits(stego_images, path_parts):
    method, bits, direction = path_parts
    results = []
    for stego_img in tqdm(stego_images, desc=f'Cycling through {method} {bits} {direction}'):
        original_msg = await get_original_message(stego_img)
        extracted_msg = await get_embedded_message(stego_img, method, bits, direction)
        index = extracted_msg.find(original_msg)
        if index != -1:
            results.append((method, bits, direction, index))

    rate = len(results) / len(stego_images)
    results = set(results)
    if len(results) == 1:
        return rate, results.pop()
    elif len(results) > 1:
        return rate, results
    else:
        return rate, None


detected_used_method_and_bits = [(rate, values) async for rate, values in for_each_image(detect_used_method_and_bits)]
detected_used_method_and_bits

Cycling through MobiStego 1 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through MobiStego 1 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through MobiStego 2 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through MobiStego 2 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through MobiStego 4 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through MobiStego 4 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PixelKnot 1 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PixelKnot 1 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PixelKnot 2 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PixelKnot 2 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PixelKnot 4 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PixelKnot 4 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PocketStego 1 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PocketStego 1 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PocketStego 2 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PocketStego 2 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PocketStego 4 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through PocketStego 4 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through Pictograph 1 msb:   0%|          | 0/4800 [00:00<?, ?it/s]

Cycling through Pictograph 1 lsb:   0%|          | 0/4800 [00:00<?, ?it/s]

Cycling through Pictograph 2 msb:   0%|          | 0/4800 [00:00<?, ?it/s]

Cycling through Pictograph 2 lsb:   0%|          | 0/4800 [00:00<?, ?it/s]

Cycling through Pictograph 4 msb:   0%|          | 0/4800 [00:00<?, ?it/s]

Cycling through Pictograph 4 lsb:   0%|          | 0/4800 [00:00<?, ?it/s]

Cycling through SteganographyM 1 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through SteganographyM 1 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through SteganographyM 2 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through SteganographyM 2 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through SteganographyM 4 msb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through SteganographyM 4 lsb:   0%|          | 0/3060 [00:00<?, ?it/s]

Cycling through Passlok 1 msb:   0%|          | 0/1530 [00:00<?, ?it/s]

Cycling through Passlok 1 lsb:   0%|          | 0/1530 [00:00<?, ?it/s]

Cycling through Passlok 2 msb:   0%|          | 0/1530 [00:00<?, ?it/s]

Cycling through Passlok 2 lsb:   0%|          | 0/1530 [00:00<?, ?it/s]

Cycling through Passlok 4 msb:   0%|          | 0/1530 [00:00<?, ?it/s]

Cycling through Passlok 4 lsb:   0%|          | 0/1530 [00:00<?, ?it/s]

[(0.0, None),
 (0.0, None),
 (0.9993464052287582, ('MobiStego', 2, 'msb', 3)),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None),
 (0.0, None)]

### Overlap the payloads

To detect a signature, we will overlap the messages of the stego images by doing a bitwise and-operation.
This naive approach can only detect a leading signature in the payloads.
For detecting a signature at the end of the payloads, we need to know the payload length
which can be calculated approximately with e.g. the RS analysis.

After collecting the messages, we will overlap them by doing a bitwise and-operation and
strip all surrounding zeros to find the signature.

In [None]:
def _overlap_message(acc, msg):
    if acc.shape != msg.shape:
        acc, msg = (acc, np.resize(msg, acc.shape)) if acc.size < msg.size else (np.resize(acc, msg.shape), msg)
    return np.bitwise_and(acc, msg)


async def extract_leading_sig(stego_images, bits: int = 1, direction='msb', embedding_method=None):
    messages = (msg async for msg in extract_messages(stego_images, bits, direction, embedding_method))
    reduced_msg = await afn.reduce(_overlap_message, messages)
    return np.trim_zeros(reduced_msg.ravel()).tobytes()


leading_signatures = {}
for method, stego_images in stego_images_by_method.items():
    leading_signatures[method] = {
        'MSB': {}, 'LSB': {}
    }
    for bits in [1, 2, 4]:
        leading_signatures[method]['MSB'][bits] = await extract_leading_sig(stego_images, bits, 'msb', method)
        leading_signatures[method]['LSB'][bits] = await extract_leading_sig(stego_images, bits, 'lsb', method)

leading_signatures