# Extract LSBs 

In this notebook, we will extract the LSBs of a stego images to find similarities in the payloads.

In [1]:
%pip install pillow tqdm ipywidgets notebook asyncstdlib
!jupyter nbextension enable --py widgetsnbextension
from PIL import Image, ImageChops
from pathlib import Path
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from asyncstdlib import itertools as ait
from asyncstdlib import functools as afn

Note: you may need to restart the kernel to use updated packages.
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         output paths as machine-readable json
  --debug        output debug information about paths

Available subcommands: dejavu events execute kernel kernelspec lab
labextension labhub migrate nbconvert notebook run server troubleshoot trust

Jupyter command `jupyter-nbextension` not found.


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Define constants

In [2]:
STEGOAPPDB_PATH = Path('../datasets/StegoAppDB_stegos_20240309-030352')
INFO_FILE = STEGOAPPDB_PATH / 'StegoAppDB_stegos_20240309-030352_stego_directory.csv'
COVERS_PATH = STEGOAPPDB_PATH / 'covers'
STEGOS_PATH = STEGOAPPDB_PATH / 'stegos'
METHOD_COLUMN = 'embedding_method'
STEGO_COLUMN = 'image_filename'
COVER_COLUMN = 'cover_image_filename'

## Collect stego images

We will read the info file to collect all embedding methods and the stego images that were generated using them.

### Gather embedding methods

The embedding methods can be found in the `METHOD_COLUMN` of the info file.

In [3]:
info_file = pd.read_csv(INFO_FILE)
embedding_methods = info_file[METHOD_COLUMN].unique()
print(f'Found the following embedding methods: {", ".join(embedding_methods)}')

Found the following embedding methods: MobiStego, PixelKnot, PocketStego, Pictograph, SteganographyM, Passlok


### Collect stego images

Now, for each embedding method, we will collect the stego images that were generated with it.

In [4]:
def collect_stego_images(embedding_method: str):
    return info_file[info_file[METHOD_COLUMN] == embedding_method][STEGO_COLUMN]


stego_images_by_method = {method: STEGOS_PATH / collect_stego_images(method) for method in embedding_methods}
{method: len(stego_images) for method, stego_images in stego_images_by_method.items()}

{'MobiStego': 3060,
 'PixelKnot': 3060,
 'PocketStego': 3060,
 'Pictograph': 4800,
 'SteganographyM': 3060,
 'Passlok': 1530}

## Try to detect a signature

The next step is to try to detect a signature in the payloads corresponding to the embedding methods.
 
### Extract the payloads

First, we will extract the LSBs of the stego images.
Some embedders may construct the payloads starting from the most significant bits (MSB) and others from the least significant bits (LSB)
which leads to the different implementations ending on `msb` and `lsb` respectively.

Furthermore, we can optimize the extraction performance by doing a bitwise or-operation over the whole payload
if the embedder used the first or to the power of 2 LSBs which is reflected in the implementations containing `opt` in their name.

Finally, we operate asynchronously to speed up the extraction process.

In [6]:
def _extract_bits_opt_lsb(data, bits: int):    
    div = 8 // bits
    message = np.zeros(len(data) // div, dtype=np.uint8)
    mask = (1 << bits) - 1
    for i in range(div):
        shift = bits * i
        message |= (data[i::div] & mask) << shift
    return message


def _extract_bits_opt_msb(data, bits: int):    
    div = 8 // bits
    message = np.zeros(len(data) // div, dtype=np.uint8)
    mask = (1 << bits) - 1
    for i in range(div):
        shift = 8 - bits - (bits * i)
        message |= (data[i::div] & mask) << shift
    return message


def _extract_bits_lsb(data, bits: int):
    msg_byte = 0
    shift = 0
    message = []
    mask = (1 << bits) - 1
    for byte in data:
        msg_byte |= (byte & mask) << shift
        shift += bits
        if shift >= 8:
            tmp = msg_byte >> 8
            message.append(msg_byte & 0xFF)
            msg_byte = tmp
            shift -= 8
    return np.array(message)


def _extract_bits_msb(data, bits: int):
    msg_byte = 0
    shift = 8 - bits
    message = []
    mask = (1 << bits) - 1
    for byte in data:
        msg_byte |= (byte & mask) << shift
        shift += bits
        if shift <= 0:
            tmp = msg_byte >> 8
            message.append(msg_byte & 0xFF)
            msg_byte = tmp
            shift += 8
    return np.array(message)


_COL_MAP = { 'R': 0, 'G': 1, 'B': 2, 'A': 3 }


def _load_image(img_path: Path, convert_mode='RGB', channels=None):
    if 'A' in channels:
        convert_mode = 'RGBA'
   
    with Image.open(img_path) as img:
        arr = np.array(img.convert(convert_mode))
        
    channels = [*channels] if channels else None
    if (convert_mode == 'RGB' and 0 < len(channels) < 3) or (convert_mode == 'RGBA' and 0 < len(channels) < 4):
        arr = arr[..., [_COL_MAP[c] for c in channels]]
    return arr.reshape(-1)

def extract_message(img_path: Path, bits: int, direction='msb', convert_mode='RGB', channels=None):
    data = _load_image(img_path, convert_mode, channels)
    if bits == 1 or bits.bit_count() == 1:
        if direction == 'msb':
            return _extract_bits_opt_msb(data, bits)
        else:
            return _extract_bits_opt_lsb(data, bits)
    else:
        if direction == 'msb':
            return _extract_bits_msb(data, bits)
        else:
            return _extract_bits_lsb(data, bits)


async def extract_messages(images, bits: int = 1, direction='msb', embedding_method=None, convert_mode='RGB', channels=None):
    if hasattr(images, '__aiter__'):
        tasks = [(img, extract_message(img, bits, direction, convert_mode, channels)) async for img in images]
    elif hasattr(images, '__iter__'):
        tasks = [(img, extract_message(img, bits, direction, convert_mode, channels)) for img in images]
    else:
        raise ValueError('diff_images must be an iterable or an async iterable')

    for (img, task) in tqdm(tasks,
                            desc=f'Extracting {bits}-LSBs' + (f' for {embedding_method}' if embedding_method else '') + (f' with {direction.upper()} direction' if direction else '')):
        yield img, await task

b'\x00\x00\x00.\x9fb*\x89Mi?\xeb\xba2\xfe,m\xaf\xb9\xbb#\r\xacX\x11\x10`\xedv\x18\x9a\xbf\x8c\xde\x92\x8a\xfa\x90\xbc\xce\xf1\x15\x8e+n\xfe\xa9\xd5\xf7~\x8bwm+^\t)\xa3o?\xa7\x1c\x13<\xf7\xfe\xf6\xcer\xc9r\xf9 \xf9\xa24-{\xfc7 ,\x86\x9f\xba-E\x7f0_gu\xfd\x1f\x93\x0f\xd5\xbf\x96\xd5\x041\x9b\x00\xbb\x99\x0b\xa8\xb8 \x94\x7f\xae(4\xbd\x82\xc2}\x85\xfd\xa5\xa0\xe6\x9b\xcc7\x00\x7fo\x97\xecv\xf6\xed\x16\x80\xaa\xe8\x9b\xe6\xbfm{\xe11\x8c\x17\xec/\xd6\xb5\x03\xee\x8a\x91n\xee\x8bWi\x0f}\xa3\x89\xffM\x8f\x16\x9b\x10\xf9-Sf\xcd\xb9:MKc\xe5\xbem\xdd\x01\xa9[]6\x7f\xba\x00\x00\x10\x1e\xd2\x8a\xfd\xc5\x8e&U\xe6\xdb\xbf\x8f,7j#\xca\xa2\x8e\xbb\xf2\xd3\xb6\x98\xedZ\x04\xed\xcb\xd6\xe3\xab4\xbb\x93\xa1\x07\x90\x168\xadK\x18\\\\\x1d^E2}D67Y\xe9\xf5D\\\xd4\x95\xcaF4N\x12\x03\x989\x8b\x80\xc3\x8fm\xab\x91*\xe7\x0by\xc6\xbf\x9f\x9d8\x8f\x86\xfeg\x7f\x9dv\nU\xbfZ`Q\xd4\x18\xd0&\x17F\xb4m\xf7g\xdc\xae\xbf\xf8\xb4\xbaH;\x12\xd0\xd9\xd3\xc9\xa80\xb0\x17\xc4`^\x86\xef\xa6{esa\x9au\xa2\xeb\xf9\xd2\xb8\xdb\xec