# Prepare ImageNet mini

This notebook shouldn't be completely executed at once. After the step 1, it is needed to execute the S-UNIWARD algorithm to create the stego images.

## Constants & imports

In [2]:
import glob
import os
import cv2
import numpy as np
import random

from shutil import copyfile
from PIL import Image, ImageOps

In [None]:
# This is the directory in which the final data set will be stored
OUTPUT_DIR = 'imagenet_0.4_256x256'

## 1. Get a subsample of the images

We need 14.000 images for the train set, 1.000 for the validation set and 5.000 test set. This is required in order to do something similar to what SRNet used for training and comparing with other architectures.

It is recommended to use a subset of ImageNet as the whole ImageNet set weights a lot. The one that was used in this case was the [imagenetmini-1000](https://www.kaggle.com/datasets/ifigotin/imagenetmini-1000).

In [3]:
SOURCE_DIR = 'Path to the imagenet-mini folder'
COVER_DIR = 'Path to store the cover images'
TARGET_SIZE = (256, 256)
SEED = 42

# Set the seed
random.seed(SEED)
np.random.seed(SEED)

In [3]:
assert not os.path.isdir(COVER_DIR)
os.mkdir(COVER_DIR)

In [4]:
%%time
# Get the full filename of all the images in the source data set
image_full_filenames = glob.glob(os.path.join(SOURCE_DIR, '*', '*', '*'))

# Select random images 20000 images
sample_full_filenames = list(np.random.choice(image_full_filenames, size=20000, replace=False))

# Resize the images and save them in PGM format.
for index, full_filename in enumerate(sample_full_filenames):
    # Read the image with PIL
    pil_image = Image.open(full_filename)
    
    # Transform the image to grayscale
    pil_image = ImageOps.grayscale(pil_image)

    # Transform the image into a numpy array, resize it and transform it back again to PIL format
    image_array = np.array(pil_image)
    resized_image = cv2.resize(image_array, TARGET_SIZE, interpolation=cv2.INTER_AREA)
    pil_resized_image = Image.fromarray(resized_image)

    # Save the image with the same name but different extension
    pil_resized_image.save(os.path.join(COVER_DIR, f'{index}.pgm'))

Wall time: 1min 34s


## 2. Create the stego images

Once we have the preprocessed images, the stego images must be created by using any algorithm. The ones presented in the [Binghamton University website](http://dde.binghamton.edu/download/stego_algorithms/) are recommended due to their simplicity.

## 3. Split the whole data set

We need to get the amount of images mentioned in the section 1 just in case we need to train another model.

In [33]:
# Constants
COVER_DIR = '../../dataset/ImageNet/imagenet_cover_256x256'
STEGO_DIR = '../../dataset/ImageNet/imagenet_stego_0.4_256x256'

# Set the seed again just in case.
random.seed(SEED)
np.random.seed(SEED)

In [29]:
# Get the filenames of the cover images
cover_filenames = os.listdir(COVER_DIR)
random.shuffle(cover_filenames)

# Select the images of each image set
train_filenames = cover_filenames[:14000]
val_filenames = cover_filenames[14000:15000]
test_filenames = cover_filenames[15000:]

In [30]:
assert set(test_filenames) & set(val_filenames) & set(train_filenames) == set()
assert set(test_filenames) & set(val_filenames) == set()
assert set(test_filenames) & set(train_filenames) == set()
assert set(val_filenames) & set(train_filenames) == set()

In [34]:
# Generate the folders necessary to train and test.
assert not os.path.isdir(OUTPUT_DIR)
os.mkdir(OUTPUT_DIR)

sets = ['train', 'val', 'test']
for set_name in sets:
    os.mkdir(os.path.join(OUTPUT_DIR, set_name))
    os.mkdir(os.path.join(OUTPUT_DIR, set_name, '0'))
    os.mkdir(os.path.join(OUTPUT_DIR, set_name, '1'))

In [35]:
# Copy the images into their respective folders
def copy_images_stego_cover_repetition_in_set(filenames, set_name, cover_dir, stego_dir):
    
    # Copy all the files into their respective folder
    for image_name in filenames:
        copyfile(os.path.join(cover_dir, image_name), 
                 os.path.join(OUTPUT_DIR, set_name, '0', image_name))
        
        copyfile(os.path.join(stego_dir, image_name), 
                 os.path.join(OUTPUT_DIR, set_name, '1', image_name))

# Execute the splitting
copy_images_stego_cover_repetition_in_set(train_filenames, 'train', COVER_DIR, STEGO_DIR)
copy_images_stego_cover_repetition_in_set(val_filenames, 'val', COVER_DIR, STEGO_DIR)
copy_images_stego_cover_repetition_in_set(test_filenames, 'test', COVER_DIR, STEGO_DIR)

## 4. Transform all the images to png

Once the data set is totally created, we need to transform all the images to PNG in order to use them with Keras.

In [40]:
# Get all the images in every set
pgm_paths = glob.glob(os.path.join(OUTPUT_DIR, '*', '*', '*'))

# Transform all the images 
for pgm_path in pgm_paths:
    # Read the pgm file
    pgm_array = cv2.imread(pgm_path, -1)
    
    # Remove the old pgm file
    os.remove(pgm_path)
    
    # Write again the array as a png file
    cv2.imwrite(pgm_path[:-3] + "png", pgm_array)