# Steganography and Steganalysis (with GPU)
Steganography is the art of hiding information (super secrets !) in an image or text, whereas steganalysis is the art of uncoverting that information. To understand, steganography, I have attached the following youtube video from the Computerphile channel (very cool channel !)

In [13]:
from IPython.display import HTML

# Youtube
HTML('<iframe width="560" height="315" src="http://www.youtube.com/embed/TWEXCYQKyDc" frameborder="0" allowfullscreen></iframe>')

There are a variety of methods to embed information/messages into a text/images. For a JPEG image, one method is to embed information in the discrete cosine transformation (DCT). 

1. First, the RGB channel space is converted to YUV/YCbCr channel space (because human eyes cannot distinguish CbCr well).
2. After converting the color space, an 8x8 kernel is applied to the image. This 8x8 kernel contains both low and high frequency cosine function in the x-y direction (i.e. each cell in the 64 (8x8) contains a custom $cos(2 \pi x) + cos(2 \pi y)$) and is being convoluted across the x-y direction on the image and across the image channel. The resultant image is a 512 x 512 x 3 float value
3. A 8x8 quantization table is applied to the resultant image. This 8x8 table is used as a quotient, specifically the float value are divided by this "custom" integer table. Hence, hiding information.
4. In order to reverse the process, the "custom" quantization table is needed to decode the message.

There are more sophisticated method to generation of the quantization table (value distribution (usually gaussian), dimension and etc)

More information can be found here (https://www.kaggle.com/prashant111/alaska2-image-steganalysis-all-you-need-to-know)

The interest for steganography and steganalysis stems from an ongoing (as of May 7th 2020) Kaggle competition (https://www.kaggle.com/c/alaska2-image-steganalysis). The dataset can be found in the website after agreements to the terms and conditions

## Step 1: Managing dataset
The provided dataset has 4 folders (corresponding to 1 negative and 3 positives labels). Each folder contains 75,000 images per folder. Each image is 512 x 512 x 3. The resulting dataset is approximately 35gb. We start off by overfitting a smaller sample size, then gradually adding more training samples to regularize and generalize predictions

In [3]:
import os
import shutil
import numpy as np

types = ["Cover","JMiPOD","JUNIWARD","UERD"]
PATH = r"C:\Users\Innovations\Desktop\AI\Steganography\Data"
data_paths = [os.listdir(os.path.join(PATH, alg)) for alg in types]

# another approach is to use np.random.shuffle(os.listdir(directory) and use index slicing to get desired sample size)
for index, folder in enumerate(data_paths):
    print(index)
    for file in folder:
        if abs(np.random.rand(1)) < 0.1:
            if index == 0:
                shutil.copyfile(PATH + '/'+ types[index] + "/" + file, PATH + '/'+ 'miniCover'+ '/' + file)
            elif index == 1:
                shutil.copyfile(PATH + '/'+ types[index] + "/" + file, PATH + '/'+ 'miniJMiPOD'+ '/'+ file)
            elif index == 2:
                shutil.copyfile(PATH + '/'+ types[index] + "/" + file, PATH + '/'+ 'miniJUNIWARD'+ '/' + file)
            elif index == 3:
                shutil.copyfile(PATH + '/'+ types[index] + "/" + file, PATH + '/'+ 'miniUERD'+ '/' + file)

0
1
2
3


## Step 2: Model Definition
Here, we build the Deep Residual Network based on the works by Mehdi et. al. (2017)(http://www.ws.binghamton.edu/fridrich/Research/SRNet.pdf)

In [2]:
import tensorflow as tf
from tensorflow.keras.layers import Activation, BatchNormalization, Conv2D, Add, AveragePooling2D, GlobalAveragePooling2D, Input, Dense, Dropout
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K

In [3]:
def layer1(inputLayer, noFilter, filterSize=(8,8), strides=(1,1), padding = "same"):
    x = Conv2D(noFilter, filterSize, strides, paadding)(inputLayer)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)

    return x

def layer2(inputLayer, noFilter, filterSize1=(8,8), filterSize2=(8,8), strides1=(1,1), strides2=(1,1), padding = "same"):
    x = layer1(inputLayer, noFilter, filterSize1, strides1, padding)
    x = Conv2D(noFilter, filterSize2, strides2, padding)(x)
    x = BatchNormalization()(x)
    x = Add()([x, inputLayer])

    return x

def layer3(inputLayer, noFilter, filterSize1=(8,8), filterSize3=(8,8), strides1=(1,1), strides3=(1,1), padding = "same"):
    x1 = layer1(inputLayer, noFilter, filterSize1, strides1, padding)
    x1 = Conv2D(noFilter, filterSize3, strides3, padding)(x1)
    x1 = BatchNormalization()(x1)
    x1 = AveragePooling2D(pool_size=(3,3), strides =(2,2))(x1)
    
    x2 = Conv2D(noFilter, (3,3), (2,2))(inputLayer)
    x2 = BatchNormalization()(x2)

    x = Add()([x1,x2])

    return x

def layer4(inputLayer, noFilter, filterSize1=(8,8), filterSize4=(8,8), strides1=(1,1), strides4=(1,1), padding = "same"):
    x = layer1(inputLayer, noFilter, filterSize1, strides1, padding)
    x = Conv2D(noFilter, filterSize4, strides4, padding)(x)
    x = BatchNormalization()(x)
    x = GlobalAveragePooling2D()(x)

    return x

def build(width, height, depth, noClasses):
    inputShape = (height, width, depth)
    chanDim = -1

    if K.image_data_format == "channel_first":
        inputShape = (depth, height, width)
        chanDim = 1

    inputs = Input(shape = inputShape)

    x = layer1(inputs, 64)
    x = layer1(x,16)

    x = layer2(x,16)
    x = layer2(x,16)
    x = layer2(x,16)
    x = layer2(x,16)
    x = layer2(x,16)
    
    x = layer3(x,16)
    x = layer3(x,32)
    x = layer3(x,64)
    x = layer3(x,128)

#     x = layer4(x,512)
    x = GlobalAveragePooling2D()(x)
    x = Dense(32)(x)
    x = Dropout(0.5)(x)
    x = Dense(1, activation="sigmoid")(x)

    model = Model(inputs=inputs, outputs=x)
    
    return model

In [4]:
model = build(512,512,3,2)

In [5]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 512, 512, 3) 0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 512, 512, 64) 12352       input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 512, 512, 64) 256         conv2d[0][0]                     
__________________________________________________________________________________________________
activation (Activation)         (None, 512, 512, 64) 0           batch_normalization[0][0]        
______________________________________________________________________________________________

Let's visualize the built model

In [None]:
from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True, to_file="model.png")

## Step 3: Prepare dataset for Pipeline intake
There are two approaches to preparing the dataset for intake.
(a) import image into memory (but significantly smaller dataset can be ingested, due to limitation on CPU and memory limit)
(b) leverage keras image data generator to build an input pipeline that flows from the directory

### Step 3a : In-Memory Dataset

In [1]:
import cv2
import numpy as np
import joblib
import os
from random import shuffle

IMG_SIZE = 512
PATH = r"C:\Users\Innovations\Desktop\AI\Steganography\Data"
test_images_path = [os.path.join(PATH,"Test",i) for i in os.listdir(r"C:\Users\Innovations\Desktop\AI\Steganography\Data\Test")]
ALGORITHMS = ['JMiPOD','JUNIWARD','UERD']

def load_image(data):
    i, j, img_path, labels = data
    
    img = cv2.imread(img_path)
    img = cv2.resize(img, (IMG_SIZE, IMG_SIZE))
    label = labels[i][j]
    
    return [np.array(img), label]

def load_training_data_multi(n_images=100):
    train_data = []
    data_paths = [os.listdir(os.path.join(PATH, alg)) for alg in ['Cover'] + ALGORITHMS]
    print(data_paths)
    labels = [np.zeros(n_images), np.ones(n_images), np.ones(n_images), np.ones(n_images)]

    for i, image_path in enumerate(data_paths):
        
        train_data_alg = joblib.Parallel(n_jobs=4, backend='threading')(
            joblib.delayed(load_image)([i, j, os.path.join(PATH, [['Cover'] + ALGORITHMS][0][i], img_p), labels]) for j, img_p in enumerate(image_path[:n_images]))

        train_data.extend(train_data_alg)
        
    shuffle(train_data)
    return train_data

def load_test_data():
    test_data = []
    for img in test_images_path:
        img = cv2.imread(img)
        img = cv2.resize(img, (512, 512))
        test_data.append([np.array(img)])
            
    return test_data

In [7]:
from sklearn.model_selection import train_test_split
training_data = load_training_data_multi(n_images=2000)
trainImages = np.array([i[0] for i in training_data])
trainLabels = np.array([i[1] for i in training_data])
X_train, X_val, y_train, y_val = train_test_split(trainImages, trainLabels, random_state=42)
print(len(X_train), len(X_val), len(y_train), len(y_val))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



6000 2000 6000 2000


### Step 3b : ImageDataGenerator

In [8]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# limited by GPU memory size
BATCH_SIZE=8

train_datagen = ImageDataGenerator(rescale=1./255, rotation_range=90)
valid_datagen = ImageDataGenerator(rescale=1./255)
train_path = r"C:\Users\Innovations\Desktop\AI\Steganography\Data\train"
train_generator = train_datagen.flow_from_directory(train_path, target_size=(512,512), batch_size=BATCH_SIZE, class_mode="binary", classes=["0", "1"], subset="training")
valid_generator = train_datagen.flow_from_directory(train_path, target_size=(512,512), batch_size=BATCH_SIZE, class_mode="binary", classes=["0", "1"], subset="validation")

In [11]:
model.compile(loss="binary_crossentropy",optimizer=tf.keras.optimizers.SGD(lr=1e-2, momentum=0.9, decay=1e-2/10),metrics=["accuracy"])

In [None]:
from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train),y_train)
print(class_weights)

Defining custom callbacks to save model for future training and learning rate scheduler on validation dataset's loss

In [1]:
from tensorflow.keras.callbacks import Callback
import os

class EpochCheckpoint(Callback):
    def __init__(self, outputPath, every=3, startAt=0):
        super(Callback, self).__init__()
        
        self.outputPath = outputPath
        self.every = every
        self.intEpoch = startAt
        
    def on_epoch_end(self, epoch, logs={}):
        if (self.intEpoch + 1) % self.every == 0:
            p = os.path.sep.join([self.outputPath, "SRNet_epoch_{}.hdf5".format(self.intEpoch+1)])
            self.model.save(p, overwrite=True)
            
        self.intEpoch +=1
        
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=0.001)

## Step 4: Train Model
### Step 4a: Train Model with a smaller but augmented dataset

In [12]:
# model.fit(train_datagen.flow(X_train, y_train, batch_size=8), validation_data = train_datagen.flow(X_val, y_val, batch_size=8), class_weight = class_weights, epochs = 10, shuffle=True, callbacks = [reduce_lr])
model.fit(train_datagen.flow(X_train, y_train, BATCH_SIZE), validation_data = valid_datagen.flow(X_val, y_val, BATCH_SIZE), steps_per_epoch = 6000 // BATCH_SIZE, validation_steps = 2000 // BATCH_SIZE, epochs = 10, shuffle=True, callbacks = [EpochCheckpoint(r"C:\Users\Innovations\Desktop\AI\Steganography", every=5, startAt=0)])

  ...
    to  
  ['...']
  ...
    to  
  ['...']
Train for 750 steps, validate for 250 steps
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x16252f3ad48>

### Step 4b: Train Model with ImageDataGenerator, with data directly from folder

In [None]:
model.fit(train_generator, steps_per_epoch=(?)//BATCH_SIZE, epochs=10, validation_data=valid_generator, validation_steps= (?)//BATCH_SIZE, verbose=1)

### Step 5: Make Predictions

In [2]:
import pandas as pd
PATH = r"C:\Users\Innovations\Desktop\AI\Steganography\Data"

sample_sub = pd.read_csv(PATH + '\\sample_submission.csv')

In [None]:
import tensorflow as tf

test = load_test_data()
ckptmodel = tf.keras.models.load_model("SRNet_epoch_10.hdf5")
test_images = np.array([i[0] for i in test]).astype("float")
test_images /= 255.0
predict = ckptmodel.predict(test_images, batch_size=32)

In [4]:
sample_sub['Label'] = predict
sample_sub.to_csv('submission (SRNet512).csv', index=False)

With the SRNET512 and 8000 sample size, the submission yields a score of **0.568**. Some authors have found better success with a different network architecture (EfficientNet B7 and InceptionNetV2). Unfortunately, due to the number of network parameters and image size, large amount of computations are required. A google colab effort will be used moving forward.