Image steganalysis using state-of-the-art machine learning techniques
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

Aletheia

Aletheia is an open source image steganalysis tool for the detection of hidden messages in images. To achieve its objectives, Aletheia uses state-of-the-art machine learning techniques. It is capable of detecting several different steganographic methods as for example LSB replacement, LSB matching and some kind of adaptive schemes.

Install

First you need to clone the GIT repository:

$ git clone https://github.com/daniellerch/aletheia.git

Inside the Aletheia directory you will find a requirements file for installing Python dependencies with pip:

$ sudo pip install -r requirements.txt 

Aletheia uses Octave so you need to install it and some dependencies. You will find the dependencies in the octave-requirements.txt file. In Debian based Linux distributions you can install the dependencies with the following commands. For different distros you can deduce the appropriate ones.

$ sudo apt-get install octave octave-image octave-signal

After that, you can execute Aletheia with:

$ ./aletheia.py 

./aletheia.py <command>

COMMANDS:

  Attacks to LSB replacement:
  - spa:   Sample Pairs Analysis.
  - rs:    RS attack.

  ML-based detectors:
  - esvm-predict:  Predict using eSVM.
  - e4s-predict:   Predict using EC.

  Feature extractors:
  - srm:    Full Spatial Rich Models.
  - srmq1:  Spatial Rich Models with fixed quantization q=1c.
  - scrmq1: Spatial Color Rich Models with fixed quantization q=1c.

  Embedding simulators:
  - lsbr-sim:       Embedding using LSB replacement simulator.
  - lsbm-sim:       Embedding using LSB matching simulator.
  - hugo-sim:       Embedding using HUGO simulator.
  - wow-sim:        Embedding using WOW simulator.
  - s-uniward-sim:  Embedding using S-UNIWARD simulator.
  - j-uniward-sim:  Embedding using J-UNIWARD simulator.
  - hill-sim:       Embedding using HILL simulator.
  - ebs-sim:        Embedding using EBS simulator.
  - ued-sim:        Embedding using UED simulator.
  - nsf5-sim:       Embedding using nsF5 simulator.

  Model training:
  - esvm:     Ensemble of Support Vector Machines.
  - e4s:      Ensemble Classifiers for Steganalysis.
  - xu-net:   Convolutional Neural Network for Steganalysis.

  Unsupervised attacks:
  - ats:      Artificial Training Sets.


Statistical attacks to LSB replacement

LSB replacement staganographic methods, that is, methods that hide information replacing the least significant bit of each pixel, are flawed. Aletheia implements two attacks to these methods: the Sample Pairs Analysis (SPA) and the RS attack.

To execute the SPA attack to an included image with LSB replacement data hidden, use the following command:

$./aletheia.py spa sample_images/lena_lsbr.png 
Hiden data found in channel R 0.0930809062336
Hiden data found in channel G 0.0923858529528
Hiden data found in channel B 0.115466382367

The command used to perform the RS attack is similar:

$./aletheia.py rs sample_images/lena_lsbr.png 
Hiden data found in channel R 0.215602586771
Hiden data found in channel G 0.210351910548
Hiden data found in channel B 0.217878287806

In both cases the results provides an estimation of the embedding rate.

Machine Learning based attacks

Most of the state of the art methods in Steganography use some kind of LSB matching. These methods are verify difficult to detect and there is not enough with simple statistical attacks. We need to use machine learning.

To use machine learning we need to prepare a training dataset, used to train our classifier. For this example we will use a database of grayscale images called Bossbase.

$ wget http://dde.binghamton.edu/download/ImageDB/BOSSbase_1.01.zip
$ unzip BOSSbase_1.01.zip

We are going to build a detector for the HILL algorithm with payload 0.40. So we need to prepare a set of images with data hidden using this algorithm. The following command embeds information into all the images downloaded:

$ ./aletheia.py hill-sim bossbase 0.40 bossbase_hill040 

With all the images prepared we need to extract features that can be processes by a machine learning algorithm. Aletheia provides different feature extractors, in this case we will use well known Rich Models. The following commands save the features into two files, on file for cover images and one file for stego images.

$ ./aletheia.py srm bossbase bossbase.fea 
$ ./aletheia.py srm bossbase_hill040 bossbase_hill040.fea

Now, we can train the classifier. Aletheia provides different classifiers, in this case we will use Ensemble Classifiers:

$ ./aletheia.py e4s bossbase.fea bossbase_hill040.fea hill040.model
Validation score: 73.0

As a results, we obtain the score using a validation set (a small subset not used during training). The output is the file "hill040.model", so we can use this for future classifications.

Finally, we can classifiy an image:

$ ./aletheia.py e4s-predict hill040.model srm my_test_image.png
my_test_image.png Stego

Using pre-built models

We provide some pre-built models to facilitate the usage of Aletheia. You can find this models in the "models" folder. For example, you can use the model "e4s_srm_bossbase_lsbm0.10.model" to classify an image with the following command:

$ ./aletheia.py e4s-predict e4s_srm_bossbase_lsbm0.10_gs.model srm my_test_image.png
my_test_image.png Stego

The name of the file give some details about the model. First we find the classification algorithm "e4s", used for Ensemble Classifiers for Steganalysis. Next we find the name of the feature extractor (srm for Spatial Rich Models). Next we find "bossbase", the name of the image database used to train the model. Next, we find the embedding algorithm (lsbm, for LSB matching) and the embedding rate (0.10 bits per pixel). Finally, we find the tag "gs" or "color" depending on the type of images used to train the model. Part of this information is needed to execute the program so we need to provide to Aletheia the classification algorithm used to predict (e4s-predict option) and the feature extractor used (srm).

Remember that the reliability of the prediction is highly dependent on the cover source. This means that if the images used to train are very different from the images we want to predict the result may not be accurate.

You can find some information about the pre-built models here.

The ATS attack

The ATS attack provides a mechanism to deal with Cover Source Mismatch. This is a problem produced by training with an incomplete dataset. Our database, no matter how big it is, does not contains a representation of all the types of images that exist. As a consequence, if the image we want to test is not well represented in our training set, the results are going to be wrong.

The ATS attack is unsupervised, that means that does not need a training database. This attack use the images we want to test (we need a number of images, this method can not be applied to only one image) to create an artificial training set that is used to train the classifier. But this method has an important drawback: we can only use it if we know the training set has some cover and some stego images. The method does not work if all the images are cover or all the images are stego.

For example, we prepare a folder "test_ats" with 10 cover and 10 stego images. We can apply the ATS attack with the following command:

$ ./aletheia.py ats lsbm-sim 0.40 srm test_ats/

9903.pgm Stego
9993.pgm Stego
9909.pgm Cover
9996.pgm Stego
9904.pgm Cover
9905.pgm Stego
9907.pgm Cover
9998.pgm Stego
9900.pgm Cover
9999.pgm Stego
9990.pgm Stego
9995.pgm Stego
9991.pgm Stego
9994.pgm Stego
9901.pgm Cover
9906.pgm Cover
9997.pgm Stego
9908.pgm Cover
9992.pgm Stego
9902.pgm Cover

Note we are supposed to know (or to have a strong assumption) about which is the embedding algorithm and bitrate used by the steganographer.