<a href="https://colab.research.google.com/github/danjethh/steg_analysis/blob/main/steg_analysis_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Run the script. It will:
Load and preprocess the dataset.
Train the Random Forest Classifier.
Evaluate the model on the test set.
Prompt you to enter the path to an image for testing.
 Enter the path to the image you want to test when prompted. Ensure the image is 512x512 pixels.

Workflow Summary

**Step 1:**
1. Load the Dataset
2. Load the clean and stego datasets.
3. Combine them into a single DataFrame.
4. Add labels to distinguish between clean and stego images.

 **Step 2:**
1. Preprocess the Data
2. Remove rows with NaN values caused by overly uniform images.
3. Remove outliers using the IQR rule.
4. Sample 50% of the dataset
5. Normalize the features using StandardScaler.
6. Reduce dimensionality using PCA to retain 99% of the variance.

In [4]:
!pip install PyWavelets

Collecting PyWavelets
  Downloading pywavelets-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.0 kB)
Downloading pywavelets-1.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/4.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/4.5 MB[0m [31m10.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m4.5/4.5 MB[0m [31m68.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m51.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyWavelets
Successfully installed PyWavelets-1.8.0


In [5]:
# Step 1: Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

In [6]:
# Step 2: Load the Dataset Function
def load_data():
    """
    Loads the clean (cover) and stego image feature datasets, labels them,
    combines them, and displays preview outputs for students to understand.
    Returns the full combined dataset with labels.
    """

    # URLs for clean and stego datasets (CSV with 41 features each)
    url_clean = "https://raw.githubusercontent.com/Sourish1997/steganalysis/master/Datasets/steg_features.csv"
    url_stego = "https://raw.githubusercontent.com/Sourish1997/steganalysis/master/Datasets/steg_lsb_features.csv"

    # Load clean (cover) images feature dataset
    print("Loading clean (cover) dataset...")
    data_clean = pd.read_csv(url_clean, header=None)
    data_clean['label'] = 0  # Label '0' for clean images

    # Display first 4 rows for understanding
    print("\nFirst 4 rows from Clean (Cover) Dataset:")
    print(data_clean.head(4))

    # Load stego images feature dataset
    print("\nLoading stego dataset...")
    data_stego = pd.read_csv(url_stego, header=None)
    data_stego['label'] = 1  # Label '1' for stego images

    # Display first 4 rows from stego dataset
    print("\nFirst 4 rows from Stego Dataset:")
    print(data_stego.head(4))

    # Combine both datasets
    print("\nCombining clean and stego datasets into a single DataFrame...")
    data_combined = pd.concat([data_clean, data_stego], axis=0, ignore_index=True)

    # Display first 8 rows of the combined dataset with labels
    print("\nFirst 8 rows of the Combined Dataset (including labels):")
    print(data_combined.head(8))

    # Display the shape of the combined dataset
    print(f"\nCombined Dataset Shape: {data_combined.shape}")

    return data_combined  # Return full dataset (100%) without sampling

# Run the function
full_dataset = load_data()

Loading clean (cover) dataset...

First 4 rows from Clean (Cover) Dataset:
          0         1         2         3         4         5         6  \
0 -0.317327  0.827515  0.760605  0.740966  0.721418  0.910647  0.861356   
1       NaN       NaN       NaN       NaN       NaN       NaN       NaN   
2 -0.503111  0.862970  0.802899  0.775813  0.751000  0.927452  0.889261   
3 -0.182988  0.887022  0.835196  0.813357  0.789932  0.911072  0.861291   

          7         8         9  ...        32        33        34        35  \
0  0.835196  0.815543  0.818339  ... -0.004257 -0.000239 -0.266943 -0.106837   
1       NaN       NaN       NaN  ... -0.064528  0.015347  0.005049 -0.145678   
2  0.866067  0.848226  0.855546  ...  0.003529  0.009316 -0.248362 -0.107545   
3  0.824739  0.795830  0.856713  ... -0.024424  0.004261 -0.137704 -0.088573   

         36        37        38        39        40  label  
0 -0.059703 -0.015162 -0.006729 -0.004329  0.001190      0  
1 -0.189235  0.075486  0.0

In [7]:
# Step 2: Preprocess the Dataset
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# Function to preprocess the data
def preprocess_data(data):
    """
    Preprocess the dataset by:
    1. Removing rows containing NaN values (invalid computations or uniform features).
    2. Normalizing features using StandardScaler (to ensure zero mean and unit variance).
    3. Reducing dimensionality using PCA (retains 10 most important components).

    Returns:
    - X: Preprocessed features
    - y: Labels (0 = clean, 1 = stego)
    - scaler: Fitted StandardScaler object
    - pca: Fitted PCA object
    """

    # Step 1: Split dataset into features and labels
    X = data.drop(columns=['label']).values  # Drop label column for features
    y = data['label'].values  # Extract label column

    # Step 2: Remove any row that contains NaN (can happen due to uniform images or divide by zero)
    print("\n🔍 Removing rows with NaN values...")
    nan_mask = ~np.isnan(X).any(axis=1)  # Mask where rows do not contain NaNs
    X = X[nan_mask]
    y = y[nan_mask]
    print(f"✅ Dataset shape after removing NaNs: {X.shape}")
    print("🧾 First 5 rows of X (after NaN removal):")
    print(X[:5])  # Show first 5 rows

    # Step 3: Normalize the features to have mean=0 and std=1
    print("\n⚙️ Normalizing features with StandardScaler...")
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    print("✅ First 5 rows after normalization:")
    print(X[:5])

    # Step 4: Apply PCA to reduce to top 10 most significant components
    print("\n📉 Applying PCA to reduce dimensionality to 10 components...")
    pca = PCA(n_components=10)
    X = pca.fit_transform(X)
    print("📊 Explained Variance Ratio of PCA:")
    print(pca.explained_variance_ratio_)
    print("✅ First 5 rows of transformed features after PCA:")
    print(X[:5])

    return X, y, scaler, pca

# Run the preprocessing function
X_processed, y_labels, scaler_model, pca_model = preprocess_data(full_dataset)


🔍 Removing rows with NaN values...
✅ Dataset shape after removing NaNs: (19358, 41)
🧾 First 5 rows of X (after NaN removal):
[[-3.17326879e-01  8.27515384e-01  7.60604845e-01  7.40965975e-01
   7.21417838e-01  9.10647001e-01  8.61356432e-01  8.35196209e-01
   8.15543437e-01  8.18339071e-01  7.58361024e-01  7.32744946e-01
   7.12313450e-01  8.06705401e-01  7.59407793e-01  1.60797637e-01
   2.14583946e-01  1.91990069e-01  1.82255482e-01  1.76242578e-01
  -2.84393560e-01 -1.14700224e-01 -5.35871230e-02 -8.57756000e-04
  -2.30281000e-04 -4.37930600e-03  3.93935000e-03 -2.54409068e-01
  -1.58545180e-01 -4.41293100e-02 -6.17937400e-03 -1.58812800e-03
  -4.25697300e-03 -2.38575000e-04 -2.66942911e-01 -1.06837230e-01
  -5.97025860e-02 -1.51620180e-02 -6.72895500e-03 -4.32901300e-03
   1.18994800e-03]
 [-5.03110538e-01  8.62969506e-01  8.02899468e-01  7.75813071e-01
   7.50999908e-01  9.27451720e-01  8.89261157e-01  8.66067459e-01
   8.48225700e-01  8.55546058e-01  8.02765823e-01  7.72049795e-

In [8]:
# Function to train the classifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Run the preprocessing function (from Step 2) to get X and y
X, y, scaler, pca = preprocess_data(full_dataset)

# Split the preprocessed dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

def train_classifier(X_train, y_train):
    """
    This function trains a Random Forest Classifier on the training data.
    Returns the trained classifier and prints training accuracy and classification report.
    """
    print("\nTraining Random Forest Classifier...")
    clf = RandomForestClassifier(
        n_estimators=100,  # Number of trees in the forest
        max_depth=10,      # Maximum depth of each tree
        random_state=42,   # For reproducibility
        n_jobs=-1          # Use all available CPU cores for faster training
    )
    clf.fit(X_train, y_train)

    # 1. Training Accuracy
    train_preds = clf.predict(X_train)
    train_accuracy = accuracy_score(y_train, train_preds)
    print(f"Training Accuracy: {train_accuracy:.4f}")

    # 2. Classification Report
    print("\nClassification Report on Training Data:")
    print(classification_report(y_train, train_preds, target_names=["Cover", "Stego"]))

    return clf

# Run the function
clf = train_classifier(X_train, y_train)



🔍 Removing rows with NaN values...
✅ Dataset shape after removing NaNs: (19358, 41)
🧾 First 5 rows of X (after NaN removal):
[[-3.17326879e-01  8.27515384e-01  7.60604845e-01  7.40965975e-01
   7.21417838e-01  9.10647001e-01  8.61356432e-01  8.35196209e-01
   8.15543437e-01  8.18339071e-01  7.58361024e-01  7.32744946e-01
   7.12313450e-01  8.06705401e-01  7.59407793e-01  1.60797637e-01
   2.14583946e-01  1.91990069e-01  1.82255482e-01  1.76242578e-01
  -2.84393560e-01 -1.14700224e-01 -5.35871230e-02 -8.57756000e-04
  -2.30281000e-04 -4.37930600e-03  3.93935000e-03 -2.54409068e-01
  -1.58545180e-01 -4.41293100e-02 -6.17937400e-03 -1.58812800e-03
  -4.25697300e-03 -2.38575000e-04 -2.66942911e-01 -1.06837230e-01
  -5.97025860e-02 -1.51620180e-02 -6.72895500e-03 -4.32901300e-03
   1.18994800e-03]
 [-5.03110538e-01  8.62969506e-01  8.02899468e-01  7.75813071e-01
   7.50999908e-01  9.27451720e-01  8.89261157e-01  8.66067459e-01
   8.48225700e-01  8.55546058e-01  8.02765823e-01  7.72049795e-

In [13]:
# Function to extract CF features from an image
import cv2
import numpy as np
from scipy import ndimage
from scipy.stats import pearsonr
import pywt
import requests
from io import BytesIO

# --- Helper Functions ---

def getPlaneBits(planeId, binary_image):
    return [int(b[planeId]) for b in binary_image]

def getBitPlanes(img):
    bin_image = []
    bit_planes = []
    for i in range(img.shape[0]):
        for j in range(img.shape[1]):
            bin_image.append(np.binary_repr(img[i][j], width=8))
    for i in range(8):
        bit_planes.append(np.array(getPlaneBits(i, bin_image)).reshape(img.shape))
    return bit_planes

def autocor(A, k, l):
    Xk = A[0:A.shape[0] - k, 0:A.shape[1] - l]
    Xl = A[k:A.shape[0], l:A.shape[1]]
    return pearsonr(Xk.flatten(), Xl.flatten())

def getHl1(img_hist, l):
    return img_hist[0:256 - l]

def getHl2(img_hist, l):
    return img_hist[l:256]

def getCHl(img_hist, l):
    return pearsonr(getHl1(img_hist, l), getHl2(img_hist, l))

def getModifiedWavelet(C, t):
    for i, row in enumerate(C):
        for j, val in enumerate(row):
            if abs(val) < t:
                C[i][j] = 0
    return C

def getE(img, t):
    coeffs = pywt.dwt2(img, 'haar')
    LL, (LH, HL, HH) = coeffs
    LH = getModifiedWavelet(LH, t)
    HL = getModifiedWavelet(HL, t)
    HH = getModifiedWavelet(HH, t)
    img_denoised = pywt.idwt2((LL, (LH, HL, HH)), 'haar')
    E = img - img_denoised
    return E

def getCE(img, t, k, l):
    E = getE(img, t)
    return autocor(E, k, l)

# --- Main CF Extraction Function ---
def extract_cf_features(image_array, scaler, pca):
    """
    Extracts 41 correlation features (CF) from a given grayscale image array.
    Features are normalized and reduced using a pre-trained scaler and PCA model.
    """
    features = []
    bit_planes = getBitPlanes(image_array)

    # Bit plane correlation
    M1 = bit_planes[0]  # LSB
    M2 = bit_planes[1]  # Second LSB
    features.append(pearsonr(M1.flatten(), M2.flatten())[0])

    # Autocorrelation on LSB
    autocor_kl_pairs = [[1, 0], [2, 0], [3, 0], [4, 0], [0, 1], [0, 2], [0, 3], [0, 4],
                        [1, 1], [2, 2], [3, 3], [4, 4], [1, 2], [2, 1]]
    for k, l in autocor_kl_pairs:
        features.append(autocor(M1, k, l)[0])

    # Histogram even vs odd correlation
    img_hist, _ = np.histogram(image_array.flatten(), bins=256, density=True)
    He = [img_hist[i] for i in range(0, 256, 2)]
    Ho = [img_hist[i] for i in range(1, 256, 2)]
    features.append(pearsonr(He, Ho)[0])

    # Histogram shifts
    for i in range(1, 5):
        features.append(getCHl(img_hist, i)[0])

    # Wavelet residual correlations
    autocor_tkl_triplets = [[1.5, 0, 1], [1.5, 1, 0], [1.5, 1, 1], [1.5, 0, 2], [1.5, 2, 0], [1.5, 1, 2], [1.5, 2, 1],
                            [2, 0, 1], [2, 1, 0], [2, 1, 1], [2, 0, 2], [2, 2, 0], [2, 1, 2], [2, 2, 1],
                            [2.5, 0, 1], [2.5, 1, 0], [2.5, 1, 1], [2.5, 0, 2], [2.5, 2, 0], [2.5, 1, 2], [2.5, 2, 1]]
    for t, k, l in autocor_tkl_triplets:
        features.append(getCE(image_array, t, k, l)[0])

    # Convert to NumPy array and normalize
    features = np.array(features)
    features = scaler.transform(features.reshape(1, -1))
    features = pca.transform(features)

    print("\nExtracted CF Feature Vector (41 values):")
    print(features)

    return features

In [15]:
# --- Main Prediction Function Using Extracted Features ---
def run_prediction(scaler, pca, clf):
    """
    Asks user for an image URL, downloads it, extracts features, and predicts stego/cover.
    """
    image_url = input("\nEnter the URL of the image to test (must be 512x512 grayscale .pgm): ")
    print("\nDownloading and processing image...")

    try:
        resp = requests.get(image_url).content
        image_array = np.asarray(bytearray(resp), dtype=np.uint8)
        image = cv2.imdecode(image_array, cv2.IMREAD_GRAYSCALE)

        if image is None or image.shape != (512, 512):
            raise ValueError("The input image must be a 512x512 grayscale image.")

        print("Extracting CF features and making prediction...")
        features = extract_cf_features(image, scaler, pca)
        prediction = clf.predict(features)
        result = "Steg Image (LSB Matching Detected)" if prediction[0] == 1 else "Cover Image (No LSB Matching)"
        print("\nPrediction Result:", result)
        return result

    except Exception as e:
        print("\nError processing the image:", e)
        return None

# Run the function to test prediction (example)
# Uncomment the following line after training classifier and preprocessing pipeline:
run_prediction(scaler, pca, clf)


Enter the URL of the image to test (must be 512x512 grayscale .pgm): https://raw.githubusercontent.com/Sourish1997/steganalysis/master/bossbase_sample/10.pgm

Downloading and processing image...
Extracting CF features and making prediction...

Extracted CF Feature Vector (41 values):
[[ 0.64193957  2.3636459   0.58906108  0.89164574 -0.56640136 -1.21387322
   1.15298357 -0.15932857 -0.09841518 -0.30340162]]

Prediction Result: Steg Image (LSB Matching Detected)


'Steg Image (LSB Matching Detected)'


Explanation of the Process:
1. Input : The user provided the path to the link to the image for testing.
2. Feature Extraction : The program extracted features from the image using the CF (Correlation Features) feature set described in the project report. These features capture spatial information from the image, particularly focusing on the least significant bit planes.
3. Preprocessing : The extracted features were preprocessed to ensure compatibility with the trained model. This includes:

  3.1 Normalization using StandardScaler.

  3.2 Dimensionality reduction using Principal Component Analysis (PCA).
4. Prediction : The preprocessed features were passed to the trained voting ensemble model, which consists of parameter-tuned versions of MLP Classifier and AdaBoost models.
5. Output : The model predicted that the image does not contain LSB matching steganography, classifying it as a Cover Image .

Key Points from the Output:
1. Prediction : The model classified the image as a Cover Image , meaning no signs of LSB matching steganography were detected.
2. Confidence : While the exact confidence score is not provided in the output, the model's accuracy and F-score (as reported in the project) suggest a reliable prediction. The final model achieved an accuracy of 75.52% and an F-score of 79.30% , which is significantly better than the benchmark Gaussian Naïve Bayes model.

 Possible Scenarios:
1. True Negative : If the image is indeed a clean image without any steganography, the prediction is correct.
2. False Negative : If the image contains LSB matching steganography but was misclassified as a cover image, this would indicate a limitation of the model. However, given the high F-score of the model, such cases are less likely but not impossible.

 Limitations to Consider:
1. Image Size : The feature extraction process is designed for 512x512 grayscale images. If the input image does not meet this requirement, it may have been cropped or resampled, potentially affecting the prediction.
2. Overly Uniform Images : If the image is overly dark or bright, some CF features may result in NaN values, making it unsuitable for analysis. However, since the program completed the prediction, this issue likely did not occur here.