# Package Install

The installation of the packages forms the technological basis necessary for advanced analysis: `tensorflow` provides deep learning functionality with Keras API to build and train a model; `numpy` provides means for efficient operations on numerical arrays and mathematical computations in general, essential for scientific computing; `pandas` allows structured data manipulation and tabular analysis; `matplotlib` and `seaborn` provide statistical visualisations and graphical representations that are essential in exploratory data analysis and result communication; `scikit-learn` provides utilities for data preprocessing, model evaluation metrics, and data partition functions; and `pillow` allows loading and manipulating images, which is crucial for vision tasks and exploratory data analysis.

Uncomment the line that corresponds to your Kernel of choice if packages have not been installed yet (Step in README.md)

In [None]:
#%conda install -c conda-forge tensorflow numpy pandas matplotlib seaborn pillow scikit-learn tqdm editdistance pyspark keras glob
%pip install --upgrade pip setuptools wheel
%pip install -r requirements.txt

# Import Libraries

The import statements are strategically organized in such a way to set up an all-encompassing analytical and modeling environment for the project on recognition of CAPTCHA. Core data handling relies on `numpy` for numerical array manipulation and `pandas` for structured DataFrame operations. These libraries support dataset preparation, statistical summaries, and tabular exploration.

Computer vision and deep learning capability is supplied by `tensorflow` and `tensorflow.keras`, which provide the comprehensive framework for building, compiling, training, and optimizing a model. This includes convolutional layers, activation functions, callbacks to control training, and utilities for saving and loading models.

Visualization and diagnostic evaluation is aided by `matplotlib.pyplot` for customizable plots and `seaborn` for improved statistical graphics that allow for the clear presentation of distributions, model performance curves, and findings from exploratory analysis. `Scikit-learn` provides the necessary functionalities related to model evaluation, such as confusion matrices, ROC and precision-recall curve calculation, AUC score, and other metrics applied for predictive performance evaluation. This module also includes functions for dividing datasets into the correct proportions of training, validation, and test subsets.

It also includes image processing functionalities using the Python Imaging Library `PIL` for loading, converting, and inspecting the CAPTCHA images. This allows the inspection of image dimensions, color patterns, and other image artifacts that could affect model behavior. Additional supporting libraries include `tqdm` for progress monitoring, `glob` and `pathlib` for managing file paths, and configuration of random seeds for reproducibility in Python, NumPy, and TensorFlow. Together, these complete an environment for data preparation, exploratory data analysis, development of convolutional models, and empirical evaluation.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
from pathlib import Path
from PIL import Image, ImageDraw, ImageFont
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, Callback
from tensorflow.keras.models import load_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_curve, auc, precision_recall_curve, average_precision_score
from tqdm import tqdm
import json
from datetime import datetime
from collections import Counter
import glob

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col, regexp_extract, input_file_name, explode, split, count, desc
import pyspark.sql.functions as F
import threading

import warnings
warnings.filterwarnings('ignore')

random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

print("TensorFlow version:", tf.__version__)
print("GPU Available:", tf.config.list_physical_devices('GPU'))

In [None]:
def build_spark():
    global spark
    try:
        spark = (
            SparkSession.builder
            .appName("CAPTCHA_Solver")
            .master("local[*]")
            .getOrCreate()
        )
        spark.sparkContext.setLogLevel("WARN")
    except Exception:
        spark = ""

# Timeout in seconds
TIMEOUT = 20

thread = threading.Thread(target=build_spark)
thread.start()
thread.join(TIMEOUT)

if thread.is_alive():
    print(f"SparkSession creation exceeded {TIMEOUT} seconds. Skipping...")
    spark = ""
else:
    if spark:
        print("SparkSession created successfully.")
    else:
        print("SparkSession failed to create.")


# Project Constants & Configuration

This section provides for the foundational parameters and directory structures, necessary for managing and training on the large-scale CAPTCHA image dataset. For this project, a dataset consisting of over one million images will be used. Access can be obtained through the following links:
- **Kaggle Link:** [1M Big CAPTCHA Dataset](https://www.kaggle.com/datasets/muzzamalhameed/1m-big-captcha-dataset/data?select=147EAhiwL4CPT5ez8LDUakKsIO5BbiUNS)

- **Google Drive Link:** [1M Big CAPTCHA Dataset](https://drive.google.com/drive/folders/147EAhiwL4CPT5ez8LDUakKsIO5BbiUNS?usp=drive_link)

- **Dataset Link:** [Dataset OneDrive Folder](https://advtechonline-my.sharepoint.com/:f:/g/personal/st10204772_vcconnect_edu_za/IgCZ5bCXuRigRrgs4mNaeBcFAd4aw2KVIZfDxlmBo4CNgKw?e=Q3DNsh)

Other key configuration parameters are defined to ensure reproducibility, including a fixed random state to achieve identical splits of the data and hence identical training. Image preprocessing settings, such as target height, width, and batch size, are standardized across all input to provide consistent training.

Training and evaluation parameters, including the number of epochs, learning rate, and proportions of data reserved for testing and validation, are defined to guide model development and performance assessment. The directories are set up such that outputs, visualizations, and metrics are systematically stored. It provides clarity in terms of the workflow's reproducibility for maintaining results, tracking experiments, and general project organization. This will ensure that the dataset and further processes are handled efficiently and consistently throughout the project.

In [None]:
# Image preprocessing parameters
IMG_HEIGHT = 100
IMG_WIDTH = 200
BATCH_SIZE = 32

# Training parameters
EPOCHS = 50
LEARNING_RATE = 0.005
TEST_SPLIT = 0.2
VAL_SPLIT = 0.1
RANDOM_STATE = 42

# Results directory
BASE_DIR = Path().resolve()
DATA_DIR = BASE_DIR / "1M_Big_Captcha_Dataset"
RESULTS_DIR = "results/"
FIGURE_DIR = "results/figures/"
METRIC_DIR = "results/metrics/"
os.makedirs(RESULTS_DIR, exist_ok=True)
os.makedirs(FIGURE_DIR, exist_ok=True)
os.makedirs(METRIC_DIR, exist_ok=True)

# Data Loading and Preprocessing Classes

The following section defines a comprehensive and extensible framework for loading, preprocessing, and preparing large-scale CAPTCHA image datasets for deep learning training. At its core, the `CAPTCHADataLoader` class is designed to handle such significant dataset challenges as memory efficiency, label encoding, and full-colour image preprocessing to prepare the data for CNN training.

The flexible methods for reading images from disk are provided by the loader. If `PySpark` is available, this can be used to distribute the loading across multiple nodes, hence handling very large datasets much more quickly. Fallbacks to using `TensorFlow`-based loading are done in the case of the unavailability of `PySpark`, whereby recursive directory scanning can be utilized to identify images, while keeping a structured mapping of paths to labels. This ensures that this pipeline can be applied both within large-scale environments and more resource-constrained ones.

Once images are found, the loader extracts labels from the filename, removing file extensions automatically. It then constructs character-to-number and number-to-character mappings, of vital importance in encoding textual labels into a numerical form that neural networks can be trained on. Multi-character labels are one-hot encoded for each character position in the label, allowing the model to predict sequences of characters simultaneously. In addition, a decoding method is provided to translate model outputs back into text that will be more readable, making it easier to assess and verify predictions.

Preprocessing is done on-the-fly in a manner to balance memory usage with training efficiency. Images are resized to consistent dimensions, normalized in the [0,1] range, and kept in RGB color to preserve vital visual features critical to state-of-the-art recognition of the CAPTCHAs. Optional, but particularly useful during training, is augmentation that randomly adjusts brightness, contrast, and saturation while adding Gaussian noise to simulate the various distortions and variability common in CAPTCHA images. Such augmentations help the model generalize well to unseen CAPTCHAs.
The `create_dataset` function constructs an optimized TensorFlow `tf.data.Dataset` pipeline that encompasses all of the major steps: loading images, preprocessing, and encoding labels. It supports batching, shuffling, caching, and prefetching to efficiently manage massive-scale CAPTCHA dataset preparation with minimal I/O bottlenecks in order to support high-performance training. This infrastructure deals with lazy loading, on-the-fly preprocessing, augmentation, and provides an efficient data pipeline that is all inclusive in preparing large-scale CAPTCHA datasets. It assures reproducibility, scalability, and robustness, appropriate for CNN-based model training that demands accuracy and efficiency necessary when dealing with complex multi-character CAPTCHA recognition tasks.

In [None]:
class CAPTCHADataLoader:
    """
    Handles loading of CAPTCHA dataset.
    Contains Tensorflow dataset loading fallback if Pyspark fails.
    """

    def __init__(self, image_dir, img_height=IMG_HEIGHT, img_width=IMG_WIDTH):
        # Initialise class attributes
        self.image_dir = image_dir
        self.img_height = img_height
        self.img_width = img_width
        self.characters = None
        self.char_to_num = None
        self.num_to_char = None
        self.data_df = None
        self.load_method = None

    # Loads dataset using Spark
    def load_with_spark(self):
        if spark is None:
            return self.load_with_tensorflow()

        try:
            image_extensions = ['jpg', 'jpeg', 'png']
            all_data = []

            for ext in image_extensions:
                pattern = os.path.join(self.image_dir, f"*.{ext}")
                files_rdd = spark.sparkContext.binaryFiles(pattern)
                paths = files_rdd.map(lambda x: x[0]).collect()

                for path in paths:
                    filename = os.path.basename(path)
                    label = os.path.splitext(filename)[0]
                    all_data.append((path, filename, label, len(label)))

            if not all_data:
                raise Exception("No images found with PySpark.")

            schema = StructType([
                StructField("image_path", StringType(), True),
                StructField("filename", StringType(), True),
                StructField("label", StringType(), True),
                StructField("label_length", IntegerType(), True)
            ])
            spark_df = spark.createDataFrame(all_data, schema)
            self.data_df = spark_df.select("image_path", "label").toPandas()
            self.load_method = "pyspark"
            return self.data_df

        # Calls function to Load images with Tensorflow if an error occurs
        except Exception:
            return self.load_with_tensorflow()

    # Loads dataset using Tensorflow
    def load_with_tensorflow(self):
        image_extensions = ['jpg', 'jpeg', 'png']
        all_paths = []

        for ext in image_extensions:
            pattern = os.path.join(self.image_dir, f"**/*.{ext}")
            all_paths.extend(glob.glob(pattern, recursive=True))

        if not all_paths:
            raise Exception(f"No images found in {self.image_dir}")

        data = [{'image_path': path, 'label': os.path.splitext(os.path.basename(path))[0]}
                for path in all_paths]

        self.data_df = pd.DataFrame(data)
        self.load_method = "tensorflow"
        return self.data_df

    def build_character_mappings(self, labels):
        all_chars = sorted(set(''.join(labels)))
        self.characters = all_chars
        self.char_to_num = {char: idx for idx, char in enumerate(all_chars)}
        self.num_to_char = {idx: char for idx, char in enumerate(all_chars)}

    # gets unprocessed images straight from dataset
    def get_data(self, limit=None, max_label_length=6):
        # IF for some reason, the dataset has not been loaded, it calls function to load it
        if self.data_df is None:
            self.load_with_spark()

        if self.data_df is None or self.data_df.empty:
            return [], []

        self.data_df = self.data_df[self.data_df['label'].str.len() <= max_label_length]

        if limit:
            subset = self.data_df.head(limit)
        else:
            subset = self.data_df

        image_paths = subset['image_path'].tolist()
        labels = subset['label'].tolist()

        if self.characters is None and labels:
            self.build_character_mappings(labels)

        return image_paths, labels

    # Image preprocessing function, per image
    def preprocess_image(self, img_path):
        """
        TensorFlow-native image loading and preprocessing for images.
        """
        img = tf.io.read_file(img_path)
        img = tf.image.decode_image(img, channels=3, expand_animations=False)
        img = tf.image.resize(img, [self.img_height, self.img_width])
        img = tf.cast(img, tf.float32) / 255.0
        return img

    # Label encoding function, per label
    def encode_single_label(self, label):
        encoded = [self.char_to_num.get(char, 0) for char in label]
        return encoded

    # Encoding label function for all labels
    def encode_labels(self, labels, max_length=6):
        if self.char_to_num is None:
            raise ValueError("Character mappings not built.")

        encoded_labels = []
        for i in range(max_length):
            position_labels = [
                self.char_to_num.get(label[i], 0) if i < len(label) else 0
                for label in labels
            ]
            one_hot = tf.keras.utils.to_categorical(
                position_labels, num_classes=len(self.characters))
            encoded_labels.append(one_hot)

        return encoded_labels

    # Decodes predictions
    def decode_predictions(self, predictions):
        if self.num_to_char is None:
            raise ValueError("Character mappings not built.")

        decoded_texts = []
        batch_size = predictions[0].shape[0]

        for i in range(batch_size):
            text = ""
            for char_pred in predictions:
                char_idx = np.argmax(char_pred[i])
                if char_idx < len(self.characters):
                    text += self.num_to_char[char_idx]
            decoded_texts.append(text)

        return decoded_texts


In [None]:
def load_and_preprocess_image(image_path, img_height, img_width, is_training=False):
    """Load and preprocess a single CAPTCHA image."""
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3) 
    image = tf.image.resize(image, [img_height, img_width])
    image = tf.cast(image, tf.float32) / 255.0

    if is_training:
        # Augmentations that help with CAPTCHA distortion
        image = tf.image.random_brightness(image, max_delta=0.25)
        image = tf.image.random_contrast(image, lower=0.8, upper=1.3)
        image = tf.image.random_saturation(image, lower=0.8, upper=1.2)

        # Slight Gaussian noise to improve robustness
        noise = tf.random.normal(tf.shape(image), mean=0.0, stddev=0.03)
        image = tf.clip_by_value(image + noise, 0.0, 1.0)

    return image


def create_dataset(
    image_paths, 
    labels, 
    data_loader, 
    max_label_length=6, 
    batch_size=BATCH_SIZE, 
    shuffle=True, 
    is_training=False
):
    """Creates an efficient, tf.data pipeline for CAPTCHA training."""
    
    num_classes = len(data_loader.characters)
    
    # Encode labels with padding
    encoded_labels = [
        [data_loader.char_to_num.get(char, data_loader.char_to_num.get(' ', 0)) 
         for char in label.ljust(max_label_length, ' ')]
        for label in labels
    ]
    encoded_labels_tensor = tf.convert_to_tensor(encoded_labels, dtype=tf.int32)

    def process_single_sample(image_path, encoded_chars):
        # Load and preprocess the image
        image = load_and_preprocess_image(
            image_path, 
            data_loader.img_height, 
            data_loader.img_width, 
            is_training=is_training
        )
        # Convert each character to one-hot encoded
        one_hot_labels = {
            f'char_{i+1}': tf.one_hot(encoded_chars[i], num_classes)
            for i in range(max_label_length)
        }
        return image, one_hot_labels

    dataset = tf.data.Dataset.from_tensor_slices((image_paths, encoded_labels_tensor))
    dataset = dataset.map(process_single_sample, num_parallel_calls=tf.data.AUTOTUNE)

    if shuffle:
        dataset = dataset.cache().shuffle(buffer_size=min(1000, len(image_paths)))

    dataset = dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return dataset

# Exploratory Data Analysis & Loading of Dataset

This section covers the initialization of the CAPTCHA dataset and an in-depth exploratory analysis in order to understand the characteristics of the data in view of training a CNN model. The **data loader** is instantiated with the dataset directory and target image dimensions, giving access to image paths and labels. A subset of 50 000 images is loaded for efficient analysis while ensuring that it is well-representative for statistical insights. The **maximum length of the label** is determined with a view to configuring the output layers of the CNN for the prediction of multiple characters.

To have a better view of the frequency distribution of every character in this dataset, the character-level statistics are computed. The **Counter** object captures occurrences, and the top 20 most frequent characters are visualized in a bar plot that helps to know the potential class imbalances, which are critical for model training and evaluation. Summary statistics, including the total number of unique characters and most common character occurrence, help in designing the output layer and loss functions.

Sample CAPTCHA images are provided to give a qualitative overview of the dataset and the variability in the placements of the characters, the fonts used, and the background patterns. This also serves to visually inspect problems that the CNN might face, such as distorted characters or overlapping elements. Image dimensions and channel information for each image is sampled over a subset to ensure image height and width are congruent with channel depth. Unique channel counts and the most common channel configuration are printed out to make sure the imagery comes in a format expected by the model.

Distributions of pixel intensities along the Red, Green, and Blue channels were analyzed by flattening pixel values and plotting histograms. This gives insight into the brightness and contrast and possible color biases in the dataset. Channel-wise statistics like mean, standard deviation, minimum, and maximum values give a quantitative measure of pixel intensity variation and thus guide preprocessing steps like normalization or adjusting contrast. These histograms and statistics would help identify whether further image augmentation or preprocessing is necessary to improve model generalization.

In [None]:
# Initialize data loader
data_loader = CAPTCHADataLoader(DATA_DIR, IMG_HEIGHT, IMG_WIDTH)
# Gets 50 000 images from dataset
image_paths, labels = data_loader.get_data(limit=50000)

print(f"Total samples: {len(image_paths)}")
print(f"Label examples: {labels[:5]}")


In [None]:
# Calculates character frequency distribution
char_freq = Counter("".join(labels))

if char_freq:
    sorted_chars = char_freq.most_common()
    top_chars = sorted_chars[:20]

    if top_chars:  
        fig, ax = plt.subplots(figsize=(12, 5))
        chars, freqs = zip(*top_chars)
        ax.bar(range(len(chars)), freqs, color='#F18F01', edgecolor='black', alpha=0.8)
        ax.set_xticks(range(len(chars)))
        ax.set_xticklabels(chars, rotation=45, fontsize=10)
        ax.set_ylabel('Frequency', fontsize=11, fontweight='bold')
        ax.set_xlabel('Character', fontsize=11, fontweight='bold')
        ax.set_title('Top 20 Character Frequency Distribution', fontsize=13, fontweight='bold')
        ax.grid(alpha=0.3, axis='y')
        
        # Add value labels on bars
        for i, freq in enumerate(freqs):
            ax.text(i, freq + max(freqs)*0.01, str(freq), 
                   ha='center', va='bottom', fontsize=9, fontweight='bold')
        
        plt.tight_layout()
        plt.savefig(os.path.join(FIGURE_DIR, 'character_frequency.png'), dpi=300, bbox_inches='tight')
        plt.show()
        
        print(f"Total unique characters: {len(char_freq)}")
        print(f"Most common character: '{sorted_chars[0][0]}' ({sorted_chars[0][1]} occurrences)")
    else:
        print("No character frequency data to display")
else:
    print("No characters found in labels - check if data was loaded correctly")

In [None]:
# Displays sample  CAPTCHA images
fig, axes = plt.subplots(3, 4, figsize=(15, 10))
# Selects 12 images from dataset
sample_indices = np.random.choice(len(image_paths), 12, replace=False)

for idx, ax in enumerate(axes.flatten()):
    img_path = image_paths[sample_indices[idx]]
    label = labels[sample_indices[idx]]
    
    img = Image.open(img_path)
    ax.imshow(img)
    ax.set_title(f"Label: {label}", fontsize=10, fontweight='bold')
    ax.axis('off')

plt.suptitle('Sample CAPTCHA Images from Dataset', fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig(os.path.join(FIGURE_DIR, 'eda_sample_images.png'), dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Analyze image dimensions and color characteristics
dimensions = []
channels_count = []

# Sample subset of images for efficiency
sample_size = min(1000, len(image_paths))
sample_paths = np.random.choice(image_paths, sample_size, replace=False)
successful_reads = 0

for img_path in tqdm(sample_paths):
    try:
        with Image.open(img_path) as img:
            dimensions.append(img.size)
            channels_count.append(len(img.getbands()))
            successful_reads += 1
    except Exception as e:
        pass

if successful_reads > 0:
    dimensions = np.array(dimensions)
    print(f"Color Channels:")
    print(f"  Unique channel counts: {np.unique(channels_count)}")
    print(f"  Most common: {np.bincount(channels_count).argmax()} channels")

In [None]:
# Analyze pixel intensity distributions
pixel_intensities = {'Red': [], 'Green': [], 'Blue': []}
sample_size = min(500, len(image_paths))
sample_paths = np.random.choice(image_paths, sample_size, replace=False)

successful_reads = 0

for img_path in tqdm(sample_paths):
    try:
        with Image.open(img_path) as img:
            img_rgb = img.convert('RGB')
            img_array = np.array(img_rgb)
            pixel_intensities['Red'].extend(img_array[:, :, 0].flatten())
            pixel_intensities['Green'].extend(img_array[:, :, 1].flatten())
            pixel_intensities['Blue'].extend(img_array[:, :, 2].flatten())
            successful_reads += 1
    except Exception as e:
        print(f"Error processing {img_path}: {e}")
        pass

print(f"Successfully processed {successful_reads}/{sample_size} images")

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
colors = ['#E63946', '#06A77D', '#0077B6']
channel_names = ['Red', 'Green', 'Blue']

for idx, (channel, color) in enumerate(zip(channel_names, colors)):
    intensities = np.array(pixel_intensities[channel])
    if len(intensities) > 0:
        axes[idx].hist(intensities, bins=50, color=color, edgecolor='black', alpha=0.7)
        axes[idx].set_xlabel('Pixel Intensity', fontsize=11, fontweight='bold')
        axes[idx].set_ylabel('Frequency', fontsize=11, fontweight='bold')
        axes[idx].set_title(f'{channel} Channel Distribution', fontsize=12, fontweight='bold')
        axes[idx].grid(alpha=0.3, axis='y')
        
        print(f"\n{channel} Channel Statistics:")
        print(f"  Mean: {intensities.mean():.2f}")
        print(f"  Std: {intensities.std():.2f}")
        print(f"  Min: {intensities.min()}")
        print(f"  Max: {intensities.max()}")
    else:
        axes[idx].text(0.5, 0.5, 'No Data', ha='center', va='center', transform=axes[idx].transAxes)
        axes[idx].set_title(f'{channel} Channel (No Data)', fontsize=12, fontweight='bold')

plt.suptitle('Pixel Intensity Distributions Across Color Channels', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig(os.path.join(FIGURE_DIR, 'eda_pixel_intensity.png'), dpi=300, bbox_inches='tight')
plt.show()

# Train-Test-Validation Split

The dataset is divided randomly into training, validation, and testing subsets containing 70%, 10%, and 20% of the data, respectively, while ensuring balanced label length distributions across the splits. A 70% `training set` offers a decent number of examples for the model to learn different patterns, features, and distortions in CAPTCHA images. This helps ensure its generalization over different character sequences, colors, and various distortions that might commonly be seen in CAPTCHAs.

The `validation set`, representing 10% of the data, works as an independent set that monitors the model's performance during training. It enables the tuning of hyperparameters, the evaluation of overfitting, and the assessment of training stability without exposing the model to the test set. Thus, using a separate validation subset allows controlled guidance and optimization of the model's learning process.

The `test set`, containing 20% of the data, is kept for final evaluation and hence provides an unbiased measure of the model's generalization ability to unseen CAPTCHAs. This split ensures that evaluation metrics reflect real-world performance and helps identify potential weaknesses or biases in the trained model.

In [None]:
# Splits data into subsets
train_paths, test_paths, train_labels, test_labels = train_test_split(
    image_paths, labels,
    test_size=TEST_SPLIT,
    random_state=RANDOM_STATE,
    stratify=None
)

train_paths, val_paths, train_labels, val_labels = train_test_split(
    train_paths, train_labels,
    test_size=VAL_SPLIT,
    random_state=RANDOM_STATE,
    stratify=None
)

print(f"Dataset split:")
print(f"  Training samples: {len(train_paths)}")
print(f"  Validation samples: {len(val_paths)}")
print(f"  Test samples: {len(test_paths)}")

# Dataset Preparation & Creation

This section provides a high-performance data loading framework to construct efficient data pipelines from the training, validation, and test subsets. It generates each dataset with batch processing and optional shuffling to optimize memory usage and training speed. The training dataset is shuffled and augmented to improve model generalization, while the validation and test datasets are left in their original order to ensure that their evaluation is consistent.

Batch size is set to the predefined configuration, enabling the model to handle several images at once, both during training and evaluation. The data pipeline also employs automatic performance tuning to optimize the prefetching, parallel processing, and loading of data. These pipelines perform on-the-fly image preprocessing, normalization, and label encoding; these steps ensure that all images are properly formatted and ready for input into the convolutional neural network.

In [None]:
train_dataset = create_dataset(
    train_paths, 
    train_labels, 
    data_loader,
    batch_size=BATCH_SIZE,
    shuffle=True,
    is_training=True
)

val_dataset = create_dataset(
    val_paths,
    val_labels,
    data_loader,
    batch_size=BATCH_SIZE,
    shuffle=False,
    is_training=False
)

test_dataset = create_dataset(
    test_paths,
    test_labels,
    data_loader,
    batch_size=BATCH_SIZE,
    shuffle=False,
    is_training=False
)


# Convolutional Neural Network Architecture for CAPTCHA Recognition

This is a deep convolutional neural network design targeted at the recognition of CAPTCHAs, which are usually sets of distorted, rotated, or overlapping characters. CAPTCHA-solving CNNs require architectures able to extract both local and global features, capture sequential dependencies, and remain robust against noise, distortion, and variability across positions of characters. Each design choice within this network has been made to help address one of these main challenges.

**Convolutional Blocks and Kernel Sizes**  
It starts with a number of convolutional blocks. The first block uses large 7 × 7 kernels, which were employed to capture a broad, low-level feature encompassing edges, strokes, and basic character shapes (Krizhevsky, Sutskever and Hinton, 2012). These features are critical for distinguishing between visually similar characters in the distorted CAPTCHA images. Larger kernels provide a larger spatial context for early stages, which is important for the initial recognition of character structures that may be partially occluded or warped (Goodfellow, Bengio and Courville, 2016). The next blocks use smaller kernels of size 5 × 5 and 3 × 3 with increasing filter counts to capture mediumand fine-grained features. These layers focus on the subtle difference among different characters, small local distortions, and patterns that consistently occur over multiple positions of characters (Rawat and Wang, 2017). Stacking more convolutional layers with smaller kernels allows the network to learn hierarchical feature representations by gradually combining simple features into more complex ones specific to characters and sequences (Lecun, Bengio and Hinton, 2015).

**Batch Normalization**  
Batch normalization is applied after nearly every convolutional and dense layer; this stabilizes the process of learning by normalizing the activations, reducing internal covariate shift, and allowing higher learning rates (Goodfellow, Bengio and Courville, 2016). This technique also brings a regularization effect that enables the network to generalize well on previously unseen CAPTCHAs, usually containing a wide variety of distortions and noise patterns (Rawat and Wang, 2017).

**Pooling Layers and Feature Consolidation**  
Each convolutional block is followed by max pooling layers that decrease the spatial dimensions while retaining the most salient features (Rawat and Wang, 2017). Pooling consolidates information, makes feature maps more robust to minor spatial variations, and reduces computational complexity. The last convolutional block is followed by a global average pooling layer that converts the spatial feature maps into a compact feature vector representative of the overall content of the CAPTCHA image. This pooling operation preserves essential spatial information while significantly reducing the number of parameters, which is important to prevent overfitting given the high variability in CAPTCHA datasets (Lecun, Bengio and Hinton, 2015).

**Dropout and Regularization**  
Dropout is applied at numerous points in the network; the deeper the layer, the higher the dropout rate - ranging from 0.2 to 0.5 for convolutional blocks and from 0.3 to 0.4 for fully connected layers. This acts as regularization for the network, insuring that it will not quickly converge to overfit on one particular feature (Srivastava et al., 2014). The higher dropout in deeper layers balances the higher capacity of deeper feature representations against the risk of overfitting, which is very important while processing rather complex, high-dimensional CAPTCHA images (Rawat and Wang, 2017).

**Multi-Branch Fully Connected Layers for Character Prediction**  
After feature extraction, the network splits into independent, fully connected layers for every character position in the CAPTCHA. Each branch consists of two dense layers with batch normalization and dropout, followed by a softmax output layer predicting the probability distribution over all possible characters (Lecun, Bengio and Hinton, 2015). The presence of several branches is the characteristic of CAPTCHA-solving CNNs; it allows the network to make independent predictions over characters while they share the same convolutional feature representation. This avoids using recurrent layers while preserving sequence awareness, hence effectively handling CAPTCHAs of different character lengths and positions (Rawat and Wang, 2017).

**Design Rationale for CAPTCHA Solving**  
- Large initial kernels capture broad character structures, aiding early feature detection (Krizhevsky, Sutskever and Hinton, 2012). 
- Smaller kernels in deeper blocks allow the recognition of fine-grained patterns, which are critical in distinguishing similar characters (Rawat and Wang, 2017). 
- Increasing the filter counts allows the network to learn higher-level diverse feature representations (Lecun, Bengio and Hinton, 2015). 
- Batch normalization stabilizes training and speeds up convergence. It also regularizes the network (Goodfellow, Bengio and Courville, 2016). 
- Max and global average pooling consolidate spatial information while reducing the model parameters, hence improving generalization (Lecun, Bengio and Hinton, 2015).
- Dropout across convolutional and fully-connected layers helps mitigate overfitting from noisy and distorted CAPTCHA data (Srivastava et al., 2014). 
- The multi-branch output structure allows independent character predictions to be made while leveraging shared features, thus balancing efficiency and accuracy in sequence recognition (Rawat and Wang, 2017)

In [None]:
def build_cnn_model(img_height, img_width, num_characters, max_label_length):
    input_img = layers.Input(shape=(img_height, img_width, 3), name='image_input')
    
    # Initial convolution to capture character-level features
    x = layers.Conv2D(64, (7, 7), activation='relu', padding='same')(input_img)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(64, (7, 7), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D((2, 2))(x) 
    x = layers.Dropout(0.2)(x)

    # Second block - optimized for horizontal features
    x = layers.Conv2D(128, (5, 5), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(128, (5, 5), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D((2, 2))(x)
    x = layers.Dropout(0.25)(x)

    # Third block - character detail extraction
    x = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(256, (3, 3), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D((2, 2))(x) 
    x = layers.Dropout(0.3)(x)

    # Fourth block - high-level feature consolidation
    x = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D((2, 2))(x)
    x = layers.Dropout(0.35)(x)

    # Final feature consolidation
    x = layers.Conv2D(512, (3, 3), activation='relu', padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.5)(x)
    
    # Separate branches per character
    outputs = []
    for i in range(max_label_length):
        char_branch = layers.Dense(512, activation='relu', name=f'char_{i+1}_dense1')(x)
        char_branch = layers.BatchNormalization(name=f'char_{i+1}_bn1')(char_branch)
        char_branch = layers.Dropout(0.4)(char_branch)
        char_branch = layers.Dense(256, activation='relu', name=f'char_{i+1}_dense2')(char_branch)
        char_branch = layers.BatchNormalization(name=f'char_{i+1}_bn2')(char_branch)
        char_branch = layers.Dropout(0.3)(char_branch)
        char_output = layers.Dense(num_characters, activation='softmax', name=f'char_{i+1}')(char_branch)
        outputs.append(char_output)

    return models.Model(inputs=input_img, outputs=outputs, name='CAPTCHA_CNN_Solver')

# Initialization and Compilation of Model

It deals with preparing the convolutional neural network for either training or evaluation. If a previously trained model exists at the given path, it is loaded to resume training or carry out the evaluation to save on computational resources. If there is no saved model, a new CNN model is created from the predefined architecture, which has already been designed for CAPTCHA recognition.

This network is then compiled with separate categorical cross-entropy losses at each character position, given the multi-output nature of CAPTCHA-solving tasks. Each character branch offers an accuracy metric for detailed monitoring at each position in the CAPTCHA sequence. An adaptive optimizer updates network weights; an adjustable learning rate provides a trade-off between convergence speed and stability. This setup prepares the model for learning from the preprocessed CAPTCHA datasets, while appropriate metrics are available to measure the progress of learning. A summary of the model provides an overview of this architecture, such as number of parameters, layer types, and output branches, which is useful to verify the design of the network before training.

In [None]:
model_path = os.path.join(METRIC_DIR, "best_model.keras")

# Either loads existing model or builds new model
if os.path.exists(model_path):
    model = load_model(model_path)
else:
    model = build_cnn_model(
        IMG_HEIGHT,
        IMG_WIDTH,
        len(data_loader.characters),
        max_label_length=6
)

losses = {f'char_{i}': 'categorical_crossentropy' for i in range(1, 7)}
metrics = {f'char_{i}': 'accuracy' for i in range(1, 7)}

# Compiles model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
    loss=losses,
    metrics=metrics
)

model.summary()

# Training Configuration and Metrics Logging

This section details the callback mechanisms responsible for keeping track of, saving, and managing model performance throughout training. A custom callback has been created that will log per-epoch training and validation losses, as well as all metrics for each character prediction. Three separate JSON files are maintained: one for all epoch losses, one for the epoch with the best monitored metric, and one for comprehensive per-character metrics. The structured logging allows for significant analysis of model behavior over time and may provide persistent records for later reproducibility and evaluation.

The process incorporates some extra callbacks to help convergence and avoid overfitting. Early stopping monitors the validation loss and stops training if there is no improvement for a given number of epochs, restoring the best weights to guarantee that the model is optimal. Model checkpointing saves the best model in terms of validation loss and keeps the most performant network. Learning rate reduction automatically decreases the learning rate when validation performance has plateaued, allowing finer adjustments in model weights, which can lead to better convergence.

The model then trains on the prepared training dataset while validating on a separate validation dataset. Metrics logging, early stopping, checkpointing, and learning rate reduction run parallel, guiding this training process with real-time monitoring, automatic adjustments, and robust performance tracking of each character in CAPTCHA sequences. This configuration will ensure efficient and reliable training but also keep detailed records for analysis and model evaluation.


## Metrics Logger

In [None]:
class MetricsLogger(Callback):
    """
    Logs per-epoch training and validation loss.
    - loss_metrics.json: both losses per epoch
    - best_metrics.json: only epoch with best monitored metric 
    - all_metrics.json: all metrics every epoch
    """
    def __init__(self, log_dir, monitor="val_loss", mode="min"):
        super().__init__()
        self.log_dir = log_dir
        self.monitor = monitor
        self.loss_metrics_path = os.path.join(log_dir, "loss_metrics.json")
        self.best_metrics_path = os.path.join(log_dir, "best_metrics.json")
        self.all_metrics_path = os.path.join(log_dir, "all_metrics.json")  
        self.loss_metrics = []
        self.all_metrics = []

        # Determine comparison function based on mode
        if mode == "min":
            self.monitor_op = lambda a, b: a < b
            self.best_value = float("inf")
        elif mode == "max":
            self.monitor_op = lambda a, b: a > b
            self.best_value = float("-inf")
        else:
            raise ValueError("mode must be 'min' or 'max'")

        self.best_metrics = {}

    # Function runs after every epoch
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}

        train_loss = logs.get("loss", None)
        val_loss = logs.get("val_loss", None)
        monitored_value = logs.get(self.monitor, None)

        epoch_metrics = {
            "epoch": epoch + 1,
            "loss": train_loss,
            "val_loss": val_loss
        }
        self.loss_metrics.append(epoch_metrics)

        with open(self.loss_metrics_path, "w") as f:
            json.dump(self.loss_metrics, f, indent=4)

        # Update best metrics if epochs metrics are new best
        if monitored_value is not None and self.monitor_op(monitored_value, self.best_value):
            self.best_value = monitored_value
            self.best_metrics = epoch_metrics
            with open(self.best_metrics_path, "w") as f:
                json.dump(self.best_metrics, f, indent=4)

        all_epoch_metrics = {"epoch": epoch + 1}
        for key, value in logs.items():
            all_epoch_metrics[key] = float(value) 
        self.all_metrics.append(all_epoch_metrics)

        with open(self.all_metrics_path, "w") as f:
            json.dump(self.all_metrics, f, indent=4)


## Model Training

In [None]:
# Metric logger initialization
metrics_logger = MetricsLogger(
    log_dir=METRIC_DIR,      
    monitor="val_loss",       
    mode="min"                 
)

callbacks = [
    EarlyStopping(
        monitor='val_loss',
        patience=7,
        restore_best_weights=True,
        verbose=1
    ),
    ModelCheckpoint(
        os.path.join(METRIC_DIR, 'best_model.keras'),
        monitor='val_loss',
        save_best_only=True,
        save_weights_only=False,
        verbose=1
    ),
    ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=3,
        min_lr=1e-5,
        verbose=1
    ),
    metrics_logger
]

# Trains Model
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=EPOCHS,
    callbacks=callbacks,
    verbose=1
)

# Model Evaluation and Performance Metrics

This part covers the evaluation of the trained CNN model on both validation and test datasets and extracts and computes key performance metrics. Previously saved model weights and metrics are loaded, if available, for reproducibility and comparison across training sessions. `Validation loss` is calculated to assess model performance during training, while `test loss` measures generalization to unseen data.

Predictions on the test dataset are generated and decoded to character sequences, allowing for comparison with true labels. One-hot encodings are used to reconstruct true labels, while predicted labels have their length aligned to correctly evaluate them. Full CAPTCHA accuracy is calculated as a proportion of test samples in which the entirety of the predicted sequence matches ground truth precisely, which correctly reflects the network's ability to solve full CAPTCHAs rather than individual characters. Character-level accuracy can be derived by averaging per-character metrics across the sequence.


## Model Evaluation

In [None]:
BEST_MODEL_PATH = os.path.join(METRIC_DIR, "best_model.keras")
BEST_METRICS_PATH = os.path.join(METRIC_DIR, "best_metrics.json")
LOSS_METRICS_PATH = os.path.join(METRIC_DIR, "loss_metrics.json")
ALL_METRICS_PATH = os.path.join(METRIC_DIR, "all_metrics.json")

# Loads best metrics
if os.path.exists(BEST_METRICS_PATH):
    with open(BEST_METRICS_PATH, "r") as f:
        best_metrics = json.load(f)

# Loads loss metrics
if os.path.exists(LOSS_METRICS_PATH):
    with open(LOSS_METRICS_PATH, "r") as f:
        loss_metrics = json.load(f)

all_metrics = ""

# Loads all metrics
if os.path.exists(ALL_METRICS_PATH):
    with open(ALL_METRICS_PATH, "r") as f:
        all_metrics = json.load(f)
       
if model is None and os.path.exists(BEST_MODEL_PATH):
    model = tf.keras.models.load_model(BEST_MODEL_PATH)
   
# Evaluate the model on the validation dataset   
val_results = model.evaluate(val_dataset, verbose=1)
val_loss = val_results[0]
        
# Evaluate the model on the test dataset
test_results = model.evaluate(test_dataset, verbose=1)
test_loss = test_results[0]


## Model Predictions

In [None]:
# Runs predictions
raw_predictions = model.predict(test_dataset, verbose=1)
predictions = np.stack(raw_predictions, axis=1) if isinstance(raw_predictions, list) else raw_predictions

max_label_length = max(len(label) for label in labels)

print(f"\nPredictions shape: {predictions.shape}") 

# Extract true labels
true_labels = []
for images, label_dict in test_dataset:
    batch_size = label_dict['char_1'].shape[0]
    for i in range(batch_size):
        char_indices = [int(np.argmax(label_dict[f'char_{pos+1}'][i].numpy()))
                        for pos in range(max_label_length)]
        decoded = ''.join(data_loader.num_to_char[idx] for idx in char_indices)
        true_labels.append(decoded)
true_labels = np.array(true_labels)

# Decode predicted labels
predicted_labels = []
num_samples = predictions.shape[0]
for i in range(num_samples):
    char_indices = np.argmax(predictions[i], axis=-1)
    decoded = ''.join(data_loader.num_to_char[int(idx)] for idx in char_indices)
    predicted_labels.append(decoded)
predicted_labels = np.array(predicted_labels)

# Align lengths if mismatch
if len(predicted_labels) != len(true_labels):
    print(f"Warning: Pred/true mismatch {len(predicted_labels)} vs {len(true_labels)}")
    m = min(len(predicted_labels), len(true_labels))
    predicted_labels = predicted_labels[:m]
    true_labels = true_labels[:m]

# Compute CAPTCHA accuracy
absolute_correct = np.sum(predicted_labels == true_labels)
final_accuracy = absolute_correct / len(true_labels)

accuracy_metrics = [value for name, value in zip(model.metrics_names, test_results) if "accuracy" in name.lower() and not np.isnan(value)] 
avg_character_accuracy = float(np.mean(accuracy_metrics))

print(f" CAPTCHA solving Accuracy: {final_accuracy:.4f}")
print(f" Correct: {absolute_correct}/{len(true_labels)}")


# Performance Visualization and Detailed Metrics Analysis

This section visualizes and analyzes the final performance of the model using various complementary approaches. Final performance metrics such as `final_accuracy`, `test-loss`, and `val_loss` are presented in a color-coded bar chart, allowing for immediate comparison of key outcomes. Loss curves for training and validation across epochs are plotted to gauge convergence stability and eventual trends related to overfitting or underfitting during training. Average character-level accuracy is calculated per epoch, a fine-grained look at the model's learning dynamics for specific characters in the CAPTCHA sequences.

Confusion matrices are created at both the overall and per-character-position levels, allowing for the identification of systematic errors in character prediction and positions most prone to misclassification. This gives an indication of which specific characters or CAPTCHA positions the model struggles with and informs possible future improvements in either preprocessing or adjusting model architecture. Correct and incorrect predictions are visualized with sample images; the characters are highlighted using color coding to indicate accuracy, enabling qualitative inspection of typical failure modes versus successful predictions.

Per-character accuracy across the positions of CAPTCHA is calculated and visualized using a gradient-colored bar chart that shows which positions have higher or lower prediction reliability. Statistical summaries emphasize the strongest and weakest positions within the sequence, targeting the refinement of the network. ROC and Precision–Recall curves are created from flattened one-hot true labels and predicted probabilities, which quantify the model's discriminative capability and the balance of precision-recall across thresholds. These visualizations, combined with metrics including `AUC` and `average_precision`, provide a comprehensive evaluation of the effectiveness of the CNN in solving CAPTCHA tasks and its ability to correctly identify characters under complex distortions and noise.

## Final Performance Metrics

The **Final Performance Metrics** section visualizes the model's key quantitative outcomes: `final_accuracy`, `test_loss`, and `val_loss`. Each metric is shown in a color-coded bar chart with values annotated, making it easier to see at a glance the general model behavior, as well as the balance between accuracy and loss.

In [None]:
# Metric arrays
metrics_names = ['CAPTCHA Accuracy', 'Test Loss', 'Validation Loss']
metrics_values = [final_accuracy, test_loss, val_loss]
metrics_colors = ['#06A77D', '#E63946', '#F18F01']

metrics_values = [float(np.atleast_1d(v)[0]) for v in metrics_values]

plt.figure(figsize=(8, 4))
bars = plt.bar(metrics_names, metrics_values, color=metrics_colors, edgecolor='black', alpha=0.7)
plt.ylabel('Value')
plt.title('Final Performance Metrics')

plt.ylim([0, max(metrics_values)*1.1 if metrics_values else 1])

# Annotate bars
for bar, val in zip(bars, metrics_values):
    plt.text(bar.get_x() + bar.get_width()/2., val + 0.01, f'{val:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.savefig(os.path.join(FIGURE_DIR, 'final_metrics.png'), dpi=300, bbox_inches='tight')
plt.show()

## Training and Validation Curves & Average Character Accuracy

The learning dynamics are evaluated by plotting the **Training and Validation Loss** over epochs. Such curves allow for the identification of underfitting, overfitting, or stable convergence patterns. Overlaying the `val_loss` gives an insight into the generalization, whereas the `loss` is indicative of how well the model learns to capture the underlying patterns in the training data.

The **Average Character Accuracy Over Epochs** gives a more fine-grained insight into the model's performance on the single character level. Averaging the accuracies across all positions shows the learning trend for each character and helps in identifying those positions which are consistently more difficult to predict due to either the structural or visual complexity of CAPTCHAs.

In [None]:
# Plot Loss curves
if history.history and len(history.history.get('loss', [])) > 0:
    epochs_range = range(1, len(history.history['loss']) + 1)
    plt.figure(figsize=(10, 4))
    plt.plot(epochs_range, history.history['loss'], label="Training Loss", color="#2E86AB", linewidth=2)
    if 'val_loss' in history.history:
        plt.plot(epochs_range, history.history['val_loss'], label="Validation Loss", color="#A23B72", linewidth=2)
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title("Training and Validation Loss")
    plt.legend()
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig(os.path.join(FIGURE_DIR, 'loss_curves.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    # Average Character Accuracy from history
    train_acc_keys = [k for k in history.history.keys() if k.endswith('_accuracy') and not k.startswith('val_')]
    val_acc_keys = [k for k in history.history.keys() if k.startswith('val_') and k.endswith('_accuracy')]

    if train_acc_keys:
        train_acc_arrays = [history.history[k] for k in train_acc_keys]
        avg_train_acc = np.mean(train_acc_arrays, axis=0)
        avg_character_accuracy = avg_train_acc[-1]

        plt.figure(figsize=(10, 4))
        plt.plot(epochs_range, avg_train_acc, label="Training Accuracy", color="#2E86AB", linewidth=2)

        if val_acc_keys:
            val_acc_arrays = [history.history[k] for k in val_acc_keys]
            avg_val_acc = np.mean(val_acc_arrays, axis=0)
            plt.plot(epochs_range, avg_val_acc, label="Validation Accuracy", color="#A23B72", linewidth=2)
        plt.xlabel("Epoch")
        plt.ylabel("Average Character Accuracy")
        plt.title("Average Character Accuracy Over Epochs")
        plt.legend()
        plt.grid(alpha=0.3)
        plt.tight_layout()
        plt.savefig(os.path.join(FIGURE_DIR, 'avg_char_accuracy.png'), dpi=300, bbox_inches='tight')
        plt.show()
        
# Version if history.history is empty
elif loss_metrics:
    epochs_range = range(1, len(loss_metrics) + 1)

    # gets Training and validation loss from Loss Metrics 
    train_loss = [m.get("loss", None) for m in loss_metrics]
    val_loss = [m.get("val_loss", None) for m in loss_metrics]

    plt.figure(figsize=(10, 4))
    plt.plot(epochs_range, train_loss, label="Training Loss", color="#2E86AB", linewidth=2)
    if any(v is not None for v in val_loss):
        plt.plot(epochs_range, val_loss, label="Validation Loss", color="#A23B72", linewidth=2)
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title("Training and Validation Loss")
    plt.legend()
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig(os.path.join(FIGURE_DIR, 'loss_curves.png'), dpi=300, bbox_inches='tight')
    plt.show()

    # Average Character Accuracy using all_metrics
    if all_metrics:
        avg_char_acc_per_epoch = []

        for epoch_metrics in all_metrics:
            char_acc_values = [v for k, v in epoch_metrics.items() if k.endswith('_accuracy') and k.startswith('char_')]
            if char_acc_values:
                avg_char_acc_per_epoch.append(np.mean(char_acc_values))
            else:
                avg_char_acc_per_epoch.append(None)

        # Plot Average Character Accuracy
        plt.figure(figsize=(10, 4))
        plt.plot(epochs_range, avg_char_acc_per_epoch, label="Average Character Accuracy", color="#2E86AB", linewidth=2)
        plt.xlabel("Epoch")
        plt.ylabel("Average Character Accuracy")
        plt.title("Average Character Accuracy Over Epochs")
        plt.legend()
        plt.grid(alpha=0.3)
        plt.tight_layout()
        plt.savefig(os.path.join(FIGURE_DIR, 'avg_char_accuracy.png'), dpi=300, bbox_inches='tight')
        plt.show()

## Confusion Matrices

The **Overall Character-Level Confusion Matrix** aggregates the predictions across all positions to indicate systemic misclassification patterns, while the **Character-Level Confusion Matrix for Each Position** isolates performance for each position of the CAPTCHA sequence. These matrices are displayed as heatmaps, where colour intensities show the counts of misclassifications and tick labels correspond to actual characters, enabling the detection of biases or frequently confused characters.

In [None]:
all_true = []
all_pred = []

for i, label in enumerate(true_labels):
    for pos, true_char in enumerate(label):
        if pos >= predictions.shape[1]:
            continue
        true_idx = data_loader.char_to_num.get(true_char, -1)
        if true_idx == -1:
            continue
        pred_idx = np.argmax(predictions[i, pos, :])
        all_true.append(true_idx)
        all_pred.append(pred_idx)

all_true = np.array(all_true)
all_pred = np.array(all_pred)

cm = confusion_matrix(all_true, all_pred, labels=list(range(len(data_loader.characters))))

fig, ax = plt.subplots(figsize=(14, 12))
sns.heatmap(cm, annot=False, cmap='YlOrRd', cbar_kws={'label': 'Misclassification Count'}, ax=ax)
ax.set_title('Overall Character-Level Confusion Matrix')
ax.set_xlabel('Predicted Character')
ax.set_ylabel('True Character')
tick_labels = [data_loader.num_to_char[i] for i in range(len(data_loader.characters))]
ax.set_xticks(np.arange(len(tick_labels)) + 0.5)
ax.set_yticks(np.arange(len(tick_labels)) + 0.5)
ax.set_xticklabels(tick_labels, rotation=45)
ax.set_yticklabels(tick_labels, rotation=0)
plt.tight_layout()
plt.savefig(os.path.join(FIGURE_DIR, 'confusion_matrix_overall.png'), dpi=300, bbox_inches='tight')
plt.show()


In [None]:
# Loop over each character position
for pos in range(max_label_length):
    if predictions.ndim == 3 and predictions.shape[1] > pos:
        # Extract predictions for the current position
        char_preds = predictions[:, pos, :]
        char_predicted = np.argmax(char_preds, axis=1)
        
        # Extract true labels for the current character position
        char_true = []
        for label in test_labels:
            if len(label) > pos:
                char = label[pos]
                if char in data_loader.char_to_num:
                    char_true.append(data_loader.char_to_num[char])
                else:
                    char_true.append(-1)  # handle unknown characters
            else:
                char_true.append(-1)
        
        # Filter out invalid labels
        valid_indices = [i for i, true_val in enumerate(char_true) if true_val != -1]
        if valid_indices:
            char_true_valid = [char_true[i] for i in valid_indices]
            char_predicted_valid = [char_predicted[i] for i in valid_indices]
            
            # Compute confusion matrix
            cm = confusion_matrix(
                char_true_valid, 
                char_predicted_valid, 
                labels=list(range(len(data_loader.characters)))
            )
            
            # Plot heatmap
            fig, ax = plt.subplots(figsize=(14, 12))
            sns.heatmap(cm, annot=False, cmap='YlOrRd', cbar_kws={'label': 'Misclassification Count'}, ax=ax)
            ax.set_title(f'Character-Level Confusion Matrix (Position {pos + 1})', fontsize=13, fontweight='bold', pad=20)
            ax.set_ylabel('True Character', fontsize=11, fontweight='bold')
            ax.set_xlabel('Predicted Character', fontsize=11, fontweight='bold')
            # Set tick labels to actual characters
            tick_labels = [data_loader.num_to_char.get(i, '') for i in range(len(data_loader.characters))]
            ax.set_xticks(np.arange(len(tick_labels)) + 0.5)
            ax.set_yticks(np.arange(len(tick_labels)) + 0.5)
            ax.set_xticklabels(tick_labels, rotation=45)
            ax.set_yticklabels(tick_labels, rotation=0)
            
            plt.tight_layout()
            plt.savefig(os.path.join(FIGURE_DIR, f'confusion_matrix_position_{pos + 1}.png'), dpi=300, bbox_inches='tight')
            plt.show()
            
            accuracy = np.trace(cm) / np.sum(cm)
            print(f"Accuracy for character position {pos + 1}: {accuracy:.4f}")
        else:
            print(f"No valid labels found for confusion matrix at position {pos + 1}")
    else:
        print(f"Invalid predictions shape for confusion matrix at position {pos + 1}")


## Error and Correct Prediction Image Analysis with Per-Character Highlighting

**Sample Misclassified CAPTCHA Images with Character Highlighting** provides a qualitative take on model errors. Incorrectly predicted characters are highlighted in red underneath the images, by which one can identify common types of error, such as misidentified or skipped characters. The same goes for **Sample Correctly Classified CAPTCHA Images with Character Highlighting**, which further helps solidify successful predictions through correctly identified characters displayed in green.

In [None]:
# Identify correct and incorrect predictions
correct_indices = [i for i, (true, pred) in enumerate(zip(test_labels, predicted_labels)) if true == pred]
incorrect_indices = [i for i, (true, pred) in enumerate(zip(test_labels, predicted_labels)) if true != pred]

print(f"Correct predictions: {len(correct_indices)} ({len(correct_indices)/len(test_labels)*100:.2f}%)")
print(f"Incorrect predictions: {len(incorrect_indices)} ({len(incorrect_indices)/len(test_labels)*100:.2f}%)")

# Per-character errors for incorrect CAPTCHAs
if incorrect_indices:
    error_lengths = []
    for idx in incorrect_indices:
        true_label = test_labels[idx]
        pred_label = predicted_labels[idx]
        min_len = min(len(true_label), len(pred_label))
        errors = sum(1 for i in range(min_len) if true_label[i] != pred_label[i])
        error_lengths.append(errors)
    
    avg_errors = np.mean(error_lengths)
    most_common_errors = max(set(error_lengths), key=error_lengths.count)
    print(f"Average wrong characters per incorrect CAPTCHA: {avg_errors:.2f}")
    print(f"Most common error count: {most_common_errors}")

# Display incorrect predictions with per-character highlighting
if incorrect_indices:
    if 'test_paths' not in locals() and 'test_paths' not in globals():
        print("Warning: test_paths not found. Cannot display misclassified images.")
    else:
        fig, axes = plt.subplots(3, 4, figsize=(15, 10))
        np.random.seed(42)
        sample_incorrect = np.random.choice(incorrect_indices, min(12, len(incorrect_indices)), replace=False)
        
        for idx, ax in enumerate(axes.flatten()):
            if idx < len(sample_incorrect):
                img_idx = sample_incorrect[idx]
                img_path = test_paths[img_idx]
                true_label = test_labels[img_idx]
                pred_label = predicted_labels[img_idx]
                
                try:
                    img = Image.open(img_path).convert("RGB")
                    ax.imshow(img)
                    ax.axis('off')

                    # Highlight incorrect characters
                    for i, (t, p) in enumerate(zip(true_label, pred_label)):
                        color = 'red' if t != p else 'green'
                        ax.text(
                            (i + 0.5) * img.width / len(true_label),  # x-position
                            img.height + 5,                           # y-position
                            p, color=color, fontsize=12, ha='center', va='top', fontweight='bold'
                        )

                    ax.set_title(f"True: {true_label}\nPred: {pred_label}", fontsize=9, fontweight='bold')
                except Exception as e:
                    ax.text(0.5, 0.5, f"Error loading image\n{img_path}", ha='center', va='center', transform=ax.transAxes)
                    ax.axis('off')
            else:
                ax.axis('off')
        
        plt.suptitle('Sample Misclassified CAPTCHA Images with Character Highlighting', fontsize=13, fontweight='bold')
        plt.tight_layout()
        plt.savefig(os.path.join(FIGURE_DIR, 'misclassified_highlighted.png'), dpi=300, bbox_inches='tight')
        plt.show()


# Display correct predictions with per-character highlighting 
if correct_indices:
    fig, axes = plt.subplots(2, 4, figsize=(15, 6))
    np.random.seed(42)
    sample_correct = np.random.choice(correct_indices, min(8, len(correct_indices)), replace=False)
    
    for idx, ax in enumerate(axes.flatten()):
        if idx < len(sample_correct):
            img_idx = sample_correct[idx]
            img_path = test_paths[img_idx]
            true_label = test_labels[img_idx]
            pred_label = predicted_labels[img_idx]
            
            try:
                img = Image.open(img_path).convert("RGB")
                ax.imshow(img)
                ax.axis('off')

                # Highlight all characters in green
                for i, (t, p) in enumerate(zip(true_label, pred_label)):
                    ax.text(
                        (i + 0.5) * img.width / len(true_label),
                        img.height + 5,
                        p, color='green', fontsize=12, ha='center', va='top', fontweight='bold'
                    )

                ax.set_title(f"True: {true_label}\nPred: {pred_label}", fontsize=9, fontweight='bold', color='green')
            except Exception as e:
                ax.text(0.5, 0.5, f"Error loading image", ha='center', va='center', transform=ax.transAxes)
                ax.axis('off')
        else:
            ax.axis('off')
    
    plt.suptitle('Sample Correctly Classified CAPTCHA Images with Character Highlighting', fontsize=13, fontweight='bold')
    plt.tight_layout()
    plt.savefig(os.path.join(FIGURE_DIR, 'correct_highlighted.png'), dpi=300, bbox_inches='tight')
    plt.show()

## Per-character Prediction Accuracy 

**Per-Position Character Prediction Accuracy** represents the accuracy for a character at each position in the CAPTCHA. The results to be reported include average accuracy, the strongest and weakest positions, and showing the findings in a gradient-colored bar chart that highlights the relative performance differences across the CAPTCHA sequence.

In [None]:
# Per-character prediction accuracy
char_accuracies = {}
char_counts = {} 

for position in range(max_label_length):
    correct_at_pos = 0
    total_at_pos = 0
    
    for true_label, pred_label in zip(test_labels, predicted_labels):
        # Check if both labels have at least position+1 characters
        if position < len(true_label) and position < len(pred_label):
            if true_label[position] == pred_label[position]:
                correct_at_pos += 1
            total_at_pos += 1
        elif position < len(true_label) and position >= len(pred_label):
            # True label has character at this position but prediction doesn't
            total_at_pos += 1
        elif position >= len(true_label) and position < len(pred_label):
            # Prediction has character at this position but true label doesn't
            total_at_pos += 1
    
    if total_at_pos > 0:
        accuracy = correct_at_pos / total_at_pos
        char_accuracies[f'Character {position+1}'] = accuracy
        char_counts[f'Character {position+1}'] = total_at_pos
    else:
        char_accuracies[f'Character {position+1}'] = 0.0
        char_counts[f'Character {position+1}'] = 0

if char_accuracies:
    fig, ax = plt.subplots(figsize=(12, 5))
    positions = list(char_accuracies.keys())
    accuracies = list(char_accuracies.values())
    
    # Create color gradient (darker green for higher accuracy)
    colors = [f'#006D32' if acc >= 0.7 else f'#06A77D' if acc >= 0.5 else '#99D8C9' for acc in accuracies]
    
    bars = ax.bar(range(len(positions)), accuracies, color=colors, edgecolor='black', alpha=0.8)
    ax.set_xticks(range(len(positions)))
    ax.set_xticklabels(positions, rotation=45, fontsize=10)
    ax.set_ylabel('Accuracy', fontsize=11, fontweight='bold')
    ax.set_xlabel('Character Position in CAPTCHA', fontsize=11, fontweight='bold')
    ax.set_title('Per-Position Character Prediction Accuracy', fontsize=12, fontweight='bold')
    ax.set_ylim([0, 1.5])
    ax.grid(alpha=0.3, axis='y')
    
    # Add value labels with percentage
    for bar, acc, count in zip(bars, accuracies, char_counts.values()):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
               f'{acc:.3f}', ha='center', va='bottom', fontsize=9)
    
    plt.tight_layout()
    plt.savefig(os.path.join(FIGURE_DIR, 'per_position_accuracy.png'), dpi=300, bbox_inches='tight')
    plt.show()
    
    avg_position_accuracy = np.mean(accuracies)
    weakest_position = positions[np.argmin(accuracies)]
    strongest_position = positions[np.argmax(accuracies)]
    
    print(f"Average position accuracy: {avg_position_accuracy:.4f}")
    print(f"Strongest position: {strongest_position} ({np.max(accuracies):.4f})")
    print(f"Weakest position: {weakest_position} ({np.min(accuracies):.4f})")
else:
    print("No valid data for per-position accuracy analysis")

## ROC Curve, AUC and Precision-Recall Curve

Further analysis involves converting character sequences to integer-class arrays to accurately calculate metrics. Precision and recall are obtained for each character position, reflecting the model's ability to correctly detect characters under the challenging distortions normally faced in CAPTCHAs. One-hot encodings of true labels and predicted probabilities enable the construction of `ROC` and `Precision–Recall` curves, allowing the evaluation of model discriminative power and performance under a variety of thresholds. The metrics, such as `AUC` and `average precision`, provide quantitative measures of overall quality and reliability for the models dealing with CAPTCHA recognition.

The **ROC Curve** and **Precision–Recall Curve** are used here to evaluate the discriminative capability of the classifier. ROC curves show the trade-off between true positive rate and false positive rate across all thresholds, while Precision–Recall curves emphasize performance in imbalanced scenarios. Calculated `AUC` and `average_precision` give numeric measures of robustness, showing the model's effectiveness in distinguishing between characters, even when the images are complex and distorted as in a typical CAPTCHA.

In [None]:
# Convert true/pred labels → integer-class arrays
y_true_int = np.array([
    [data_loader.char_to_num[ch] for ch in captcha]
    for captcha in true_labels
])

y_pred_int = np.array([
    [data_loader.char_to_num[ch] for ch in captcha]
    for captcha in predicted_labels
])

y_pred_probs = predictions
num_classes = y_pred_probs.shape[-1]
seq_len = y_pred_probs.shape[1]

# Precision & Recall per character
precision_scores = []
recall_scores = []

for i in range(seq_len):
    p = precision_score(y_true_int[:, i], y_pred_int[:, i], average='macro', zero_division=0)
    r = recall_score(y_true_int[:, i], y_pred_int[:, i], average='macro', zero_division=0)
    precision_scores.append(p)
    recall_scores.append(r)

# Build one-hot true y for ROC/PR curve
y_true_1hot = np.eye(num_classes)[y_true_int]

# Flatten true labels and predicted probas
y_true_flat = y_true_1hot.ravel()
y_pred_flat = y_pred_probs.ravel()

In [None]:
# Plot ROC curve
fpr, tpr, _ = roc_curve(y_true_flat, y_pred_flat)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, lw=2, label=f"AUC = {roc_auc:.4f}")
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.tight_layout()
plt.savefig(os.path.join(FIGURE_DIR, 'roc_curve.png'), dpi=300, bbox_inches='tight')
plt.show()


In [None]:
# Plot PR curve
precisions, recalls, thresholds = precision_recall_curve(y_true_flat, y_pred_flat)
average_precision = average_precision_score(y_true_flat, y_pred_flat)

plt.figure(figsize=(6, 6))
plt.plot(recalls, precisions, lw=2, label=f"AP = {average_precision:.4f}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision–Recall Curve")
plt.legend(loc="lower left")
plt.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.savefig(os.path.join(FIGURE_DIR, 'pr_curve.png'), dpi=300, bbox_inches='tight')
plt.show()

# Reference List

- Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep Learning. [online] pp.1–23. Available at: <https://mitpress.mit.edu/9780262035613/deep-learning/> [Accessed 17 May 2025].
- He, H. and Garcia, E.A., 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, [online] 21(9), pp.1263–1284. https://doi.org/10.1109/TKDE.2008.239.
- Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. ImageNet Classification with Deep Convolutional Neural Networks. [online] Available at: <http://code.google.com/p/cuda-convnet/>.
- Lecun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, https://doi.org/10.1038/nature14539.
- Rawat, W. and Wang, Z., 2017. Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural computation, [online] 29(9), pp.2352–2449. https://doi.org/10.1162/NECO_A_00990.
- Shorten, C. and Khoshgoftaar, T.M., 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 2019 6:1, [online] 6(1), pp.60-. https://doi.org/10.1186/S40537-019-0197-0.
- Srivastava, N., Hinton, G., Krizhevsky, A. and Salakhutdinov, R., 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, .
 
