# Image classifier - complete pipeline

## Approach:

`find relevant image characteristics` $\rightarrow$ `create feature vector creation pipeline` $\rightarrow$ `iterate through dataset` 

The pipeline we are trying to implement handles images using out `ImageProcessor` and passes them through out `FeatureExtractor` which in turn returns a feature vector composed of our different analysis methods. This way we are not constraining ourselves to a rigid processing system that can only take a certain type of feature vector.

For displaying images we're using a `ImageVisualizer` tool.

Our feature vector is a `(n,) numpy array` (`n` is the number of features extracted) that can later be added to some database or hashed for quick lookup of similar images.

In [1]:
import cv2
import os
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity
import os
import sqlite3
import json

# Loading and displaying images


In [2]:
class ImageProcessor:
    def __init__(self, directory):
        self.directory = directory
        self.images = []
        self.downscaled_images = []
        
    def load_images(self):
        self.images = []
        for filename in os.listdir(self.directory):
            filepath = os.path.join(self.directory, filename)
            if os.path.isfile(filepath):
                img = cv2.imread(filepath)
                if img is not None:
                    self.images.append((filename, img))
    
    def downscale_images(self, factor=0.01):
        self.downscaled_images = []
        for filename, img in self.images:
            height, width = img.shape[:2]
            new_size = (int(width * factor), int(height * factor))
            downscaled_img = cv2.resize(img, new_size, interpolation=cv2.INTER_AREA)
            self.downscaled_images.append((filename, downscaled_img))

class ImageVisualizer:
    @staticmethod
    def display_image(images, img_id=None):
        if img_id is None:
            img_id = random.randint(0, len(images)-1)
        filename, img = images[img_id]
        plt.figure()
        plt.title(filename)
        plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
        plt.axis('off')
        plt.show()
    
    @staticmethod
    def show_similar_images(original_images, similar_images_data, title="", show_plot=False, save_plot=False):
        image_dict = {filename: img for filename, img in original_images}
        fig, axes = plt.subplots(1, 5, figsize=(20, 5))
        if len(similar_images_data) < 5:
            axes = axes.flat[:len(similar_images_data)]
        
        for ax, (filename, _, _, similarity) in zip(axes, similar_images_data[:5]):
            img = image_dict[filename]
            ax.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
            ax.set_title(f"{filename}\nSimilarity: {similarity:.2f}")
            ax.axis('off')
        fig.suptitle(title)
        plt.tight_layout()
        if save_plot: plt.savefig(f"{title.replace(' ', '_')}_comparison.png")
        if show_plot: plt.show()


# Image Characteristics for feature vector

Out `FeatureExtractor` module concept is designes modularly, so that any image processing function can be added later on in the process so that the pipeline built around the extraction of image features doesn't have to be touched

In [3]:
class FeatureExtractor:
    """
    A modular, object-oriented feature extraction class for image analysis.

    This class allows dynamic and extensible extraction of a wide variety of image features
    from a given image. It supports selective feature computation, meaning you can
    request only the features you need, and the resulting feature vector will adapt accordingly,
    while maintaining a consistent ordering.

    Attributes:
    -----------
    filename : str
        The name or path of the image file (used for identification/logging purposes).
    image : np.ndarray
        The image data as a NumPy array (BGR format, typically from cv2.imread).
    feature_functions : dict
        A mapping of feature names to their corresponding extraction methods.

    Methods:
    --------
    extract_features(features: list[str]) -> np.ndarray
        Extracts the selected features from the image and returns a flattened NumPy array
        of shape (num_features, ). Each feature can contribute a scalar or a vector, and
        all are concatenated in the order requested.

    Design Rationale:
    -----------------
    - **Flexibility**: Users can specify exactly which features to extract by passing a list of feature names.
    - **Extensibility**: New features can be added simply by writing a new method and registering it
      in the `feature_functions` dictionary. This avoids modifying core logic and encourages modular design.
    - **Consistency**: Regardless of the number or type of features requested, the output is always
      a flat NumPy array, enabling compatibility with machine learning pipelines or downstream analysis.
    - **Encapsulation**: Image handling and feature logic are neatly encapsulated within the class.

    Example Usage:
    --------------
    >>> import cv2
    >>> img = cv2.imread("example.jpg")
    >>> extractor = FeatureExtractor("example.jpg", img)
    >>> features = extractor.extract_features(["mean_intensity", "edge_density", "color_histogram"])
    >>> print(features.shape)  # Output: (num_features, )

    Adding a New Feature:
    ---------------------
    1. Define a new method following the `_feature_name(self)` naming pattern.
       The method should return a scalar or 1D/2D array-like output.
       
       Example:
       >>> def _texture_entropy(self):
       >>>     from skimage.measure import shannon_entropy
       >>>     return shannon_entropy(self.image)

    2. Register the new method in `self.feature_functions` inside `__init__`:
       >>> self.feature_functions["texture_entropy"] = self._texture_entropy

    3. Now, you can request "texture_entropy" as part of your feature list.
    """

    def __init__(self, images, img_id):
        self.filename, self.image = images[img_id] # contains filename, image
        self.feature_functions = {
            "extract_dominant_colors": self._extract_dominant_colors_kmeans,
            "extract_dominant_colors_fingerpring": self._extract_dominant_colors_fingerprint,
            "mean_intensity": self._mean_intensity,
            "edge_density": self._edge_density,
            "color_histogram": self._color_histogram,
            "fft_fingerpring": self._fft_fingerprinting
        }
        pass

    def generate_feature_vector(self, features: list[str]) -> np.ndarray:
        extracted = {}

        for feature_name in features:
            if feature_name not in self.feature_functions:
                raise ValueError(f"Feature '{feature_name}' not implemented.")
            
            feature_value = self.feature_functions[feature_name]()
            extracted[feature_name] = feature_value

        return extracted

    # ===== feature extraction methods =====
    
    def _extract_dominant_colors_kmeans(self):
        pass

    def _extract_dominant_colors_fingerprint(self):
        pass

    def _mean_intensity(self):
        return np.mean(self.image)

    def _edge_density(self):
        edges = cv2.Canny(self.image, 100, 200)
        return np.sum(edges > 0) / edges.size

    def _color_histogram(self, bins=8):
        """Returns histogram of number of bins for each color
        output: numpy array shape (bins*3, )
        """
        chans = cv2.split(self.image)
        features = []
        for chan in chans:
            hist = cv2.calcHist([chan], [0], None, [bins], [0, 256])
            hist = cv2.normalize(hist, hist).flatten()
            features.extend(hist)
        return np.array(features)
    
    def _fft_fingerprinting(self):
        pass

# Database management class

This version of the `DatabaseManager` will infer column names and types from the features used. It determines the SQL data type by checking which type is located in the dictionary of extracted features and build the queries respectively.

In [4]:
class DatabaseManager:
    """
    Manages SQLite database operations for storing and retrieving image features.
    This class dynamically creates table columns based on the feature names used while extracting them.
    """
    def __init__(self, db_path):
        self.conn = sqlite3.connect(db_path)
        self.cursor = self.conn.cursor()
        pass

    def _get_sql_type(self, value):
        """
        Looks at the data type of the feature
        Infers an appropriate SQL datatype from the python value"""
        if isinstance(value, (np.ndarray, list)):
            return "TEXT" # makes numpy arrays and lists as text for JSON serialized arrays later
        if isinstance(value, (int, np.integer)):
            return "INTEGER" # checks for int, specifies INTEGER for db
        if isinstance(value, (float, np.floating)):
            return "REAL"
        return "TEXT" # default value if no other type is found

    def create_table(self, features_to_extract, sample_feature_dict):
        """
        Creates the 'features' table if it doens't already exist
        Looks at dictionary of extracted features to create the cols and specify their datatype 
        """
        columns_sql = ["id INTEGER PRIMARY KEY AUTOINCREMENT", "filename TEXT UNIQUE"]
        for feature_name in features_to_extract:
            if feature_name in sample_feature_dict:
                sample_value = sample_feature_dict[feature_name]
                sql_type = self._get_sql_type(sample_value) # calls _get_sql_type function to know whihch datatype to specify when creating the table
                # Column name for SQL
                safe_feature_name = ''.join(c for c in feature_name if c.isalnum() or c == "_")
                columns_sql.append(f"{safe_feature_name} {sql_type}")
        
        create_table_query = f"CREATE TABLE IF NOT EXISTS features ({", ".join(columns_sql)})"

        self.cursor.execute(create_table_query)
        self.conn.commit()
        print("Database and table are ready")
        pass

    def insert_feature_vector(self, filename, feature_dict):
        """
        Dynamically inserts any image's features into the database
        It handles column names and values serialization automatically.        
        """
        column_names = ["filename"] + list(feature_dict.keys())

        values = [filename]
        for value in feature_dict.values():
            if isinstance(value, np.ndarray):
                values.append(json.dumps(value.tolist())) # serialize numpy arrays
            else:
                values.append(value)
        
        placeholders = ", ".join(["?"] * len(column_names))
        # ensure column names are sanitized for safety
        safe_column_names = ", ".join("".join(c for c in name if c.isalnum() or c == "_") for name in column_names)

        insert_query = f"INSERT INTO features ({safe_column_names}) VALUES ({placeholders})"
        
        try:
            self.cursor.execute(insert_query, tuple(values))
            self.conn.commit()
        except:
            print(f"Features for '{filename}' already exist in the database. Skipping.")
        
    def close(self):
        """Close the db connectin"""
        if self.conn:
            self.conn.close()
            print("Database connection is closed.")

# Execution

The following code block creates an SQLite database from the images located in the specified directory.

In [None]:
# GLOBAL CONSTANTS
DIR = r"C:\Users\anton\OneDrive\Documents\HSD\sem4\DAISY_2025_images_for_bigdata"
os.environ['LOKY_MAX_CPU_COUNT'] = '12'  # Set to your actual core count


testing_id = 1
features_to_extract = [
    "mean_intensity",
    "edge_density",
    "color_histogram"
]

features_as_string = "-"
for i in range(len(features_to_extract)):
    features_as_string += features_to_extract[i]
    if i+1 == len(features_to_extract):
        pass
    else:
        features_as_string += "-" 

db_name= f"image_features_with{features_as_string}.db"


processor = ImageProcessor(DIR)
db_manager = DatabaseManager(db_name)
visualizer = ImageVisualizer()


# load images and downscale for faster processing
processor.load_images()
processor.downscale_images(factor=0.01)
images_to_process = processor.downscaled_images
print(f"Found {len(images_to_process)} images to process")


# create database dynamically
if images_to_process:
    # generate a sample feature dictinary from the first image to define DB Schema
    print("Generating samplel features to define database schema...")
    sample_extractor = FeatureExtractor(images_to_process, 0)
    sample_features = sample_extractor.generate_feature_vector(features_to_extract)

    # create table dynamically based on schema
    db_manager.create_table(features_to_extract, sample_features)
else:
    print("No images found to process")
    # keine ahnung wie man das dann debugd hahah


# feature extraction and storage
try:
    if images_to_process:
        total_iamges = len(images_to_process)
        for i in range(total_iamges):
            filename, _ = images_to_process[i]

            # initialize extractor for the current image
            extractor = FeatureExtractor(images_to_process, i)

            # generate feature dict
            feature_dict = extractor.generate_feature_vector(features_to_extract)

            # insert the features to the db
            db_manager.insert_feature_vector(filename, feature_dict)

            print(f"Processed and stored features for image {i+1}/{total_iamges}: {filename}")
finally:
    db_manager.close()

print("\nFeature extraction and storage complete.")

Found 108 images to process
Generating samplel features to define database schema...
Database and table are ready
Processed and stored features for image 1/108: 20250328_101537.jpg
Processed and stored features for image 2/108: 20250328_101618.jpg
Processed and stored features for image 3/108: 20250328_101653.jpg
Processed and stored features for image 4/108: 20250328_101657.jpg
Processed and stored features for image 5/108: 20250328_101820.jpg
Processed and stored features for image 6/108: 20250328_101852.jpg
Processed and stored features for image 7/108: 20250328_101935.jpg
Processed and stored features for image 8/108: 20250328_102018.jpg
Processed and stored features for image 9/108: 20250328_102144.jpg
Processed and stored features for image 10/108: 20250328_102238.jpg
Processed and stored features for image 11/108: 20250328_102309.jpg
Processed and stored features for image 12/108: 20250328_102425.jpg
Processed and stored features for image 13/108: 20250328_102530.jpg
Processed a

# 