# Team Members

* **Student ID:** 0001111416  
  **Full Name:** Alessio Pittiglio  
  **Institutional Email:** alessio.pittiglio@studio.unibo.it

* **Student ID:** 0001086355  
  **Full Name:** Parsa Mastouri Kashani  
  **Institutional Email:** parsa.mastouri@studio.unibo.it


# **Product Recognition of Books**

## Image Processing and Computer Vision - Assignment Module \#1


Contacts:

- Prof. Giuseppe Lisanti -> giuseppe.lisanti@unibo.it
- Prof. Samuele Salti -> samuele.salti@unibo.it
- Alex Costanzino -> alex.costanzino@unibo.it
- Francesco Ballerini -> francesco.ballerini4@unibo.it

Computer vision-based object detection techniques can be applied in library or bookstore settings to build a system that identifies books on shelves.

Such a system could assist in:
* Helping visually impaired users locate books by title/author;
* Automating inventory management (e.g., detecting misplaced or out-of-stock books);
* Enabling faster book retrieval by recognizing spine text or cover designs.

## Task
Develop a computer vision system that, given a reference image for each book, is able to identify such book from one picture of a shelf.

<figure>
<a href="https://ibb.co/pvLVjbM5"><img src="https://i.ibb.co/svVx9bNz/example.png" alt="example" border="0"></a>
</figure>

For each type of product displayed on the shelf, the system should compute a bounding box aligned with the book spine or cover and report:
1. Number of instances;
1. Dimension of each instance (area in pixel of the bounding box that encloses each one of them);
1. Position in the image reference system of each instance (four corners of the bounding box that enclose them);
1. Overlay of the bounding boxes on the scene images.

<font color="red"><b>Each step of this assignment must be solved using traditional computer vision techniques.</b></font>

#### Example of expected output
```
Book 0 - 2 instance(s) found:
  Instance 1 {top_left: (100,200), top_right: (110, 220), bottom_left: (10, 202), bottom_right: (10, 208), area: 230px}
  Instance 2 {top_left: (90,310), top_right: (95, 340), bottom_left: (24, 205), bottom_right: (23, 234), area: 205px}
Book 1 – 1 instance(s) found:
.
.
.
```

## Data
Two folders of images are provided:
* **Models**: contains one reference image for each product that the system should be able to identify;
* **Scenes**: contains different shelve pictures to test the developed algorithm in different scenarios.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# !cp -r /content/drive/MyDrive/AssignmentsIPCV/dataset.zip ./
# !unzip dataset.zip

## Evaluation criteria
1. **Clarity and conciseness**. Present your work in a readable way: format your code and comment every important step;

2. **Procedural correctness**. There are several ways to solve the assignment. Design your own sound approach and justify every decision you make;

3. **Correctness of results**. Try to solve as many instances as possible. You should be able to solve all the instances of the assignment, however, a thoroughly justified and sound procedure with a lower number of solved instances will be valued **more** than a poorly designed and justified approach that solves more or all instances.

---

# Our solution

The computer vision system we designed is based on a variant of the **Generalized Hough Transform (GHT)** that uses local invariant features such as SIFT. This approach is also often referred to as the Star Model. The system operates in two phases: an offline phase and an online phase.

### Offline phase

The goal of the offline phase is to build a model of the object to be detected starting from a template image.

1. First, we extract the local invariant features from the template image.
2. The object model is defined as a collection of these features. For each feature, we store its position $(x,y)$, its canonical orientation $\theta$, its characteristic scale $s$, and its SIFT descriptor vector.
3. A reference point is chosen for the object, typically the barycenter (centroid) of all keypoints.
4. For each keypoint, a joining vector $r$ is computed. This vector points from the keypoint's position to the reference point and is stored as an additional attribute for each keypoint.

At the end of this phase, we obtain the **Star Model**. The name comes from the fact that visually, these vectors all point toward the center, giving the appearance of a star.

### Online phase

The actual detection takes place during the online phase. The Star Model is used to find instances of the object in new images.

1. The same type of local features are extracted from the target image.
2. The descriptors of the target image's keypoints are matched against the descriptors of the keypoints stored in the Star Model. Lowe's ratio test is typically used to keep only the reliable matches and discard ambiguous ones.
3. Voting represents the core of this method, where its strength truly lies. For each match, a vote is cast for the most probable position of the reference point in the target image. Each match also provides hypotheses for rotation and scale.
    - For rotation, the difference between the canonical orientations of the target and model keypoints gives an estimate of the object's rotation.
    - For scale, the ratio of the characteristic scales of the target and model keypoints gives an estimate of the object's scale change.
    
    Once these estimates are obtained, we transform the joining vector from the model keypoint and add it to the target keypoint's position to cast a vote for the reference point

4. Votes are cast into a 2D accumulator. After all matches have voted, peaks in the accumulator array indicate the likely presence and position of the object.

## Bounding box estimation phase

To this already robust system, we added a third phase to estimate a bounding box for the object. Once the peak is found (using a local maximum detection algorithm on the grid), we introduce a data structure to keep track of which matches voted for that reference point. These matches are then used to estimate a homography.

To make the system more robust, we also added a backup **similarity pose estimation** computed via LLS. Why? (1) It requires fewer points to estimate (3 instead of the 4 required for homography) and (2) it is quite reliable in our case since the objects we aim to detect are mostly rectangular.

## Additional improvements

Another improvement was the use of CLAHE for image pre-processing. This locally enhances the contrast of an image while limiting noise amplification through a “clip limit.” This resulted in an increased number of detected keypoints and, consequently, more votes in the accumulator.

CLAHE can be applied in two different ways with completely different results
- Apply it directly to the grayscale image.
- Convert the image to the LAB color space, apply it only to the L channel, and then continue the pipeline using that L channel.

In our case, empirical evaluation showed that the second approach produced better results.

## Inverse matching

The standard SIFT algorithm, which uses Lowe's ratio test, is designed to find one-to-one correspondences between keypoints in the template and those in the target image. By its nature, it is poorly suited for scenarios where multiple instances need to be detected, especially when a model feature appears multiple times in the target image.

For example, if a model keypoint has three matches in the scene, Lowe's ratio will keep at most one of them. In our early experiments, sequentially masking detected objects with a black box and re-running SIFT detection produced good results, but was extremely inefficient. To overcome this limitation, we changed the matching strategy. Instead of asking:

> "Which scene feature best matches the model feature?"

we ask:

> "For each scene feature, which model feature does it resemble the most?"

In [None]:
import os
import cv2
import matplotlib.pyplot as plt
import numpy as np
from ght_sift_lib import SiftGhtDetector, natural_sort_key, format_and_print_results

In [None]:
import logging
logging.basicConfig(level=logging.INFO)

In [None]:
MODELS_DIR = os.path.join("dataset", "models")
SCENES_DIR = os.path.join("dataset", "scenes")

In [None]:
models = {}
detector = SiftGhtDetector(
    num_octave_layers=5,
    bin_size=6,
    ratio_threshold=0.75,
    min_votes=3,
    nms_window_size=8,
    use_clahe=True,
)

In [None]:
model_files = sorted(
    [
        f
        for f in os.listdir(MODELS_DIR)
        if f.lower().endswith((".png", ".jpg", ".jpeg"))
    ],
    key=natural_sort_key,
)

In [None]:
for i, filename in enumerate(model_files):
    book_name = f"Book {i}"
    model_path = os.path.join(MODELS_DIR, filename)
    models[book_name] = cv2.imread(model_path)

In [None]:
scene_files = sorted(os.listdir(SCENES_DIR), key=natural_sort_key)

In [None]:
scene_files = sorted(os.listdir(SCENES_DIR), key=natural_sort_key)

for scene_filename in scene_files:
    if not scene_filename.lower().endswith((".png", ".jpg", ".jpeg")):
        continue

    scene_path = os.path.join(SCENES_DIR, scene_filename)
    scene_image = cv2.imread(scene_path)
    overlay_image = scene_image.copy()
    all_detections = {}

    for book_name, model_image in models.items():
        peaks, accumulator, bounding_boxes = detector.detect(model_image, scene_image)

        valid_boxes = [b for b in bounding_boxes if b.get("bounding_box") is not None]
        if valid_boxes:
            all_detections[book_name] = valid_boxes

            colors = [(0, 0, 255), (0, 255, 0), (255, 0, 0), (0, 255, 255)]

            for i, det in enumerate(valid_boxes):
                color = colors[i % len(colors)]
                corners = np.int32(det["bounding_box"])
                cv2.polylines(
                    overlay_image, [corners], isClosed=True, color=color, thickness=2
                )

    format_and_print_results(all_detections, scene_filename)

    plt.imshow(cv2.cvtColor(overlay_image, cv2.COLOR_BGR2RGB))
    plt.title(scene_filename)
    plt.show()

## Conclusion

In conclusion, after careful parameter tuning, this system proved capable of detecting 86.8% of the object instances across all scenes (46 out of 53), demonstrating its robustness.

One limitation remains, due to […]. Possible improvements could be explored, although this is always challenging when dealing with classical computer vision approaches. While reliable in industrial contexts, such approaches struggle to generalize to scenes that differ significantly from those for which the parameters were tuned

## References

- Zuiderveld, K. (1994). Contrast Limited Adaptive Histogram Equalization. In *Graphics Gems IV* (pp. 474-485), Academic Press Professional.
- Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints. *International Journal of Computer Vision*, 60(2), 91-110.
- Ballard, D. H. (1981). Generalizing the Hough Transform to detect arbitrary shapes. *Pattern Recognition*, 13(2), 111-122.
- Prof. Lisanti's lecture slides.