# Bag of Visual Words

## Notation for Bag of Visual Words:

- _<span style="color:red">Codeword = Cluster Center of a Specific Cluster = Codevector = Local Features = (Visual) Word</span>_
- _<span style="color:red">Codebook = Cluster Centers of All Clusters = (Visual) Vocabulary = (Visual) Dictionary</span>_
- _<span style="color:red">Image (i.e. as a whole, not patches) = Document</span>_

## What is Bag of Visual Words (BoVW)?

- In document classification, **a Bag of Words is a sparse histogram over the vocabulary** (Or, equivalently, sparse vector of occurrence counts of words...).
- Similar to document classification, in computer vision, **a Bag of Visual Words is the representation of an image just by the histogram of visual words** (Or, equivalently, a vector of occurrence counts of a vocabulary of local image features, the collection of visual words, distribution of word occurences...).  
- So, the general idea of Bag of Visual Words is simply to represent an image as **a set of features**. (Hence, it can also be referred to as a bag of features.)
- The Bag of Visual Words model can be applied to image classification, **by treating image features as visual words**. 
- Therefore, a visual word can be considered as **a representative of several similar patches**.
- We use **the keypoints and their descriptors** to construct vocabularies and represent each image as a frequency histogram of features that are in the image.
- From the frequency histogram, later, we can **find another similar images** or **predict the category of the image**.
- Bag of Visual Words works very well for image-level classification and for recognizing object instances. It is especially useful in matching an image to a large database of object instances (like game covers).

![image](bovw1.jpeg)

## How to Build a Bag of Visual Words?

1. Detect features & extract descriptors from each image in the dataset

![image](bovw2.jpeg)
Detecting features and extracting descriptors in an image can be done by using feature extractor algorithms (for example, SIFT, KAZE, etc).

2. Next, build a visual vocabulary out of extracted features. For that purpose, we learn the visual vocabulary by constructing clusters from the descriptors (we can use **K-Means**, DBSCAN or another clustering algorithm): 
   - **The center of each cluster will be used as the visual dictionary’s visual words.** 
   - **The number of the clusters is the codebook size (analogous to the size of the word dictionary).**

![image](bovw3.jpeg)

3. Through the clustering process, each patch in an image is mapped to a certain visual word and the image can be represented by the histogram of the visual words. So, now, it is possible to come up with a histogram for each image which counts how many times each visual word occurs in that image. This histogram is our Bag of Visual Words.

## Issues Regarding Visual Vocabulary Size:

- If too small, visual words do not represent all patches.
- If too large, quantization artifacts and overfitting occur.

## Image Retrieval with Bag of Visual Words:

Works well for CD covers or movie posters and a real-time performance is possible. However, the performance naturally depends on the size of the database and it degrades as the database grows.

### Building the Database:

1. Detect features and extract descriptors from the image 
2. Cluster the descriptors to construct/learn the vocabulary 
3. Compute weights for each visual word:
   - If a visual word occurs in too many images, it is likely that it has a discriminative power which is too weak to be useful for matching.
   - Therefore, calculate weights by Term Frequency - Inverse Document Frequency (TF - IDF) to give more weight to more discriminative visual words.
4. Create an inverted file mapping from visual words to images:
   - Because there are a lot of features obtained from a variety of images, it is likely that the histogram (BoVW) will be very sparse (mostly zeros)
   - Inverted file helps speed up the similarity computation because it eliminates totally irrelevant images by only considering the images whose bins overlap with that of the query image.
   - The mapping is something like, feature1: image1, image5... **(Also see the _Inverted File Index Example_ part below)**

[//]: # (- By using Bag of Visual Words representation from our dataset, we can compute this image’s nearest neighbors.) 
[//]: # (- We can do it by: )
  [//]: # (- **using a standard classifier: k-nearest neighbors algorithm or support vector machine**)
  [//]: # (- **clustering BoVW vectors over the image collection [discovering visual themes]** )
[//]: # (- Given a collection of visual words, for every document, or image, how many of each visual word is in each image. [Frequency histogram built by the count of each visual word [local features obtained from patches] in the image])

### Term Frequency - Inverse Document Frequency (TF-IDF) Weighting:

The number of times a term occurs in a document is called its Term Frequency (TF).

IDF of a word $w$:
- $\displaystyle \log\bigg(\frac{\text{# of documents}}{\text{# of documents $w$ appears}}\bigg)$

Compute the value of bin $w$ in image $I$:
- (TF of $w$ in $I$) $\times$ (IDF of $w$)

### Inverted File Index Example:

- If I've got a book it would be really hard to find all the examples of where it says Karl Marx in that book by flipping through that book. So, what do we do, instead, we build an index. The index in the back is an efficient way to find all the pages that contain a particular word.
- Using the same logic, after extracting the features of the query image, look up the extracted feature in the inverted file index to find all images in which this particular feature occurs.

## Uses of Bag of Visual Words Representation:

1. Treat Bag of Visual Word representation of an image as the feature vector for a stardard classifier. (e.g. kNN, SVM etc.)
2. Cluster the Bag of Visual Word representation of all images (including the query image) and find the cluster that the query image fits.