# Semantic Image Search

## Goal

The goal of this week's capstone project is to combine techniques from computer vision and natural language processing (NLP) to allow querying for images using keywords. 

## Approach

We'll learn a mapping from images features (extracted using pre-trained computer vision models such as ResNet-18) to semantic features (based on word embeddings).

We'll create training data consisting of triples of the form: 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `(text, good_image, bad_image)`

We want the similarity in semantic space between the `text` and `good_image` to be greater than the similarity between the `text` and `bad_image`, i.e.,

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `sim(se(text), se(good_image)) > sim(se(text), se(bad_image))`

where `se()` stands for semantic embedding. We'll use cosine similarity to measure similarity. But note that if all vectors are normalized to have unit length, then cosine similarity is equivalent to the dot product between the two vectors.

We'll embed text by essentially averaging the word embeddings (e.g., GloVe embeddings) for all words in the string.

We'll embed images with a linear projection from image feature space to the semantic space (same dimension as word embeddings):

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `se(im) = im M`

where `im` has shape (1, 512) and `M` has shape (512, 50) assuming 512-dimensional image features and 50-dimensional word embedding features. The idea (hope) is that `im` will be mapped to a good region of the semantic space.

For example, if `im` contains a dog, we want multiplication by `M` to result in a vector `se(im)` that's close to where the word "dog" is in GloVe space (`glove["dog"]`).

### Ranking Loss

To find a good M matrix, we'll use PyTorch's MarginRankingLoss. This loss penalizes when the ordering of the values is wrong, and stops penalizing once the order is right "enough" (determined by the desired margin). The reasoning is that once the ordering between values is right, we don't need to waste effort trying to make it even more right.

## Dataset

We'll be using the COCO dataset. From the website (http://cocodataset.org/),

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "COCO is a large-scale object detection, segmentation, and captioning dataset."

We've already pre-extracted image features using the resnet18 model for train2014 so you don't need to worry about this step. We've put these as well as the captions in dropbox for you to download:

captions_train2014.json
<br/>
https://www.dropbox.com/s/h5u86wp9wfhtkz1/captions_train2014.json?dl=0

resnet18_features_train.pkl.gz
<br/>
https://www.dropbox.com/s/2g6m70ouitxftt9/resnet18_features_train.pkl.gz?dl=0

or in zip format:

resnet18_features_train.pkl.zip
<br/>
https://www.dropbox.com/s/83skvy9bub36pkl/resnet18_features_train.pkl.zip?dl=0

The fe file contains the image features (in dictionary of PyTorch tensors indexed by image_id).

If during the course of the project, you find you want to do things with the raw images, you can download from the COCO dataset, but the files are pretty large. :)

## Tasks

These tasks are the minimum set of things that need to be completed as part of the capstone project. You should coordinate with your team about how to divide them up. (Note: Some might naturally be combined.)

* create capability to embed captions
 * add up word embeddings for all words in caption weighted by IDF of words across all captions (to down-weight common words); then normalize
* create training and validation sets of triples
* train model
 * embed caption
 * embed good image
 * embed bad image
 * compute similarities from caption to good and caption to bad
 * compute loss with margin ranking loss
 * take optimization step
* measure loss, accuracy (in terms of triples correct)
* create image "database" by mapping whole set of image features to semantic features with trained model
* create function to query database and return top k images
* create function to display set of images (given image ids)
 * note that the image metadata (contained in `captions_train2014.json`) includes a property called "coco_url" that can be used download a particular image on demand for display
 * maybe display their captions, too (for debugging)
* create function that finds top k similar images to a query image
 * give option for doing similarity search in original image feature space or learned semantic space

## Optional Tasks

* make a simple website (with bottle or flask); more details to follow
* embed some new images (e.g., from your phones) and add to database or create new separate database
* measure actual rank of good image during training (ask for details)
* try hard negative mining (ask for details)
 * e.g., see brief desciption here: http://www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html#part4
* check out "Hubs in Space" paper:
 * http://www.jmlr.org/papers/v11/radovanovic10a.html (ask for details)
* try different word embeddings
 * e.g., pre-trained word2vec embeddings that did "phrase detection" first
* try different image features (will require using PyTorch trained models; details to follow)
* try Roccio relevance feedback:
 * https://en.wikipedia.org/wiki/Rocchio_algorithm
* try different way to embed captions (other than average)

## Extra Optional Tasks

* image captioning (ala "Show and Tell" paper)
 * https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Vinyals_Show_and_Tell_2015_CVPR_paper.html