# CLIP for selection suggestions

CLIP is a model for comparing natural language with images. In this notebook, I want to check if it can be used as part of our selection UI. This is a qualitative study.

The CLIP model outputs embeddings (e_text, e_image) for a (text, image) pair. The dot product between the e_text and e_image is the similarity score. For a given text prompt and a graphic, I'll show the best node in the graphic's tree. I'll start out with ground truth annotations for graphics. 

In [None]:
# LOAD DATA and MODEL
import torch
import clip
from PIL import Image
from vectorrvnn.utils import *
from vectorrvnn.data import *
import matplotlib.pyplot as plt

data = TripletDataset('../data/All/Test')

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

In [None]:
def matchingNode (text, tree) :
    pp_text = clip.tokenize([text]).to(device)
    pathSets = [tree.nodes[n]['pathSet'] for n in tree.nodes]
    subdocs = [subsetSvg(tree.doc, ps) for ps in pathSets]
    rasters = [rasterize(sd, 256, 256) for sd in subdocs]
    normalized = [(r - r.min()) / (r.max() - r.min()) for r in rasters]
    bit8 = [(n * 255).astype(np.uint8) for n in normalized]
    images = [Image.fromarray(b, 'RGBA') for b in bit8]
    pp_image = torch.stack([preprocess(im) for im in images]).to(device)
    with torch.no_grad() : 
        logits_per_image, _ = model(pp_image, pp_text)
        probs = logits_per_image.softmax(dim=0).cpu().numpy()
        probs = probs.reshape(-1)
    top3 = probs.argsort()[-3:][::-1]
    fig, axes = plt.subplots(1, 3)
    print("Showing top 3 matches for prompt -", text)
    for i, ax in enumerate(axes) : 
        ax.imshow(bit8[top3[i]])
    plt.show()

In [None]:
dataId = 42

print("Showing whole graphic")
plt.imshow(rasterize(data[dataId].doc, 256, 256))
plt.show()
matchingNode("Scissors", data[dataId])