In [None]:
#default_exp visualizer

In [None]:
#hide
from nbdev.showdoc import *
%matplotlib inline

# Visualize

---
## Dimensionality Reduction

One use-case for `memery` is to explore large image datasets, for cleaning and curation purposes. Sifting images by hand takes a long time, and it's near impossible to keep all the images in your mind at noce.

Even with semantic search capabilities, it's hard to get an overview of all the images. CLIP sees things in many more dimensions than humans do, so no matter how many searches you run you can't be sure if you're missing some outliers you don't even know to search for.

The ideal overview would be a map of all the images along all the dimensions, but we don't know how to visualize or parse 512-dimensional spaces for human brains. So we have to do dimensional reduction: find a function in some space with ≤ 3 dimensions that best emulates the 512-dim embeddings we have, and map that instead.

The recent advance in dimensional reduction is Minimum Distortion Embedding, an abstraction over all types of embeddings like PCA, t-SNE, or k-means clustering. We can use the `pymde` library to embed them and `matplotlib` to draw the images as their own markers on the graph. We'll also need `torch` to process the tensors, and `memery` functions to process the database

In [None]:
import pymde
import torch
from pathlib import Path
from memery.loader import db_loader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


Let's get a database of embeddings from the local folder

In [None]:
db = db_loader('images/memery.pt', device)

In [None]:
db[0].keys()

In [None]:
embeds = torch.stack([v['embed'] for v in db.values()], 0)

There are two methods to invoke with `pymde`: `preserve_neighbors` and `preserve_distances`. They create different textures in the final product. Let's see what each looks like on our sample dataset.

In [None]:
mde_n = pymde.preserve_neighbors(embeds, verbose=False, device='cuda')
mde_d = pymde.preserve_distances(embeds, verbose=False, device='cuda')

In [None]:
embed_n = mde_n.embed(verbose=False, snapshot_every=1)
embed_d = mde_d.embed(verbose=False, snapshot_every=1)

In [None]:
pymde.plot(embed_n)

In [None]:
pymde.plot(embed_d)

In [None]:
mde_n.play(savepath='./graphs/mde_n.gif')

In [None]:
mde_d.play(savepath='./graphs/mde_d.gif')

In [None]:
assert embed_n.shape

---
Now I want to plot images as markers, instead of little dots. Haven't figured out yet how to merge this with `pymde.plot` functions, so I'm doing it right in matplotlib. 

If we just plot the images at their coordinates, they will overlap (especially on the `preserve_neighbors` plot) so eventually maybe I can normalize the x and y axes and plot things on a grid? at least a little bit

In [None]:
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
from tqdm import tqdm

In [None]:
def plot_images_from_tensors(coords, image_paths, dpi=600, savefile = 'default.jpg', zoom=0.03):
    fig, ax = plt.subplots()
    fig.dpi = dpi
    fig.set_size_inches(8,8)
    
    ax.xaxis.set_visible(False)    
    ax.yaxis.set_visible(False)
    
    cc = coords.cpu()
    x_max, y_max = cc.argmax(0)
    x_min, y_min = cc.argmin(0)
    
    low = min(cc[x_min][0], cc[y_min][1])
    high = max(cc[x_max][0], cc[y_max][1])
    sq_lim = max(abs(low), abs(high))
    
    plt.xlim(low, high)
    plt.ylim(low, high)
    
#     plt.xlim(-sq_lim, sq_lim)
#     plt.ylim(-sq_lim, sq_lim)

    for i, coord in tqdm(enumerate(coords)):
        try:
            x, y = coord

            path = str(image_paths[i])
            with open(path, 'rb') as image_file:
                image = plt.imread(image_file)

                im = OffsetImage(image, zoom=zoom, resample=False)
                im.image.axes = ax
                ab = AnnotationBbox(im, (x,y), frameon=False, pad=0.0,)
                ax.add_artist(ab)
        except SyntaxError:
            pass
    print("Drawing images as markers...")
    plt.savefig(savefile)
    print(f'Saved image to {savefile}')


In [None]:
filenames = [v['fpath'] for v in db.values()]

In [None]:
savefile = 'graphs/embed_n.jpg'

plot_images_from_tensors(embed_n, filenames, savefile=savefile)

In [None]:
savefile = 'graphs/embed_d.jpg'

plot_images_from_tensors(embed_d, filenames, savefile=savefile)

I suppose it makes sense that the `preserve_neighbors` function clumps things together and the `preserve_distances` spreads them out. It's nice to see the actual distances and texture of the data, for sure. But I'd also like to be able to see them bigger, with only relative data about where they are to each other. Let's see if we can implement a normalization function and plot them again.

Currently the embedding tensor is basically a list pairs of floats. Can I convert those to a set of integers that's the length of the amount of images? I don't know how to do this in matrix math so I'll try it more simply first.

In [None]:
len(embed_n)

In [None]:
embed_list = [(float(x),float(y)) for x,y in embed_n]
embed_dict = {k: v for k, v in zip(filenames, embed_list)}
len(embed_dict)

In [None]:
def normalize_embeds(embed_dict):
    sort_x = {k: v[0] for k, v in sorted(embed_dict.items(), key=lambda item: item[1][0])}
    norm_x = {item[0]: i for i, item in enumerate(sort_x.items())}
    
    sort_y = {k: v[1] for k, v in sorted(embed_dict.items(), key=lambda item: item[1][1])}
    norm_y = {item[0]: i for i, item in enumerate(sort_y.items())}

    normalized_dict = {k: (norm_x[k], norm_y[k]) for k in embed_dict.keys()}
    return(normalized_dict)

In [None]:
norm_dict = normalize_embeds(embed_dict)

In [None]:
len(norm_dict)

I probably could do that all in torch but right now I'm just going to pipe it back into tensors and put it through my plotting function:

In [None]:
norms = torch.stack([torch.tensor([x, y]) for x, y in norm_dict.values()])


In [None]:
plot_images_from_tensors(norms, filenames, savefile='graphs/normalized.jpg')

It worked!! The clusters still exist but their distances are relaxed so they can be displayed better on the graph. It's removing some information, for sure. but unclear if that is information a human needs.

I wonder if it works on the `preserve_distances` method...

In [None]:
embed_list = [(float(x),float(y)) for x,y in embed_d]
embed_dict = {k: v for k, v in zip(filenames, embed_list)}
norm_dict = normalize_embeds(embed_dict)
norms = torch.stack([torch.tensor([x, y]) for x, y in norm_dict.values()])

In [None]:
plot_images_from_tensors(norms, filenames, savefile='graphs/normalized-d.jpg')

This looks okay. It reduces overall distances but keeps relative distances? Still not sure what the actionalbe difference between these two methods is. 

Well, it works okay for now. The next question is, how to incorporate it into a working GUI?

I wonder how matplotlib does natively, for a much larger dataset. Let's see:

# Large dataset

In [None]:
def normalize_tensors(embdgs, names):
    embed_list = [(float(x),float(y)) for x,y in embdgs]
    embed_dict = {k: v for k, v in zip(names, embed_list)}
    norm_dict = normalize_embeds(embed_dict)
    norms = torch.stack([torch.tensor([x, y]) for x, y in norm_dict.values()])
    return(norms)

In [None]:
db = db_loader('/home/mage/Pictures/memes/memery.pt', device)

In [None]:
filenames = [v['fpath'] for v in db.values()]

In [None]:
clips = torch.stack([v['embed'] for v in db.values()])

In [None]:
filenames[:5]

In [None]:
mde_lg = pymde.preserve_neighbors(clips, verbose=False, device='cuda')

In [None]:
embed_lg = mde_lg.embed(verbose=False, snapshot_every=1)

In [None]:
norms_lg = normalize_tensors(embed_lg,filenames)
len(norms_lg)

In [None]:
plot_images_from_tensors(embed_lg, filenames, savefile='graphs/normalized-lg.jpg')

---

### Be careful here

It is possible to use embeddings as target coordinates to delete sections of the data:

In [None]:
to_delete = []
for coord, img in zip(#embedding, filenames):
    x, y = coord
    if x < -2 or y < -1:
        to_delete.append(img)

In [None]:
len(to_delete)

In [None]:
for img in to_delete:
    imgpath = Path(img)
    imgpath.unlink()

It worked! A better distribution and fewer of the wrong things