In [None]:
# default_exp core

In [None]:
#hide
from nbdev.showdoc import *

# Core

Find the meme you are looking for!



This is the core module, which loads the CLIP model and the encodings of all the images in the folder, then tokenizes the search text or image, and finally returns a sorted list of filenames.

## Modular flow system

I'm using the Neural Search design pattern as described by Han Xiao in e.g. [General Neural Elastic Search and Go Way Beyond](https://hanxiao.io/2019/07/29/Generic-Neural-Elastic-Search-From-bert-as-service-and-Go-Way-Beyond)&c.

This is a system designed to be scalable and distributed if necessary. Even for a single-machine scenario, I like the functional style of it: grab data, transform it and pass it downstream, all the way from the folder to the output widget.

There are two main types of operater in neural search: **flows** and **executors**.

**Flows** are specific patterns of data manipulation and storage. **Executors** are the operators that transform the data within the flow. 

There are two core flows to any search system: indexing, and querying. The plan here is to make executors that can be composed into flows and then compose the flows into a UI that supports querying and, to some extent, indexing as well.

The core executors for this use case are:
 - Loader
 - Crafter
 - Encoder
 - Indexer
 - Ranker
 - Gateway
 

**Gateway Process -- not yet implemented**

Takes a query and processes it through either Indexing Flow or Querying Flow, passing along arguments. The main entrypoint for each iteration of the index/query process.

Querying Flow can technically process either text or image search, becuase the CLIP encoder will put them into the same embedding space. So we might as well build in a method for either, and make it available to the user, since it's impressive and useful and relatively easy to build. 

Eventually the Gateway process probably needs to be quite complicated, for serving all the different users and for delivering REST APIs to different clients. We'll need a way to accept text and images as HTTP requests and return JSON dictionaries (especially at the encoder, which will remain server-bound more than any other executor).

### Usage

For now, calling the `queryFlow` process directly is the simplest gateway.

In [None]:
#export
import time
import torch
from pathlib import Path
from memery.loader import get_image_files, archive_loader, db_loader, treemap_loader 
from memery.crafter import crafter
from memery.encoder import image_encoder, text_encoder
from memery.indexer import join_all, build_treemap, save_archives
from memery.ranker import ranker, nns_to_files

## Flows

The `indexFlow` checks the local directory for files with image extensions, loads the archive, splits out any new images to be encoded and encodes them, then finally builds a treemap and saves it along with the new archive. It returns a tuple with the locations of the archive and treemap.

In [None]:
#export
def indexFlow(path):
    root = Path(path)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    filepaths = get_image_files(root)
    archive_db = {}
    
    archive_db, new_files = archive_loader(filepaths, root, device)
    print(f"Loaded {len(archive_db)} encodings")
    print(f"Encoding {len(new_files)} new images")

    start_time = time.perf_counter()

    crafted_files = crafter(new_files, device)
    new_embeddings = image_encoder(crafted_files, device)
    
    db = join_all(archive_db, new_files, new_embeddings)
    print("Building treemap")
    t = build_treemap(db)
    
    print(f"Saving {len(db)}images")
    save_paths = save_archives(root, t, db)
    print(f"Done in {time.perf_counter() - start_time} seconds")
    
    return(save_paths)

In [None]:
# delete the current savefile for testing purposes
Path('images/memery.pt').unlink()
Path('images/memery.ann').unlink()

In [None]:
save_paths = indexFlow('./images')

  0%|          | 0/1 [00:00<?, ?it/s]

Loaded 0 encodings
Encoding 80 new images


100%|██████████| 1/1 [00:00<00:00,  1.02it/s]


Building treemap
Saving 79images
Done in 1.381443322999985 seconds


In [None]:
assert save_paths
save_paths


('images/memery.pt', 'images/memery.ann')

The `queryFlow` takes a path and a query, checks for an index, loads it and searches through the treemap if it exists, and calls `indexFlow` to index it first if it hasn't been.

In [None]:
#export
def queryFlow(path, query): 
    root = Path(path)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    dbpath = root/'memery.pt'
    db = db_loader(dbpath, device)
    treepath = root/'memery.ann'
    treemap = treemap_loader(treepath)
    
    if treemap == None or db == {}:
        dbpath, treepath = indexFlow(root)
        treemap = treemap_loader(Path(treepath))
        db = db_loader(dbpath, device)
    
    print(f"Searching {len(db)} images")
    start_time = time.perf_counter()
    
    query_vec = text_encoder(query, device)
    indexes = ranker(query_vec, treemap)
    ranked_files = nns_to_files(db, indexes)
    
    print(f"Done in {time.perf_counter() - start_time} seconds")
    
    return(ranked_files)

        

In [None]:
#This is just a helper function to print images in a notebook:

from memery.gui import get_grid

In [None]:
root = './images'
query = 'dog'

In [None]:
ranked = queryFlow(root, query)

Searching 79 images
Done in 0.08653578300072695 seconds


In [None]:
ranked[:6]

['images/Wholesome-Meme-8.jpg',
 'images/Wholesome-Meme-5.jpg',
 'images/Wholesome-Meme-35.jpg',
 'images/Wholesome-Meme-67.png',
 'images/embarassed-dog-on-bed-SA2BDZW.jpg',
 'images/Wholesome-Meme-72.jpg']

In [None]:
get_grid(ranked, n=6)

GridBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x…

---

_Working out the timing issues in flow_

In [None]:
%load_ext line_profiler

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


In [None]:
root = '/home/mage/Pictures/memes/Pixel/Pictures/'
query = 'dog'

In [None]:
%lprun -f queryFlow queryFlow(root, query)

Searching 367 images
Done in 0.03215723000175785 seconds


Timer unit: 1e-06 s

Total time: 0.059621 s
File: <ipython-input-8-f0ea29459e86>
Function: queryFlow at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
     2                                           def queryFlow(path, query): 
     3         1         83.0     83.0      0.1      root = Path(path)
     4         1         51.0     51.0      0.1      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     5                                               
     6         1         42.0     42.0      0.1      dbpath = root/'memery.pt'
     7         1      26706.0  26706.0     44.8      db = db_loader(dbpath, device)
     8         1         36.0     36.0      0.1      treepath = root/'memery.ann'
     9         1        413.0    413.0      0.7      treemap = treemap_loader(treepath)
    10                                               
    11         1          2.0      2.0      0.0      if treemap == None or db == {}:
    12                     

In [None]:
%lprun -f indexFlow indexFlow(Path(root))

0it [00:00, ?it/s]

Loaded 367 encodings
Encoding 0 new images


0it [00:00, ?it/s]

Building treemap





Saving 367images
Done in 1.8026166210001975 seconds


Timer unit: 1e-06 s

Total time: 1.85444 s
File: <ipython-input-4-3f277edb55d4>
Function: indexFlow at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
     2                                           def indexFlow(path):
     3         1         38.0     38.0      0.0      root = Path(path)
     4         1         30.0     30.0      0.0      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     5                                               
     6         1      16144.0  16144.0      0.9      filepaths = get_image_files(root)
     7         1          2.0      2.0      0.0      archive_db = {}
     8                                               
     9         1      35290.0  35290.0      1.9      archive_db, new_files = archive_loader(filepaths, root, device)
    10         1        189.0    189.0      0.0      print(f"Loaded {len(archive_db)} encodings")
    11         1         55.0     55.0      0.0      print(f"Encoding {len(new_file

In [None]:
root = '/home/mage/Pictures/occult-imagery/'
query = 'dog'

In [None]:
%lprun -f queryFlow queryFlow(root, query)

Searching 26722 images
Done in 95.6256318759988 seconds


Timer unit: 1e-06 s

Total time: 96.3339 s
File: <ipython-input-8-f0ea29459e86>
Function: queryFlow at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
     2                                           def queryFlow(path, query): 
     3         1         52.0     52.0      0.0      root = Path(path)
     4         1         37.0     37.0      0.0      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     5                                               
     6         1         28.0     28.0      0.0      dbpath = root/'memery.pt'
     7         1     707512.0 707512.0      0.7      db = db_loader(dbpath, device)
     8         1         30.0     30.0      0.0      treepath = root/'memery.ann'
     9         1        457.0    457.0      0.0      treemap = treemap_loader(treepath)
    10                                               
    11         1          1.0      1.0      0.0      if treemap == None or db == {}:
    12                      

In [None]:
%lprun -f indexFlow indexFlow(Path(root))

0it [00:00, ?it/s]

Loaded 26722 encodings
Encoding 0 new images


0it [00:00, ?it/s]

Building treemap





Saving 26722images
Done in 122.27596841900231 seconds


Timer unit: 1e-06 s

Total time: 132.016 s
File: <ipython-input-4-3f277edb55d4>
Function: indexFlow at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
     2                                           def indexFlow(path):
     3         1         19.0     19.0      0.0      root = Path(path)
     4         1         17.0     17.0      0.0      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     5                                               
     6         1     844439.0 844439.0      0.6      filepaths = get_image_files(root)
     7         1          2.0      2.0      0.0      archive_db = {}
     8                                               
     9         1    8895508.0 8895508.0      6.7      archive_db, new_files = archive_loader(filepaths, root, device)
    10         1         94.0     94.0      0.0      print(f"Loaded {len(archive_db)} encodings")
    11         1         21.0     21.0      0.0      print(f"Encoding {len(new_fil