In [None]:
# default_exp core

In [None]:
#hide
from nbdev.showdoc import *

# Core

> Index, query and save embeddings of images by folder

## Rationale

**Memery takes a folder of images, and a search query, and returns a list of ranked images.**

The images and query are both projected into a high-dimensional semantic space, courtesy of OpenAI's [https://github.com/openai/CLIP](https://openai.com/blog/clip/). These embeddings are indexed and treemapped using the [Annoy](https://github.com/spotify/annoy) library, which provides nearest-neighbor results for the search query. These results are then transmitted to the user interface (currently as a list of file locations).

We provide various interfaces for the end user, which all call upon the function `query_flow` and `index_flow` below.


## Modular flow system

Memery uses the Neural Search design pattern as described by Han Xiao in e.g. [General Neural Elastic Search and Go Way Beyond](https://hanxiao.io/2019/07/29/Generic-Neural-Elastic-Search-From-bert-as-service-and-Go-Way-Beyond)&c.

This is a system designed to be scalable and distributed if necessary. Even for a single-machine scenario, I like the functional style of it: grab data, transform it and pass it downstream, all the way from the folder to the output widget.

There are two main types of operater in this pattern: **flows** and **executors**.

**Flows** are specific patterns of data manipulation and storage. **Executors** are the operators that transform the data within the flow. 

There are two core flows to any search system: indexing, and querying. The plan here is to make executors that can be composed into flows and then compose the flows into a UI that supports querying and, to some extent, indexing as well.

The core executors for this use case are:
 - Loader
 - Crafter
 - Encoder
 - Indexer
 - Ranker
 - Gateway
 

**NB: The executors are currently implemented as functions. A future upgrade will change the names to verbs to match, or change their implementation to classes if they're going to act as nouns.**

These executors are being implemented ad hoc in the flow functions, but should probably be given single entry points and have their specific logic happen within their own files. Deeper abstractions with less coupling.

## Flows

In [None]:
#export
import time
import torch

from pathlib import Path
from memery.loader import get_image_files, get_valid_images, archive_loader, db_loader, treemap_loader 
from memery.crafter import crafter, preproc
from memery.encoder import image_encoder, text_encoder, image_query_encoder
from memery.indexer import join_all, build_treemap, save_archives
from memery.ranker import ranker, nns_to_files

#### Indexing

In [None]:
#export
def index_flow(path):
    '''Indexes images in path, returns the location of save files'''
    root = Path(path)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Loading
    filepaths = loader.get_image_files(root)
    archive_db = {}
    
    archive_db, new_files = loader.archive_loader(filepaths, root, device)
    print(f"Loaded {len(archive_db)} encodings")
    print(f"Encoding {len(new_files)} new images")

    # Crafting and encoding
    crafted_files = crafter.crafter(new_files, device)
    new_embeddings = encoder.image_encoder(crafted_files, device)
    
    # Reindexing
    db = indexer.join_all(archive_db, new_files, new_embeddings)
    print("Building treemap")
    t = indexer.build_treemap(db)
    
    print(f"Saving {len(db)} encodings")
    save_paths = indexer.save_archives(root, t, db)

    return(save_paths)

In [None]:
show_doc(index_flow)

We can index the local `images` folder to test

In [None]:

# delete the current savefile for testing purposes
Path('images/memery.pt').unlink()
Path('images/memery.ann').unlink()

# run the index flow. returns the path
save_paths = index_flow('./images')

In [None]:
assert save_paths # Returns True if the path exists
save_paths

#### Querying

In [None]:
#export
def query_flow(path, query=None, image_query=None):
    '''
    Indexes a folder and returns file paths ranked by query.
    
    Parameters:
        path (str): Folder to search
        query (str): Search query text
        image_query (Tensor): Search query image(s)

    Returns:
        list of file paths ranked by query
    '''
    start_time = time.time()
    root = Path(path)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Check if we should re-index the files
    print("Checking files")
    dbpath = root/'memery.pt'
    db = loader.db_loader(dbpath, device)
    treepath = root/'memery.ann'
    treemap = treemap_loader(treepath)
    filepaths = get_valid_images(root)

    # # Rebuild the tree if it doesn't 
    # if treemap == None or len(db) != len(filepaths):
    #     print('Indexing')
    #     dbpath, treepath = index_flow(root)
    #     treemap = loader.treemap_loader(Path(treepath))
    #     db = loader.db_loader(dbpath, device)
    
    # Convert queries to vector
    print('Converting query')
    if image_query:
        img = crafter.preproc(image_query)
    if query and image_query:
        text_vec = encoder.text_encoder(query, device)
        image_vec = encoder.image_query_encoder(img, device)
        query_vec = text_vec + image_vec
    elif query:
        query_vec = encoder.text_encoder(query, device)
    elif image_query:
        query_vec = encoder.image_query_encoder(img, device)
    else:
        print('No query!')

    # Rank db by query    
    print(f"Searching {len(db)} images")
    indexes = ranker.ranker(query_vec, treemap)
    ranked_files = ranker.nns_to_files(db, indexes)
    
    print(f"Done in {time.time() - start_time} seconds")
    
    return(ranked_files)

        

In [None]:
show_doc(query_flow)

In [None]:
ranked = query_flow('./images', 'dog')

print(ranked[0])


In [None]:
assert ranked[0] == "images/memes/Wholesome-Meme-8.jpg"

![](images/memes/Wholesome-Meme-8.jpg)

*Then what?! What are the limitations of this system? What are its options? What configuration can i do if i'm a power user? Why did you organize things this way instead of a different way?*

*This, and probably each of the following notebooks, would benefit from a small recording session where I try to explain it to an imaginary audience. So that I can get the narrative of how it works, and then arrange the code around that.*
