In [None]:
# default_exp core

In [None]:
#hide
from nbdev.showdoc import *

# Core

Find the meme you are looking for! <~~ *this one i like*

## Rationale

This is the core module, which loads the CLIP model and the encodings of all the images in the folder, then tokenizes the search text or image, and finally returns a sorted list of filenames.

*Express this as a process and nouns, instead of a blob sentence. The next part doesn't count because it's a whole new heading and it's crufted with nerd talk. Say clearly what the core flow is here. Something like:*

*Memery takes a folder of images, and a search query, and returns a list of nearest neighbors to the query.*

*The query flow and index flow can be used separately, but by default the query flow calls the indexing flow on each search. Image encodings are saved to disk. Only new images will be encoded with each indexing.*

d

## Modular flow system

I'm using the Neural Search design pattern as described by Han Xiao in e.g. [General Neural Elastic Search and Go Way Beyond](https://hanxiao.io/2019/07/29/Generic-Neural-Elastic-Search-From-bert-as-service-and-Go-Way-Beyond)&c.

This is a system designed to be scalable and distributed if necessary. Even for a single-machine scenario, I like the functional style of it: grab data, transform it and pass it downstream, all the way from the folder to the output widget.

There are two main types of operater in neural search: **flows** and **executors**.

**Flows** are specific patterns of data manipulation and storage. **Executors** are the operators that transform the data within the flow. 

There are two core flows to any search system: indexing, and querying. The plan here is to make executors that can be composed into flows and then compose the flows into a UI that supports querying and, to some extent, indexing as well.

The core executors for this use case are:
 - Loader
 - Crafter
 - Encoder
 - Indexer
 - Ranker
 - Gateway
 

**Gateway Process -- not yet implemented**

Takes a query and processes it through either Indexing Flow or Querying Flow, passing along arguments. The main entrypoint for each iteration of the index/query process.

Querying Flow can technically process either text or image search, becuase the CLIP encoder will put them into the same embedding space. So we might as well build in a method for either, and make it available to the user, since it's impressive and useful and relatively easy to build. 

Eventually the Gateway process probably needs to be quite complicated, for serving all the different users and for delivering REST APIs to different clients. We'll need a way to accept text and images as HTTP requests and return JSON dictionaries (especially at the encoder, which will remain server-bound more than any other executor).

*Do i actually need any of this? Perhaps if i'm going to implement it myself. But won't i just end up using Jina oonce it gets implemented fully? Or something similar? Well in any case this is meant to be a readable document, so i can explain how this is meant to owrk in the future without being so obtuse that it becomes an obstacle to reading further...*

### Usage

For now, calling the `queryFlow` process directly is the simplest gateway.

*Meaningless. Explain how to use it closer to where it is implemented. Also show an example of it in use. Terrible sentence really.*

## Flows

The `index_flow` checks the local directory for files with image extensions, loads the archive, splits out any new images to be encoded and encodes them, then finally builds a treemap and saves it along with the new archive. It returns a tuple with the locations of the archive and treemap.

*Split this up into small chunks with code tests. No blob sentences!*

In [None]:
#export
import time
import torch

from pathlib import Path
from memery.loader import get_image_files, archive_loader, db_loader, treemap_loader 
from memery.crafter import crafter, preproc
from memery.encoder import image_encoder, text_encoder, image_query_encoder
from memery.indexer import join_all, build_treemap, save_archives
from memery.ranker import ranker, nns_to_files

#### Indexing

In [None]:
#export
def index_flow(path):
    root = Path(path)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    filepaths = get_image_files(root)
    archive_db = {}
    
    archive_db, new_files = archive_loader(filepaths, root, device)
    print(f"Loaded {len(archive_db)} encodings")
    print(f"Encoding {len(new_files)} new images")

#     start_time = time.perf_counter()

    crafted_files = crafter(new_files, device)
    new_embeddings = image_encoder(crafted_files, device)
    
    db = join_all(archive_db, new_files, new_embeddings)
    print("Building treemap")
    t = build_treemap(db)
    
    print(f"Saving {len(db)} encodings")
    save_paths = save_archives(root, t, db)
#     print(f"Done in {time.perf_counter() - start_time} seconds")
    
    return(save_paths)

In [None]:
# delete the current savefile for testing purposes
Path('images/memery.pt').unlink()
Path('images/memery.ann').unlink()

save_paths = index_flow('./images')

In [None]:
assert save_paths
save_paths


('images/memery.pt', 'images/memery.ann')

#### Querying

The `query_flow` takes a path and a query, checks for an index, loads it and searches through the treemap if it exists, and calls `index_flow` to index it first if it hasn't been.

*BLOB!*

In [None]:
#export
def query_flow(path, query=None, image_query=None):
    start_time = time.time()
    print('starting timer')
    root = Path(path)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    print("Checking files")
    dbpath = root/'memery.pt'
    db = db_loader(dbpath, device)
    treepath = root/'memery.ann'
    treemap = treemap_loader(treepath)
    
    filepaths = get_image_files(root)
    if treemap == None or len(db) != len(filepaths):
        print('Indexing')
        dbpath, treepath = index_flow(root)
        treemap = treemap_loader(Path(treepath))
        db = db_loader(dbpath, device)
    
    print('Converting query')
    if image_query:
        img = preproc(image_query)
    if query and image_query:
        text_vec = text_encoder(query, device)
        image_vec = image_query_encoder(img, device)
        query_vec = text_vec + image_vec
    elif query:
        query_vec = text_encoder(query, device)
    elif image_query:
        query_vec = image_query_encoder(img, device)
    else:
        print('No query!')
        
    print(f"Searching {len(db)} images")
    indexes = ranker(query_vec, treemap)
    ranked_files = nns_to_files(db, indexes)
    
    print(f"Done in {time.time() - start_time} seconds")
    
    return(ranked_files)

        

*Then what?! What are the limitations of this system? What are its options? What configuration can i do if i'm a power user? Why did you organize things this way instead of a different way?*

*This, and probably each of the following notebooks, would benefit from a small recording session where I try to explain it to an imaginary audience. So that I can get the narrative of how it works, and then arrange the code around that.*
