# Turn a working Jina example into a full-fledged memery pipeline

uery flow:

Start timer
Get location of files
Get device (GPU/CPU)
Read database
Load index
Check if files changed from database
Index database if changed
Check for image/text query
Encode query
Add query encodings if multiple
Index top_k by query
Get filenames from index
Return filenames
A lot of the code I wrote for memery wasjust glue code for moving files and tensors around. In Jina framework, a lot of that is done for me. But I still ned to build a Flow using Executors explicitly. What are the Jina analogies for the processes I'm using in memery?


- Index flow:
   - Get location of files
   - Get device (GPU/CPU)

   - Get image files recursively from root
   - Read saved files

   - Find new/changed files

   - Transform and batch new files
   - Encode new files

   - Update database w new files
   - Index all files
   - Save database to archive

   - Return savefile

- Query flow:
   - Start timer
   - Get location of files
   - Get device (GPU/CPU)
   
   - Read database
   - Load index
   - Check if files changed from database
     - Index database if changed
   
   - Check for image/text query
   - Encode query
     - Add query encodings if multiple
   
   - Index top_k by query
   - Get filenames from index
   
   - Return filenames
   
In hindsight, I should have not put indexing flow inside of querying flow. Or at least put it behind a toggle. With Jina this shouldn't be an issue -- different flows have their own endpoints, and client and server apps will be separate concerns.

So a lot of this really has to do with tracking the files in the filesystem. That's good when you're dealing with one local folder, but it doesn't scale well. I want to use a known database system, either a file-based one or a full relational db server. Jina has Storage executors for this function, with CRUD endpoints built in.

There's also a lot of steps dealing with transforming the inputs to a PyTorch format for CLIP to deal with. There are already CLIP preprocessing and encoding executors.

Finally there's indexing and tree-building. A good chunk of that is done within Indexer executors, so that should be pretty simple. After that it's all frontend design.

I'll build the index flow first, populate a test index, then build a query flow that only tests it. Tying the indexing into the query optionally can come later.

Before building the flow, I'll want to test each Executor and make sure their inputs and outputs line up.

In [None]:
from jina import Document, DocumentArray, Executor, Flow, requests
from pathlib import Path
import torch

## Index flow
   - Get location of files
   - Get device (GPU/CPU)
   
This is still pretty straightforward. Finding the path we're supposed to search is still necessary, even if the embeddings are actually kept in a database.

In [None]:
path = './images/memes'

In [None]:
root = Path(path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


### Loader

- Get image files recursively from root
   - Read saved files
   - Find new/changed files

This is where the Storage executor will eventually come in. Right now I'm going to pull all the files anew.

In [None]:
from memery.loader import get_image_files, verify_image

In [None]:
from jina.types.document.generators import from_files

In [None]:
files_list = [str(o[0].resolve()) for o in get_image_files(root) if verify_image(o[0])]
# files_list

84it [00:00, 55024.45it/s]

Skipping bad file: images/memes/corrupted-file.jpeg
due to <class 'PIL.UnidentifiedImageError'>
Skipping bad file: images/memes/.ipynb_checkpoints/corrupted-file-checkpoint.jpeg
due to <class 'PIL.UnidentifiedImageError'>





In [None]:
docs = DocumentArray(from_files(files_list))

In [None]:
docs

<jina.types.arrays.document.DocumentArray length=81 at 140219451978320>

### Encoder
- Transform and batch new files
   - Encode new files
   
For this process we use the Flow to batch things through. Copying some code here from [AlexCG's meme search repo](https://github.com/alexcg1/jina-meme-search/blob/168e0f2ca6b4a34e3db730e2439cc61a0162e020/backend-image/app.py) to play with

In [None]:
class UriToBlob(Executor):
    @requests
    def uri_to_blob(self, docs, **kwargs):
        for doc in docs:
            doc.tags["uri"] = doc.uri
            doc.tags["uri_absolute"] = str(Path(doc.uri).resolve())
            doc.convert_image_uri_to_blob()

In [None]:
WORKSPACE_DIR = "workspace"

index_flow = (
    Flow()
    .add(
        uses=UriToBlob, 
        name="processor",
#         needs="gateway"
        ) # Embed image in doc, not just filename
    .add(
        uses="jinahub://ImageNormalizer",
        uses_with={"target_size": 96},
        name="image_normalizer",
    )
    .add(
        uses="jinahub+docker://CLIPImageEncoder",
        uses_metas={"workspace": WORKSPACE_DIR},
#         uses_with={'device': 'cuda'}, gpus='all',
        name="meme_image_encoder",
        volumes='./data:/encoder/data',
        
    )
    .add(
        uses="jinahub://SimpleIndexer/old",
        uses_with={"index_file_name": "index"},
        uses_metas={"workspace": WORKSPACE_DIR},
        name="meme_image_simple_indexer",
        volumes=f"./{WORKSPACE_DIR}:/workspace/workspace",
    )
)

In [None]:
index_flow

In [None]:
# # def index():
# if Path(WORKSPACE_DIR).exists():
#     print(f"'{WORKSPACE_DIR}' folder exists. Please delete")
#     sys.exit
    
docs = DocumentArray(from_files(files_list))

with index_flow:
    index_flow.index(inputs=docs, return_results=True)


Output()

            simple_indexer shadows one of built-in Python module name.
            It is imported as `user_module.simple_indexer`

            Affects:
            - Either, change your code from using `from simple_indexer import ...`
              to `from user_module.simple_indexer import ...`
            - Or, rename simple_indexer to another name
            [0m [1;30m(raised from /home/mage/.local/lib/python3.7/site-packages/jina/importer.py:120)[0m


[32m‚†¥[0m 4/5 waiting [33mmeme_image_encoder[0m to be ready...                                   meme_image_encoder@400197[I]:[0m
meme_image_encoder@400197[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;224mM[38;5;223mM[38;5;222mWWW[38;5;223mMM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[0;0m[0m
meme_image_encoder@400197[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;222mWNNNNNNNW[38;5;230mM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[0;0m[0m
meme_image_encoder@400197[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;223mM[38;5;222mNNNNNNNNN[38;5;223mM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[0;0m[0m
meme_image_encoder@400197[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;230mM[38;5;222mWNNNNNNNN[38;5;223mM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[0;0m[0m
meme_image_encoder@400197[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;230mM[38;5;223mM[38;5;222mWNNNW[38;5;223mW[38;5;230mM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

In [None]:
# index_flow.protocol = "http"
# index_flow.port_expose = 12345

# with index_flow:
#     index_flow.block()


   - Update database w new files
   - Index all files
   - Save database to archive

   - Return savefile

## Query flow

- Query flow:
   - Start timer
   - Get location of files
   - Get device (GPU/CPU)
   
   - Read database
   - Load index
   - Check if files changed from database
     - Index database if changed
   
   - Check for image/text query
   - Encode query
     - Add query encodings if multiple
   
   - Index top_k by query
   - Get filenames from index
   
   - Return filenames
   

In [None]:
query_flow = (
    Flow()
    .add(
        uses="jinahub+docker://CLIPTextEncoder",
        uses_metas={"workspace": WORKSPACE_DIR},
#         uses_with={'device': 'cuda'},
#         volumes="./data:/encoder/data",
        volumes='~/.cache/huggingface:/root/.cache/huggingface',
        name="text_encoder",
    )
    .add(
        uses="jinahub://SimpleIndexer/old",
        uses_with={"index_file_name": "index"},
        uses_metas={"workspace": WORKSPACE_DIR},
        name="meme_image_simple_indexer",
        volumes=f"./{WORKSPACE_DIR}:/workspace/workspace",
    )
)

In [None]:
query_flow

In [None]:
# Sometimes get port errors from docker, restarting docker fixes this

query_flow.protocol = "http"
query_flow.port_expose = 12345
# Start the Flow
with query_flow:
#     query_flow.post(on="/index", inputs=docs) # Set the Flow to index
#     query_flow.search(
#         inputs = Document(text="dog"),
#         on_done = print())
    query_flow.block() # Keep the Flow open, ready for user to search

            simple_indexer shadows one of built-in Python module name.
            It is imported as `user_module.simple_indexer`

            Affects:
            - Either, change your code from using `from simple_indexer import ...`
              to `from user_module.simple_indexer import ...`
            - Or, rename simple_indexer to another name
            [0m [1;30m(raised from /home/mage/.local/lib/python3.7/site-packages/jina/importer.py:120)[0m


[32m‚†ß[0m 2/3 waiting [33mtext_encoder[0m to be ready...                                            text_encoder@406186[I]:[0m
   text_encoder@406186[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;224mM[38;5;223mM[38;5;222mWWW[38;5;223mMM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[0;0m[0m
   text_encoder@406186[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;222mWNNNNNNNW[38;5;230mM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[0;0m[0m
   text_encoder@406186[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;223mM[38;5;222mNNNNNNNNN[38;5;223mM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[0;0m[0m
   text_encoder@406186[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;230mM[38;5;222mWNNNNNNNN[38;5;223mM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[0;0m[0m
   text_encoder@406186[I]:[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[38;5;230mM[38;5;223mM[38;5;222mWNNNW[38;5;223mW[38;5;230mM[38;5;15mMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM[0;0m

Okay, that works! With some of Alex's code in `jina_frontend.py` run through Streamlit, I'm able to hook right into the running Jina process and send search requests. 


It still takes a long time to boot up CLIP, though. I'll want to make it easy to start the backend and leave it running. Then making a way to index folders and add them to the database will be important. Also a way to filter them, when you're only searching one folder of the memes stored in the total database.

Really this points to a philosophical issue: do I want this to be an app that's easy to use locally, with smart folder management? Something like VLC that can scan a Library of folders but keep the metadata separate from file organizaition? Or do I want more of a web-hosted interface that can keep its own database and not worry about files?

Well, perhaps it will be possible to do both. But right now I'm thinking of the "web share" functionality as central to the casual use of memery, i.e., the user "shares" an image to the database where it is saved and indexed, then they can search within the memery interface and "share" it back out. So I guess the single-update functionality will be as important as the batch, folder-driven functionality. 

But now I know I can set up the CRUD API and do a hosted backend service for sure. It's time to start thinking of what I really want in a frontend app...