# Jina Meme Search Workshop

![](http://examples.jina.ai/images/meme_search.gif)

In this workshop we're going to build a meme search engine using [Jina](https://github.com/jina-ai/jina/). It will search a dataset of memes and return URLs to the images as well as metadata.

For this workshop we'll just focus on searching **text**. For a more complete repo (including text/image search and frontend) you can check [this link](https://github.com/alexcg1/jina-meme-search).

You can play with a live example [here](https://examples.jina.ai).

# Prerequisites

- Check out [Jina's repo](https://github.com/jina-ai/jina/) to understand what Jina does
- Watch [Neural search using cute fuzzy animals](https://www.youtube.com/watch?v=3FyddFAFNPQ) to understand how neural search works
- Watch [Jina basics in under two minutes](https://www.youtube.com/watch?v=mnnC37ewQI8) to learn the fundamental components of Jina

# Terminology

Let's go through the words we'll be using in this workshop.

In our example, we'll search through all the <code>text</code> (which comes from a JSON file) and then display the image <code>uri</code> of each match.

<table>
    <tr>
        <td>
            <img src="https://raw.githubusercontent.com/jina-ai/workshops/main/memes/koala.png" width=300 align="left">
        </td>
        <td>
            <table>
                <tr>
                    <td>Template</td>
                    <td>Surprised Koala</td>
                </tr>
                <tr>
                    <td>Caption</td>
                    <td>This is poisonous. What</td>
                </tr>
                <tr>
                    <td><code>uri</code></td>
                    <td>https//i.imgflip.com/foo_bar.jpg</td>
                </tr>
                <tr>
                    <td><code>text</code></td>
                    <td>Surprised Koala | This is poisonous. What</td>
                </tr>
            </table>
        </td>
    </tr>
</table>

# Set up basics

## Housekeeping

### Clean up from last time

In [5]:
!cd /
!rm -rf workspace images *.jpg

zsh:1: no matches found: *.jpg


### Set basic options

In [6]:
# Enable Jupyter widgets so we can see images
!jupyter nbextension enable --py widgetsnbextension

# Disable warnings
import warnings
warnings.filterwarnings('ignore')

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## Set maximum images to index

Since this is just a workshop and not a real-world application we'll just search through 50 memes. This will save us time in processing.

In [7]:
max_docs = 50

## Install Prerequisites

In [8]:
!pip install jina==2.6.4
!pip install ipywidgets==7.6.5 # Get nice widgets in the notebook



# Process data

We use an open-source dataset from imgflip, originally downloaded from [Kaggle](https://www.kaggle.com/abhishtagatya/imgflipscraped-memes-caption-dataset).

## Why this dataset?

We chose this dataset because

- It has rich metadata (caption, template name)
- It has recognizable memes (many datasets were just random pics with overlaid Impact font)
- It doesn't *seem* to have too many racist/sexist/\*phobic memes 🤞

## Why does this dataset kinda suck?

It only contains so many memes, and new memes come out all the time. So it won't have the latest, dankest stuff.

## Download data

In [9]:
!mkdir -p data
!wget -O data/memes.json -nc https://jina-examples-datasets.s3.amazonaws.com/memes/memes.json -q

## Load data

I've written a function to help load the data from the JSON file we downloaded earlier.

In this function we:

- Create a `DocumentArray` to hold `Documents` (using [docarray package](https://docarray.jina.ai))
- Optionally shuffle the memes
- Create a `Document` for each meme
- Set `Document.text` to the template name (e.g. `"Surprised Koala"`) + meme caption (e.g. `"This is poisonous. What"`)
- Populate some `tags` for the `Document` (e.g. absolute URL for image)

In [10]:
from docarray import Document, DocumentArray

In [11]:
import json
def prep_docs(input_file, num_docs=None, shuffle=True):
    docs = DocumentArray()
    memes = []
    print(f"Processing {input_file}")
    with open(input_file, "r") as file:
        raw_json = json.loads(file.read())

    for template in raw_json:
        for meme in template["generated_memes"]:
            meme["template"] = template["name"]
        memes.extend(template["generated_memes"])

    if shuffle:
        import random
        random_seed = 1337

        random.seed(random_seed)
        random.shuffle(memes)

    for meme in memes[:num_docs]:
        doctext = f"{meme['template']} - {meme['caption_text']}"
        doc = Document(text=doctext)
        doc.tags = meme
        doc.tags["uri_absolute"] = "http:" + doc.tags["image_url"]
        docs.extend([doc])

    return docs

### Why shuffle?

The memes are arranged alphabetically by template. So if we don't shuffle we're going to get very similar memes. This makes it more difficult to search for something interesting.

In [12]:
docs = prep_docs(
    input_file="data/memes.json", 
    num_docs=max_docs, 
    shuffle=False
)

Processing data/memes.json


In [13]:
!wget -nc https://raw.githubusercontent.com/jina-ai/workshops/main/memes/helper.py
from helper import show_images

File ‘helper.py’ already there; not retrieving.



In [14]:
show_images(docs)

./images/3xd5o0.jpg
./images/3xctnx.jpg
./images/3wu3bd.jpg
./images/3xb1dj.jpg
./images/3w8qn4.jpg


HBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00\…

If we shuffle we get a more interesting mix:

In [15]:
docs = prep_docs(
    input_file="data/memes.json", 
    num_docs=max_docs, 
    shuffle=True
)

Processing data/memes.json


In [16]:
show_images(docs)

./images/2ggogb.jpg
./images/39nq93.jpg
./images/2l0pta.jpg
./images/ldben.jpg
./images/1k5o93.jpg


HBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00\…

# Index and search data with Flow

## Set up Flow

Before we index or search, we need to create our Flow. Only then can we open it as a context manager and do stuff with it.

In [17]:
from jina import Flow

In [18]:
flow = (
    Flow()
    .add(
        name="meme_text_encoder",
        uses="jinahub://SpacyTextEncoder/",                 # Using Executors from Jina Hub means we don't need to write our own!
        uses_with={"model_name": "en_core_web_md"},
        install_requirements=True
    )
    .add(
        name="meme_text_indexer",
        uses="jinahub://SimpleIndexer",
        install_requirements=True
    )
)

## Index data with Flow

We use our Flow to build an index of where all the meme text embeddings lie on an n-dimensional graph

In [20]:
with flow:
  flow.index(
      inputs=docs,
      request_size=64,
  )
print("DONE!")

[32m⠸[0m 2/3 waiting [33mmeme_text_encoder[0m to be ready...                                    Collecting en-core-web-md==3.1.0
[32m⠼[0m 2/3 waiting [33mmeme_text_encoder[0m to be ready...                                      Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.1.0/en_core_web_md-3.1.0-py3-none-any.whl (45.4 MB)
[32m⠦[0m 2/3 waiting [33mmeme_text_encoder[0m to be ready...                                    [38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
           Flow@1072746[I]:[32m🎉 Flow is ready to use![0m                                        
	🔗 Protocol: 		[1mGRPC[0m
	🏠 Local access:	[4m[36m0.0.0.0:54415[0m
	🔒 Private network:	[4m[36m192.168.1.68:54415[0m
	🌐 Public address:	[4m[36m83.47.10.22:54415[0m[0m


            The syntax of traversal_path is changed to comma-separated string, '
            that means your need to change ('r',) into `r`. '
            The old list of string syntax will be deprecated soon
            [0m [1;30m(raised from /mnt/data/work/repos/workshops/memes/env/lib/python3.7/site-packages/docarray/array/mixins/traverse.py:28)[0m


DONE!


## Search data with Flow

### Create query Document

A Document is the fundamental thing that Jina works with. So anything we pass or out needs to be a Document

In [21]:
query_doc = Document(text="school")

### Pass query Document to Flow

This will search our index of 50 memes for Documents similar to our input Document

In [22]:
with flow:
  response = flow.search(inputs=query_doc, return_results=True)
print("DONE!")

[32m⠸[0m 2/3 waiting [33mmeme_text_encoder[0m to be ready...                                    Collecting en-core-web-md==3.1.0
[32m⠹[0m 2/3 waiting [33mmeme_text_encoder[0m to be ready...                                      Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.1.0/en_core_web_md-3.1.0-py3-none-any.whl (45.4 MB)
[32m⠸[0m 2/3 waiting [33mmeme_text_encoder[0m to be ready...                                    [38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
           Flow@1072746[I]:[32m🎉 Flow is ready to use![0m                                        
	🔗 Protocol: 		[1mGRPC[0m
	🏠 Local access:	[4m[36m0.0.0.0:56319[0m
	🔒 Private network:	[4m[36m192.168.1.68:56319[0m
	🌐 Public address:	[4m[36m83.47.10.22:56319[0m[0m


            The syntax of traversal_path is changed to comma-separated string, '
            that means your need to change ('r',) into `r`. '
            The old list of string syntax will be deprecated soon
            [0m [1;30m(raised from /mnt/data/work/repos/workshops/memes/env/lib/python3.7/site-packages/docarray/array/mixins/traverse.py:28)[0m


DONE!


### Extract matches

A Jina response contains a lot of extra data. We just want the DocumentArray with matching Documents

In [23]:
matches = response[0].docs[0].matches

In [24]:
show_images(matches)

./images/2xp1ne.jpg
./images/1dynut.jpg
./images/1xrl4d.jpg
./images/r6txm.jpg
./images/3bkpsm.jpg


HBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00H\x00H\x00\x00\xff\xdb\x00\x84\x0…

## Using this IRL

How can we take this out of a notebook and build a real-world application with it?

### Use a better indexer

Jina's **[PQLiteIndexer](https://hub.jina.ai/executor/pn1qofsj)** offers powerful features like pre-filtering based on metadata. In our notebook we just use [SimpleIndexer](https://hub.jina.ai/executor/zb38xlt4) which is nice for demonstrations but lacks PQLite's power.

### Use a RESTful API

#### In notebook: gRPC

```python
with flow:
    flow.search(Document(text="foo"))
```

#### In Python: RESTful or gRPC

```python
with flow:
    flow.protocol = "http"
    flow.port_expose = 12345
    flow.block()
```

### Use Dockerized Executors...

This means having to install fewer requirements locally.


#### In notebook: `jinahub://foo`

```python
flow = (
    Flow()
    .add(...)
    .add(
        uses="jinahub://SimpleIndexer",
    )
)
```

#### In Python: `jinahub+docker://foo`

```python
flow = (
    Flow()
    .add(...)
    .add(
        uses="jinahub+docker://SimpleIndexer",
    )
)
```

### ...or wrap everything in Docker

See an example [docker-compose.yml](https://github.com/alexcg1/jina-meme-search/blob/main/docker-compose.yml)