# Multilingual Joint Image & Text Embeddings 

This example shows how [SentenceTransformers](https://www.sbert.net) can be used to map images and texts to the same vector space. 

As model, we use the [OpenAI CLIP Model](https://github.com/openai/CLIP), which was trained on a large set of images and image alt texts.

The original CLIP Model only works for English, hence, we used [Multilingual Knowlegde Distillation](https://arxiv.org/abs/2004.09813) to make this model work with 50+ languages.

As a source for fotos, we use the [Unsplash Dataset Lite](https://unsplash.com/data), which contains about 25k images. See the [License](https://unsplash.com/license) about the Unsplash images. 

Note: 25k images is rather small. If you search for really specific terms, the chance are high that no such photo exist in the collection.

In [22]:
from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import torch
import pickle
import zipfile
from IPython.display import display
from IPython.display import Image as IPImage
import os
from tqdm.autonotebook import tqdm

# Here we load the multilingual CLIP model. Note, this model can only encode text.
# If you need embeddings for images, you must load the 'clip-ViT-B-32' model
model = SentenceTransformer('clip-ViT-B-32-multilingual-v1')

  0%|          | 0.00/502M [00:00<?, ?B/s]

In [2]:
# Next, we get about 25k images from Unsplash 
img_folder = 'photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)
    
    photo_filename = 'unsplash-25k-photos.zip'
    if not os.path.exists(photo_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename)
        
    #Extract all images
    with zipfile.ZipFile(photo_filename, 'r') as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, img_folder)
        

In [3]:
# Now, we need to compute the embeddings
# To speed things up, we destribute pre-computed embeddings
# Otherwise you can also encode the images yourself.
# To encode an image, you can use the following code:
# from PIL import Image
# img_emb = model.encode(Image.open(filepath))

use_precomputed_embeddings = True

if use_precomputed_embeddings: 
    emb_filename = 'unsplash-25k-photos-embeddings.pkl'
    if not os.path.exists(emb_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+emb_filename, emb_filename)
        
    with open(emb_filename, 'rb') as fIn:
        img_names, img_emb = pickle.load(fIn)  
    print("Images:", len(img_names))
else:
    #For embedding images, we need the non-multilingual CLIP model
    img_model = SentenceTransformer('clip-ViT-B-32')

    img_names = list(glob.glob('unsplash/photos/*.jpg'))
    print("Images:", len(img_names))
    img_emb = img_model.encode([Image.open(filepath) for filepath in img_names], batch_size=128, convert_to_tensor=True, show_progress_bar=True)


Images: 24996


In [4]:
# Next, we define a search function.
def search(query, k=3):
    # First, we encode the query (which can either be an image or a text string)
    query_emb = model.encode([query], convert_to_tensor=True, show_progress_bar=False)
    
    # Then, we use the util.semantic_search function, which computes the cosine-similarity
    # between the query embedding and all image embeddings.
    # It then returns the top_k highest ranked images, which we output
    hits = util.semantic_search(query_emb, img_emb, top_k=k)[0]
    
    print("Query:")
    display(query)
    for hit in hits:
        print(img_names[hit['corpus_id']])
        display(IPImage(os.path.join(img_folder, img_names[hit['corpus_id']]), width=200))


In [5]:
search("Two dogs playing in the snow")

Query:


'Two dogs playing in the snow'

lyStEjlKNSw.jpg


<IPython.core.display.Image object>

FAcSe7SjDUU.jpg


<IPython.core.display.Image object>

Hb6nGDgWztE.jpg


<IPython.core.display.Image object>

In [16]:
#German: A cat on a chair
search("Eine Katze auf einem Stuhl")

Query:


'Eine Katze auf einem Stuhl'

CgGDzMYdYw8.jpg


<IPython.core.display.Image object>

kjERLXaHjXc.jpg


<IPython.core.display.Image object>

I-YJ-gaJNaw.jpg


<IPython.core.display.Image object>

In [21]:
#Spanish: Many fish
search("Muchos peces")

Query:


'Muchos peces'

H22jcGTyrS4.jpg


<IPython.core.display.Image object>

CJ_9I6aXSnc.jpg


<IPython.core.display.Image object>

_MJKaRig1Ic.jpg


<IPython.core.display.Image object>

In [13]:
#Chinese: A beach with palm trees
search("棕榈树的沙滩")

Query:


'棕榈树的沙滩'

crIXKhUDpBI.jpg


<IPython.core.display.Image object>

_6iV1AJZ53s.jpg


<IPython.core.display.Image object>

rv63du1a79E.jpg


<IPython.core.display.Image object>

In [9]:
#Russian: A sunset on the beach
search("Закат на пляже")

Query:


'Закат на пляже'

JC5U3Eyiyr4.jpg


<IPython.core.display.Image object>

5z1QDcisnJ8.jpg


<IPython.core.display.Image object>

rdG4hRoyVR0.jpg


<IPython.core.display.Image object>

In [10]:
#Turkish: A dog in a park
search("Parkta bir köpek")

Query:


'Parkta bir köpek'

ROJLfAbL1Ig.jpg


<IPython.core.display.Image object>

0O9A0F_d1qA.jpg


<IPython.core.display.Image object>

4mdsPUtN0P0.jpg


<IPython.core.display.Image object>

In [12]:
# Japanese: New York at night
search("夜のニューヨーク")

Query:


'夜のニューヨーク'

FGjR4IGwP7U.jpg


<IPython.core.display.Image object>

8nCMOFYyXF4.jpg


<IPython.core.display.Image object>

ZAOEjcpdMkc.jpg


<IPython.core.display.Image object>