<a href="https://colab.research.google.com/github/evergreenllc2020/clip/blob/main/Semantic_Search_on_images_with_OpenAI_CLIP_Unsplash.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<h2 align=center> Semantic Keyword Search using CLIP </h2> ###
### This notebook shows how to fine tune ranking of image search results using CLIP models.
### Author: Evergreen Technologies LLC
### Linked in Profile : https://www.linkedin.com/in/evergreen-technologies-usa-3a7422198/
### Link to Full Udemy Course : https://www.udemy.com/course/build-movie-review-classification-with-bert-and-tensorflow/learn/lecture/24635958#overview


# Preparation for Colab

Make sure you're running a GPU runtime; if not, select "GPU" as the hardware accelerator in Runtime > Change Runtime Type in the menu. The next cells will print the CUDA version of the runtime if it has a GPU, and install PyTorch 1.7.1.

<div align="center">
    <img width="512px" src='https://docs.google.com/drawings/d/e/2PACX-1vSaom5DqkOZYNGZVoSaliUTYJf_OSz1xsRSRgAldk-TyfqgG5hWyOCjmsyra6z7uwsLCpK6WZZc-dFM/pub?w=960&h=720' />
    <p style="text-align: center;color:gray">Figure 1: Overall workflow</p>
</div>

In this [project](https://www.udemy.com/), you will learn how to fine-tune image search results using CLIP.

### Learning Objectives

By the time you complete this project, you will be able to:

- Tokenize and Preprocess Text for image search
- Use UInsplash image Search API to view results prior to fune tuning
- Fine tune image search results by encdoing search keyword and image embeddeing using CLIP
- Perform similarity between imnage anf text embeddings and select top 3

### Prerequisites

In order to be successful with this project, it is assumed you are:

- Understanding the Python programming language
- Basic Familiar with deep learning  
- Familiar with Pytorch 

This project/notebook consists of several Tasks.

- **[Task 1]()**: Introduction to the Project.
- **[Task 2]()**: Setup your Pytorch and Colab Runtime
- **[Task 3]()**: Download pretrained CLIP model
- **[Task 4]()**: Download recall image set from Unsplash
- **[Task 5]()**: Tokenize search keywords into text embeddings
- **[Task 6]()**: Tokenize each image in recall set into image embeddings
- **[Task 7]()**: Perform similarity between image embeddings and text embeddings
- **[Task 8]()**: Sort the results by descending order similarity score
- **[Task 9]()**: Display Top N results

### Task 1 : Introduction to the semantic keywrd image search
<div align="center">
    <img width="512px" src='https://drive.google.com/uc?id=1fnJTeJs5HUpz7nix-F9E6EZdgUflqyEu' />
    <p style="text-align: center;color:gray">Figure 1: Overall workflow Model</p>
</div>

### Task 2: Set up Pytorch and Colab runtime

You will only be able to use the Colab Notebook after you save it to your Google Drive folder. Click on the File menu and select “Save a copy in Drive…

![Copy to Drive](https://drive.google.com/uc?id=1CH3eDmuJL8WR0AP1r3UE6sOPuqq8_Wl7)

### Check GPU Availability

Check if your Colab notebook is configured to use Graphical Processing Units (GPUs). If zero GPUs are available, check if the Colab notebook is configured to use GPUs (Menu > Runtime > Change Runtime Type).

![Hardware Accelerator Settings](https://drive.google.com/uc?id=1qrihuuMtvzXJHiRV8M7RngbxFYipXKQx)

In [None]:
!nvidia-smi

### Derive Pytorch version thats compatible with GPU Runtime (CUDA)

Check if your Colab notebook is configured to use Graphical Processing Units 

In [None]:
import subprocess

CUDA_version = [s for s in subprocess.check_output(["nvcc", "--version"]).decode("UTF-8").split(", ") if s.startswith("release")][0].split(" ")[-1]
print("CUDA version:", CUDA_version)

if CUDA_version == "10.0":
    torch_version_suffix = "+cu100"
elif CUDA_version == "10.1":
    torch_version_suffix = "+cu101"
elif CUDA_version == "10.2":
    torch_version_suffix = ""
else:
    torch_version_suffix = "+cu110"

### Install Pytorch

In [None]:
! pip install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} -f https://download.pytorch.org/whl/torch_stable.html ftfy regex

### import numpy

In [None]:
import numpy as np
import torch

print("Torch version:", torch.__version__)

### Task 3: Downloading the CLIP model

CLIP models are distributed as TorchScript modules.

In [None]:
MODELS = {
    "ViT-B/32":       "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
}

In [None]:
! wget {MODELS["ViT-B/32"]} -O model.pt

In [None]:
model = torch.jit.load("model.pt").cuda().eval()
input_resolution = model.input_resolution.item()
context_length = model.context_length.item()
vocab_size = model.vocab_size.item()

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)

# Image Preprocessing

We resize the input images and center-crop them to conform with the image resolution that the model expects. Before doing so, we will normalize the pixel intensity using the dataset mean and standard deviation.



In [None]:
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from PIL import Image

preprocess = Compose([
    Resize(input_resolution, interpolation=Image.BICUBIC),
    CenterCrop(input_resolution),
    ToTensor()
])

image_mean = torch.tensor([0.48145466, 0.4578275, 0.40821073]).cuda()
image_std = torch.tensor([0.26862954, 0.26130258, 0.27577711]).cuda()

# Text Preprocessing

We use a case-insensitive tokenizer. The tokenizer code is hidden in the second cell below

In [None]:
! pip install ftfy regex
! wget https://openaipublic.azureedge.net/clip/bpe_simple_vocab_16e6.txt.gz -O bpe_simple_vocab_16e6.txt.gz

In [None]:
#@title

import gzip
import html
import os
from functools import lru_cache

import ftfy
import regex as re


@lru_cache()
def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a corresponding list of unicode strings.
    The reversible bpe codes work on unicode strings.
    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
    This is a signficant percentage of your normal, say, 32K bpe vocab.
    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
    And avoids mapping to whitespace/control characters the bpe code barfs on.
    """
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1
    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))


def get_pairs(word):
    """Return set of symbol pairs in a word.
    Word is represented as tuple of symbols (symbols being variable-length strings).
    """
    pairs = set()
    prev_char = word[0]
    for char in word[1:]:
        pairs.add((prev_char, char))
        prev_char = char
    return pairs


def basic_clean(text):
    text = ftfy.fix_text(text)
    text = html.unescape(html.unescape(text))
    return text.strip()


def whitespace_clean(text):
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text


class SimpleTokenizer(object):
    def __init__(self, bpe_path: str = "bpe_simple_vocab_16e6.txt.gz"):
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
        merges = gzip.open(bpe_path).read().decode("utf-8").split('\n')
        merges = merges[1:49152-256-2+1]
        merges = [tuple(merge.split()) for merge in merges]
        vocab = list(bytes_to_unicode().values())
        vocab = vocab + [v+'</w>' for v in vocab]
        for merge in merges:
            vocab.append(''.join(merge))
        vocab.extend(['<|startoftext|>', '<|endoftext|>'])
        self.encoder = dict(zip(vocab, range(len(vocab))))
        self.decoder = {v: k for k, v in self.encoder.items()}
        self.bpe_ranks = dict(zip(merges, range(len(merges))))
        self.cache = {'<|startoftext|>': '<|startoftext|>', '<|endoftext|>': '<|endoftext|>'}
        self.pat = re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", re.IGNORECASE)

    def bpe(self, token):
        if token in self.cache:
            return self.cache[token]
        word = tuple(token[:-1]) + ( token[-1] + '</w>',)
        pairs = get_pairs(word)

        if not pairs:
            return token+'</w>'

        while True:
            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                    new_word.extend(word[i:j])
                    i = j
                except:
                    new_word.extend(word[i:])
                    break

                if word[i] == first and i < len(word)-1 and word[i+1] == second:
                    new_word.append(first+second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            else:
                pairs = get_pairs(word)
        word = ' '.join(word)
        self.cache[token] = word
        return word

    def encode(self, text):
        bpe_tokens = []
        text = whitespace_clean(basic_clean(text)).lower()
        for token in re.findall(self.pat, text):
            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
        return bpe_tokens

    def decode(self, tokens):
        text = ''.join([self.decoder[token] for token in tokens])
        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors="replace").replace('</w>', ' ')
        return text


### Task 4:  Semantic search on images retrived from Unsplash
**Note:** I would strongly encourage you to sign in to Unsplash https://unsplash.com/developers and get your acess key key and replace in the cell below to run through this demo.

In [None]:
unspash_original_api = "https://api.unsplash.com/search/photos?client_id={access_key}&query={query}&per_page={numresults}&page=1"
unsplash_access_key = "aRfK6aArC_8xh7upDyfwZdA5vQ1ciJceTDiJQp4zKWo"


### install iPyPlot to plot images 

In [None]:
#  Necessary installations
!pip install ipyplot==1.1.0

### Search Unsplash with the keyword and retrieve top N images

In [None]:
import requests
import PIL
from io import BytesIO
import ipyplot

search_keyword = "dog and cat"

no_to_retrieve = 100
unsplash_api = unspash_original_api.format(access_key=unsplash_access_key, query=search_keyword, numresults=no_to_retrieve)
response = requests.get(unsplash_api)
output = response.json()

all_images =[]
for each in output["results"]:
  urls = each["urls"]
  imageurl = urls["full"]
  response = requests.get(imageurl)
  image = PIL.Image.open(BytesIO(response.content)).convert("RGB")
  all_images.append(image)

print ("Total no of images retrived: ",len(all_images))




### Display top images

In [None]:
# plot the top 50 (max) retrived images
ipyplot.plot_images(all_images,max_images =50,img_width=150)

- **[Task 5]()**: Tokenize search keywords into text embeddings
- **[Task 6]()**: Tokenize each image in recall set into image embeddings


In [None]:


images = [preprocess(im) for im in all_images]
image_input = torch.tensor(np.stack(images)).cuda()
image_input -= image_mean[:, None, None]
image_input /= image_std[:, None, None]
with torch.no_grad():
    image_features = model.encode_image(image_input).float()
image_features /= image_features.norm(dim=-1, keepdim=True)

tokenizer = SimpleTokenizer()

def get_text_features(sentence):
  text_tokens = [tokenizer.encode("%s "%(sentence) + "<|endoftext|>")]
  text_input = torch.zeros(len(text_tokens), model.context_length, dtype=torch.long)
  for i, tokens in enumerate(text_tokens):
    text_input[i, :len(tokens)] = torch.tensor(tokens)
    
  text_input = text_input.cuda()
  with torch.no_grad():
    text_features = model.encode_text(text_input).float()
    text_features /= text_features.norm(dim=-1, keepdim=True)

  return text_features

def get_top_N_semantic_similarity(similarity_list,N):
  results = zip(range(len(similarity_list)), similarity_list)
  results = sorted(results, key=lambda x: x[1],reverse= True)
  top_N_images = []
  scores=[]
  for index,score in results[:N]:
    scores.append(score)
    top_N_images.append(all_images[index])
  return scores,top_N_images

- **[Task 7]()**: Perform similarity between image embeddings and text 
- **[Task 8]()**: Sort the results by descending order similarity score
- **[Task 9]()**: Display Top N results

In [None]:
#semantic_search_phrase = "dog staring at cat"
#semantic_search_phrase = "cat playing with dog"
#semantic_search_phrase = "dog with human"
#semantic_search_phrase = "dog sitting on a chair"
semantic_search_phrase = "dog with a mountain background"




text_features_extracted = get_text_features(semantic_search_phrase)
similarity = text_features_extracted.cpu().numpy() @ image_features.cpu().numpy().T

similarity = similarity[0]
scores,imgs= get_top_N_semantic_similarity(similarity,N=3)
print ("scores ",scores)
ipyplot.plot_images(imgs,img_width=300)


In [None]:
semantic_search_phrase = "kittens and puppies"
text_features_extracted = get_text_features(semantic_search_phrase)
similarity = text_features_extracted.cpu().numpy() @ image_features.cpu().numpy().T

similarity = similarity[0]
scores,imgs= get_top_N_semantic_similarity(similarity,N=3)
print ("scores ",scores)
ipyplot.plot_images(imgs,img_width=300)

In [None]:
semantic_search_phrase = "dog and cat playing in the snow"
text_features_extracted = get_text_features(semantic_search_phrase)
similarity = text_features_extracted.cpu().numpy() @ image_features.cpu().numpy().T

similarity = similarity[0]
scores,imgs= get_top_N_semantic_similarity(similarity,N=3)
print ("scores ",scores)
ipyplot.plot_images(imgs,img_width=300)