# Finetuner X CLIP Benchmark

@bo_wangbo
@fissoreg

Traditionally, searching images from text (text-to-image retrieval) relies on human annotations. The [OpenAI CLIP](https://github.com/openai/CLIP) model instead maps texts and images into dense vectors in the same semantic space, making it possible to directly measure the similarity of texts to images.

Following this, LAION AI has produced the [CLIP benchmark](https://github.com/LAION-AI/CLIP_benchmark) in order to evaluate the performance of CLIP models on various standard datasets, such as `Flickr8k`, `Flickr30k` and `MS-Coco Captions`. These datasets consist of a set of images and each image is linked to five text descriptions.

In this Colab notebook, we'll try to use [Finetuner](https://github.com/jina-ai/finetuner) to fine-tune the CLIP model on `Flickr8k`, and compare the retrieval metrics produced by the fine-tuned model against pre-trained zero-shot results produced from CLIP Benchmark.

*NOTE: Finetuner is a cloud-based training platform, which requires you to login and Finetuner will allocate computational resources automatically for free.*

**Please Consider [Switch to GPU Runtime](https://medium.com/@oribarel/getting-the-most-out-of-your-google-colab-2b0585f82403) for faster evaluation!**

![flickr](https://miro.medium.com/max/1400/0*5WGNo6Ty72mdQhcK)

In [None]:
!pip install "finetuner[full]==0.6.3"
!pip install nest_asyncio
# our fork of CLIP benchmark, resolved some minor issues in data builder and adjust the evaluator code to allow evaluator receive 2 models
# when fine-tuning CLIP, Finetuner will un-wrap the CLIP model into 2 models and save them individually
!pip install kaggle
!pip install git+https://github.com/bwanglzu/CLIP_benchmark.git

## Prepare training data

CLIP Benchmark has done a lot job for us with a dataset `builder`. However, to feed the training data into Finetuner, we need to convert it into Jina DocArray format. To be more specific:

1. CLIP Benchmark contains a file named `captions.txt` which includes all Flickr8k image urls with captions.
2. CLIP Benchmark reused the Karpathy split which split the `Flickr8k` into test sets and training sets. The test set includs 5000 images with annotations.

We will build our training set by loading all images then exclude all the test set images from the Karpathy split.



In [None]:
import os
from clip_benchmark.datasets.builder import build_dataset

# please fill in your kaggle token here, you should be able to get your kaggle 
# user name and key in kaggle personal settings.
# CLIP Benchmark uses kaggle to download flickr8k dataset
os.environ['KAGGLE_USERNAME'] = ''
os.environ['KAGGLE_KEY'] = ''

build_dataset(dataset_name='flickr8k', annotation_file=None, download=True)

Dataset Flickr
    Number of datapoints: 1000
    Root location: root

In [None]:
root_dir = '/content/root/'
full_annotation = root_dir + 'captions.txt'
test_annotation = root_dir + 'flickr8k_test_karpathy.txt'

all_imgs = []
test_imgs = []
with open(full_annotation, 'r') as f:
    next(f) # exclude the header line
    for idx, item in enumerate(f.readlines()):
        all_imgs.append(item.split(',', 1)[0])

with open(test_annotation, 'r') as f:
    next(f) # exclude the header line
    for idx, item in enumerate(f.readlines()):
        test_imgs.append(item.split(',', 1)[0])

print(f'Size of the full image set is {len(all_imgs)}')
print(f'Size of the test image set is {len(test_imgs)}')

Size of the full image set is 40455
Size of the test image set is 5000


Now we will convert the downloaded images into `DocumentArray` format. The essentials of `DocumentArray` are as follows:

+ `docarray` is a dependency of Finetuner, so no need to install it separately.
+ A `Document` wraps a piece of data of any type -- be image, text or anything.
+ A `DocumentArray` wraps a list of `Document`s.
+ When fine-tuning CLIP, our training data consist pairs of one image and one text.

We organize a training set like this:

```python
from docarray import Document, DocumentArray

pairs = DocumentArray()
pair_1 = Document(chunks=[
    img_chunk = Document(uri='your-image.jpg', modality='image'),
    txt_chunk = Document(content='the text descriptor', modality='text'),
]}
pair_2 = ...
pairs.extend([pair_1, pair_2, ...])
```

In [None]:
from tqdm import tqdm
from docarray import Document, DocumentArray

train = DocumentArray()
with open(full_annotation, 'r') as f:
    next(f) # exclude the header line
    for idx, line in tqdm(enumerate(f.readlines())):
        url, txt = line.split(',', 1)
        if url in test_imgs:  # do not include test images into training set
            continue
        img_chunk = Document(uri=root_dir + url, modality='image')
        txt_chunk = Document(content=txt, modality='text')
        img_chunk.load_uri_to_image_tensor(224, 224)
        img_chunk.pop('uri')
        pair = Document(chunks=[img_chunk, txt_chunk])
        train.append(pair)
        if idx == 5000: # we only use a subset to train
            break

print(f'The size of the training data is {len(train)}')

5000it [00:30, 166.30it/s]


The size of the training data is 4376


The Flickr8k dataset contains 8,000 images, each with 5 descriptive texts, or 40,000 image-text pairs in total.

+ The training set has ~35000 image-text pairs.
+ The test set has ~5000 image-text pairs.

## Start Fine-tuning

Now we have everything ready, the next step is to start the fine-tuning job using Finetuner. What Finetuner doing is it takes pre-trained model from a 3rd party library, such as `open_clip`, then jointly optimize the `CLIPLoss`function for the image encoder and text encoder.

Finetuner will also reserve a cloud GPU for you for free.

In [None]:
import finetuner
import nest_asyncio

nest_asyncio.apply() # to execute async function in the notebook
finetuner.login()
# please copy the link below into a new tab in order to login.

In [None]:
# Note, we have push the training set below to the cloud, and set the dataset as public, so you don't have to push again.
# train.push('finetuner-flickr8k-demo', public=True, show_progress=True)
# finetuner.delete_run('clip-run')

In [None]:
run = finetuner.fit(
    model='ViT-B-32#openai', # we take ViT-B-32 trained from Open AI, model provided by OpenCLIP
    train_data='finetuner-flickr8k-demo', # the dataset we prepared has been pushed to the cloud in the prev section
    run_name='clip-run',
    loss='CLIPLoss', # use CLIPLoss for fine-tuning CLIP model
    epochs=5,
    learning_rate= 1e-6,
    cpu=False,
)

In [None]:
# takes around ~10 minutes to finish
for log_entry in run.stream_logs():
    print(log_entry)

[15:42:58] INFO     Starting finetuner run ...                                                           __main__.py:112
           DEBUG    Found Hubble authentication token                                                    __main__.py:124
           DEBUG    Running in online mode                                                               __main__.py:125
           INFO     Reading config ...                                                                   __main__.py:132
           DEBUG    Reading config from stream                                                           __main__.py:144
           INFO     Parsing config ...                                                                   __main__.py:147
           INFO     Config loaded 📜                                                                     __main__.py:149
           INFO     Run name: clip-run                                                                   __main__.py:151
           INFO     Experiment na

## Inference

After fine-tuning is finished, your fine-tuned model is saved in the cloud as an `artifact`. An `artifact` contains the model weights, and some metadata such as evaluation metrics and hyper-parameters.

In order to download your artifact, call the method `run.save_artifact()`.

Since CLIP is actually two models and we are fine-tuning them in parallel, there will be two models downloaded as one artifact: a text encoder and an image encoder. To use these models to do encodings, you will need the `finetuner.get_model()` with a `select_model` -- either `clip-text` or `clip-vision` -- get access to CLIPs constituent models individually.

In [None]:
artifact = run.save_artifact('clip-model')

clip_txt_encoder = finetuner.get_model(artifact=artifact, select_model='clip-text')
clip_img_encoder = finetuner.get_model(artifact=artifact, select_model='clip-vision')

Output()

100%|████████████████████████████████████████| 354M/354M [00:01<00:00, 267MiB/s]


With these two models and Finetuner, you can encode your image and text data with:

```python
data = DocumentArray([Document(content='some text to encode')])
finetuner.encode(model=clip_txt_encoder, data=data)
```

THIS WILL BE REPLACED

In order to re-used the code from CLIP Benchmark, we need to adjust a bit to get the re-build model ourselves and load the weights.

In [None]:
'''
note these are installed along with the finetuner package
while not intented to become public methods.
You can use finetuner.encode(model, data) to encode your DocumentArray
In order to re-use CLIP Benchmark, we used a "hacky" way to build models as below
'''
!unzip clip-model/clip-run.zip # as said, artifact are saved as zip together with weights and some metadata.
import torch
from commons.models.builders import OpenCLIPVisionBuilder, OpenCLIPTextBuilder

clip_vision = OpenCLIPVisionBuilder(descriptor='ViT-B-32#openai').build()
clip_vision.load_state_dict(torch.load(f'/content/{run.name}/models/clip-vision/model.pt'))

clip_text = OpenCLIPTextBuilder(descriptor='ViT-B-32#openai').build()
clip_text.load_state_dict(torch.load(f'/content/{run.name}/models/clip-text/model.pt'))

<All keys matched successfully>

In [None]:
"""Console script for clip_benchmark. 
Code copied from CLIP Benchmark with minor adjusts to run in colab.
"""
import sys
import json
import torch
import open_clip
from pprint import pprint

from clip_benchmark.datasets.builder import build_dataset, get_dataset_collate_fn
from clip_benchmark.metrics import  zeroshot_retrieval

from torch.utils.data import default_collate



device = "cuda" if torch.cuda.is_available() else "cpu"
image_encoder = clip_vision.to(device)
text_encoder = clip_text.to(device)
_, _, transform = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
dataset = build_dataset(
    dataset_name='flickr8k',
    root='root',
    transform=transform,
    split='test',
    annotation_file=None,
    download=True,
)
collate_fn = get_dataset_collate_fn('flickr8k')

dataloader = torch.utils.data.DataLoader(
    dataset, batch_size=64,
    shuffle=False, num_workers=4,
    collate_fn=collate_fn
)

metrics = zeroshot_retrieval.evaluate(
    image_encoder,
    text_encoder,
    dataloader,
    open_clip.tokenizer.tokenize,
    recall_k_list=[5],
    device=device,
    amp=True
)
dump = {
    "dataset": 'flickr8k',
    "model": 'ViT-B-32',
    "pretrained": 'openai',
    "task": 'finetuned',
    "metrics": metrics
}
pprint(dump)

  cpuset_checked))
16it [00:16,  1.01s/it]


{'dataset': 'flickr8k',
 'metrics': {'image_retrieval_recall@5': 0.8537999987602234,
             'text_retrieval_recall@5': 0.9100000262260437},
 'model': 'ViT-B-32',
 'pretrained': 'openai',
 'task': 'finetuned'}


## Zero-Shot VS Fine-Tuned

TO REVISE WITH BO AFTER RELEASE

CLIP Benchmark has done a lot of experiments and publihsed their results in [this csv](https://github.com/LAION-AI/CLIP_benchmark/blob/main/benchmark/benchmark.csv).

For simplicity, we show the comparsion below:

+ `image_retrieval_recall@5`: use text queries to find top 5 similar images.
+ `text_retrieval_recall@5`: use image to find top 5 similar text.


| model                            | dataset       | imageRecall@5(zero-shot) | textRecall@5(zero-shot) | imageRecall@5(fine-tuned) | textRecall@5(fine-tuned) |
|----------------------------------|---------------|-------------------|----------------------|---------|-------------|
| ViT-B-32#openai                  | flickr8k      |0.5319737792015076 | 0.6991719007492065   |0.8537999987602234| 0.9100000262260437 |

Apart from that, we have done some extensive experiments on three datasets, these are the results we get:


| model                            | dataset       | imageRecall@5(zero-shot) | textRecall@5(zero-shot) | imageRecall@5(fine-tuned) | textRecall@5(fine-tuned) |
|----------------------------------|---------------|-------------------|----------------------|---------|-------------|
| ViT-B-32#openai                  | flickr8k      |0.5319737792015076 | 0.6991719007492065   |0.8651999831199646| 0.9079999923706055 |
| ViT-B-16-plus-240                | flickr8k      |0.6441478133201599 | 0.7916203141212463   |0.8784000277519226| 0.9200000166893005 |
| ViT-B-32-quickgelu#laion400m_e32 | flickr8k      |0.5787171125411987 | 0.7392163872718811   |0.849399983882904 | 0.9020000100135803 |
| ViT-B-32#openai                  | flickr30k     |0.8338000178337097 | 0.9490000009536743   |0.9016000032424927| 0.9480000138282776 |
| ViT-B-16-plus-240                | flickr30k     |0.8894000053405762 | 0.9710000157356262   |0.9169999957084656| 0.9710000157356262 |
| ViT-B-32-quickgelu#laion400m_e32 | flickr30k     |0.8546000123023987 | 0.9409999847412109   |0.8715999722480774| 0.9290000200271606 |
| ViT-B-32#openai                  | coco captions |0.5584565997123718 | 0.748199999332428    |0.6546581387519836| 0.7454000115394592 |
| ViT-B-16-plus-240                | coco captions |0.6620951890945435 | 0.8101999759674072   |0.7120751738548279| 0.8136000037193298 |
| ViT-B-32-quickgelu#laion400m_e32 | coco captions |0.6084766387939453 | 0.7675999999046326   |0.6713714599609375| 0.7635999917984009 |

Default hyper-parameters are: `learning_rate: 1e-6`, `epochs: 5`, `optimizer: Adam`.
Flickr models are fine-tuned on 8k/30k images while evaluated on the karpathy test set.
``mscoco_captions`` models are fine-tuned on a random subset (100k pairs) extracted from `2014 train images`
and evaluated on `2014 val images`.

## General Insights when fine-tuning CLIP

+ Use a small learning rate, such as 1e-5, 1e-6, 1e-7...
+ Not necessarily need huge and complex models, based ViT-B-32 is good enough with fine-tuning.
+ If your search case is close domain/different domain, fine-tuning might be a good idea, otherwise not.
+ If you don't have text descriptors, only keyword, use a prompt to turn your keyword into a sentence. For instance a cat image with `cat` as label, please turn `cat` into `this is a photo of cat`. 
+ Consider [wise-ft](https://github.com/mlfoundations/wise-ft) if you do not want to loss too much zero-shot capbility.

Again, if you are interested to see how Finetuner could improve your representation and apply for search/recommendation/deduplication tasks, please find us at:

+ Github: https://github.com/jina-ai/finetuner
+ Documentation: https://finetuner.jina.ai/

Last but not least, huge thanks to all open source contributors:

+ CLIP Benchmark: https://github.com/LAION-AI/CLIP_benchmark
+ Open CLIP: https://github.com/mlfoundations/open_clip