[![Dataflowr](https://raw.githubusercontent.com/dataflowr/website/master/_assets/dataflowr_logo.png)](https://dataflowr.github.io/website/)

# CLIP

[CLIP](https://github.com/openai/CLIP) (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs by OpenAI. 

It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.


![](https://raw.githubusercontent.com/openai/CLIP/main/CLIP.png)

In [None]:
# Uncomment the code below if you use google colab:
#%pip install git+https://github.com/openai/CLIP.git
#%mkdir data
#%cd data
#!wget https://raw.githubusercontent.com/dataflowr/notebooks/master/Module19/data/cat.jpg
#!wget https://raw.githubusercontent.com/dataflowr/notebooks/master/Module19/data/dog.png
#!wget https://raw.githubusercontent.com/dataflowr/notebooks/master/Module19/data/caltech101_full.json
#%cd ..

In [None]:
import torch
import clip
from PIL import Image
import numpy as np

In [None]:
dog_image = Image.open("data/dog.png")
cat_image = Image.open("data/cat.jpg")

In [None]:
dog_image

In [None]:
cat_image

# First use of CLIP

Use the [code snippets](https://github.com/openai/CLIP#usage) in order to get the good labels for the 2 images above.

Note that in the code provided, the features for the text and for the image are not used. Check that the probabilities can be recovered from these features directly.

# Building a classifier from CLIP

Check that the classifier below is working for the images above.

In [None]:
class Classifier_CLIP:
    def __init__(self, labels):
        self.labels = labels
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model, self.preprocess = clip.load("ViT-B/32", device=self.device)
        self.text = clip.tokenize(labels).to(self.device)
        
    def classify(self, image_pil, verbose=False):
        image = self.preprocess(image_pil).unsqueeze(0).to(device)
        with torch.no_grad():
            logits_per_image, logits_per_text = self.model(image, self.text)
            probs = logits_per_image.softmax(dim=-1).cpu().numpy()
            if verbose:
                print('predicted class: ', self.labels[np.argmax(probs)])
        return np.argmax(probs)

In [None]:
#classifier.classify(dog_image, verbose=True)

In [None]:
#classifier.classify(cat_image, verbose=True)

# Testing the classifier on Caltech 101 

Now we want to see what are the performances of this classifier on the [Caltech 101](https://data.caltech.edu/records/mzrjq-6wc02) dataset.

You first need to download the dataset with [torchvision](https://pytorch.org/vision/stable/generated/torchvision.datasets.Caltech101.html#torchvision.datasets.Caltech101)

In [None]:
import torchvision
#caltech_data = torchvision.datasets.Caltech101('data/', download=True)

In [None]:
caltech_data

In [None]:
k = 4578
caltech_data[k]

In [None]:
caltech_data[k][0]

In [None]:
caltech_data.categories[caltech_data[k][1]]

Now, you need to add methods to the `Classifier_CLIP` class. You can see on this [nice blogpost of Sean Osier](https://www.seanosier.com/2021/03/20/python-add-method-existing-class/) how to do it.

First make a method to create the texts corresponding to the labels and tokenize them. Once this is done check what is the predicition made by the classifier on the image above. Is it right?

Now, add two methods: `predict` will take a batch of images and compute the corresponding probabilities and predictions and `test` will take as input a dataloader and use predict to compute the accuracy of the classifier on the dataset.

Hint: to create the dataloader you can use `from more_itertools import chunked`

# Better performances with GPT!

Using the idea of [Visual Classification via Description from Large Language Models](https://github.com/sachit-menon/classify_by_description_release/tree/master#visual-classification-via-description-from-large-language-models) by Sachit Menon, Carl Vondrick (ICLR 2023), try to get better performances!
![](https://raw.githubusercontent.com/sachit-menon/classify_by_description_release/master/figs/latent-points.png)

If you do not want to do prompt engineering, there are descriptors provided in the file `caltech101_full.json`

In [None]:
import json

def load_descriptors(self, filename):
    # Opening JSON file
    f = open(filename)
    self.descriptors = json.load(f)
    pass

In [None]:
Classifier_CLIP.load_descriptors = load_descriptors

In [None]:
classifier_caltech.load_descriptors('data/caltech101_full.json')

In [None]:
classifier_caltech.descriptors

# Prompt engineering for image classification

Try to get better descriptors!

In [None]:
import os
os.environ["OPENAI_API_KEY"] = # your key here

def stringtolist(description):
    return [descriptor[2:] for descriptor in description.split('\n') if (descriptor != '') and (descriptor.startswith('- '))]

I am using the same [prompts as in the original paper](https://github.com/sachit-menon/classify_by_description_release/blob/master/generate_descriptors.py) adapted to the new API. It could be improved...

In [None]:
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are useful visual features for distinguishing a lemur in a photo?"},
    {"role": "assistant", "content": """There are several useful visual features to tell there is a lemur in a photo:
- four-limbed primate
- black, grey, white, brown, or red-brown
- wet and hairless nose with curved nostrils
- long tail
- large eyes
- furry bodies
- clawed hands and feet"""},
    {"role": "user", "content": "What are useful visual features for distinguishing a television in a photo?"},
    {"role": "assistant", "content": """There are several useful visual features to tell there is a television in a photo:
- electronic device
- black or grey
- a large, rectangular screen
- a stand or mount to support the screen
- one or more speakers
- a power cord
- input ports for connecting to other devices
- a remote control"""},
    {"role": "user", "content": "What are useful visual features for distinguishing a dragonfly in a photo? Provide an answer following the above pattern, give only the list of visual features."}
  ]
)

In [None]:
response.choices[0].message.content

In [None]:
def generate_prompt(category_name: str):
    # you can replace the examples with whatever you want; these were random and worked, could be improved
    return [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are useful visual features for distinguishing a lemur in a photo?"},
    {"role": "assistant", "content": """There are several useful visual features to tell there is a lemur in a photo:
- four-limbed primate
- black, grey, white, brown, or red-brown
- wet and hairless nose with curved nostrils
- long tail
- large eyes
- furry bodies
- clawed hands and feet"""},
    {"role": "user", "content": "What are useful visual features for distinguishing a television in a photo?"},
    {"role": "assistant", "content": """There are several useful visual features to tell there is a television in a photo:
- electronic device
- black or grey
- a large, rectangular screen
- a stand or mount to support the screen
- one or more speakers
- a power cord
- input ports for connecting to other devices
- a remote control"""},
    {"role": "user", "content": f"What are useful visual features for distinguishing a {category_name} in a photo? Provide an answer following the above pattern, give only the list of visual features."}
  ]

def obtain_descriptors_and_save(filename, class_list):
    responses = {}
    descriptors = {}
    prompts = [generate_prompt(category.replace('_', ' ')) for category in class_list]
    client = OpenAI()

    responses = [client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages= prompt
    ) for prompt in prompts]
    
    
    response_texts = [resp.choices[0].message.content for resp in responses]
    descriptors_list = [stringtolist(response_text) for response_text in response_texts]
    descriptors = {cat: descr for cat, descr in zip(class_list, descriptors_list)}

    # save descriptors to json file
    if not filename.endswith('.json'):
        filename += '.json'
    with open(filename, 'w') as fp:
        json.dump(descriptors, fp)