# Introduction to CLIP

### Lab Table of Contents
* Part 1
    1. [1_imitation_learning.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part_1/1_imitation_learning.ipynb)
* **Part 2**
    1. [1_chatgpt.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/1_chatgpt.ipynb)
    2. **[2_CLIP.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/2_CLIP.ipynb)**
    3. [3_VLM_BLIP.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/3_VLM_BLIP.ipynb)
    4. [4_VLA.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/4_VLA.ipynb)
    5. [5_safety.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/5_safety.ipynb)
* [Lab Checkoff](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/checkoff.md)

## CLIP (Contrastive Languageâ€“Image Pre-training)
CLIP is a neural network trained on (image, text) pairs that uses natural language processing to produce text when given an input image. Pre-training on 400M (image, text) pairs, CLIP learns about images from raw text and then uses an image encoder and text encoder to focus on visual and text features, respectively. This model was initially designed to address several problems with deep learning approaches for computer vision.

In this lab, you will use the [HuggingFace implementation of CLIP](https://huggingface.co/docs/transformers/model_doc/clip). Walk through the two CLIP examples below.

Make sure to save your outputs and discuss answers to the questions with your lab partner.

#### Before Beginning with Code - Complete Environment Set-Up:
* `conda create -n <env_name> python=3.10`

>**Note:** The CLIP model requires a good amount of space. Before beginning with this notebook, it might be helpful to either free up local space or use Colab.

In [None]:
# Install dependencies

!pip install ipykernel
!pip install torch==2.9.1
!pip install transformers
!pip install datasets
!pip install pillow

## CLIP Image Classification

This model performs zero-shot image classification, classifying images into categories without prior knowledge or explicit training on those categories. This is enabled through CLIP's dense pre-training on (image, text) pairs.

The three subsequent code cells use three different variations of the CLIP model for image classification on the same image.
1. Run the three cells to test the three different models. Save the output image description from each model.
2. How accurate is each of the descriptions? Why do you think some perform better than others?
3. The `labels` list in the third model is a list of predicted image labels to use for classifying the image. Change some of the values and rerun this third model. Experiment with labels that are very specific to the image, ones that are very incorrect for the image, and what happens if you have two labels that are similar to each other. 
2. How does CLIP Image Classification differ from using an image as an input for ChatGPT in notebook 1?

In [None]:
# Image Classification with CLIP

# Image to Classify

from transformers import AutoImageProcessor, CLIPForImageClassification
import torch
from datasets import load_dataset
from transformers import CLIPProcessor, CLIPModel

import json
import urllib.request
import matplotlib.pyplot as plt

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

plt.imshow(image)
plt.axis('off')
plt.title('Input Image')
plt.show()

In [None]:
# Image Classification with CLIP

# CLIP Model Variation 1

modelname = "openai/clip-vit-base-patch32"
image_processor = AutoImageProcessor.from_pretrained(modelname)
model = CLIPForImageClassification.from_pretrained(modelname)

inputs = image_processor(image, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

# model predicts one of the 1000 ImageNet classes
predicted_label = logits.argmax(-1).item()
#print(model.config.id2label[predicted_label])

# Download imagenet class names
url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
class_names = urllib.request.urlopen(url).read().decode("utf-8").split("\n")

pred_label = class_names[predicted_label]
print(pred_label)

In [None]:
# Image Classification with CLIP

# CLIP Model Variation 2

modelname = "google/vit-base-patch16-224"
image_processor = AutoImageProcessor.from_pretrained(modelname)
model = CLIPForImageClassification.from_pretrained(modelname)

inputs = image_processor(image, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

# model predicts one of the 1000 ImageNet classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

# Download imagenet class names
url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
class_names = urllib.request.urlopen(url).read().decode("utf-8").split("\n")

pred_label = class_names[predicted_label]
print(pred_label)

In [None]:
# Image Classification with CLIP

# CLIP Model Variation 3

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# update this list with different predicted labels
labels = ["tabby cat", "golden retriever", "goldfish", "email", "black cat"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits_per_image

pred = logits.argmax().item()
print(labels[pred])

### Continue to
[3_VLM_BLIP.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/3_VLM_BLIP.ipynb)

## References

* [HuggingFace CLIP](https://huggingface.co/docs/transformers/model_doc/clip)
* [OpenAI CLIP GitHub Repository](https://github.com/openai/CLIP)
* [OpenAI CLIP Blog](https://openai.com/index/clip/)