# Introduction to CLIP

### Lab Table of Contents
1. [1_chatgpt.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/1_chatgpt.ipynb)
2. **[2_CLIP.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/2_CLIP.ipynb)**
3. [3_VLM_BLIP.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/3_VLM_BLIP.ipynb)
4. [4_VLA.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/4_VLA.ipynb)
5. [5_safety.ipynb](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/5_safety.ipynb)
6. [Lab Checkoff](https://github.com/abbykoneill/lerobot/blob/main/lab_part2/checkoff.txt)

## CLIP (Contrastive Languageâ€“Image Pre-training)
CLIP is a neural network trained on (image, text) pairs that uses natural language processing to product text when given an input image. Pre-training on 400M (image, text) pairs, CLIP learns about images from raw text and then uses an image encoder and text encoder to focus on visual and text features, respectively. This model was initially designed to address several problems with deep learning approaches for computer vision.

In this lab, you will use the [HuggingFace implementation of CLIP](https://huggingface.co/docs/transformers/model_doc/clip). Walk through the two CLIP examples below.

Make sure to save your outputs and discuss answers to the questions with your lab partner.

#### Before Beginning with Code - Complete Environment Set-Up:
* `conda create -n <env_name> python=3.10`

In [None]:
# Install dependencies

!pip install ipykernel
!pip install torch
!pip install transformers
!pip install datasets
!pip install pillow

## CLIP Vision Model with Projection

This model projects visual features from an input image into the same latent space as text features in order to compute a similarity score, enabling zero-shot image classification and text-based image retrieval.

1. Run the vision model code below. Save and comment on the output. How does this differ from using an image as an input for ChatGPT in notebook 1?

In [None]:
# CLIP Vision Model with Projection

import torch
from transformers import AutoProcessor, CLIPVisionModelWithProjection
from transformers.image_utils import load_image

model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

inputs = processor(images=image, return_tensors="pt")

with torch.inference_mode():
    outputs = model(**inputs)
image_embeds = outputs.image_embeds

## CLIP Image Classification

This model performs zero-shot image classification, classifying images into categories without prior knowledge or explicit training on those categories. This is enabled through CLIP's dense pre-training on (image, text) pairs.

2. Run the image classification model code below. Save and comment on the output. How does this differ from using an image as an input for ChatGPT in notebook 1? How does it differ from the vision model with projection above?

In [None]:
# Image Classification with CLIP

from transformers import AutoImageProcessor, CLIPForImageClassification
import torch
from datasets import load_dataset

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

image_processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPForImageClassification.from_pretrained("openai/clip-vit-base-patch32")

inputs = image_processor(image, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

# model predicts one of the 1000 ImageNet classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

## References

* [HuggingFace CLIP](https://huggingface.co/docs/transformers/model_doc/clip)
* [OpenAI CLIP GitHub Repository](https://github.com/openai/CLIP)
* [OpenAI CLIP Blog](https://openai.com/index/clip/)