# Vision-Language Model Playground
This notebook outlines experiments related to multi-modal vision and language models. Mainly  

Models to test:
- OpenAI CLIP
- Facebook TimesFormer
- VisualBert
- Flamingo
- UniLm
- Large World Model


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import requests
import tqdm
import pprint

from PIL import Image

sns.set_theme()
%matplotlib inline

## Task Overview
- "How many people are wearing red in this picture?" <br>
(image: https://live.staticflickr.com/3019/4554964811_a96d51b67b_b.jpg)


## OpenAI CLIP

In [3]:
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
probs

tensor([[0.9949, 0.0051]], grad_fn=<SoftmaxBackward0>)

In [None]:
inputs = processor(text=["How many cats are in the image?"], images=image, return_tensors="pt", padding=True)

model(**inputs)

## TimesFormer

In [None]:
from transformers import AutoImageProcessor, TimesformerForVideoClassification

video = list(np.random.randn(8, 3, 224, 224))

processor = AutoImageProcessor.from_pretrained("facebook/timesformer-base-finetuned-k600")
model = TimesformerForVideoClassification.from_pretrained("facebook/timesformer-base-finetuned-k600")

inputs = processor(images=video, return_tensors="pt")

with torch.no_grad():
  outputs = model(**inputs)
  logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

## VisualBERT

## Large World Model

## MOO