<a href="https://colab.research.google.com/github/dvschultz/ml-art-colabs/blob/master/CLIP_Dataset_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to use CLIP Zero-Shot on your own classification dataset

This notebook provides an example of how to benchmark CLIP's zero shot classification performance on your own classification dataset.

[CLIP](https://openai.com/blog/clip/) is a new zero shot image classifier relased by OpenAI that has been trained on 400 million text/image pairs across the web. CLIP uses these learnings to make predictions based on a flexible span of possible classification categories.

CLIP is zero shot, that means **no training is required**. 

This notebook is modified from a notebook by [Roboflow](https://colab.research.google.com/drive/1LXla2q9MCRRI_kTjpvag2Vz-7EGLnki5)


---

If you find this notebook useful, please consider signing up for my [Patreon](https://www.patreon.com/bustbright) or [YouTube channel](https://www.youtube.com/channel/UCaZuPdmZ380SFUMKHVsv_AA/join). You can also send me a one-time payment on [Venmo](https://venmo.com/Derrick-Schultz).


# Download and Install CLIP Dependencies

In [None]:
!nvidia-smi -L

In [None]:
#installing some dependencies, CLIP was release in PyTorch
import subprocess

CUDA_version = [s for s in subprocess.check_output(["nvcc", "--version"]).decode("UTF-8").split(", ") if s.startswith("release")][0].split(" ")[-1]
print("CUDA version:", CUDA_version)

if CUDA_version == "10.0":
    torch_version_suffix = "+cu100"
elif CUDA_version == "10.1":
    torch_version_suffix = "+cu101"
elif CUDA_version == "10.2":
    torch_version_suffix = ""
else:
    torch_version_suffix = "+cu110"

!pip install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} -f https://download.pytorch.org/whl/torch_stable.html ftfy regex

import numpy as np
import torch
import os

print("Torch version:", torch.__version__)
os.kill(os.getpid(), 9)

## Clone the CLIP repo

In [None]:
#clone the CLIP repository
!git clone https://github.com/openai/CLIP.git
%cd CLIP

In [None]:
import torch 
import clip

print(clip.available_models())

## Import Dataset

Next we need to import a dataset. You can upload a zip directly to Colab (drag and drop it into the Files tab on your left), sync your Drive, or import using gdown.

In [None]:
#sync Drive account
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!gdown --id 1BWGeWn0LMLa0ZgfoXTqO2hhGu6BPVh2O -O /content/mineral-samples.zip
%cd /content/
!unzip mineral-samples.zip
%cd CLIP

## Create Tokens
CLIP will use pieces of text, called tokens, to compare your image against. Below we will create a single token to test with.

I rcommend experimenting with phrases you use. CLIP can respond to  particular sentence structures for better or worse responses.

In [None]:
captions = ['A photo containing text','A photograph without text' ]

# Single Image Scoring

Let’s start by looking at a single image and a single caption. CLIP can take the image and provide a probablity for how likely the model thinks the caption and image match.

In [None]:
import torch
import clip
from PIL import Image
import glob
from IPython.display import Image as Img, display

def argmax(iterable):
    return max(enumerate(iterable), key=lambda x: x[1])[0]

captions = ['A photograph of a gemstone','A photograph of a gemstone held by a hand' ]
img = '/content/content/mineral-samples/121477819_181900606849103_3709047619295376368_n.jpg'

device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform = clip.load("ViT-B/32", device=device)

#define our target classifications, you can should experiment with these strings of text as you see fit
text = clip.tokenize(captions).to(device)

image = transform(Image.open(img)).unsqueeze(0).to(device)
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
    print(probs)
    pred = captions[argmax(list(probs)[0])]
    display(Img(filename=img, width=400))
    print(pred)

In [None]:
!unzip /content/mineral-samples.zip

## Sorting into two classes
Below we’ll extend the above example to look at every image in a folder and sort the images into two folders. We’ll use the probability score and take the class that gets the higher probability. I recommend using tokens that express som binary operation.

As this proceess runs the image, probability score, and predictions will be displayed. Pay close attention to false positives and consider editing your tokens if you see too many.

In [None]:
import os
import torch
import clip
from PIL import Image
import glob
from IPython.display import Image as Img, display

def argmax(iterable):
    return max(enumerate(iterable), key=lambda x: x[1])[0]

imgs = glob.glob('/content/tests/A_photograph_of_a_gemstone/*.*')
captions = ['A photograph of a gemstone','A photograph of a gemstone containing text' ]

fpaths = []
for f in captions:
    fpath = '/content/test2/'+f.replace(' ','_')
    fpaths.append(fpath)
    if not os.path.exists(fpath):
        os.makedirs(fpath)
print(fpaths)

device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform = clip.load("ViT-B/32", device=device)

#define our target classifications, you can should experiment with these strings of text as you see fit
text = clip.tokenize(captions).to(device)

for img in imgs:
    image = transform(Image.open(img)).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        
        logits_per_image, logits_per_text = model(image, text)
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        pred = captions[argmax(list(probs)[0])]
        display(Img(filename=img, width=400))
        print(probs)
        print(pred)

        img_name = img.split('/')[-1]
        path = fpaths[argmax(list(probs)[0])] + '/' + img_name
        !cp {img} {path}

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Sorting into multiple classes (Max probabilty)

So far we’ve only looked at two classes, but you can technically use any number of categories. This example will only sort by maximum probability, so each image will only end up in one class at a time.

In [None]:
import os
import torch
import clip
from PIL import Image
import glob
from IPython.display import Image as Img, display

def argmax(iterable):
    return max(enumerate(iterable), key=lambda x: x[1])[0]

captions = ['A photograph of a gemstone on a black background','A photograph of a gemstone on a white background','A photograph of a gemstone on a gradient background' ]
imgs = glob.glob('/content/minerals-min1024/*.jpg')

fpaths = []
for f in captions:
    fpath = '/content/'+f.replace(' ','_')
    fpaths.append(fpath)
    if not os.path.exists(fpath):
        os.makedirs(fpath)
print(fpaths)

device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform = clip.load("ViT-B/32", device=device)

#define our target classifications, you can should experiment with these strings of text as you see fit
text = clip.tokenize(captions).to(device)

for img in imgs:
    image = transform(Image.open(img)).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        
        logits_per_image, logits_per_text = model(image, text)
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        pred = captions[argmax(list(probs)[0])]
        display(Img(filename=img, width=400))
        print(probs)
        print(pred)

        #saves images
        img_name = img.split('/')[-1]
        path = fpaths[argmax(list(probs)[0])] + '/' + img_name
        !cp {img} {path}

## Sorting into multiple classes (Greedy)

You might want to sort an image into multiple folders. To do this you’ll ned to set a minimum probability score.

This can be pretty tricky. Because all probability scores add up to one, you’ll need to find a good value that will define "confidence, but not under- or over-confidence."

In [None]:
%cd CLIP/

In [None]:
import os
import torch
import clip
from PIL import Image
import glob
import numpy as np
from IPython.display import Image as Img, display

def argmax(iterable):
    return max(enumerate(iterable), key=lambda x: x[1])[0]

imgs = glob.glob('/content/minerals/*.*')
# imgs = ['/content/26871485_568905930137798_4275492080328900608_n.jpg',
#       '/content/27579032_703652923356546_8926669318020661248_n.jpg',
#       '/content/28156709_2038566666361358_5503147325851172864_n.jpg',
#       '/content/34982795_174824020040310_6797853509149523968_n.jpg',
#       '/content/35459382_252143185338131_7462939386892517376_n.jpg',
#       '/content/35518589_666909173653781_4904083633842683904_n.jpg']
captions = ['A photo of a gemstone and no visible hands','A photo of gemstone and a visible hand' ]
min_prob = .2

fpaths = []
for f in captions:
    fpath = '/content/drive/MyDrive/CLIP-data/'+f.replace(' ','_')
    fpaths.append(fpath)
    if not os.path.exists(fpath):
        os.makedirs(fpath)
print(fpaths)

device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform = clip.load("ViT-B/32", device=device)

#define our target classifications, you can should experiment with these strings of text as you see fit
text = clip.tokenize(captions).to(device)

for img in imgs:
    image = transform(Image.open(img)).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        
        logits_per_image, logits_per_text = model(image, text)
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        pred = captions[argmax(list(probs)[0])]
        display(Img(filename=img, width=400))
        print(probs)

        #saves images
        img_name = img.split('/')[-1]
        for i in range(len(probs[0])):
            if(probs[0][i] >= min_prob):
                print(captions[i])
                path = fpaths[i] + '/' + img_name
                !cp {img} {path}

## Trying Something else

In [None]:
import os
import torch
import clip
from PIL import Image
import glob
from IPython.display import Image as Img, display

def argmax(iterable):
    return max(enumerate(iterable), key=lambda x: x[1])[0]

imgs = glob.glob('/content/mineral-samples/*.*')
# imgs = ['/content/75580670_179414003112257_70106094399349512_n.jpg', '/content/45345032_1060841144103369_8637295453703979280_n.jpg', '/content/121477819_181900606849103_3709047619295376368_n.jpg', '/content/121531725_973722799796652_2455931968095755240_n.jpg','/content/93375640_256338438825322_6148423300973295637_n.jpg','/content/64895378_2350623818592868_2370413862839102887_n.jpg']
#define our target classifications, you should experiment with these strings of text as you see fit
captions = ['An uncropped photo of a gemstone', 'An uncropped photo of a gemstone, contains a hand', 'A cropped photo of a gemstone', 'A photo of gemstone, contains text' ,'A photo of gemstone, contains a hand' ]

device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform = clip.load("ViT-B/32", device=device)
# model, transform = clip.load("RN101", device=device)


text = clip.tokenize(captions).to(device)

# print(text)

for img in imgs:
    image = transform(Image.open(img)).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        
        logits_per_image, logits_per_text = model(image, text)
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        pred = captions[argmax(list(probs)[0])]
        display(Img(filename=img, width=400))
        print(probs)
        print(pred)

In [None]:
import os
import torch
import clip
from PIL import Image
import glob
from IPython.display import Image as Img, display

def argmax(iterable):
    return max(enumerate(iterable), key=lambda x: x[1])[0]

imgs = glob.glob('/content/mineral-samples/*.*')
# imgs = ['/content/75580670_179414003112257_70106094399349512_n.jpg', '/content/45345032_1060841144103369_8637295453703979280_n.jpg', '/content/121477819_181900606849103_3709047619295376368_n.jpg', '/content/121531725_973722799796652_2455931968095755240_n.jpg','/content/93375640_256338438825322_6148423300973295637_n.jpg','/content/64895378_2350623818592868_2370413862839102887_n.jpg']
captions = ['An uncropped photo of a gemstone','A cropped photo of a gemstone','A photo containing text', 'A photo that does not contain text', 'A photo with hands', 'A photo without hands']

device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform = clip.load("ViT-B/32", device=device)
# model, transform = clip.load("RN101", device=device)

#define our target classifications, you can should experiment with these strings of text as you see fit
text = clip.tokenize(captions).to(device)

for img in imgs:
    image = transform(Image.open(img)).unsqueeze(0).to(device)
    with torch.no_grad():
        # t = text - t_not
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        
        display(Img(filename=img, width=400))

        logits_per_image, logits_per_text = model(image, text[:2])
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        pred = captions[argmax(list(probs)[0])]
        
        print(probs)
        print(pred)

        logits_per_image, logits_per_text = model(image, text[2:4])
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        pred = captions[argmax(list(probs)[0])+2]
        
        print(probs)
        print(pred)

        logits_per_image, logits_per_text = model(image, text[4:6])
        probs = logits_per_image.softmax(dim=-1).cpu().numpy()
        pred = captions[argmax(list(probs)[0])+4]
        
        print(probs)
        print(pred)

        # logits_per_image, logits_per_text = model(image, t_not)
        # print(logits_per_image)
