# Image Classification using LLM

In [1]:
!pip install transformers datasets pillow

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [2]:
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
from PIL import Image
import requests

# Load model and feature extractor
model_name = "google/vit-base-patch16-224"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForImageClassification.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

In [4]:
image = Image.open("/content/n01443537_goldfish.JPEG")
image = image.convert("RGB")

# Preprocess image
inputs = feature_extractor(images=image, return_tensors="pt")

In [6]:
# Perform inference
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax(-1).item()
print("Predicted class:", predicted_class)

Predicted class: 1


In [7]:
# Convert the predicted class to a readable format
class_names = model.config.id2label
predicted_class_name = class_names[predicted_class]
print("Predicted class name:", predicted_class_name)

Predicted class name: goldfish, Carassius auratus


In [8]:

from transformers import pipeline

# Load a text-generation model (GPT-like)
llm = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")

# Create a prompt
explanation_prompt = f"The Vision Transformer (ViT) model classified the given image as class '{predicted_class}'. "
explanation_prompt += "Explain why this classification is reasonable based on the image's features."

from transformers import pipeline

# Load a text-generation model (GPT-like)
llm = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")

# Create a prompt
explanation_prompt = f"The Vision Transformer (ViT) model classified the given image as class '{predicted_class}'. "
explanation_prompt += "Explain why this classification is reasonable based on the image's features."

# Generate explanation
llm_explanation = llm(explanation_prompt, max_length=150)[0]["generated_text"]
from transformers import pipeline

# Load a text-generation model (GPT-like)
llm = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")

# Create a prompt
explanation_prompt = f"The Vision Transformer (ViT) model classified the given image as class '{predicted_class}'. "
explanation_prompt += "Explain why this classification is reasonable based on the image's features."

# Generate explanation
llm_explanation = llm(explanation_prompt, max_length=150)[0]["generated_text"]

print("\n🔍 LLM Explanation:\n", llm_explanation)

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Device set to use cpu
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



🔍 LLM Explanation:
 The Vision Transformer (ViT) model classified the given image as class '1'. Explain why this classification is reasonable based on the image's features. Explain how the transformation can be obtained using the ViT model.

In the context of image restoration, the image is viewed in 3D space (say by point cloud reconstruction). Thus, in order to visualize the object in the image, we need to perform 3D mapping to get a 3D point cloud. In the case of image restoration there typically exists multiple image patches corresponding to 3D points, and each patch is a different object. For example, these 3D points may correspond to faces, and thus be used for surface reconstruction. This paper investigates the problem of mapping each of
