<a href="https://colab.research.google.com/github/everestso/Summer24/blob/main/c164s26_Vison_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 AI Project Report  
## Team: **Human–AI Collaborative Intelligence**

---

### 📌 Project Title  
**Exploring Intelligent Agents and Decision-Making in AI Systems**

---

### 👥 Team Members

#### **David Ruby**  
**Role:** Human Researcher & Project Lead  
**Expertise:**  
- Artificial Intelligence & Machine Learning  
- Reinforcement Learning and Intelligent Agents  
- Programming (Python, C++, SQL)  

**Contributions:**  
- Defined project goals and research questions  
- Designed experiments and evaluation criteria  
- Implemented core algorithms and analysis  
- Interpreted results and ensured conceptual understanding  

---

#### **ChatGPT (AI Assistant)**  
**Role:** AI Collaborator & Learning Support Tool  
**Expertise:**  
- Concept explanation and brainstorming  
- Code generation, debugging, and refactoring  
- Documentation and formatting assistance  

**Contributions:**  
- Assisted with idea generation and alternative approaches  
- Provided explanations of AI concepts and algorithms  
- Helped draft and refine code, comments, and written sections  
- Supported clarity, organization, and presentation  

---

### 🤝 Collaboration Statement
This project was completed through an intentional **human–AI collaboration**.  
All AI-assisted content was critically evaluated, modified, and validated by the human team member.  
The final submission reflects **David Ruby’s understanding, reasoning, and responsibility** for the work.

---

### 🎓 Course
**Undergraduate Artificial Intelligence**

---


# Vision w/ Hugging Face

The Hugging Face **Transformers** library provides high-level pipelines for working with
computer vision and multimodal models, including image captioning and prompted
vision–language generation. These pipelines abstract away much of the preprocessing,
model loading, and inference logic, allowing you to experiment quickly with
state-of-the-art models.

---

## 📘 Core Documentation

- **Vision & Multimodal Tasks (Overview)**  
  https://huggingface.co/docs/transformers/tasks/vision

- **Pipelines API (How `pipeline()` works)**  
  https://huggingface.co/docs/transformers/main_classes/pipelines

---

## Common Image Captioning Models

| Model | Promptable | Verbosity | Teaching Use |
|------|------------|-----------|--------------|
| ViT-GPT2 (`nlpconnect/vit-gpt2-image-captioning`) | ❌ | Low–Medium | Simple intro to image captioning |
| BLIP (`Salesforce/blip-image-captioning-base/large`) | ✅ | Medium–High | Best all-around for multimodal prompting |
| BLIP-2 (`Salesforce/blip2-flan-t5-*`) | ✅✅ | High | Advanced vision + instruction tuning |
| GIT (`microsoft/git-base`) | ❌ | Medium | More literal, factual captions |
| OFA (`OFA-Sys/ofa-base`) | ⚠️ | Medium | Unified multimodal model concept |

---

## 🔗 Model-Specific References (Optional Deep Dives)

- **BLIP Models (Image Captioning & VQA)**  
  https://huggingface.co/Salesforce/blip-image-captioning-base

- **BLIP-2 (Vision + Instruction-Following LLMs)**  
  https://huggingface.co/docs/transformers/model_doc/blip-2

- **GIT (Generative Image-to-Text)**  
  https://huggingface.co/microsoft/git-base

- **OFA (Unified Multimodal Model)**  
  https://huggingface.co/OFA-Sys/ofa-base

---

> **Note:** While these models are accessed using the `image-to-text` pipeline,
> some (such as BLIP and BLIP-2) also accept text prompts that *condition* how the
> image is interpreted and described. This makes them a useful stepping stone toward
> more advanced multimodal and agentic AI systems.




https://huggingface.co/docs/transformers/pipeline_tutorial

In [None]:
https://drive.google.com/file/d/1FNiCMMrxiIq00nnrdSM-HjE93o7EVFp3/view?usp=sharing



In [None]:

## !pip install -q gdown


# Download the file from Google Drive
# The file ID is extracted from the URL: https://drive.google.com/file/d/1FNiCMMrxiIq00nnrdSM-HjE93o7EVFp3/view?usp=sharing
!gdown 1FNiCMMrxiIq00nnrdSM-HjE93o7EVFp3

fn="PXL_20260128_185635132~2.jpg"

Downloading...
From: https://drive.google.com/uc?id=1FNiCMMrxiIq00nnrdSM-HjE93o7EVFp3
To: /content/PXL_20260128_185635132~2.jpg
  0% 0.00/1.29M [00:00<?, ?B/s]100% 1.29M/1.29M [00:00<00:00, 13.7MB/s]


In [None]:
import warnings
warnings.filterwarnings(action = 'ignore')

In [None]:
from transformers import pipeline
from transformers.utils import logging
logging.set_verbosity(40)



In [None]:
### Task: Image captioning
captioner = pipeline(task="image-to-text",
                     model="nlpconnect/vit-gpt2-image-captioning")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/982M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/241 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

In [None]:
captioner(fn)

[{'generated_text': 'a bird is sitting on a piece of paper '}]

## Prompted Image Captioning

Traditional image captioning models generate a short description of an image with no additional guidance. While useful, these captions are often brief and generic. **Prompted image captioning** extends this idea by allowing a *text prompt* to accompany the image, guiding how the model interprets and describes what it sees.

In prompted image captioning, the model receives **both visual input and textual context** at the same time. The text prompt acts as an instruction, question, or stylistic guide that shapes the generated caption.

For example, the same image may produce very different outputs depending on the prompt:

- *“Describe the image.”*  
- *“Describe the image in vivid botanical detail.”*  
- *“What season does this image suggest?”*  
- *“List the visible objects and their colors.”*

Models such as **BLIP** and **BLIP-2** are explicitly designed to support this form of multimodal prompting. Internally, these models combine image embeddings with language model instructions, enabling them to:
- Generate longer and more descriptive captions
- Answer questions about an image
- Adjust tone, detail level, or focus based on the prompt

Prompted image captioning is an important step toward **multimodal reasoning systems**. Rather than passively describing images, models can be *steered* to interpret visual information in task-specific ways—an ability that later becomes central in **agentic AI systems**, where perception is guided by goals and instructions.


In [None]:
#captioner2 = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
captioner2 = pipeline("image-to-text", model="Salesforce/blip2-flan-t5-xl")


config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/5.81G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/432 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

In [None]:
# Basic caption
print(captioner_blip2("tulips.jpg"))

# Prompted caption (steer style/detail)
print(captioner_blip2("tulips.jpg", prompt="Describe the image in vivid botanical detail:"))

# More verbose caption via generation params
print(captioner_blip2("tulips.jpg", prompt="Describe the image in detail:"))

[{'generated_text': 'tulips in bloom in a garden'}]
[{'generated_text': 'a lily of the valley'}]
[{'generated_text': 'The image shows a man in a red shirt and a woman in a red shirt.'}]


In [None]:
# Basic caption
print(captioner_blip2("dog.jpg"))

# More verbose caption via generation params
print(captioner_blip2("dog.jpg", prompt="Describe the image in detail:"))

[{'generated_text': 'a small white and brown dog is looking up'}]
[{'generated_text': 'The image shows a man in a red shirt and a woman in a red shirt.'}]


In [None]:
vqa = pipeline(
    "visual-question-answering",
    model="Salesforce/blip2-flan-t5-xl"
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
vqa(image="dog.jpg", question="Describe the dog’s appearance and posture in detail.")

[{'answer': 'The dog is a stout, muscular dog with a slender body and a slender neck. The dog is a stout, muscular dog with a slender body and a slender neck.'}]

In [None]:
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xl")

image = Image.open("dog.jpg").convert("RGB")

inputs = processor(
    image,
    text="Describe the dog’s appearance and posture in detail.",
    return_tensors="pt"
)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=60
    )

print(processor.decode(output[0], skip_special_tokens=True))