# Image(s) to Text

## Artworks 

### Fito Segrera: 1 & N Chairs (2017)

[Website Exhibition Frankfurter Kunstverein](https://www.fkv.de/fito-segrera/)

```{margin}
Fito Segrera: 1 & N Chairs (2017)<br>
LCD screens, micro-computers, Internet connection, cognitive computation engine and custom software.
```
![Fito Segrera: 1 & N Chairs (2017)](img/FitoSegrera_1&NChairs_2017.jpg)

<br>

### Philipp Schmitt: Computed Curation (2016-17/18)

[Website Exhibition Frankfurter Kunstverein](https://www.fkv.de/fito-segrera/)

[Project Page](https://philippschmitt.com/work/computed-curation)  
[Book (Video)](https://vimeo.com/266039697)  
[Production Process](https://vimeo.com/225081193)  

```{margin}
Philipp Schmitt: Computed Curation (2016-17/18)
```
![Philipp Schmitt: Computed Curation (2016-17/18))](img/PhilippSchmitt_ComputedCuration-1.jpg)

<br>

```{margin}
Philipp Schmitt: Computed Curation (2016-17/18)
```
![Philipp Schmitt: Computed Curation (2016-17/18))](img/PhilippSchmitt_ComputedCuration-3.jpg)

<br>

### Excavating AI - Kate Crawford and Trevor Paglen (2018)

[Excavating AI - Kate Crawford and Trevor Paglen (2018)](https://excavating.ai/)

```{margin}
Excavating AI - Kate Crawford and Trevor Paglen (2018)
```
![Excavating AI](img/ex.jpg)

<br>

### From ‘Apple’ to ‘Anomaly’ - Trevor Paglen (2019)

[From ‘Apple’ to ‘Anomaly’ - Trevor Paglen (2019)](https://paglen.studio/2020/04/09/from-apple-to-anomaly-pictures-and-labels-selections-from-the-imagenet-dataset-for-object-recognition/)

```{margin}
From ‘Apple’ to ‘Anomaly’ - Trevor Paglen (2019)
```
![From ‘Apple’ to ‘Anomaly’](img/Trevor.jpg)


### Captioning


In [None]:
import gradio as gr
from PIL import Image
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration

# 1. Load BLIP processor and model
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
model.eval()

# 2. Define a captioning function
def generate_caption(image: Image.Image) -> str:
    """
    Takes a PIL Image from Gradio’s camera input and returns a text caption.
    """
    # Preprocess the image
    inputs = processor(images=image, return_tensors="pt")

    # Generate caption (you can adjust num_beams, max_length, etc. as needed)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            num_beams=5,
            max_length=20
        )
    caption = processor.decode(out[0], skip_special_tokens=True)
    return caption

# 3. Set up Gradio interface with a webcam (camera) input
camera_input = gr.Image(type="pil")

iface = gr.Interface(
    fn=generate_caption,
    inputs=camera_input,
    outputs="text",
    title="BLIP Image Captioning (Webcam)",
    description="Point your webcam at something, click 'Submit', and BLIP will describe what it sees."
)

# 4. Launch the app
if __name__ == "__main__":
    iface.launch()



### Visual Question Answering


In [1]:
import gradio as gr
from PIL import Image
import torch
from transformers import BlipProcessor, BlipForQuestionAnswering

# 1. Load BLIP VQA processor and model
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
model.eval()

# 2. Define a VQA‐style function
def answer_question(image: Image.Image, question: str) -> str:
    """
    Takes a PIL Image and a text question, 
    returns BLIP’s predicted answer.
    """
    # Preprocess both image and text
    inputs = processor(images=image, text=question, return_tensors="pt")

    # Generate answer (you can tune num_beams, max_length, etc.)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            num_beams=5,
            max_length=10
        )
    answer = processor.decode(out[0], skip_special_tokens=True)
    return answer

# 3. Set up Gradio interface with a camera input + text box
camera_input = gr.Image(type="pil", label="Webcam Image")
text_input = gr.Textbox(lines=2, placeholder="Ask a question about the image…", label="Question")

iface = gr.Interface(
    fn=answer_question,
    inputs=[camera_input, text_input],
    outputs="text",
    title="BLIP VQA (Webcam)",
    description="Point your webcam at something, type a question, click 'Submit', and BLIP will answer."
)

# 4. Launch the app
if __name__ == "__main__":
    iface.launch()


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
