# Gai/Gen: Image-to-Text (ITT)

## 1. Note

The following examples has been tested on the following environment:

-   NVidia GeForce RTX 2060 6GB
-   Windows 11 + WSL2
-   Ubuntu 22.04
-   Python 3.10
-   CUDA Toolkit 11.8
-   openai 1.6.1
-   llava 1.1.3

## 2. Create Virtual Environment and Install Dependencies

We will create a seperate virtual environment for this to avoid conflicting dependencies that each underlying model requires.

```sh
sudo apt update -y && sudo apt install ffmpeg git git-lfs -y
conda create -n ITT python=3.10.10 -y
conda activate ITT
pip install ".[ITT]"
```



## 3. Install Model

In [None]:
%%bash
huggingface-cli download liuhaotian/llava-v1.5-7b \
        --local-dir ~/gai/models/llava-v1.5-7b \
        --local-dir-use-symlinks False


## 4. Example

In [None]:
## 6.17 OpenAI Vision Image-to-Text Generation

from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('openai-vision')


import base64
encoded_string = ""
with open("../tests/gen/itt/buses.jpeg", "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode('utf-8')

print("GENERATE:")
response = gen.create(
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{encoded_string}",
          },
        },
      ],
    }
  ],
  max_tokens=300
  )
print(response.choices[0].message.content)


In [None]:
## 6.18 OpenAI Vision Image-to-Text Streaming

from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('openai-vision')

import base64
encoded_string = ""
with open("../tests/buses.jpeg", "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode('utf-8')

print("STREAMING:")
response = gen.create(
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{encoded_string}",
          },
        },
      ],
    }
  ],
  max_tokens=300,
  stream=True
  )
for chunk in response:
    if chunk.choices[0].delta.content:
      print(chunk.choices[0].delta.content,end="",flush=True)


The following demo uses the LLaVa model.

In [None]:
## 6.19 LlaVa Image-to-Text Generation
import base64
encoded_string = ""
with open("./buses.jpeg", "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode('utf-8')

from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('llava-transformers')

print("GENERATE:")
response = gen.create(
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{encoded_string}",
          },
        },
      ],
    }
  ],
  do_sample=True,
  temperature=10e-9,
  max_new_tokens=300
  )
print(response.choices[0].message.content)


In [None]:
## 6.20 LlaVa Image-to-Text Streaming

from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('llava-transformers')

import base64
encoded_string = ""
with open("./buses.jpeg", "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode('utf-8')

print("STREAMING:")
response = gen.create(
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{encoded_string}",
          },
        },
      ],
    }
  ],
  max_new_tokens=300,
  do_sample=True,
  temperature=10e-9,
  stream=True
  )
for chunk in response:
    print(chunk.choices[0].delta.content,end="",flush=True)


Another Llava model, but in huggingface format that can be directly loaded using pure Transformers API. Pro: More consistent for codebase. Con: Might be slightly less accurate.

In [None]:
%%bash
# Install the model
huggingface-cli download llava-hf/llava-1.5-7b-hf \
        --local-dir ~/gai/models/llava-1.5-7b-hf \
        --local-dir-use-symlinks False


In [None]:
%%bash
# This requires the latest transformer version so activate ITT2 environment 
pip install -e ".[ITT2]"

In [None]:
## Generating Text from Images
from gai.gen.itt.Transformers_ITT import Transformers_ITT
config = {
    "type": "itt",
    "engine": "Transformers_ITT",
    "model_path": "models/llava-1.5-7b-hf",
    "model_name": "llava-1.5-7b",
}
import base64
encoded_string = ""
with open("./buses.jpeg", "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode('utf-8')

itt = Transformers_ITT(config)
response = itt.create(
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe what you see in this image."},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{encoded_string}",
          },
        },
      ],
    }
  ],
  do_sample=True,
  temperature=10e-9,
  stream=False
  )
print(response.choices[0].message.content)

In [None]:
## Streaming Text from Images
from gai.gen.itt.Transformers_ITT import Transformers_ITT
config = {
    "type": "itt",
    "engine": "Transformers_ITT",
    "model_path": "models/llava-1.5-7b-hf",
    "model_name": "llava-1.5-7b",
}
import base64
encoded_string = ""
with open("./buses.jpeg", "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode('utf-8')

itt = Transformers_ITT(config)
response = itt.create(
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe what you see in this image."},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{encoded_string}",
          },
        },
      ],
    }
  ],
  do_sample=True,
  temperature=10e-9,
  stream=True
  )
for chunk in response:
    print(chunk.choices[0].delta.content,end="",flush=True)

## Running as a Service

### Step 1: Start Docker container

```bash
docker run -d \
    --name gai-itt \
    -p 12031:12031 \
    --gpus all \
    -v ~/gai/models:/app/models \
    kakkoii1337/gai-itt:latest
```

### Step 2: Wait for model to load

```bash
docker logs gai-itt
```

When the loading is completed, the logs should show this:

```bash
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:12031 (Press CTRL+C to quit)
```

### Step 3: Test

In [None]:
import base64
import mimetypes

def file_to_data_url(filename):
    # Get the mime type of the file
    mime_type, _ = mimetypes.guess_type(filename)

    # Read the file in binary mode
    with open(filename, 'rb') as f:
        encoded_string = base64.b64encode(f.read()).decode('utf-8')

    # Format the data url
    data_url = f"data:{mime_type};base64,{encoded_string}"

    return data_url

# Load the image file and convert it to a data url
data_url = file_to_data_url('../tests/mr-lion-sketch.png')

prompt = {
    "max_new_tokens":1000,
    "stream":True,
    "messages":[{"role":"user","content":[{"type": "text", "text": "What’s in this image?"},{"type":"image_url","image_url":{"url":data_url}}]}]
    }
import json
import requests

response = requests.post("http://localhost:12031/gen/v1/vision/completions", json=prompt)
for chunk in response.iter_lines():
    chunk = json.loads(chunk.decode())
    print(chunk["choices"][0]["delta"]["content"],end="",flush=True)
