### Document Parsing with Qwen2.5-VL

Welcome to this notebook, which showcases the powerful document parsing capabilities of our model. It can process any image and output its content in various formats such as HTML, JSON, MD, and LaTeX. Notably, we introduce a unique Qwenvl HTML format that includes positional information for each component, enabling precise document reconstruction and manipulation.

#### \[Setup\]

Load visualization utils.

In [1]:
!pip install git+https://github.com/huggingface/transformers
!pip install qwen-vl-utils
!pip install qwen_agent
!pip install openai

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-wx28s9nl
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-wx28s9nl
  Resolved https://github.com/huggingface/transformers to commit 5f8d17268ced2ca5f51b0216782356b16be0d6f4
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.52.0.dev0-py3-none-any.whl size=11586091 sha256=e99225819d559ffa9ce1ca35588063a4adbf9898604131abc7a7421fdea6cfb7
  Stored in directory: /tmp/pip-ephem-wheel-cache-dr_4ekez/wheels/04/a3/f1/b88775f8e1665827525b19ac7590250f1038d947067beba9fb
Successfully built transformer

In [7]:
# Get Noto font
!apt-get install fonts-noto-cjk

import os
from PIL import Image, ImageDraw, ImageFont
import requests
from io import BytesIO
from bs4 import BeautifulSoup, Tag
from pathlib import Path
import re

# Function to draw bounding boxes and text on images based on HTML content
def draw_bbox(image_path, resized_width, resized_height, full_predict):
    if image_path.startswith("http"):
        response = requests.get(image_path)
        image = Image.open(BytesIO(response.content))
    else:
        image = Image.open(image_path)
    original_width = image.width
    original_height = image.height

    # Parse the provided HTML content
    soup = BeautifulSoup(full_predict, 'html.parser')
    # Extract all elements that have a 'data-bbox' attribute
    elements_with_bbox = soup.find_all(attrs={'data-bbox': True})

    filtered_elements = []
    for el in elements_with_bbox:
        if el.name == 'ol':
            continue  # Skip <ol> tags
        elif el.name == 'li' and el.parent.name == 'ol':
            filtered_elements.append(el)  # Include <li> tags within <ol>
        else:
            filtered_elements.append(el)  # Include all other elements

    font = ImageFont.truetype("NotoSansCJK-Regular.ttc", 20)
    draw = ImageDraw.Draw(image)

    # Draw bounding boxes and text for each element
    for element in filtered_elements:
        bbox_str = element['data-bbox']
        text = element.get_text(strip=True)
        x1, y1, x2, y2 = map(int, bbox_str.split())

        # Calculate scaling factors
        scale_x = resized_width / original_width
        scale_y = resized_height / original_height

        # Scale coordinates accordingly
        x1_resized = int(x1 / scale_x)
        y1_resized = int(y1 / scale_y)
        x2_resized = int(x2 / scale_x)
        y2_resized = int(y2 / scale_y)

        if x1_resized > x2_resized:
            x1_resized, x2_resized = x2_resized, x1_resized
        if y1_resized > y2_resized:
            y1_resized, y2_resized = y2_resized, y1_resized

        # Draw bounding box
        draw.rectangle([x1_resized, y1_resized, x2_resized, y2_resized], outline='red', width=2)
        # Draw associated text
        draw.text((x1_resized, y2_resized), text, fill='black', font=font)

    # Display the image
    image.show()

# Function to clean and format HTML content
def clean_and_format_html(full_predict):
    soup = BeautifulSoup(full_predict, 'html.parser')

    # Regular expression pattern to match 'color' styles in style attributes
    color_pattern = re.compile(r'\bcolor:[^;]+;?')

    # Find all tags with style attributes and remove 'color' styles
    for tag in soup.find_all(style=True):
        original_style = tag.get('style', '')
        new_style = color_pattern.sub('', original_style)
        if not new_style.strip():
            del tag['style']
        else:
            new_style = new_style.rstrip(';')
            tag['style'] = new_style

    # Remove 'data-bbox' and 'data-polygon' attributes from all tags
    for attr in ["data-bbox", "data-polygon"]:
        for tag in soup.find_all(attrs={attr: True}):
            del tag[attr]

    classes_to_update = ['formula.machine_printed', 'formula.handwritten']
    # Update specific class names in div tags
    for tag in soup.find_all(class_=True):
        if isinstance(tag, Tag) and 'class' in tag.attrs:
            new_classes = [cls if cls not in classes_to_update else 'formula' for cls in tag.get('class', [])]
            tag['class'] = list(dict.fromkeys(new_classes))  # Deduplicate and update class names

    # Clear contents of divs with specific class names and rename their classes
    for div in soup.find_all('div', class_='image caption'):
        div.clear()
        div['class'] = ['image']

    classes_to_clean = ['music sheet', 'chemical formula', 'chart']
    # Clear contents and remove 'format' attributes of tags with specific class names
    for class_name in classes_to_clean:
        for tag in soup.find_all(class_=class_name):
            if isinstance(tag, Tag):
                tag.clear()
                if 'format' in tag.attrs:
                    del tag['format']

    # Manually build the output string
    output = []
    for child in soup.body.children:
        if isinstance(child, Tag):
            output.append(str(child))
            output.append('\n')  # Add newline after each top-level element
        elif isinstance(child, str) and not child.strip():
            continue  # Ignore whitespace text nodes
    complete_html = f"""```html\n<html><body>\n{" ".join(output)}</body></html>\n```"""
    return complete_html

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
  fonts-noto-cjk-extra
The following NEW packages will be installed:
  fonts-noto-cjk
0 upgraded, 1 newly installed, 0 to remove and 34 not upgraded.
Need to get 61.2 MB of archives.
After this operation, 93.2 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-cjk all 1:20220127+repack1-1 [61.2 MB]
Fetched 61.2 MB in 4s (14.2 MB/s)
Selecting previously unselected package fonts-noto-cjk.
(Reading database ... 126101 files and directories currently installed.)
Preparing to unpack .../fonts-noto-cjk_1%3a20220127+repack1-1_all.deb ...
Unpacking fonts-noto-cjk (1:20220127+repack1-1) ...
Setting up fonts-noto-cjk (1:20220127+repack1-1) ...
Processing triggers for fontconfig (2.13.1-4.2ubuntu5) ...


Load model and processors.

In [4]:
#If accessing via API, you can skip this step and directly use the inference_with_api function.
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.float16,device_map="auto")
processor = AutoProcessor.from_pretrained(model_path)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/65.4k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.53G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/216 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/5.70k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Load inference

In [5]:
def inference(img_url, prompt, system_prompt="You are a helpful assistant", max_new_tokens=32000):
  image = Image.open(img_url)
  messages = [
    {
      "role": "system",
      "content": system_prompt
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": prompt
        },
        {
          "image": img_url
        }
      ]
    }
  ]
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  print("input:\n",text)
  inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to('cuda')

  output_ids = model.generate(**inputs, max_new_tokens=1024)
  generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
  output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
  print("output:\n",output_text[0])

  input_height = inputs['image_grid_thw'][0][1]*14
  input_width = inputs['image_grid_thw'][0][2]*14

  return output_text[0], input_height, input_width

#### 1. Document Parsing in QwenVL HTML Format

Here shows how to generate and process HTML content using Qwen2.5-VL. The generated HTML follows the QwenVL Document Parser format with bounding boxes.

In [9]:
img_url = "/content/example.png"
image = Image.open(img_url)

system_prompt="Вы — ИИ, специализирующийся на распознавании и извлечении текста из изображений. Ваша миссия — проанализировать документ изображения и сгенерировать результат в формате HTML QwenVL Document Parser, используя указанные теги, сохраняя при этом конфиденциальность пользователя и целостность данных."
prompt =  "QwenVL HTML "

## Use a local HuggingFace model to inference.
output, input_height, input_width = inference(img_url, prompt)




## Use an API-based approach to inference. Apply API key here: https://bailian.console.alibabacloud.com/?apiKey=1
# from qwen_vl_utils import smart_resize
# os.environ['DASHSCOPE_API_KEY'] = 'your_api_key_here'
# min_pixels = 512*28*28
# max_pixels = 2048*28*28
# image = Image.open(img_url)
# width, height = image.size
# input_height,input_width = smart_resize(height,width,min_pixels=min_pixels, max_pixels=max_pixels)
# output = inference_with_api(img_url, prompt, min_pixels=min_pixels, max_pixels=max_pixels)


# Visualization
print(input_height, input_width)
print(output)
draw_bbox(img_url, input_width, input_height, output)

ordinary_html = clean_and_format_html(output)
print(ordinary_html)

input:
 <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
QwenVL HTML <|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

output:
 The image you provided appears to be a Russian employment application form from VTB Bank (ВТБ). Here is a summary of the key information filled in:

1. **Personal Information:**
   - **Name:** Евгений Аркадьевич
   - **Date of Birth:** 13 September 1987
   - **Gender:** М (Male)
   - **Place of Birth:** Санкт-Петербург

2. **Citizenship and Nationality:**
   - **Primary Citizenship:** Russian Federation
   - **Secondary Citizenship:** Kazakhstan

3. **Work Experience:**
   - **Employment Status:** Нет (None)

4. **Residency and Green Card Status:**
   - **Residence in the USA:** Да (Yes)
   - **Green Card Status:** Нет (None)

5. **Registration Address:**
   - **Primary Address:** Республика Татарстан, ул. Кремлевская, 13
   - **Temporary Registration Address:** Алтайский край, деревня Скрытая в горах

6. **A

AttributeError: 'NoneType' object has no attribute 'children'

#### 2. Generate ordinary html with Qwen2.5-VL

In [None]:
# generate ordinary html

img_url = "assets/document_parsing/docparsing_example5.png"
prompt = "图片解析成html"
image = Image.open(img_url)

## Use a local HuggingFace model to inference.
output, input_height, input_width = inference(img_url, prompt)
print(output)

## Use an API-based approach to inference. Apply API key here: https://bailian.console.alibabacloud.com/?apiKey=1
# from qwen_vl_utils import smart_resize
# os.environ['DASHSCOPE_API_KEY'] = 'your_api_key_here'
# min_pixels = 512*28*28
# max_pixels = 2048*28*28
# image = Image.open(img_url)
# width, height = image.size
# input_height,input_width = smart_resize(height,width,min_pixels=min_pixels, max_pixels=max_pixels)
# output = inference_with_api(img_url, prompt, min_pixels=min_pixels, max_pixels=max_pixels)
# print(output)

input:
 <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
图片解析成html<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

output:
 将图片内容解析成HTML格式如下：

```html
<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <title>百度搜索结果</title>
</head>
<body>
    <header>
        <div class="logo">
            <img src="https://www.baidu.com/img/baidu_jgylogo3.gif" alt="百度Logo">
        </div>
        <nav>
            <ul>
                <li><a href="#">网页</a></li>
                <li><a href="#">AI搜</a></li>
                <li><a href="#">图片</a></li>
                <li><a href="#">资讯</a></li>
                <li><a href="#">视频</a></li>
                <li><a href="#">笔记</a></li>
                <li><a href="#">地图</a></li>
                <li><a href="#">贴吧</a></li>
                <li><a href="#">文库</a></li>
                <li><a href="#">更多</a></li>
            </ul>
        </nav>
    </header>
    <main>
        <section c