<a href="https://colab.research.google.com/github/dodamm111/PDF_GQ/blob/main/PDF2Json2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. 필요 라이브러리 설치
- 코랩을 쓰는 이유 : 메모리와 GPU를 노트북 환경보다 효과적으로 쓸 수 있어서
- 라이브러리 버전 차이 때문에

In [None]:
!pip install huggingface_hub ultralytics
!pip install doclayout-yolo
!pip install pdf2image
!pip install easyocr
!pip3 install torch
!pip3 install torchvision

!apt-get update
!apt-get install -y poppler-utils

Collecting ultralytics
  Downloading ultralytics-8.3.75-py3-none-any.whl.metadata (35 kB)
Collecting ultralytics-thop>=2.0.0 (from ultralytics)
  Downloading ultralytics_thop-2.0.14-py3-none-any.whl.metadata (9.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.8.0->ultralytics)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.8.0->ultralytics)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.8.0->ultralytics)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.8.0->ultralytics)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.8.0->ultralytics)
  Downloading nv

In [None]:
import os
import json
import cv2
import numpy as np
import easyocr
import pandas as pd
from pdf2image import convert_from_path
from huggingface_hub import hf_hub_download
from ultralytics import YOLO
from doclayout_yolo import YOLOv10
from PIL import Image
from transformers import Pix2StructForConditionalGeneration, AutoProcessor
import torch

Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.


In [None]:
# 전역 device 설정: CUDA 사용 가능 시 "cuda", 아니면 "cpu"로 설정
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")

Using device: cuda


## 2. 페이지 레이아웃 추출
1. PDF -> jpg 꼴로 페이지 단위 이미지로 변환 (poppler 사용)
2. YOLOv10 모델을 이용하여 영역 검출
3. IOU를 계산하여 겹치는 부분 등의 문제 해결
4. 각 구분마다 이미지를 크롭하고 메타데이터와 함께 이를 저장함

In [None]:
# -------------------------------------------------------------------------
# PDF 파일을 페이지별 이미지(PIL Image)로 변환하는 함수
def pdf_to_images(pdf_path, dpi=300):
    images = convert_from_path(pdf_path, dpi=dpi)
    print(f"Converted PDF to {len(images)} page objects.")
    return images

# -------------------------------------------------------------------------
# YOLOv10 모델 로드 함수 (DocLayout-YOLO)
def load_yolo_model():
    filepath = hf_hub_download(
        repo_id="juliozhao/DocLayout-YOLO-DocStructBench",
        filename="doclayout_yolo_docstructbench_imgsz1024.pt"
    )
    return YOLOv10(filepath)

# -------------------------------------------------------------------------
# 두 박스 간 IoU 계산 함수
def calculate_iou(box1, box2):
    x1, y1, x2, y2 = box1
    x3, y3, x4, y4 = box2
    inter_x1 = max(x1, x3)
    inter_y1 = max(y1, y3)
    inter_x2 = min(x2, x4)
    inter_y2 = min(y2, y4)
    inter_area = max(0, inter_x2 - inter_x1) * max(0, inter_y2 - inter_y1)
    box1_area = (x2 - x1) * (y2 - y1)
    box2_area = (x4 - x3) * (y4 - y3)
    return inter_area / (box1_area + box2_area - inter_area)

# -------------------------------------------------------------------------
# 중복 박스 제거 함수 (IoU 기준)
def filter_duplicate_boxes(bounding_boxes, iou_threshold=0.5):
    filtered_boxes = []
    for box in bounding_boxes:
        keep = True
        for fbox in filtered_boxes:
            iou = calculate_iou(
                (box["x_min"], box["y_min"], box["x_max"], box["y_max"]),
                (fbox["x_min"], fbox["y_min"], fbox["x_max"], fbox["y_max"])
            )
            if iou > iou_threshold:
                if box["confidence"] > fbox["confidence"]:
                    filtered_boxes.remove(fbox)
                else:
                    keep = False
                break
        if keep:
            filtered_boxes.append(box)
    return filtered_boxes

# -------------------------------------------------------------------------
# 고유 접미사 생성 (식별용)
def generate_unique_suffix(index):
    alphabet = "abcdefghijklmnopqrstuvwxyz"
    return alphabet[index % len(alphabet)]

# -------------------------------------------------------------------------
# 단일 이미지(페이지)에서 영역(박스) 검출 함수
def process_image(image, model, page_number):
    image_array = np.array(image)
    # GPU/CPU 선택: DEVICE 사용
    det_res = model.predict(image_array, imgsz=1024, conf=0.2, device=DEVICE)
    bounding_boxes = []
    for i, box in enumerate(det_res[0].boxes):
        class_name = model.names[int(box.cls)]
        class_number = int(box.cls)
        unique_suffix = generate_unique_suffix(i)
        bounding_boxes.append({
            "class": class_name,
            "confidence": float(box.conf),
            "x_min": float(box.xyxy[0][0]),
            "y_min": float(box.xyxy[0][1]),
            "x_max": float(box.xyxy[0][2]),
            "y_max": float(box.xyxy[0][3]),
            "unique_id": f"page{page_number}_class{class_number}_{unique_suffix}",
            "page_number": page_number
        })
    filtered_boxes = filter_duplicate_boxes(bounding_boxes, iou_threshold=0.5)
    return filtered_boxes

# -------------------------------------------------------------------------
# PDF 전체에 대해 페이지별 이미지 변환 및 영역 검출
def process_pdf(pdf_path, model, dpi=300):
    images = pdf_to_images(pdf_path, dpi=dpi)
    all_detections = []
    for page_number, image in enumerate(images, start=1):
        detections = process_image(image, model, page_number)
        all_detections.append(detections)
    return images, all_detections

# -------------------------------------------------------------------------
# 검출된 영역을 원본 이미지에서 크롭하고, bounding_box 및 page_number 정보를 포함해 분리
def crop_detections(images, all_detections):
    cropped_results = {"table": [], "plain text": [], "figure": []}
    for detections in all_detections:
        if not detections:
            continue
        page_number = detections[0]["page_number"]
        image = np.array(images[page_number - 1])
        for box in detections:
            x_min = int(box["x_min"])
            y_min = int(box["y_min"])
            x_max = int(box["x_max"])
            y_max = int(box["y_max"])
            cropped_img = image[y_min:y_max, x_min:x_max]
            category = box["class"]
            region_dict = {
                "unique_id": box["unique_id"],
                "image": cropped_img,
                "page_number": page_number,
                "bounding_box": {
                    "x_min": box["x_min"],
                    "y_min": box["y_min"],
                    "x_max": box["x_max"],
                    "y_max": box["y_max"]
                }
            }
            if category in cropped_results:
                cropped_results[category].append(region_dict)
            else:
                if "other" not in cropped_results:
                    cropped_results["other"] = []
                cropped_results["other"].append(region_dict)
    return cropped_results

##3-1 평문 영역에서의 텍스트 추출
- EasyOCR 활용

In [None]:
# -------------------------------------------------------------------------
# 평문 영역에서 EasyOCR으로 텍스트 추출 (EasyOCR 사용)
class TextExtractorFromMemory:
    def __init__(self):
        self.reader = easyocr.Reader(['ko', 'en'], gpu=torch.cuda.is_available())
    def extract_text(self, image):
        text_result = self.reader.readtext(image, detail=0)
        text = " ".join(text_result).strip()
        text = " ".join(text.split())
        return text

# -------------------------------------------------------------------------
# 평문 영역 처리 함수: 최종 출력 구조는 아래와 같이 함
# {data_id, page_number, region_type, content, meta:{bounding_box: ...}}
def process_plain_text_regions(plain_text_regions):
    extractor = TextExtractorFromMemory()
    results = []
    for region in plain_text_regions:
        unique_id = region["unique_id"]
        text = extractor.extract_text(region["image"])
        results.append({
            "data_id": unique_id,
            "page_number": region["page_number"],
            "region_type": "평문",
            "content": text,
            "meta": {
                "bounding_box": region["bounding_box"]
            }
        })
    return results

### 3-2. 표 영역의 텍스트 추출
- OpenCV와 EasyOCR을 기반으로 함

In [None]:
# -------------------------------------------------------------------------
# 표 영역 처리: 표의 셀 텍스트 및 그리드 정보 추출
def extract_text_from_cells(cells_data):
    extracted_text = []
    for cell in cells_data:
        if 'text' in cell:
            extracted_text.append(cell['text'])
    return ' '.join(extracted_text)

# -------------------------------------------------------------------------
# 표 영역에서 표 구조 및 셀 텍스트 추출 (OpenCV와 EasyOCR 사용)
class TableExtractor:
    def __init__(self):
        self.reader = easyocr.Reader(['ko', 'en'], gpu=torch.cuda.is_available())
    def process_image(self, image):
        if isinstance(image, str):
            self.image = cv2.imread(image)
        else:
            self.image = image
        self.result = self.image.copy()
        self.detect_lines()
        self.classify_lines_and_find_intersections()
        self.remove_duplicate_points()
        data, extracted_cells = self.extract_text_from_cells()
        df = pd.DataFrame(data)
        df = df.replace(r'^\s*$', np.nan, regex=True)
        df = df.replace('', np.nan)
        df = df.dropna(how='all', axis=0).dropna(how='all', axis=1)
        df = df.reset_index(drop=True)
        df = df.fillna('')
        processed_cells = []
        for i in range(len(df)):
            for j in range(len(df.columns)):
                original_cell = next((cell for cell in extracted_cells if cell['row'] == i + 1 and cell['col'] == j + 1), None)
                if original_cell:
                    processed_cells.append({
                        'row': i + 1,
                        'col': j + 1,
                        'text': df.iloc[i, j],
                        'coordinates': original_cell['coordinates']
                    })
        final_result = {'cells': processed_cells, 'grid_info': {'rows': len(df), 'cols': len(df.columns)}}
        return final_result
    def detect_lines(self):
        self.edges = cv2.Canny(self.image, 50, 150, apertureSize=3)
        self.lines = cv2.HoughLinesP(self.edges, 1, np.pi/180, threshold=100, minLineLength=100, maxLineGap=10)
        return self.lines
    def classify_lines_and_find_intersections(self):
        self.intersection_points = []
        self.horizontal_lines = []
        self.vertical_lines = []
        if self.lines is not None:
            for line in self.lines:
                x1, y1, x2, y2 = line[0]
                angle = np.abs(np.arctan2(y2 - y1, x2 - x1) * 180.0 / np.pi)
                if angle < 10 or angle > 170:
                    self.horizontal_lines.append(line[0])
                elif 80 < angle < 100:
                    self.vertical_lines.append(line[0])
            height, width = self.image.shape[:2]
            margin = 10
            self.horizontal_lines.append([margin, margin, width - margin, margin])
            self.horizontal_lines.append([margin, height - margin, width - margin, height - margin])
            self.vertical_lines.append([margin, margin, margin, height - margin])
            self.vertical_lines.append([width - margin, margin, width - margin, height - margin])
            self._find_intersection_points()
            self._process_end_points()
    def _find_intersection_points(self):
        for h_line in self.horizontal_lines:
            for v_line in self.vertical_lines:
                x1, y1, x2, y2 = h_line
                x3, y3, x4, y4 = v_line
                denominator = ((x1 - x2) * (y3 - y4) - (y1 - y2) * (x3 - x4))
                if denominator != 0:
                    t = ((x1 - x3) * (y3 - y4) - (y1 - y3) * (x3 - x4)) / denominator
                    u = -((x1 - x2) * (y1 - y3) - (y1 - y2) * (x1 - x3)) / denominator
                    if 0 <= t <= 1 and 0 <= u <= 1:
                        x = int(x1 + t * (x2 - x1))
                        y = int(y1 + t * (y2 - y1))
                        self.intersection_points.append((x, y))
        self.intersection_points = sorted(set(self.intersection_points), key=lambda p: (p[1], p[0]))
    def _process_end_points(self):
        end_points = []
        for line in self.horizontal_lines + self.vertical_lines:
            x1, y1, x2, y2 = line
            end_points.append((x1, y1))
            end_points.append((x2, y2))
        x_values = [point[0] for point in end_points]
        y_values = [point[1] for point in end_points]
        x_min, x_max = min(x_values), max(x_values)
        y_min, y_max = min(y_values), max(y_values)
        self.filtered_end_points = [(x, y) for (x, y) in end_points if (x_min <= x <= x_min + 10 or x_max - 10 <= x <= x_max) or (y_min <= y <= y_min + 10 or y_max - 10 <= y <= y_max)]
        self.all_points = self.intersection_points + self.filtered_end_points
    def remove_duplicate_points(self, distance_threshold=15):
        self.unique_points = []
        for point in self.all_points:
            is_unique = True
            for unique_point in self.unique_points:
                distance = np.linalg.norm(np.array(point) - np.array(unique_point))
                if distance <= distance_threshold:
                    is_unique = False
                    break
            if is_unique:
                self.unique_points.append(point)
    def extract_text_from_cells(self, min_height=30, min_width=30):
        self.x_coords = sorted(list(set([point[0] for point in self.intersection_points])))
        self.y_coords = sorted(list(set([point[1] for point in self.intersection_points])))
        data = []
        extracted_cells = []
        for i in range(len(self.y_coords) - 1):
            row = []
            for j in range(len(self.x_coords) - 1):
                top_left_x = self.x_coords[j]
                top_left_y = self.y_coords[i]
                bottom_right_x = self.x_coords[j + 1]
                bottom_right_y = self.y_coords[i + 1]
                tile = self.image[top_left_y:bottom_right_y, top_left_x:bottom_right_x]
                cell_info = {'row': i + 1, 'col': j + 1, 'coordinates': {'top_left': (top_left_x, top_left_y), 'bottom_right': (bottom_right_x, bottom_right_y)}}
                if tile.shape[0] < min_height or tile.shape[1] < min_width:
                    row.append("")
                    cell_info['text'] = ""
                    extracted_cells.append(cell_info)
                    continue
                text_result = self.reader.readtext(tile, detail=0)
                text = "\n".join(text_result).strip()
                row.append(text)
                cell_info['text'] = text
                extracted_cells.append(cell_info)
            data.append(row)
        return data, extracted_cells

# -------------------------------------------------------------------------
# 표 영역 처리 함수: 최종 출력 구조에 data_id, page_number, region_type, content, meta 포함
def process_table_regions(table_regions):
    table_extractor = TableExtractor()
    results = []
    for region in table_regions:
        unique_id = region["unique_id"]
        try:
            table_result = table_extractor.process_image(region["image"])
        except Exception as e:
            print(f"Table extraction failed for {unique_id}: {e}")
            continue
        table_text = extract_text_from_cells(table_result["cells"])
        results.append({
            "data_id": unique_id,
            "page_number": region["page_number"],
            "region_type": "일반표",
            "content": table_text,
            "meta": {
                "bounding_box": region["bounding_box"],
                "cells": table_result.get("cells", []),
                "grid": table_result.get("grid_info", {})
            }
        })
    return results

###3-3. 도표 영역 처리
- Pix2Struct를 기반하여 이를 사전 학습시킨 brainventures/deplot_kr을 불러와서 사용함

In [None]:
# -------------------------------------------------------------------------
# 도표 영역 처리: Pix2Struct를 사용하여 도표 설명 생성
class FigureExtractor:
    def __init__(self):
        self.processor = AutoProcessor.from_pretrained("brainventures/deplot_kr")
        self.model = Pix2StructForConditionalGeneration.from_pretrained("brainventures/deplot_kr")
        self.model.to(DEVICE)
    def extract_figure_info(self, image):
        if not isinstance(image, Image.Image):
            image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
        inputs = self.processor(images=image, return_tensors="pt").to(DEVICE)
        outputs = self.model.generate(**inputs, max_length=1024)
        result = self.processor.batch_decode(outputs, skip_special_tokens=True)[0]
        return result

# -------------------------------------------------------------------------
# 도표 영역 처리 함수: 최종 출력 구조에 data_id, page_number, region_type, content, meta 포함
def process_figure_regions(figure_regions):
    figure_extractor = FigureExtractor()
    results = []
    for region in figure_regions:
        unique_id = region["unique_id"]
        try:
            figure_text = figure_extractor.extract_figure_info(region["image"])
        except Exception as e:
            print(f"Figure extraction failed for {unique_id}: {e}")
            continue
        results.append({
            "data_id": unique_id,
            "page_number": region["page_number"],
            "region_type": "도표",
            "content": figure_text,
            "meta": {
                "bounding_box": region["bounding_box"]
            }
        })
    return results

### 3-4. 네이버 CLOVA API를 활용한 방법
- 100회까지 무료이고, 그 이상은 요금이 발생함
- 로컬에서는 아무래도 이걸로 해야 할 듯? (코랩은 지금 유료버전 사용 중)

In [None]:
'''
import requests
import time
import uuid

# 표 영역: NAVER CLOVA OCR API를 사용해 표 OCR 수행 (열 방향 정렬)
class TextExtractorAPI:
    def __init__(self, api_url, secret_key):
        self.api_url = api_url
        self.secret_key = secret_key

    def extract_text(self, image_path):
        request_json = {
            'images': [{
                'format': image_path.split('.')[-1],
                'name': os.path.basename(image_path)
            }],
            'requestId': str(uuid.uuid4()),
            'version': 'V2',
            'timestamp': int(round(time.time() * 1000)),
            'enableTableDetection': True
        }
        payload = {'message': json.dumps(request_json).encode('UTF-8')}
        with open(image_path, 'rb') as f:
            files = {'file': f}
            headers = {'X-OCR-SECRET': self.secret_key}
            response = requests.post(self.api_url, headers=headers, data=payload, files=files)
            response.raise_for_status()
        return response.json()

def extract_table_data(api_url, secret_key, image_path):
    extractor = TextExtractorAPI(api_url, secret_key)
    result = extractor.extract_text(image_path)
    texts = []
    for img in result.get("images", []):
        for table in img.get("tables", []):
            columns = {}
            for cell in table.get("cells", []):
                col = cell["columnIndex"]
                row = cell["rowIndex"]
                cell_text = " ".join(word["inferText"] for line in cell.get("cellTextLines", []) for word in line.get("cellWords", []) if "inferText" in word)
                columns.setdefault(col, []).append((row, cell_text))
            # 열 순서대로, 행 순으로 정렬 후 텍스트 합치기
            col_texts = []
            for col in sorted(columns.keys()):
                col_texts.append(" ".join(text for _, text in sorted(columns[col], key=lambda x: x[0])))
            texts.append(" ".join(col_texts))
    return " ".join(texts)

def process_table_regions(table_regions, api_url, secret_key, temp_dir):
    results = []
    os.makedirs(temp_dir, exist_ok=True)
    for region in table_regions:
        unique_id = region["unique_id"]
        temp_path = os.path.join(temp_dir, f"{unique_id}.png")
        cv2.imwrite(temp_path, region["image"])
        table_text = extract_table_data(api_url, secret_key, temp_path)
        results.append({
            "data_id": unique_id,
            "page_number": region["page_number"],
            "region_type": "일반표",
            "content": table_text,
            "meta": {"bounding_box": region["bounding_box"]}
        })
    return results
    '''

'\nimport requests\nimport time\nimport uuid\n\n# 표 영역: NAVER CLOVA OCR API를 사용해 표 OCR 수행 (열 방향 정렬)\nclass TextExtractorAPI:\n    def __init__(self, api_url, secret_key):\n        self.api_url = api_url\n        self.secret_key = secret_key\n\n    def extract_text(self, image_path):\n        request_json = {\n            \'images\': [{\n                \'format\': image_path.split(\'.\')[-1],\n                \'name\': os.path.basename(image_path)\n            }],\n            \'requestId\': str(uuid.uuid4()),\n            \'version\': \'V2\',\n            \'timestamp\': int(round(time.time() * 1000)),\n            \'enableTableDetection\': True\n        }\n        payload = {\'message\': json.dumps(request_json).encode(\'UTF-8\')}\n        with open(image_path, \'rb\') as f:\n            files = {\'file\': f}\n            headers = {\'X-OCR-SECRET\': self.secret_key}\n            response = requests.post(self.api_url, headers=headers, data=payload, files=files)\n            response.r

##4 최종 실행 코드
- pdf_path에 경로 입력해주면 됨
- 최종 결과로 json과 csv 둘 다 얻을 수 있음 (csv는 보기 편하라고 추가해둠)
- 메모리가 부족하거나, cpu를 이용해서 돌리면 퍼질 수 있음
- 로컬에서 돌리려고 해보고 있는데, OpenCV과 로컬에서 돌리기 조금 까다롭다.

In [None]:
# -------------------------------------------------------------------------
# 최종 파이프라인 실행 및 결과 저장
pdf_path = "/content/샘플_간단.pdf"  # PDF 파일 경로 (실제 파일 경로에 맞게 수정)

# 1. 모델 로드, PDF→이미지 변환 및 영역 검출
model = load_yolo_model()  # DocLayout-YOLO 모델 로드
images, all_detections = process_pdf(pdf_path, model, dpi=300)
cropped_results = crop_detections(images, all_detections)

# 2. 영역별(평문, 표, 도표)로 분리
plain_text_regions = cropped_results.get("plain text", [])
table_regions = cropped_results.get("table", [])
figure_regions = cropped_results.get("figure", [])

print(f"Total plain text regions: {len(plain_text_regions)}")
print(f"Total table regions: {len(table_regions)}")
print(f"Total figure regions: {len(figure_regions)}")

# 3. 각 영역별 텍스트 추출 및 처리 (최종 출력 구조: data_id, page_number, region_type, content, meta)
plain_text_extraction_results = process_plain_text_regions(plain_text_regions)

table_extraction_results = process_table_regions(table_regions)

'''
API로 하려는 시도를 해봤으나. 지금도 성능이 무난해서 패스
API_URL = "https://zs86y5dbis.apigw.ntruss.com/custom/v1/38296/f1f32ef77bd50be9fac754ee87f6da4ea19c110cbb1a2ebeda56cec30a6594e3/general"
SECRET_KEY = "___"
TEMP_DIR = "/content/temp_tables"
table_extraction_results = process_table_regions(table_regions, API_URL, SECRET_KEY, TEMP_DIR)
'''

figure_extraction_results = process_figure_regions(figure_regions)

# 4. 세 영역 결과를 합침
combined_results = plain_text_extraction_results + table_extraction_results + figure_extraction_results

# 5. 사람이 읽는 순서대로 정렬 (bounding_box의 y_min, x_min 순)
combined_results.sort(key=lambda x: (x["page_number"],
                                       x["meta"]["bounding_box"]["y_min"],
                                       x["meta"]["bounding_box"]["x_min"]))

# 6. 최종 결과를 JSON 파일로 저장
RESULTS_DIR = "Output"
os.makedirs(RESULTS_DIR, exist_ok=True)
final_combined_json = os.path.join(RESULTS_DIR, "result.json")
with open(final_combined_json, "w", encoding="utf-8") as f:
    json.dump(combined_results, f, ensure_ascii=False, indent=4)
print(f"Combined JSON extraction results saved to: {final_combined_json}")

# 7. 최종 결과를 CSV 파일로 저장
# CSV에서는 meta 필드는 문자열(JSON 문자열)로 저장됨
df = pd.DataFrame(combined_results)
df["meta"] = df["meta"].apply(lambda m: json.dumps(m, ensure_ascii=False))
final_combined_csv = os.path.join(RESULTS_DIR, "result.csv")
df.to_csv(final_combined_csv, index=False, encoding='euc-kr')
print(f"Combined CSV extraction results saved to: {final_combined_csv}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


(…)clayout_yolo_docstructbench_imgsz1024.pt:   0%|          | 0.00/40.7M [00:00<?, ?B/s]

Converted PDF to 12 page objects.

0: 576x1024 1 figure, 340.0ms
Speed: 75.7ms preprocess, 340.0ms inference, 698.8ms postprocess per image at shape (1, 3, 576, 1024)

0: 576x1024 2 titles, 5 plain texts, 106.0ms
Speed: 8.5ms preprocess, 106.0ms inference, 2.0ms postprocess per image at shape (1, 3, 576, 1024)

0: 576x1024 1 title, 2 plain texts, 83.3ms
Speed: 11.4ms preprocess, 83.3ms inference, 22.4ms postprocess per image at shape (1, 3, 576, 1024)

0: 576x1024 1 title, 5 plain texts, 92.0ms
Speed: 35.5ms preprocess, 92.0ms inference, 1.5ms postprocess per image at shape (1, 3, 576, 1024)

0: 576x1024 1 title, 1 plain text, 81.1ms
Speed: 10.1ms preprocess, 81.1ms inference, 8.5ms postprocess per image at shape (1, 3, 576, 1024)

0: 576x1024 1 title, 6 plain texts, 177.3ms
Speed: 38.5ms preprocess, 177.3ms inference, 1.3ms postprocess per image at shape (1, 3, 576, 1024)

0: 576x1024 1 figure, 1 figure_caption, 79.9ms
Speed: 8.4ms preprocess, 79.9ms inference, 1.6ms postprocess per i



Total plain text regions: 32
Total table regions: 1
Total figure regions: 2
Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |██████████████████████████████████████████████████| 100.1% Complete

  df = df.replace(r'^\s*$', np.nan, regex=True)


preprocessor_config.json:   0%|          | 0.00/250 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.45k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/851k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.27M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/891 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/164 [00:00<?, ?B/s]

Legacy behavior is being used. The current behavior will be deprecated in version 5.0.0. In the new behavior, If both images and text are provided, image_processor is not a VQA processor, and `add_special_tokens` is unset, the default value of `add_special_tokens` will be changed to `False` when calling the tokenizer. To test the new behavior, set `legacy=False`as a processor call argument.


model.safetensors:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

Combined JSON extraction results saved to: Output/result.json
Combined CSV extraction results saved to: Output/result.csv
