# ETL Pipeline: PDF to PostgreSQL pgvector

이 노트북은 PDF 파일을 읽어서 마크다운으로 변환하고, 청크로 나눈 후 임베딩하여 PostgreSQL pgvector에 적재하는 전체 ETL 파이프라인을 구현합니다.

## Pipeline Overview
1. **Chapter-based Markdown Conversion**: 챕터 배열을 받아서 챕터별로 나눈 후 마크다운으로 변환
2. **Noise Removal**: 챕터 첫 페이지와 전체 페이지에서 노이즈 제거 (header/footer 패턴)
3. **Markdown Header Labeling**: 특정 패턴(Exercises, Key Terms 등)을 마크다운 헤더로 변환
4. **Header-based Chunking**: 마크다운 헤더를 기준으로 청크 분할
5. **Header-based Filtering**: 특정 헤더를 가진 청크 필터링
6. **Embedding with Clova**: Clova 임베딩 API를 사용한 벡터 생성 (RPM 고려)
7. **PostgreSQL pgvector Loading**: 임베딩 결과를 PostgreSQL pgvector에 적재

---
## Checkpoint 1: Chapter-based Markdown Conversion

PDF 파일에서 챕터 정보를 받아 각 챕터별로 페이지를 나누고 마크다운으로 변환합니다.

### 입력
- PDF 파일 경로
- 챕터별 시작 페이지 정보 (딕셔너리)

### 출력
- `chapter_markdowns`: 챕터별 마크다운 텍스트 (dict)

### 체크포인트 저장
- 변수: `checkpoint_1_chapter_markdowns`

### Step 1.1: 라이브러리 설치

필요한 라이브러리를 설치합니다.

In [1]:
!pip install pymupdf4llm pymupdf -q

### Step 1.2: 라이브러리 임포트

필요한 라이브러리를 임포트합니다.

In [2]:
import pymupdf4llm
import fitz  # PyMuPDF
import os

Consider using the pymupdf_layout package for a greatly improved page layout analysis.


### Step 1.3: PDF 파일 및 챕터 정보 설정

처리할 PDF 파일 경로와 각 챕터의 시작 페이지 번호를 정의합니다.

In [124]:
# PDF 파일 경로
pdf_path = "data/network.pdf"

# 챕터별 시작 페이지 번호 (PDF 페이지 번호)
chapter_start_pages = {
    "Chapter 1": 5,
    "Chapter 2": 47,
    "Chapter 3": 103,
    "Chapter 4": 179,
    "Chapter 5": 229,
    "Chapter 6": 287,
    "Chapter 7": 341,
    "Chapter 8": 373,
    "Chapter 9": 415,
}

# 챕터 정보 정렬 및 총 페이지 수 확인
sorted_chapters = sorted(chapter_start_pages.items(), key=lambda item: item[1])

doc = fitz.open(pdf_path)
total_pages = len(doc)
doc.close()

print(f"Target PDF: {pdf_path}")
print(f"Total Pages: {total_pages}")
print(f"Total Chapters: {len(sorted_chapters)}")
print(f"\nChapter Configuration:")
for chapter_name, start_page in sorted_chapters:
    print(f"  - {chapter_name}: Page {start_page}")

Target PDF: data/network.pdf
Total Pages: 482
Total Chapters: 9

Chapter Configuration:
  - Chapter 1: Page 5
  - Chapter 2: Page 47
  - Chapter 3: Page 103
  - Chapter 4: Page 179
  - Chapter 5: Page 229
  - Chapter 6: Page 287
  - Chapter 7: Page 341
  - Chapter 8: Page 373
  - Chapter 9: Page 415


### Step 1.4: 챕터별 마크다운 변환 실행

각 챕터의 페이지 범위를 계산하고 pymupdf4llm을 사용하여 마크다운으로 변환합니다.

In [125]:
chapter_markdowns = {}

print("=" * 60)
print("Starting Chapter-based Markdown Conversion")
print("=" * 60)

for i, (chapter_name, start_page) in enumerate(sorted_chapters):
    # 페이지 범위 계산 (0-indexed)
    start_idx = start_page - 1
    
    if i < len(sorted_chapters) - 1:
        # 다음 챕터 시작 전까지
        end_idx = sorted_chapters[i + 1][1] - 1
    else:
        # 마지막 챕터는 PDF 끝까지
        end_idx = total_pages
    
    # 페이지 범위 리스트 생성
    page_range = list(range(start_idx, end_idx))
    
    print(f"\n[{chapter_name}]")
    print(f"  Processing pages {start_idx + 1} to {end_idx} ({len(page_range)} pages)...")
    
    # pymupdf4llm을 사용하여 페이지별로 마크다운 변환
    chapter_md_text = pymupdf4llm.to_markdown(pdf_path, pages=page_range)
    
    # 챕터별 마크다운 저장
    chapter_markdowns[chapter_name] = chapter_md_text
    
    print(f"  ✓ Completed (Length: {len(chapter_md_text):,} characters)")

print("\n" + "=" * 60)
print(f"✓ All chapters converted successfully!")
print(f"  Total chapters: {len(chapter_markdowns)}")
print("=" * 60)

Starting Chapter-based Markdown Conversion

[Chapter 1]
  Processing pages 5 to 46 (42 pages)...
  ✓ Completed (Length: 119,014 characters)

[Chapter 2]
  Processing pages 47 to 102 (56 pages)...
  ✓ Completed (Length: 163,058 characters)

[Chapter 3]
  Processing pages 103 to 178 (76 pages)...
  ✓ Completed (Length: 221,119 characters)

[Chapter 4]
  Processing pages 179 to 228 (50 pages)...
  ✓ Completed (Length: 143,643 characters)

[Chapter 5]
  Processing pages 229 to 286 (58 pages)...
  ✓ Completed (Length: 175,491 characters)

[Chapter 6]
  Processing pages 287 to 340 (54 pages)...
  ✓ Completed (Length: 170,656 characters)

[Chapter 7]
  Processing pages 341 to 372 (32 pages)...
  ✓ Completed (Length: 96,598 characters)

[Chapter 8]
  Processing pages 373 to 414 (42 pages)...
  ✓ Completed (Length: 123,555 characters)

[Chapter 9]
  Processing pages 415 to 482 (68 pages)...
  ✓ Completed (Length: 205,808 characters)

✓ All chapters converted successfully!
  Total chapters: 9


### Step 1.5: 체크포인트 저장 및 결과 확인

변환된 마크다운을 체크포인트로 저장하고 샘플 결과를 확인합니다.

In [123]:
# 체크포인트 1 저장
checkpoint_1_chapter_markdowns = chapter_markdowns.copy()

print("✓ Checkpoint 1 saved successfully!")
print(f"  Variable: checkpoint_1_chapter_markdowns")
print(f"  Chapters: {list(checkpoint_1_chapter_markdowns.keys())}")
print(f"\n--- Sample: First 500 characters of Chapter 1 ---")
print(checkpoint_1_chapter_markdowns["Chapter 1"][:3000])
print("...")

✓ Checkpoint 1 saved successfully!
  Variable: checkpoint_1_chapter_markdowns
  Chapters: ['Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', 'Chapter 5', 'Chapter 6', 'Chapter 7', 'Chapter 8', 'Chapter 9', 'Chapter 10', 'Chapter 11', 'Chapter 12', 'Chapter 13', 'Chapter 14', 'Chapter 15', 'Chapter 16']

--- Sample: First 500 characters of Chapter 1 ---
# Chapter 1 Before the Advent of Database Systems

**ADRIENNE WATT**


The way in which computers manage data has come a long way over the last few decades. Today’s users take for granted

the many benefits found in a database system. However, it wasn’t that long ago that computers relied on a much less

elegant and costly approach to data management called the file-based system.

### **File-based System**


One way to keep information on a computer is to store it in permanent files. A company system has a number of

application programs; each of them is designed to manipulate data files. These application programs have been written

a

---
## Checkpoint 2: Noise Removal

변환된 마크다운에서 불필요한 노이즈를 제거합니다.
- 전체 페이지: Header/Footer 영역의 페이지 번호, Copyright 문구 등 제거
- 챕터 첫 페이지: 추가로 중복되는 챕터 제목 등 제거

### 입력
- `checkpoint_1_chapter_markdowns`: 챕터별 원본 마크다운

### 출력
- `chapter_markdowns_cleaned`: 노이즈가 제거된 챕터별 마크다운 (dict)

### 체크포인트 저장
- 변수: `checkpoint_2_cleaned_markdowns`

### Step 2.1: 라이브러리 임포트

정규표현식 처리를 위한 라이브러리를 임포트합니다.

In [50]:
import re

### Step 2.2: 노이즈 제거 패턴 설정

In [60]:
# Header/Footer 확인 범위 설정 (각각 몇 줄씩 체크할지)
header_check_range = 5
footer_check_range = 5

# Header/Footer 무조건 삭제할 줄 수 (패턴 무관)
header_remove_lines = 0     # 상단 N줄을 무조건 삭제 (0 = 삭제 안함)
footer_remove_lines = 4     # 하단 N줄을 무조건 삭제 (0 = 삭제 안함)

# Header 패턴 (페이지 상단에서 제거할 패턴)
header_patterns = [
    r"Computer Networks: A Systems Approach, Release Version 6.1"
]

# Footer 패턴 (페이지 하단에서 제거할 패턴)
footer_patterns = []

# 챕터 첫 페이지 전용 패턴 (첫 페이지에서만 추가로 제거)
first_page_patterns = []

print("Noise Removal Configuration:")
print(f"  Header check range: {header_check_range} lines")
print(f"  Header unconditional removal: {header_remove_lines} lines")
print(f"  Footer check range: {footer_check_range} lines")
print(f"  Footer unconditional removal: {footer_remove_lines} lines")
print(f"  Header patterns: {len(header_patterns)} patterns")
print(f"  Footer patterns: {len(footer_patterns)} patterns")
print(f"  First page patterns: {len(first_page_patterns)} patterns")

Noise Removal Configuration:
  Header check range: 5 lines
  Header unconditional removal: 0 lines
  Footer check range: 5 lines
  Footer unconditional removal: 4 lines
  Header patterns: 1 patterns
  Footer patterns: 0 patterns
  First page patterns: 0 patterns


### Step 2.3: 노이즈 제거 함수 정의

In [61]:
def clean_page_markdown(md_text, is_first_page=False, show_removed=False, page_num=None):
    """
    페이지별 마크다운에서 노이즈를 제거합니다.
    
    Args:
        md_text (str): 원본 마크다운 텍스트
        is_first_page (bool): 챕터의 첫 페이지 여부
        show_removed (bool): 제거된 줄을 표시할지 여부
        page_num (int): 페이지 번호 (로깅용)
        
    Returns:
        str: 노이즈가 제거된 마크다운 텍스트
    """
    if not md_text:
        return ""
    
    lines = md_text.split('\n')
    lines_to_remove = set()
    removal_reasons = {}  # 제거 이유 저장
    
    # 1. 무조건 삭제할 줄 처리
    # Header: 상단 N줄 무조건 삭제
    if header_remove_lines > 0:
        for i in range(min(header_remove_lines, len(lines))):
            lines_to_remove.add(i)
            removal_reasons[i] = f"Header unconditional removal (top {header_remove_lines} lines)"
    
    # Footer: 하단 N줄 무조건 삭제
    if footer_remove_lines > 0:
        footer_start = max(0, len(lines) - footer_remove_lines)
        for i in range(footer_start, len(lines)):
            lines_to_remove.add(i)
            removal_reasons[i] = f"Footer unconditional removal (bottom {footer_remove_lines} lines)"
    
    # 2. Header 영역에서 패턴 매칭
    for i in range(min(header_check_range, len(lines))):
        if i in lines_to_remove:
            continue  # 이미 제거 대상이면 스킵
        
        line_stripped = lines[i].strip()
        
        # Header 패턴 체크
        for pattern in header_patterns:
            if re.search(pattern, line_stripped, re.IGNORECASE):
                lines_to_remove.add(i)
                removal_reasons[i] = f"Header pattern: {pattern}"
                break
    
    # 3. Footer 영역에서 패턴 매칭
    footer_start_idx = max(0, len(lines) - footer_check_range)
    for i in range(footer_start_idx, len(lines)):
        if i in lines_to_remove:
            continue  # 이미 제거 대상이면 스킵
        
        line_stripped = lines[i].strip()
        
        # Footer 패턴 체크
        for pattern in footer_patterns:
            if re.search(pattern, line_stripped, re.IGNORECASE):
                lines_to_remove.add(i)
                removal_reasons[i] = f"Footer pattern: {pattern}"
                break
    
    # 4. 챕터 첫 페이지인 경우 추가 패턴 체크
    if is_first_page:
        for i, line in enumerate(lines):
            if i in lines_to_remove:
                continue
            line_stripped = line.strip()
            for pattern in first_page_patterns:
                if re.search(pattern, line_stripped, re.IGNORECASE):
                    lines_to_remove.add(i)
                    removal_reasons[i] = f"First page pattern: {pattern}"
                    break
    
    # 제거된 줄 표시
    if show_removed and lines_to_remove:
        print(f"    [Page {page_num}] Removed {len(lines_to_remove)} line(s):")
        for i in sorted(lines_to_remove)[:10]:  # 처음 10개만 표시
            reason = removal_reasons.get(i, "Unknown")
            line_preview = lines[i][:60] + "..." if len(lines[i]) > 60 else lines[i]
            print(f"      Line {i}: '{line_preview}' ({reason})")
        if len(lines_to_remove) > 10:
            print(f"      ... and {len(lines_to_remove) - 10} more lines")
    
    # 제거 대상이 아닌 줄만 필터링
    cleaned_lines = [line for i, line in enumerate(lines) if i not in lines_to_remove]
    
    return '\n'.join(cleaned_lines)


print("✓ Noise removal function defined successfully!")
print("  - Header: Unconditional removal + pattern matching")
print("  - Footer: Unconditional removal + pattern matching")
print("  - First page: Pattern matching")

✓ Noise removal function defined successfully!
  - Header: Unconditional removal + pattern matching
  - Footer: Unconditional removal + pattern matching
  - First page: Pattern matching


### Step 2.4: 챕터별 노이즈 제거 실행

Checkpoint 1에서 변환된 마크다운에 노이즈 제거 함수를 적용합니다.
각 챕터는 여러 페이지로 구성되어 있으므로, 페이지 단위로 분리하여 처리합니다.

In [62]:
chapter_markdowns_cleaned = {}

print("=" * 60)
print("Starting Noise Removal")
print("=" * 60)

for i, (chapter_name, start_page) in enumerate(sorted_chapters):
    # 페이지 범위 계산 (0-indexed)
    start_idx = start_page - 1
    
    if i < len(sorted_chapters) - 1:
        end_idx = sorted_chapters[i + 1][1] - 1
    else:
        end_idx = total_pages
    
    print(f"\n[{chapter_name}]")
    print(f"  Processing pages {start_idx + 1} to {end_idx}...")
    
    # 페이지별로 변환하고 노이즈 제거 적용
    chapter_full_text = []
    
    for page_idx in range(start_idx, end_idx):
        # 첫 페이지 여부 확인
        is_first_page = (page_idx == start_idx)
        
        # 페이지를 마크다운으로 변환
        page_md = pymupdf4llm.to_markdown(pdf_path, pages=[page_idx])
        
        # 노이즈 제거 적용 (제거된 줄 표시)
        cleaned_md = clean_page_markdown(
            page_md, 
            is_first_page=is_first_page, 
            show_removed=True, 
            page_num=page_idx + 1
        )
        
        chapter_full_text.append(cleaned_md)
    
    # 챕터의 모든 페이지를 결합
    chapter_markdowns_cleaned[chapter_name] = "\n\n".join(chapter_full_text)
    
    original_length = len(checkpoint_1_chapter_markdowns.get(chapter_name, ""))
    cleaned_length = len(chapter_markdowns_cleaned[chapter_name])
    removed = original_length - cleaned_length
    
    print(f"  ✓ Completed")
    print(f"    Original: {original_length:,} chars")
    print(f"    Cleaned: {cleaned_length:,} chars")
    print(f"    Removed: {removed:,} chars")

print("\n" + "=" * 60)
print(f"✓ Noise removal completed for all chapters!")
print(f"  Total chapters: {len(chapter_markdowns_cleaned)}")
print("=" * 60)

Starting Noise Removal

[Chapter 1]
  Processing pages 5 to 46...
    [Page 5] Removed 4 line(s):
      Line 46: '**5**' (Footer unconditional removal (bottom 4 lines))
      Line 47: '' (Footer unconditional removal (bottom 4 lines))
      Line 48: '' (Footer unconditional removal (bottom 4 lines))
      Line 49: '' (Footer unconditional removal (bottom 4 lines))
    [Page 6] Removed 5 line(s):
      Line 0: '**Computer Networks: A Systems Approach, Release Version 6.1...' (Header pattern: Computer Networks: A Systems Approach, Release Version 6.1)
      Line 63: '**6** **Chapter 1. Foundation**' (Footer unconditional removal (bottom 4 lines))
      Line 64: '' (Footer unconditional removal (bottom 4 lines))
      Line 65: '' (Footer unconditional removal (bottom 4 lines))
      Line 66: '' (Footer unconditional removal (bottom 4 lines))
    [Page 7] Removed 5 line(s):
      Line 0: '**Computer Networks: A Systems Approach, Release Version 6.1...' (Header pattern: Computer Networks: A

### Step 2.5: 체크포인트 저장 및 결과 확인

노이즈가 제거된 마크다운을 체크포인트로 저장하고 샘플 결과를 확인합니다.

In [63]:
# 체크포인트 2 저장
checkpoint_2_cleaned_markdowns = chapter_markdowns_cleaned.copy()

print("✓ Checkpoint 2 saved successfully!")
print(f"  Variable: checkpoint_2_cleaned_markdowns")
print(f"  Chapters: {list(checkpoint_2_cleaned_markdowns.keys())}")
print(f"\n--- Sample: First 500 characters of cleaned Chapter 1 ---")
print(checkpoint_2_cleaned_markdowns["Chapter 9"][:1000])
print("...")

✓ Checkpoint 2 saved successfully!
  Variable: checkpoint_2_cleaned_markdowns
  Chapters: ['Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', 'Chapter 5', 'Chapter 6', 'Chapter 7', 'Chapter 8', 'Chapter 9']

--- Sample: First 500 characters of cleaned Chapter 1 ---
**CHAPTER**

# **NINE** **APPLICATIONS**


Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the
beginning.


_—Winston Churchill_

# **Problem: Applications Need Their Own Protocols**


We started this book by talking about application programs—everything from web browsers to videoconferencing tools—that people want to run over computer networks. In the intervening chapters, we have
developed, one building block at a time, the networking infrastructure needed to make such applications
possible. We have now come full circle, back to network applications. These applications are part network
protocol (in the sense that they exchange messages with their peers on other machines) an

---
## Checkpoint 3: Markdown Header Labeling

특정 키워드(Exercises, Attribution, Key Terms 등)를 찾아서 마크다운 헤더로 변환합니다.

### 입력
- `checkpoint_2_cleaned_markdowns`: 정제된 챕터별 마크다운
- `header_labels`: 변환할 키워드와 헤더 레벨 매핑 (dict)

### 출력
- `chapter_markdowns_labeled`: 헤더가 라벨링된 챕터별 마크다운 (dict)

### 체크포인트 저장
- 변수: `checkpoint_3_labeled_markdowns`

### Step 3.1: 헤더 라벨링 설정

특정 키워드를 마크다운 헤더로 변환하기 위한 설정을 정의합니다.

In [64]:
# 헤더로 변환할 키워드와 헤더 레벨 매핑
# 키: 찾을 키워드, 값: 헤더 레벨 (2 = ##)
header_labels = {
    "Key Takeaway": 1
}

print("Header Labeling Configuration:")
print(f"  Total keywords: {len(header_labels)}")
print(f"\nKeywords to convert:")
for keyword, level in header_labels.items():
    header_symbol = "#" * level
    print(f"  - '{keyword}' → '{header_symbol} {keyword}'")

Header Labeling Configuration:
  Total keywords: 1

Keywords to convert:
  - 'Key Takeaway' → '# Key Takeaway'


### Step 3.2: 헤더 라벨링 함수 정의

키워드를 찾아서 마크다운 헤더로 변환하는 함수를 정의합니다.

In [65]:
import re

def label_headers(text, label_map, show_conversions=False):
    if not text:
        return ""
    
    lines = text.split('\n')
    new_lines = []
    conversions = []
    
    for line_num, line in enumerate(lines):
        line_stripped = line.strip()
        matched = False
        
        # 1. 마크다운 특수 기호들 제거 (앞뒤에 붙는 것들 위주)
        # # (헤더), * (강조/불릿), _ (이탤릭), ~ (취소선), ` (코드), > (인용), - (리스트)
        # strip()을 사용하여 문장 시작과 끝의 기호들을 먼저 제거합니다.
        cleaned_line = line_stripped.strip("#* _~`>-")
        
        # 2. HTML 태그가 섞여 있을 경우 제거 (예: <b>Exercises</b>)
        cleaned_line = re.sub(r'<[^>]*>', '', cleaned_line).strip()
        
        # 각 키워드에 대해 매칭 시도
        for keyword, level in label_map.items():
            # 대소문자 구분 없이 비교 (양쪽 모두 공백 제거 상태)
            if cleaned_line.lower() == keyword.lower():
                prefix = "#" * level
                new_line = f"{prefix} {keyword}"
                new_lines.append(new_line)
                matched = True
                
                if show_conversions:
                    conversions.append({
                        'line_num': line_num,
                        'original': line_stripped,
                        'converted': new_line,
                        'keyword': keyword
                    })
                break
        
        if not matched:
            new_lines.append(line)
    
    if show_conversions and conversions:
        print(f"    Found {len(conversions)} header(s) to label:")
        for conv in conversions:
            print(f"      Line {conv['line_num']}: '{conv['original']}' → '{conv['converted']}'")
    
    return '\n'.join(new_lines)

### Step 3.3: 챕터별 헤더 라벨링 실행

Checkpoint 2에서 정제된 마크다운에 헤더 라벨링을 적용합니다.

In [66]:
chapter_markdowns_labeled = {}

print("=" * 60)
print("Starting Header Labeling")
print("=" * 60)

total_conversions = 0

for chapter_name, text in checkpoint_2_cleaned_markdowns.items():
    print(f"\n[{chapter_name}]")
    print(f"  Processing header labeling...")
    
    # 헤더 라벨링 적용 (변환 내용 표시)
    labeled_text = label_headers(text, header_labels, show_conversions=True)
    
    # 변환된 텍스트 저장
    chapter_markdowns_labeled[chapter_name] = labeled_text
    
    # 변환 횟수 계산 (간단한 방법: 변환 전후 길이 차이)
    original_lines = text.count('\n')
    labeled_lines = labeled_text.count('\n')
    
    print(f"  ✓ Completed")

print("\n" + "=" * 60)
print(f"✓ Header labeling completed for all chapters!")
print(f"  Total chapters: {len(chapter_markdowns_labeled)}")
print("=" * 60)

Starting Header Labeling

[Chapter 1]
  Processing header labeling...
    Found 4 header(s) to label:
      Line 377: '**Key Takeaway**' → '# Key Takeaway'
      Line 510: '**Key Takeaway**' → '# Key Takeaway'
      Line 701: '**Key Takeaway**' → '# Key Takeaway'
      Line 1115: '**Key Takeaway**' → '# Key Takeaway'
  ✓ Completed

[Chapter 2]
  Processing header labeling...
    Found 2 header(s) to label:
      Line 1076: '**Key Takeaway**' → '# Key Takeaway'
      Line 1587: '**Key Takeaway**' → '# Key Takeaway'
  ✓ Completed

[Chapter 3]
  Processing header labeling...
    Found 8 header(s) to label:
      Line 1732: '**Key Takeaway**' → '# Key Takeaway'
      Line 2091: '**Key Takeaway**' → '# Key Takeaway'
      Line 2206: '**Key Takeaway**' → '# Key Takeaway'
      Line 2391: '**Key Takeaway**' → '# Key Takeaway'
      Line 3132: '**Key Takeaway**' → '# Key Takeaway'
      Line 3374: '**Key Takeaway**' → '# Key Takeaway'
      Line 3482: '**Key Takeaway**' → '# Key Takeaway'
    

### Step 3.4: 체크포인트 저장 및 결과 확인

헤더가 라벨링된 마크다운을 체크포인트로 저장하고 샘플 결과를 확인합니다.

In [67]:
# 체크포인트 3 저장
checkpoint_3_labeled_markdowns = chapter_markdowns_labeled.copy()

print("✓ Checkpoint 3 saved successfully!")
print(f"  Variable: checkpoint_3_labeled_markdowns")
print(f"  Chapters: {list(checkpoint_3_labeled_markdowns.keys())}")

# 변환 예시 찾기 (Key Terms가 포함된 첫 번째 챕터)
sample_chapter = None
for chapter_name, text in checkpoint_3_labeled_markdowns.items():
    if "## Key Terms" in text or "## Exercises" in text:
        sample_chapter = chapter_name
        break

if sample_chapter:
    print(f"\n--- Sample: {sample_chapter} (showing Key Terms/Exercises section) ---")
    text = checkpoint_3_labeled_markdowns[sample_chapter]
    
    # Key Terms나 Exercises가 포함된 부분 찾기
    lines = text.split('\n')
    for i, line in enumerate(lines):
        if line.startswith("## Key Terms") or line.startswith("## Exercises"):
            # 해당 라인과 다음 10줄 출력
            sample_text = '\n'.join(lines[i:min(i+10, len(lines))])
            print(sample_text)
            print("...")
            break
else:
    print("\n--- Sample: First 500 characters of Chapter 16 ---")
    print(checkpoint_3_labeled_markdowns["Chapter 1"])
    print("...")

✓ Checkpoint 3 saved successfully!
  Variable: checkpoint_3_labeled_markdowns
  Chapters: ['Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', 'Chapter 5', 'Chapter 6', 'Chapter 7', 'Chapter 8', 'Chapter 9']

--- Sample: First 500 characters of Chapter 16 ---
**CHAPTER**

# **ONE** **FOUNDATION**


I must create a System, or be enslav’d by another Man’s; I will not Reason and Compare: my
business is to Create.


_—William Blake_

# **Problem: Building a Network**


Suppose you want to build a computer network, one that has the potential to grow to global proportions and
to support applications as diverse as teleconferencing, video on demand, electronic commerce, distributed
computing, and digital libraries. What available technologies would serve as the underlying building blocks,
and what kind of software architecture would you design to integrate these building blocks into an effective communication service? Answering this question is the overriding goal of this book—to describe the


---
## Checkpoint 4: Header-based Chunking

LangChain의 MarkdownHeaderTextSplitter를 사용하여 마크다운 헤더를 기준으로 텍스트를 청크로 분할합니다.

### 입력
- `checkpoint_3_labeled_markdowns`: 헤더가 라벨링된 챕터별 마크다운
- `headers_to_split_on`: 분할 기준이 되는 헤더 레벨 (list)

### 출력
- `chapter_chunks`: 챕터별 청크 리스트 (dict)

### 체크포인트 저장
- 변수: `checkpoint_4_chunks`

### Step 4.1: 라이브러리 설치

LangChain의 MarkdownHeaderTextSplitter를 사용하기 위한 라이브러리를 설치합니다.

In [81]:
!pip install langchain-text-splitters -q

### Step 4.2: MarkdownHeaderTextSplitter 및 RecursiveCharacterTextSplitter 설정분할 기준 설정

In [93]:
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

# 분할 기준이 되는 헤더 레벨 설정
# (헤더 심볼, 메타데이터 키)
headers_to_split_on = [
    ("#", "Header 1"),      # # Chapter 1
    ("##", "Header 2"),     # ## Key Terms, ## Exercises
    ("###", "Header 3"),    # ### Sub-sections
    ("####", "Header 4"),   # #### Sub-sections
    ("#####", "Header 5"),  # ##### Sub-sections
]

# MarkdownHeaderTextSplitter 생성
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False  # 헤더를 청크 텍스트에 포함
)

# RecursiveCharacterTextSplitter 생성 (큰 청크 재분할용)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1600,      # ~400 tokens (영문 기준)
    chunk_overlap=200,    # ~50 tokens (12.5% overlap)
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

print("MarkdownHeaderTextSplitter Configuration:")
print(f"  Headers to split on:")
for header_symbol, metadata_key in headers_to_split_on:
    print(f"    - '{header_symbol}' → metadata key: '{metadata_key}'")

print(f"\nRecursiveCharacterTextSplitter Configuration:")
print(f"  Chunk size: 1600 characters (~400 tokens)")
print(f"  Chunk overlap: 200 characters (~50 tokens)")
print(f"  Applied to chunks > 700 characters")

MarkdownHeaderTextSplitter Configuration:
  Headers to split on:
    - '#' → metadata key: 'Header 1'
    - '##' → metadata key: 'Header 2'
    - '###' → metadata key: 'Header 3'
    - '####' → metadata key: 'Header 4'
    - '#####' → metadata key: 'Header 5'

RecursiveCharacterTextSplitter Configuration:
  Chunk size: 1600 characters (~400 tokens)
  Chunk overlap: 200 characters (~50 tokens)
  Applied to chunks > 700 characters


### Step 4.3: 청크 분할 수행

In [94]:
import re
from langchain_core.documents import Document

def split_large_chunk(chunk, threshold=700):
    """
    700자를 초과하는 청크를 RecursiveCharacterTextSplitter로 재분할하고
    각 서브청크에 원래 헤더 정보를 보존하며, 끊긴 문장을 병합합니다.
    """
    if len(chunk.page_content) <= threshold:
        return [chunk]

    # 1. 메타데이터에서 헤더 정보(H1~H5) 추출
    headers = []
    for i in range(1, 6):
        header_key = f"Header {i}"
        if header_key in chunk.metadata:
            header_symbols = "#" * i
            headers.append(f"{header_symbols} {chunk.metadata[header_key]}")
    
    header_text = "\n".join(headers) if headers else ""

    # 2. 본문만 추출 (분할을 위해 일시적으로 헤더 제거)
    content = chunk.page_content
    for header in headers:
        content = content.replace(header, "").strip()

    if not content:
        return [chunk]

    # 3. 스마트 문단 감지 전처리
    # 규칙 1: 공백 3개 이상 → 문단 구분(\n\n)
    content = re.sub(r' {3,}', '\n\n', content)

    # 규칙 2: 줄 시작 위치의 대문자만 새로운 문단으로 인식
    content = re.sub(r'\n([A-Z][a-z]{3,})', r'\n\n\1', content)

    # 4. RecursiveCharacterTextSplitter로 본문 재분할
    # (주의: 외부에서 정의된 recursive_splitter 객체를 사용합니다)
    sub_texts = recursive_splitter.split_text(content)

    # 5. 후처리: 소문자로 시작하는 청크를 이전 청크와 병합 (문장 잘림 방지)
    merged_texts = []
    for text in sub_texts:
        text_stripped = text.strip()
        if not text_stripped:
            continue
        
        # 첫 글자가 소문자이고 이전 청크가 있으면 병합
        if text_stripped[0].islower() and merged_texts:
            merged_texts[-1] = merged_texts[-1] + " " + text_stripped
        else:
            merged_texts.append(text_stripped)

    # 6. 각 서브텍스트에 헤더를 재결합하여 Document 생성
    sub_chunks = []
    for sub_text in merged_texts:
        full_content = f"{header_text}\n\n{sub_text}" if header_text else sub_text
        
        sub_chunk = Document(
            page_content=full_content,
            metadata=chunk.metadata.copy()
        )
        sub_chunks.append(sub_chunk)
        
    return sub_chunks

# --- 메인 실행 로직 ---
chapter_chunks = {}

print("=" * 60)
print("Starting Header-based Chunking with Smart Paragraph Detection")
print("=" * 60)

total_chunks = 0
chunks_requiring_split = 0

for chapter_name, text in checkpoint_3_labeled_markdowns.items():
    print(f"\n[{chapter_name}]")
    print(f"  Text length: {len(text):,}")
    print(f"  Splitting by headers...")
    
    # Step 1: MarkdownHeaderTextSplitter로 1차 분할
    initial_chunks = markdown_splitter.split_text(text)
    print(f"  ✓ Initial split: {len(initial_chunks)} chunk(s)")
    
    # Step 2: 크기 임계값 초과 시 스마트 재분할 적용
    final_chunks = []
    chapter_split_count = 0
    
    for chunk in initial_chunks:
        chunk_size = len(chunk.page_content)
        
        if chunk_size > 700:
            sub_chunks = split_large_chunk(chunk, threshold=700)
            final_chunks.extend(sub_chunks)
            
            if len(sub_chunks) > 1:
                chunks_requiring_split += 1
                chapter_split_count += 1
                print(f"    → Chunk ({chunk_size} chars) split into {len(sub_chunks)} sub-chunks")
        else:
            final_chunks.append(chunk)
            
    chapter_chunks[chapter_name] = final_chunks
    total_chunks += len(final_chunks)
    
    print(f"  ✓ Final chunks: {len(final_chunks)} (split applied: {chapter_split_count})")
    
    # 샘플 출력 (첫 번째 청크)
    if final_chunks:
        first = final_chunks[0]
        print(f"    First chunk size: {len(first.page_content)} chars")
        print(f"    First chunk preview: {first.page_content[:100].replace(chr(10), ' ')}...")

print("\n" + "=" * 60)
print(f"✓ Chunking completed!")
print(f"  Total chapters: {len(chapter_chunks)}")
print(f"  Final total chunks: {total_chunks}")
print("=" * 60)

Starting Header-based Chunking with Smart Paragraph Detection

[Chapter 1]
  Text length: 115,121
  Splitting by headers...
  ✓ Initial split: 11 chunk(s)
    → Chunk (3683 chars) split into 3 sub-chunks
    → Chunk (8158 chars) split into 7 sub-chunks
    → Chunk (11225 chars) split into 7 sub-chunks
    → Chunk (24346 chars) split into 16 sub-chunks
    → Chunk (24507 chars) split into 19 sub-chunks
    → Chunk (14840 chars) split into 10 sub-chunks
    → Chunk (22027 chars) split into 17 sub-chunks
    → Chunk (5567 chars) split into 4 sub-chunks
  ✓ Final chunks: 86 (split applied: 8)
    First chunk size: 11 chars
    First chunk preview: **CHAPTER**...

[Chapter 2]
  Text length: 157,636
  Splitting by headers...
  ✓ Initial split: 13 chunk(s)
    → Chunk (3197 chars) split into 3 sub-chunks
    → Chunk (9457 chars) split into 8 sub-chunks
    → Chunk (8681 chars) split into 6 sub-chunks
    → Chunk (16471 chars) split into 12 sub-chunks
    → Chunk (20680 chars) split into 14 su

### Step 4.4: 체크포인트 저장 및 결과 확인

분할된 청크를 체크포인트로 저장하고 샘플 결과를 확인합니다.

In [97]:
# 체크포인트 4 저장
checkpoint_4_chunks = chapter_chunks.copy()

print("✓ Checkpoint 4 saved successfully!")
print(f"  Variable: checkpoint_4_chunks")
print(f"  Chapters: {list(checkpoint_4_chunks.keys())}")

# 샘플 청크 확인
sample_chapter = "Chapter 2"
if sample_chapter in checkpoint_4_chunks and checkpoint_4_chunks[sample_chapter]:
    print(f"\n--- Sample: {sample_chapter} chunks ---")
    print(f"Total chunks in {sample_chapter}: {len(checkpoint_4_chunks[sample_chapter])}")
    
    # 처음 10개 청크의 메타데이터와 내용 미리보기
    for i, chunk in enumerate(checkpoint_4_chunks[sample_chapter][:10]):
        print(f"\n[Chunk {i+1}]")
        print(f"  Metadata: {chunk.metadata}")
        print(f"  Content length: {len(chunk.page_content)} characters")
        print(f"  Content preview:")
        preview = chunk.page_content.replace('\n', ' ')
        print(f"    {preview}...")
else:
    print(f"\n--- No chunks found for {sample_chapter} ---")

✓ Checkpoint 4 saved successfully!
  Variable: checkpoint_4_chunks
  Chapters: ['Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', 'Chapter 5', 'Chapter 6', 'Chapter 7', 'Chapter 8', 'Chapter 9']

--- Sample: Chapter 2 chunks ---
Total chunks in Chapter 2: 119

[Chunk 1]
  Metadata: {}
  Content length: 11 characters
  Content preview:
    **CHAPTER**...

[Chunk 2]
  Metadata: {'Header 1': '**TWO** **DIRECT LINKS**'}
  Content length: 154 characters
  Content preview:
    # **TWO** **DIRECT LINKS**   It is a mistake to look too far ahead. Only one link in the chain of destiny can be handled at a time.   _—Winston Churchill_...

[Chunk 3]
  Metadata: {'Header 1': '**Problem: Connecting to a Network**'}
  Content length: 1467 characters
  Content preview:
    # **Problem: Connecting to a Network**  In Chapter 1 we saw that networks consist of links interconnecting nodes. One of the fundamental problems we face is how to connect two nodes together. We also introduced the “cloud” abstract

---
## Checkpoint 5: Header-based Filtering

특정 헤더를 가진 청크를 필터링하여 제거합니다.
(예: Attribution, Exercises 등 학습에 불필요한 섹션)

### 입력
- `checkpoint_4_chunks`: 분할된 챕터별 청크
- `filter_headers`: 제거할 헤더 키워드 리스트 (list)

### 출력
- `chapter_chunks_filtered`: 필터링된 챕터별 청크 (dict)

### 체크포인트 저장
- 변수: `checkpoint_5_filtered_chunks`

### Step 5.1: 필터링 설정학습에 불필요한 섹션을 제거하기 위한 설정을 정의

In [99]:
# 1. 제거할 헤더 키워드 설정
# 해당 헤더를 포함하는 청크는 데이터셋에서 제외됩니다.
filter_headers = header_labels  # 이전에 정의된 header_labels 리스트를 사용

# 2. 각 챕터에서 강제로 제거할 초기 청크 수
# 0 = 제거 안 함, N = 챕터 시작 부분의 처음 N개 청크를 무조건 제거
# 보통 서론이나 챕터 요약 등 반복되는 앞부분을 걷어낼 때 사용합니다.
skip_first_chunks = 2  

# --- 설정 정보 출력 ---
print("=" * 60)
print("Header-based Filtering Configuration")
print("=" * 60)
print(f"  Total filter keywords: {len(filter_headers)}")
print(f"  Skip first N chunks per chapter: {skip_first_chunks}")

print(f"\n[Chunks with these headers will be removed]")
for keyword in filter_headers:
    print(f"  - {keyword}")

if skip_first_chunks > 0:
    print(f"\n⚠ Note: First {skip_first_chunks} chunk(s) from each chapter will be removed regardless of headers.")
print("=" * 60)

Header-based Filtering Configuration
  Total filter keywords: 1
  Skip first N chunks per chapter: 2

[Chunks with these headers will be removed]
  - Key Takeaway

⚠ Note: First 2 chunk(s) from each chapter will be removed regardless of headers.


### Step 5.2: 필터링 함수 정의

청크의 메타데이터를 확인하여 특정 헤더를 가진 청크를 필터링하는 함수를 정의합니다.

In [100]:
def should_filter_chunk(chunk, filter_keywords):
    """
    청크의 메타데이터 내 모든 헤더(Header 1, Header 2, ...)를 검사하여 
    필터링 대상인지 확인합니다.
    
    Args:
        chunk: LangChain Document 객체
        filter_keywords: 필터링할 키워드 리스트 (예: ["Exercises", "References"])
        
    Returns:
        bool: 헤더 중 하나라도 키워드와 일치하면 True
    """
    metadata = chunk.metadata
    
    # 비교 효율을 위해 필터링 키워드를 소문자 세트(set)로 변환 (O(1) 조회를 위함)
    filter_keywords_lower = {k.lower() for k in filter_keywords}
    
    # 메타데이터의 모든 키-값 쌍을 순회
    for key, value in metadata.items():
        # 1. 키가 "Header"로 시작하는지 확인 (Header 1, Header 2, ... 모두 포함)
        if key.startswith("Header"):
            # 2. 값이 문자열인 경우, 해당 값이 필터링 키워드에 포함되는지 확인
            if isinstance(value, str) and value.lower() in filter_keywords_lower:
                return True
                
    return False

print("✓ Extensible filtering function defined successfully!")

✓ Extensible filtering function defined successfully!


### Step 5.3: 챕터별 청크 필터링 실행Checkpoint 4에서 분할된 청크에 2단계 필터링을 적용합니다:1. **위치 기반 제거**: 각 챕터의 처음 N개 청크 제거2. **키워드 기반 제거**: 특정 헤더를 포함하는 청크 제거**처리 순서:**- Step 1: 각 챕터에서 처음 N개 청크를 건너뜀 (skip_first_chunks)- Step 2: 남은 청크 중 헤더 키워드 매칭되는 청크 제거

In [101]:
chapter_chunks_filtered = {}

print("=" * 60)
print("Starting Header-based Filtering")
print("=" * 60)

total_original_chunks = 0
total_filtered_chunks = 0
total_removed_chunks = 0
total_skipped_chunks = 0

for chapter_name, chunks in checkpoint_4_chunks.items():
    print(f"\n[{chapter_name}]")
    print(f"  Original chunks: {len(chunks)}")
    
    # --- Step 1: 위치 기반 제거 (처음 N개 청크) ---
    skipped_chunks = []
    remaining_chunks = chunks
    
    if skip_first_chunks > 0 and len(chunks) > skip_first_chunks:
        skipped_chunks = chunks[:skip_first_chunks]
        remaining_chunks = chunks[skip_first_chunks:]
        print(f"  ⚠ Skipped first {len(skipped_chunks)} chunk(s)")
        total_skipped_chunks += len(skipped_chunks)
    
    # --- Step 2: 헤더 키워드 기반 필터링 ---
    filtered_chunks = []
    removed_chunks = []
    
    for chunk in remaining_chunks:
        # should_filter_chunk 함수는 사전에 정의되어 있어야 합니다.
        if should_filter_chunk(chunk, filter_headers):
            removed_chunks.append(chunk)
        else:
            filtered_chunks.append(chunk)
    
    # 챕터별 필터링된 결과 저장
    chapter_chunks_filtered[chapter_name] = filtered_chunks
    
    total_original_chunks += len(chunks)
    total_filtered_chunks += len(filtered_chunks)
    total_removed_chunks += len(removed_chunks)
    
    print(f"  Filtered chunks: {len(filtered_chunks)}")
    print(f"  Removed by keyword: {len(removed_chunks)}")
    
    # 제거된 청크 상세 정보 표시 (디버깅용)
    if removed_chunks:
        print(f"  Removed chunk details:")
        for i, chunk in enumerate(removed_chunks):
            metadata_str = ", ".join([f"{k}: {v}" for k, v in chunk.metadata.items()])
            print(f"    [{i+1}] {metadata_str}")

# --- 최종 결과 요약 ---
print("\n" + "=" * 60)
print(f"✓ Filtering completed for all chapters!")
print(f"  Total chapters: {len(chapter_chunks_filtered)}")
print(f"  Original chunks: {total_original_chunks}")
print(f"  Skipped by position: {total_skipped_chunks}")
print(f"  Removed by keyword: {total_removed_chunks}")
print(f"  Final filtered chunks: {total_filtered_chunks}")
if total_original_chunks > 0:
    retention_rate = (total_filtered_chunks / total_original_chunks) * 100
    print(f"  Retention rate: {retention_rate:.1f}%")
print("=" * 60)

Starting Header-based Filtering

[Chapter 1]
  Original chunks: 86
  ⚠ Skipped first 2 chunk(s)
  Filtered chunks: 67
  Removed by keyword: 17
  Removed chunk details:
    [1] Header 1: Key Takeaway
    [2] Header 1: Key Takeaway
    [3] Header 1: Key Takeaway
    [4] Header 1: Key Takeaway
    [5] Header 1: Key Takeaway
    [6] Header 1: Key Takeaway
    [7] Header 1: Key Takeaway
    [8] Header 1: Key Takeaway
    [9] Header 1: Key Takeaway
    [10] Header 1: Key Takeaway
    [11] Header 1: Key Takeaway
    [12] Header 1: Key Takeaway
    [13] Header 1: Key Takeaway
    [14] Header 1: Key Takeaway
    [15] Header 1: Key Takeaway
    [16] Header 1: Key Takeaway
    [17] Header 1: Key Takeaway

[Chapter 2]
  Original chunks: 119
  ⚠ Skipped first 2 chunk(s)
  Filtered chunks: 102
  Removed by keyword: 15
  Removed chunk details:
    [1] Header 1: Key Takeaway
    [2] Header 1: Key Takeaway
    [3] Header 1: Key Takeaway
    [4] Header 1: Key Takeaway
    [5] Header 1: Key Takeaway
    

### Step 5.4: 체크포인트 저장 및 결과 확인

필터링된 청크를 체크포인트로 저장하고 샘플 결과를 확인합니다.

In [102]:
# 체크포인트 5 저장
checkpoint_5_filtered_chunks = chapter_chunks_filtered.copy()

print("✓ Checkpoint 5 saved successfully!")
print(f"  Variable: checkpoint_5_filtered_chunks")
print(f"  Chapters: {list(checkpoint_5_filtered_chunks.keys())}")

# 샘플 청크 확인
sample_chapter = "Chapter 1"
if sample_chapter in checkpoint_5_filtered_chunks and checkpoint_5_filtered_chunks[sample_chapter]:
    print(f"\n--- Sample: {sample_chapter} filtered chunks ---")
    print(f"Total chunks in {sample_chapter}: {len(checkpoint_5_filtered_chunks[sample_chapter])}")
    
    # 처음 3개 청크의 메타데이터와 내용 미리보기
    for i, chunk in enumerate(checkpoint_5_filtered_chunks[sample_chapter][:3]):
        print(f"\n[Chunk {i+1}]")
        print(f"  Metadata: {chunk.metadata}")
        print(f"  Content length: {len(chunk.page_content)} characters")
        print(f"  Content preview:")
        preview = chunk.page_content[:200].replace('\n', ' ')
        print(f"    {preview}...")
else:
    print(f"\n--- No chunks found for {sample_chapter} ---")

# 필터링 전후 비교
print(f"\n--- Before/After Comparison for {sample_chapter} ---")
if sample_chapter in checkpoint_4_chunks:
    before_count = len(checkpoint_4_chunks[sample_chapter])
    after_count = len(checkpoint_5_filtered_chunks[sample_chapter])
    removed_count = before_count - after_count
    print(f"  Before filtering: {before_count} chunks")
    print(f"  After filtering: {after_count} chunks")
    print(f"  Removed: {removed_count} chunks ({removed_count / before_count * 100:.1f}%)")

✓ Checkpoint 5 saved successfully!
  Variable: checkpoint_5_filtered_chunks
  Chapters: ['Chapter 1', 'Chapter 2', 'Chapter 3', 'Chapter 4', 'Chapter 5', 'Chapter 6', 'Chapter 7', 'Chapter 8', 'Chapter 9']

--- Sample: Chapter 1 filtered chunks ---
Total chunks in Chapter 1: 67

[Chunk 1]
  Metadata: {'Header 1': '**Problem: Building a Network**'}
  Content length: 1282 characters
  Content preview:
    # **Problem: Building a Network**  Suppose you want to build a computer network, one that has the potential to grow to global proportions and to support applications as diverse as teleconferencing, vi...

[Chunk 2]
  Metadata: {'Header 1': '**Problem: Building a Network**'}
  Content length: 810 characters
  Content preview:
    # **Problem: Building a Network**  What distinguishes a computer network from these other types of networks? Probably the most important characteristic of a computer network is its generality. Compute...

[Chunk 3]
  Metadata: {'Header 1': '**Problem: Building a

---
## Checkpoint 6: Embedding with Clova

각 청크에 대해 Naver Clova 임베딩 API를 호출하여 벡터를 생성합니다.
QPM(Queries Per Minute) 제한을 고려하여 rate limiting을 적용합니다.
청크 1개당 API 요청 1개를 보냅니다.

### 입력
- : 필터링된 챕터별 청크
- Clova API 설정 (API Key, Endpoint)

### 출력
- : 청크별 임베딩 벡터와 메타데이터 (list of dict)

### 체크포인트 저장
- 변수: 

### Step 6.1: Clova API 설정

Naver Clova Studio Embedding v2 API를 사용하기 위한 설정을 정의합니다.

In [104]:
import os
import requests
import time
from typing import List, Dict, Any

# Clova Studio Embedding v2 API 설정
# 실제 사용 시 환경 변수나 별도 설정 파일에서 로드하는 것을 권장합니다
CLOVA_API_ENDPOINT = "https://clovastudio.stream.ntruss.com/v1/api-tools/embedding/v2/"  # 예: https://clovastudio.apigw.ntruss.com/testapp/v1/api-tools/embedding/v2/...
CLOVA_API_KEY = "nv-0dae30139f894cf094b6b97aec7219313z2D"  # X-NCP-CLOVASTUDIO-API-KEY
CLOVA_APIGW_API_KEY = "YOUR_APIGW_API_KEY"  # X-NCP-APIGW-API-KEY (필요한 경우)

# Rate Limiting 설정
QPM_LIMIT = 60  # 분당 쿼리 제한 (Queries Per Minute)

# 환경 변수에서 로드 (선택사항)
# CLOVA_API_ENDPOINT = os.getenv("CLOVA_API_ENDPOINT", CLOVA_API_ENDPOINT)
# CLOVA_API_KEY = os.getenv("CLOVA_API_KEY", CLOVA_API_KEY)
# CLOVA_APIGW_API_KEY = os.getenv("CLOVA_APIGW_API_KEY", CLOVA_APIGW_API_KEY)

print("Clova Studio Embedding v2 API Configuration:")
print(f"  Endpoint: {CLOVA_API_ENDPOINT[:50]}..." if len(CLOVA_API_ENDPOINT) > 50 else f"  Endpoint: {CLOVA_API_ENDPOINT}")
print(f"  QPM Limit: {QPM_LIMIT}")
print(f"  Batch Size: {BATCH_SIZE}")
print(f"  API Key configured: {'Yes' if CLOVA_API_KEY != 'YOUR_API_KEY' else 'No (Please configure)'}")

Clova Studio Embedding v2 API Configuration:
  Endpoint: https://clovastudio.stream.ntruss.com/v1/api-tools...
  QPM Limit: 60
  Batch Size: 25
  API Key configured: Yes


### Step 6.2: Rate Limiter 구현

QPM(분당 쿼리 제한)을 고려한 Rate Limiter를 구현합니다.
청크 1개당 API 요청 1개를 보내므로, QPM 60 = 분당 60개 청크 처리

In [105]:
from collections import deque
from datetime import datetime, timedelta

class RateLimiter:
    """
    RPM(분당 요청 제한)을 관리하는 Rate Limiter
    """
    def __init__(self, qpm_limit: int):
        self.qpm_limit = qpm_limit
        self.request_times = deque()
    
    def wait_if_needed(self):
        """
        필요한 경우 대기하여 RPM 제한을 준수합니다.
        """
        now = datetime.now()
        one_minute_ago = now - timedelta(minutes=1)
        
        # 1분 이내의 요청만 유지
        while self.request_times and self.request_times[0] < one_minute_ago:
            self.request_times.popleft()
        
        # RPM 제한에 도달한 경우 대기
        if len(self.request_times) >= self.qpm_limit:
            # 가장 오래된 요청이 1분이 지날 때까지 대기
            oldest_request = self.request_times[0]
            wait_until = oldest_request + timedelta(minutes=1)
            wait_seconds = (wait_until - now).total_seconds()
            
            if wait_seconds > 0:
                print(f"    Rate limit reached. Waiting {wait_seconds:.1f} seconds...")
                time.sleep(wait_seconds + 0.1)  # 약간의 버퍼 추가
                
                # 대기 후 다시 정리
                now = datetime.now()
                one_minute_ago = now - timedelta(minutes=1)
                while self.request_times and self.request_times[0] < one_minute_ago:
                    self.request_times.popleft()
        
        # 현재 요청 시간 기록
        self.request_times.append(now)

# Rate Limiter 초기화
rate_limiter = RateLimiter(QPM_LIMIT)

print("✓ Rate Limiter initialized successfully!")
print(f"  QPM Limit: {QPM_LIMIT}")

✓ Rate Limiter initialized successfully!
  QPM Limit: 60


### Step 6.3: Clova Embedding 함수 정의

Clova Studio API를 호출하여 단일 텍스트의 임베딩을 생성하는 함수를 정의합니다.

In [106]:
import uuid

def get_clova_embedding(text: str, retry_count: int = 3) -> List[float]:
    """
    Clova Studio Embedding v2 API를 호출하여 단일 텍스트의 임베딩을 생성합니다.
    
    Args:
        text: 임베딩할 텍스트
        retry_count: 실패 시 재시도 횟수
        
    Returns:
        임베딩 벡터 (리스트)
    """
    headers = {
        "Authorization": f"Bearer {CLOVA_API_KEY}",
        "Content-Type": "application/json"
    }
    
    # API 요청 body 구성
    request_body = {
        "text": text
    }
    
    for attempt in range(retry_count):
        try:
            # Rate limiting 적용
            rate_limiter.wait_if_needed()
            
            # API 호출
            response = requests.post(
                CLOVA_API_ENDPOINT,
                headers=headers,
                json=request_body,
                timeout=30
            )
            
            # 응답 확인
            if response.status_code == 200:
                result = response.json()
                # 응답에서 임베딩 추출: result["result"]["embedding"]
                embedding = result.get("result", {}).get("embedding", [])
                
                if not embedding:
                    raise Exception("No embedding in response")
                
                return embedding
            else:
                print(f"    API Error (attempt {attempt + 1}/{retry_count}): {response.status_code} - {response.text[:200]}")
                if attempt < retry_count - 1:
                    time.sleep(2 ** attempt)  # Exponential backoff
                    
        except Exception as e:
            print(f"    Exception (attempt {attempt + 1}/{retry_count}): {str(e)[:200]}")
            if attempt < retry_count - 1:
                time.sleep(2 ** attempt)
    
    # 모든 재시도 실패
    raise Exception(f"Failed to get embedding after {retry_count} attempts")

print("✓ Clova embedding function defined successfully!")

✓ Clova embedding function defined successfully!


### Step 6.4: 청크별 임베딩 생성 실행

Checkpoint 5에서 필터링된 청크에 대해 임베딩을 생성합니다.
청크 1개당 API 요청 1개를 보내며, QPM 제한을 준수합니다.

In [107]:
from datetime import datetime

# 모든 청크를 평탄화하여 리스트로 변환
all_chunks_with_metadata = []

for chapter_name, chunks in checkpoint_5_filtered_chunks.items():
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks_with_metadata.append({
            'chapter': chapter_name,
            'chunk_index': chunk_idx,
            'metadata': chunk.metadata,
            'content': chunk.page_content
        })

print("=" * 60)
print("Starting Clova Embedding Generation")
print("=" * 60)
print(f"Total chunks to embed: {len(all_chunks_with_metadata)}")
print(f"QPM Limit: {QPM_LIMIT}")
print(f"Estimated time: ~{(len(all_chunks_with_metadata) / QPM_LIMIT * 60):.1f} minutes")
print()

# 임베딩 결과 저장
chunk_embeddings = []
failed_chunks = []

start_time = datetime.now()

# 청크 1개씩 처리
for idx, chunk_data in enumerate(all_chunks_with_metadata):
    chunk_num = idx + 1
    total_chunks = len(all_chunks_with_metadata)
    
    if idx % 10 == 0:  # 10개마다 진행 상황 출력
        print(f"[Processing {chunk_num}/{total_chunks}] {chunk_data['chapter']} - Chunk {chunk_data['chunk_index']}")
    
    try:
        # Clova API 호출 (청크 1개)
        embedding = get_clova_embedding(chunk_data['content'])
        
        # 결과 저장
        chunk_embeddings.append({
            'chapter': chunk_data['chapter'],
            'chunk_index': chunk_data['chunk_index'],
            'metadata': chunk_data['metadata'],
            'content': chunk_data['content'],
            'embedding': embedding,
            'embedding_dim': len(embedding) if isinstance(embedding, list) else 0
        })
        
    except Exception as e:
        print(f"  ✗ Failed for {chunk_data['chapter']} chunk {chunk_data['chunk_index']}: {str(e)[:100]}")
        failed_chunks.append(chunk_data)
        continue
    
    # 진행 상황 표시 (매 10개마다)
    if (idx + 1) % 10 == 0:
        elapsed = (datetime.now() - start_time).total_seconds()
        processed = idx + 1
        remaining = len(all_chunks_with_metadata) - processed
        
        avg_time_per_chunk = elapsed / processed
        estimated_remaining = avg_time_per_chunk * remaining
        print(f"  ✓ Progress: {processed}/{total_chunks} ({processed / total_chunks * 100:.1f}%)")
        print(f"    Elapsed: {elapsed / 60:.1f} min, Estimated remaining: {estimated_remaining / 60:.1f} min")
        print()

end_time = datetime.now()
total_time = (end_time - start_time).total_seconds()

print("=" * 60)
print(f"✓ Embedding generation completed\!")
print(f"  Total chunks: {len(all_chunks_with_metadata)}")
print(f"  Successfully embedded: {len(chunk_embeddings)}")
print(f"  Failed: {len(failed_chunks)}")
print(f"  Total time: {total_time / 60:.1f} minutes")
print(f"  Average time per chunk: {total_time / len(all_chunks_with_metadata):.2f} seconds")
print("=" * 60)

Starting Clova Embedding Generation
Total chunks to embed: 712
QPM Limit: 60
Estimated time: ~712.0 minutes

[Processing 1/712] Chapter 1 - Chunk 0
  ✓ Progress: 10/712 (1.4%)
    Elapsed: 0.0 min, Estimated remaining: 1.9 min

[Processing 11/712] Chapter 1 - Chunk 10
  ✓ Progress: 20/712 (2.8%)
    Elapsed: 0.1 min, Estimated remaining: 1.8 min

[Processing 21/712] Chapter 1 - Chunk 20
  ✓ Progress: 30/712 (4.2%)
    Elapsed: 0.1 min, Estimated remaining: 1.8 min

[Processing 31/712] Chapter 1 - Chunk 30
  ✓ Progress: 40/712 (5.6%)
    Elapsed: 0.1 min, Estimated remaining: 1.8 min

[Processing 41/712] Chapter 1 - Chunk 40
  ✓ Progress: 50/712 (7.0%)
    Elapsed: 0.1 min, Estimated remaining: 1.8 min

[Processing 51/712] Chapter 1 - Chunk 50
  ✓ Progress: 60/712 (8.4%)
    Elapsed: 0.2 min, Estimated remaining: 1.7 min

[Processing 61/712] Chapter 1 - Chunk 60
    Rate limit reached. Waiting 50.5 seconds...
  ✓ Progress: 70/712 (9.8%)
    Elapsed: 1.0 min, Estimated remaining: 9.4 min

### Step 6.5: 체크포인트 저장 및 결과 확인

생성된 임베딩을 체크포인트로 저장하고 샘플 결과를 확인합니다.

In [108]:
import json
import os

# 체크포인트 6 저장
checkpoint_6_embeddings = chunk_embeddings.copy()

# JSON 파일로 저장
print("\n--- Saving to JSON file ---")
embeddings_for_json = []

for item in checkpoint_6_embeddings:
    embeddings_for_json.append({
        'chapter': item['chapter'],
        'chunk_index': item['chunk_index'],
        'metadata': item['metadata'],
        'content': item['content'],
        'embedding': item['embedding'].tolist() if hasattr(item['embedding'], 'tolist') else item['embedding'],
        'embedding_dim': item['embedding_dim']
    })

json_file_path = 'checkpoint_6_embeddings.json'

with open(json_file_path, 'w', encoding='utf-8') as f:
    json.dump(embeddings_for_json, f, ensure_ascii=False, indent=2)

print(f"✓ Embeddings saved to {json_file_path}")
print(f"  File size: {os.path.getsize(json_file_path) / (1024 * 1024):.2f} MB")
print("\n✓ Checkpoint 6 saved successfully!")
print(f"  Variable: checkpoint_6_embeddings")
print(f"  Total embeddings: {len(checkpoint_6_embeddings)}")

# 샘플 확인
if checkpoint_6_embeddings:
    sample = checkpoint_6_embeddings[0]
    print(f"\n--- Sample: First embedding ---")
    print(f"  Chapter: {sample['chapter']}")
    print(f"  Chunk index: {sample['chunk_index']}")
    print(f"  Metadata: {sample['metadata']}")
    print(f"  Content length: {len(sample['content'])} characters")
    print(f"  Content preview: {sample['content'][:200]}...")
    print(f"  Embedding dimension: {sample['embedding_dim']}")
    
    # 임베딩이 리스트인지 확인 후 출력
    preview = sample['embedding'][:10] if isinstance(sample['embedding'], list) else 'N/A'
    print(f"  Embedding preview (first 10 values): {preview}")
    
    # 챕터별 통계
    print(f"\n--- Statistics by Chapter ---")
    chapter_counts = {}
    for item in checkpoint_6_embeddings:
        chapter = item['chapter']
        chapter_counts[chapter] = chapter_counts.get(chapter, 0) + 1
        
    for chapter in sorted(chapter_counts.keys()):
        count = chapter_counts[chapter]
        print(f"  {chapter}: {count} chunks")
else:
    print("\n--- No embeddings generated ---")

# 실패한 청크 확인 (failed_chunks 변수가 정의되어 있다고 가정)
if 'failed_chunks' in locals() and failed_chunks:
    print(f"\n--- Failed Chunks: {len(failed_chunks)} ---")
    for i, failed in enumerate(failed_chunks[:5]):  # 처음 5개만 표시
        print(f"  [{i+1}] Chapter: {failed['chapter']}, Index: {failed['chunk_index']}")
    if len(failed_chunks) > 5:
        print(f"  ... and {len(failed_chunks) - 5} more")


--- Saving to JSON file ---
✓ Embeddings saved to checkpoint_6_embeddings.json
  File size: 13.70 MB

✓ Checkpoint 6 saved successfully!
  Variable: checkpoint_6_embeddings
  Total embeddings: 712

--- Sample: First embedding ---
  Chapter: Chapter 1
  Chunk index: 0
  Metadata: {'Header 1': '**Problem: Building a Network**'}
  Content length: 1282 characters
  Content preview: # **Problem: Building a Network**

Suppose you want to build a computer network, one that has the potential to grow to global proportions and
to support applications as diverse as teleconferencing, vi...
  Embedding dimension: 1024
  Embedding preview (first 10 values): [0.31054688, 0.5805664, -0.8569336, 0.5751953, 0.33325195, -0.046417236, -0.10888672, 0.5839844, 0.8149414, 0.78125]

--- Statistics by Chapter ---
  Chapter 1: 67 chunks
  Chapter 2: 102 chunks
  Chapter 3: 77 chunks
  Chapter 4: 84 chunks
  Chapter 5: 56 chunks
  Chapter 6: 88 chunks
  Chapter 7: 62 chunks
  Chapter 8: 72 chunks
  Chapter 9: 1

---
## Checkpoint 7: PostgreSQL pgvector Loading

생성된 임베딩 벡터를 PostgreSQL의 pgvector extension을 사용하여 데이터베이스에 적재합니다.

### 입력
- `checkpoint_6_embeddings`: 청크별 임베딩 벡터와 메타데이터
- PostgreSQL 연결 정보 (host, port, database, user, password)

### 출력
- 데이터베이스에 적재 완료

### 체크포인트 저장
- 데이터베이스 적재 완료 여부

### Step 7.1: 라이브러리 설치 및 임포트

PostgreSQL과 pgvector를 사용하기 위한 라이브러리를 설치하고 임포트합니다.

In [109]:
!pip install psycopg2-binary -q

In [110]:
import psycopg2
from psycopg2.extras import execute_batch
import json

print("✓ Libraries imported successfully!")

✓ Libraries imported successfully!


### Step 7.2: PostgreSQL 연결 설정

PostgreSQL 데이터베이스 연결 정보를 설정합니다.

In [116]:
# PostgreSQL 연결 정보
PG_HOST = "localhost"
PG_PORT = 5432
PG_DATABASE = "mydb"
PG_USER = "myuser"
PG_PASSWORD = "mypassword"

# 테이블 설정
TABLE_NAME = "document_embeddings"
CATEGORY = "Network"

print("PostgreSQL Configuration:")
print(f"  Host: {PG_HOST}")
print(f"  Port: {PG_PORT}")
print(f"  Database: {PG_DATABASE}")
print(f"  User: {PG_USER}")
print(f"  Table: {TABLE_NAME}")
print(f"  Category: {CATEGORY}")

PostgreSQL Configuration:
  Host: localhost
  Port: 5432
  Database: mydb
  User: myuser
  Table: document_embeddings
  Category: Network


### Step 7.3: 테이블 스키마 생성

pgvector extension과 테이블을 생성합니다.

In [120]:
# # PostgreSQL 연결
# try:
#     conn = psycopg2.connect(
#         host=PG_HOST,
#         port=PG_PORT,
#         database=PG_DATABASE,
#         user=PG_USER,
#         password=PG_PASSWORD
#     )
#     conn.autocommit = True
#     cursor = conn.cursor()
    
#     print("✓ Connected to PostgreSQL successfully!")
    
#     # pgvector extension 생성
#     print("\nCreating pgvector extension...")
#     cursor.execute("CREATE EXTENSION IF NOT EXISTS vector;")
#     print("  ✓ pgvector extension enabled")
    
#     # 테이블이 이미 존재하는지 확인
#     cursor.execute(f"""
#         SELECT EXISTS (
#             SELECT FROM information_schema.tables 
#             WHERE table_name = '{TABLE_NAME}'
#         );
#     """)
#     table_exists = cursor.fetchone()[0]
    
#     if table_exists:
#         print(f"\n⚠ Table '{TABLE_NAME}' already exists. Truncating...")
#         cursor.execute(f"TRUNCATE TABLE {TABLE_NAME};")
#         print(f"  ✓ Table truncated")
#     else:
#         print(f"\nCreating table '{TABLE_NAME}'...")
        
#         # 테이블 생성
#         create_table_sql = f"""
#         CREATE TABLE {TABLE_NAME} (
#             id SERIAL PRIMARY KEY,
#             content TEXT NOT NULL,
#             category VARCHAR(255) NOT NULL,
#             embedding VECTOR(1024) NOT NULL,
#             tsvector TSVECTOR,
#             metadata JSONB,
#             created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
#         );
#         """
#         cursor.execute(create_table_sql)
#         print(f"  ✓ Table created")
        
#         # 인덱스 생성
#         print("\nCreating indexes...")
        
#         # HNSW 인덱스 (벡터 유사도 검색용)
#         cursor.execute(f"""
#             CREATE INDEX {TABLE_NAME}_embedding_idx 
#             ON {TABLE_NAME} 
#             USING hnsw (embedding vector_cosine_ops);
#         """)
#         print("  ✓ HNSW index created for embedding")
        
#         # GIN 인덱스 (전체 텍스트 검색용)
#         cursor.execute(f"""
#             CREATE INDEX {TABLE_NAME}_tsvector_idx 
#             ON {TABLE_NAME} 
#             USING gin (tsvector);
#         """)
#         print("  ✓ GIN index created for tsvector")
        
#         # Category 인덱스
#         cursor.execute(f"""
#             CREATE INDEX {TABLE_NAME}_category_idx 
#             ON {TABLE_NAME} (category);
#         """)
#         print("  ✓ B-tree index created for category")
    
#     print("\n" + "=" * 60)
#     print("✓ Database setup completed!")
#     print("=" * 60)
    
# except Exception as e:
#     print(f"✗ Database setup failed: {str(e)}")
#     raise

### Step 7.4: 배치 INSERT 실행

Checkpoint 6의 임베딩 데이터를 100개씩 배치로 PostgreSQL에 적재합니다.
- tsvector: content만 사용
- metadata: 헤더값을 JSONB로 저장

In [118]:
from datetime import datetime

BATCH_SIZE = 100  # 100개씩 배치 INSERT

print("=" * 60)
print("Starting Batch INSERT to PostgreSQL")
print("=" * 60)
print(f"Total embeddings: {len(checkpoint_6_embeddings)}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Estimated batches: {(len(checkpoint_6_embeddings) + BATCH_SIZE - 1) // BATCH_SIZE}")
print()

# INSERT 쿼리 준비
insert_sql = f"""
    INSERT INTO {TABLE_NAME} (content, category, embedding, tsvector, metadata)
    VALUES (%s, %s, %s, to_tsvector('english', %s), %s);
"""

inserted_count = 0
failed_count = 0
start_time = datetime.now()

try:
    # 배치 단위로 처리
    for batch_start in range(0, len(checkpoint_6_embeddings), BATCH_SIZE):
        batch_end = min(batch_start + BATCH_SIZE, len(checkpoint_6_embeddings))
        batch = checkpoint_6_embeddings[batch_start:batch_end]
        
        batch_num = (batch_start // BATCH_SIZE) + 1
        total_batches = (len(checkpoint_6_embeddings) + BATCH_SIZE - 1) // BATCH_SIZE
        
        print(f"[Batch {batch_num}/{total_batches}] Inserting rows {batch_start + 1} to {batch_end}...")
        
        # 배치 데이터 준비
        batch_data = []
        for item in batch:
            # metadata에 헤더값 저장
            metadata = item['metadata']
            
            batch_data.append((
                item['content'],                    # content
                CATEGORY,                           # category
                item['embedding'],                  # embedding (리스트 → PostgreSQL VECTOR로 자동 변환)
                item['content'],                    # tsvector용 텍스트 (content만 사용)
                json.dumps(metadata)                # metadata (JSONB - 헤더값 포함)
            ))
        
        try:
            # 배치 INSERT 실행
            execute_batch(cursor, insert_sql, batch_data, page_size=BATCH_SIZE)
            inserted_count += len(batch_data)
            print(f"  ✓ Successfully inserted {len(batch_data)} rows")
            
        except Exception as e:
            print(f"  ✗ Batch insert failed: {str(e)[:200]}")
            failed_count += len(batch_data)
            continue
        
        # 진행 상황 표시
        elapsed = (datetime.now() - start_time).total_seconds()
        processed = batch_end
        remaining = len(checkpoint_6_embeddings) - processed
        
        if processed > 0:
            avg_time_per_row = elapsed / processed
            estimated_remaining = avg_time_per_row * remaining
            print(f"    Progress: {processed}/{len(checkpoint_6_embeddings)} ({processed / len(checkpoint_6_embeddings) * 100:.1f}%)")
            print(f"    Elapsed: {elapsed:.1f}s, Estimated remaining: {estimated_remaining:.1f}s")
        
        print()
    
    # 커밋
    conn.commit()
    
    end_time = datetime.now()
    total_time = (end_time - start_time).total_seconds()
    
    print("=" * 60)
    print(f"✓ Batch INSERT completed!")
    print(f"  Total embeddings: {len(checkpoint_6_embeddings)}")
    print(f"  Successfully inserted: {inserted_count}")
    print(f"  Failed: {failed_count}")
    print(f"  Total time: {total_time:.1f} seconds")
    print(f"  Average time per row: {total_time / len(checkpoint_6_embeddings):.3f} seconds")
    print("=" * 60)
    
except Exception as e:
    print(f"\n✗ INSERT failed: {str(e)}")
    conn.rollback()
    raise
finally:
    if 'cursor' in locals() and cursor:
        cursor.close()
        print("\n--- Resource Cleanup ---")
        print("✓ Cursor closed.")
        print("ℹ Connection (conn) remains open for subsequent verification.")
    
    # 세션 상태 확인 (선택 사항)
    if not conn.closed:
        print(f"✓ Session is still active on table: {TABLE_NAME}")

Starting Batch INSERT to PostgreSQL
Total embeddings: 712
Batch size: 100
Estimated batches: 8

[Batch 1/8] Inserting rows 1 to 100...
  ✓ Successfully inserted 100 rows
    Progress: 100/712 (14.0%)
    Elapsed: 0.4s, Estimated remaining: 2.7s

[Batch 2/8] Inserting rows 101 to 200...
  ✓ Successfully inserted 100 rows
    Progress: 200/712 (28.1%)
    Elapsed: 0.9s, Estimated remaining: 2.4s

[Batch 3/8] Inserting rows 201 to 300...
  ✓ Successfully inserted 100 rows
    Progress: 300/712 (42.1%)
    Elapsed: 1.4s, Estimated remaining: 2.0s

[Batch 4/8] Inserting rows 301 to 400...
  ✓ Successfully inserted 100 rows
    Progress: 400/712 (56.2%)
    Elapsed: 1.9s, Estimated remaining: 1.5s

[Batch 5/8] Inserting rows 401 to 500...
  ✓ Successfully inserted 100 rows
    Progress: 500/712 (70.2%)
    Elapsed: 2.5s, Estimated remaining: 1.0s

[Batch 6/8] Inserting rows 501 to 600...
  ✓ Successfully inserted 100 rows
    Progress: 600/712 (84.3%)
    Elapsed: 3.0s, Estimated remaining: 

### Step 7.5: 데이터 검증 및 결과 확인

적재된 데이터를 검증하고 샘플 데이터를 확인합니다.

In [119]:
print("=" * 60)
print("Data Validation")
print("=" * 60)

try:
    # 총 레코드 수 확인
    cursor.execute(f"SELECT COUNT(*) FROM {TABLE_NAME};")
    total_count = cursor.fetchone()[0]
    print(f"\nTotal records in table: {total_count}")
    
    # 카테고리별 통계
    cursor.execute(f"""
        SELECT category, COUNT(*) 
        FROM {TABLE_NAME} 
        GROUP BY category;
    """)
    category_stats = cursor.fetchall()
    print(f"\nRecords by category:")
    for category, count in category_stats:
        print(f"  - {category}: {count}")
    
    # 샘플 데이터 확인 (첫 3개)
    cursor.execute(f"""
        SELECT id, 
               LEFT(content, 100) as content_preview, 
               category, 
               metadata,
               array_length(embedding::float[], 1) as embedding_dim
        FROM {TABLE_NAME}
        ORDER BY id
        LIMIT 3;
    """)
    
    samples = cursor.fetchall()
    
    print(f"\n--- Sample Records ---")
    for sample in samples:
        record_id, content_preview, category, metadata, embedding_dim = sample
        print(f"\n[Record {record_id}]")
        print(f"  Category: {category}")
        print(f"  Content: {content_preview}...")
        print(f"  Metadata: {metadata}")
        print(f"  Embedding dimension: {embedding_dim}")
    
    # 테이블 크기 확인
    cursor.execute(f"""
        SELECT pg_size_pretty(pg_total_relation_size('{TABLE_NAME}'));
    """)
    table_size = cursor.fetchone()[0]
    print(f"\nTable size: {table_size}")
    
    # 인덱스 확인
    cursor.execute(f"""
        SELECT indexname, indexdef 
        FROM pg_indexes 
        WHERE tablename = '{TABLE_NAME}';
    """)
    indexes = cursor.fetchall()
    print(f"\nIndexes ({len(indexes)}):")
    for idx_name, idx_def in indexes:
        print(f"  - {idx_name}")
    
    print("\n" + "=" * 60)
    print("✓ Data validation completed!")
    print("=" * 60)
    
except Exception as e:
    print(f"✗ Validation failed: {str(e)}")
finally:
    # 연결 종료
    cursor.close()
    conn.close()
    print("\n✓ Database connection closed")

Data Validation
✗ Validation failed: cursor already closed

✓ Database connection closed
