# 민사법 데이터셋 빌더 (JSON 원본 보존)

이 노트북은 ZIP 파일 내 JSON 데이터를 **원본 구조 그대로** HuggingFace Dataset으로 변환합니다.
- 원본 JSON 필드 모두 보존
- 최소한의 정규화만 적용 (Arrow/Parquet 호환성)
- 타입별 자동 분류 및 통합

In [1]:
!uv pip install datasets

[2mUsing Python 3.10.18 environment at: /mnt/c/Users/LANDSOFT/Documents/dev/law/.venv[0m
[2mAudited [1m1 package[0m [2min 853ms[0m[0m
[2mAudited [1m1 package[0m [2min 853ms[0m[0m


In [2]:
from pathlib import Path
import zipfile
import json
from datasets import Dataset
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
BASE_DIR = "/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터"

In [5]:
# BASE_DIR 아래 모든 zip file 경로 수집
zip_files = list(Path(BASE_DIR).rglob("*.zip"))
print(f"Found {len(zip_files)} zip files.")
for zip_file in zip_files:
    print(zip_file)

Found 22 zip files.
/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/3.개방데이터/1.데이터/Training/01.원천데이터/TS_01. 민사법_002. 법령.zip
/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/3.개방데이터/1.데이터/Training/01.원천데이터/TS_01. 민사법_003. 심결례.zip
/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/3.개방데이터/1.데이터/Training/01.원천데이터/TS_01. 민사법_001. 판결문.zip
/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/3.개방데이터/1.데이터/Training/01.원천데이터/TS_01. 민사법_004. 유권해석.zip
/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/3.개방데이터/1.데이터/Training/02.라벨링데이터/TL_01. 민사법_001. 판결문_0001. 질의응답.zip
/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/3.개방데이터/1.데이터/Training/02.라벨링데이터/TL_01. 민사법_001. 판결문_0002. 요약.zip
/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/3.개방데이터/1.데이터/Training/02.라벨링데이터/TL_01. 민사법_002. 법령_0001. 질의응답.zip
/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/3.개방데이터/1.데이터/Training/02.라벨링데이터/TL_01. 민사법_003. 심결례_0001. 질의응답.zip
/mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/3.개방데이터/1.

In [7]:
# ZIP 파일 타입별 분류 및 JSON 구조 탐색
import zipfile
import json
from pathlib import Path
from collections import defaultdict

zip_by_type = defaultdict(list)

for zpath in zip_files:
    name = zpath.name
    # 파일명 패턴 파싱: TS/TL/VS/VL + 데이터타입(판결문/법령/심결례/유권해석) + 태스크(질의응답/요약)
    if "판결문" in name:
        data_type = "precedent"
    elif "법령" in name:
        data_type = "statute"
    elif "심결례" in name:
        data_type = "trial_decision"
    elif "유권해석" in name:
        data_type = "authoritative_interpretation"
    else:
        data_type = "unknown"
    
    if "질의응답" in name:
        task_type = "qa"
    elif "요약" in name:
        task_type = "summary"
    else:
        task_type = "source"
    
    split = "train" if ("TL_" in name or "TS_" in name) else "validation"
    
    key = f"{split}_{data_type}_{task_type}"
    zip_by_type[key].append(zpath)

print("ZIP 파일 분류:")
for key, paths in sorted(zip_by_type.items()):
    print(f"\n{key}: {len(paths)} files")
    for p in paths:
        print(f"  - {p.name}")

ZIP 파일 분류:

train_authoritative_interpretation_qa: 1 files
  - TL_01. 민사법_004. 유권해석_0001. 질의응답.zip

train_authoritative_interpretation_source: 1 files
  - TS_01. 민사법_004. 유권해석.zip

train_authoritative_interpretation_summary: 1 files
  - TL_01. 민사법_004. 유권해석_0002. 요약.zip

train_precedent_qa: 1 files
  - TL_01. 민사법_001. 판결문_0001. 질의응답.zip

train_precedent_source: 1 files
  - TS_01. 민사법_001. 판결문.zip

train_precedent_summary: 1 files
  - TL_01. 민사법_001. 판결문_0002. 요약.zip

train_statute_qa: 1 files
  - TL_01. 민사법_002. 법령_0001. 질의응답.zip

train_statute_source: 1 files
  - TS_01. 민사법_002. 법령.zip

train_trial_decision_qa: 1 files
  - TL_01. 민사법_003. 심결례_0001. 질의응답.zip

train_trial_decision_source: 1 files
  - TS_01. 민사법_003. 심결례.zip

train_trial_decision_summary: 1 files
  - TL_01. 민사법_003. 심결례_0002. 요약.zip

validation_authoritative_interpretation_qa: 1 files
  - VL_01. 민사법_004. 유권해석_0001. 질의응답.zip

validation_authoritative_interpretation_source: 1 files
  - VS_01. 민사법_004. 유권해석.zip

validation_

In [None]:
# 각 타입별 샘플 JSON 구조 확인
def examine_zip_structure(zpath, max_items=2):
    """ZIP 파일 내 JSON 구조를 출력"""
    print(f"\n{'='*80}")
    print(f"파일: {zpath.name}")
    print('='*80)
    
    with zipfile.ZipFile(zpath, 'r') as zf:
        json_files = [f for f in zf.namelist() if f.endswith('.json')]
        print(f"JSON 파일 수: {len(json_files)}")
        
        for jf in json_files[:1]:  # 첫 번째 JSON만
            with zf.open(jf) as f:
                data = json.load(f)
            
            print(f"\nJSON 파일: {jf}")
            print(f"타입: {type(data)}")
            
            if isinstance(data, list):
                print(f"리스트 길이: {len(data)}")
                for i, item in enumerate(data[:max_items]):
                    print(f"\n--- 항목 {i} ---")
                    if isinstance(item, dict):
                        print(f"Keys: {list(item.keys())}")
                        # 중요 필드만 출력
                        for key in list(item.keys())[:8]:
                            val = item[key]
                            if isinstance(val, str):
                                print(f"  {key}: {val[:100]}")
                            elif isinstance(val, (list, dict)):
                                print(f"  {key}: {type(val).__name__} (len={len(val) if hasattr(val, '__len__') else '?'})")
                            else:
                                print(f"  {key}: {val}")
            elif isinstance(data, dict):
                print(f"Keys: {list(data.keys())}")

# 대표 샘플 검사
sample_types = [
    ("train_precedent_source", "판결문 원천"),
    ("train_precedent_qa", "판결문 QA"),
    ("train_precedent_summary", "판결문 요약"),
    ("train_statute_source", "법령 원천"),
    ("train_statute_qa", "법령 QA"),
    ("train_trial_decision_source", "심결례 원천"),
    ("train_trial_decision_qa", "심결례 QA"),
    ("train_authoritative_interpretation_source", "유권해석 원천"),
]

for key, desc in sample_types:
    if key in zip_by_type and zip_by_type[key]:
        examine_zip_structure(zip_by_type[key][0], max_items=1)


파일: TS_01. 민사법_001. 판결문.zip
JSON 파일 수: 76291

JSON 파일: /민사법_판결문_10019.json
타입: <class 'dict'>
Keys: ['doc_class', 'doc_id', 'casenames', 'normalized_court', 'casetype', 'sentences', 'announce_date']

파일: TL_01. 민사법_001. 판결문_0001. 질의응답.zip
JSON 파일 수: 73065

JSON 파일: /민사법_판결문_질의응답_10021.json
타입: <class 'dict'>
Keys: ['info', 'taskinfo']

파일: TL_01. 민사법_001. 판결문_0002. 요약.zip
JSON 파일 수: 3228

JSON 파일: /민사법_판결문_요약_91299.json
타입: <class 'dict'>
Keys: ['info', 'taskinfo']

파일: TS_01. 민사법_002. 법령.zip
JSON 파일 수: 12

JSON 파일: /민사법_법령_7.json
타입: <class 'dict'>
Keys: ['statute_name', 'effective_date', 'proclamation_date', 'statute_type', 'statute_abbrv', 'statute_category', 'sentences', 'data_class']

파일: TL_01. 민사법_002. 법령_0001. 질의응답.zip
JSON 파일 수: 12

JSON 파일: /민사법_법령_질의응답_4.json
타입: <class 'dict'>
Keys: ['info', 'taskinfo']

파일: TS_01. 민사법_003. 심결례.zip
JSON 파일 수: 2510

JSON 파일: /민사법_심결례_1002.json
타입: <class 'dict'>
Keys: ['doc_class', 'document_type', 'doc_id', 'decision_date', 'result', 'sent

In [8]:
"""
타입별 데이터셋 빌더
각 문서 타입(판결문, 법령, 심결례, 유권해석)과 태스크(원천/QA/요약)별로 별도 데이터셋 생성
"""

from pathlib import Path
import zipfile
import json
from datasets import Dataset, DatasetDict
from tqdm.auto import tqdm
from typing import Any, Dict, List
from collections import defaultdict


class TypedDatasetBuilder:
    """문서 타입별 데이터셋 빌더"""
    
    def __init__(self, base_dir: str):
        self.base_dir = Path(base_dir)
        self.zip_files = list(self.base_dir.rglob("*.zip"))
        self.datasets_by_type = defaultdict(lambda: {"train": [], "validation": []})
    
    def classify_zip(self, zpath: Path) -> tuple:
        """ZIP 파일명으로 분류"""
        name = zpath.name
        
        # 데이터 타입
        if "판결문" in name:
            data_type = "precedent"
        elif "법령" in name:
            data_type = "statute"
        elif "심결례" in name:
            data_type = "trial_decision"
        elif "유권해석" in name:
            data_type = "interpretation"
        else:
            data_type = "unknown"
        
        # 태스크 타입
        if "질의응답" in name:
            task_type = "qa"
        elif "요약" in name:
            task_type = "summary"
        else:
            task_type = "source"
        
        # Split
        split = "train" if ("TL_" in name or "TS_" in name) else "validation"
        
        return split, data_type, task_type
    
    def process_precedent_source(self, item: Dict) -> Dict:
        """판결문 원천 데이터 처리"""
        sentences = item.get("sentences", [])
        text = " ".join([s.get("text", "") for s in sentences if isinstance(s, dict)])
        
        return {
            "doc_class": item.get("doc_class"),
            "doc_id": item.get("doc_id"),
            "casenames": item.get("casenames"),
            "normalized_court": item.get("normalized_court"),
            "casetype": item.get("casetype"),
            "announce_date": item.get("announce_date"),
            "text": text,
            "sentences": json.dumps(sentences, ensure_ascii=False),
            "char_len": len(text),
            "word_len": len(text.split()),
        }
    
    def process_precedent_qa(self, item: Dict) -> Dict:
        """판결문 QA 데이터 처리"""
        info = item.get("info", {})
        taskinfo = item.get("taskinfo", [])
        
        questions = []
        answers = []
        for task in taskinfo:
            if isinstance(task, dict):
                q = task.get("question")
                a = task.get("answer")
                if q: questions.append(q)
                if a: answers.append(a)
        
        return {
            "doc_id": info.get("doc_id"),
            "casenames": info.get("casenames"),
            "normalized_court": info.get("normalized_court"),
            "casetype": info.get("casetype"),
            "announce_date": info.get("announce_date"),
            "questions": questions,
            "answers": answers,
            "qa_count": len(questions),
            "info_json": json.dumps(info, ensure_ascii=False),
        }
    
    def process_precedent_summary(self, item: Dict) -> Dict:
        """판결문 요약 데이터 처리"""
        info = item.get("info", {})
        taskinfo = item.get("taskinfo", [])
        
        summaries = []
        for task in taskinfo:
            if isinstance(task, dict):
                summ = task.get("summary")
                if summ: summaries.append(summ)
        
        return {
            "doc_id": info.get("doc_id"),
            "casenames": info.get("casenames"),
            "normalized_court": info.get("normalized_court"),
            "casetype": info.get("casetype"),
            "announce_date": info.get("announce_date"),
            "summaries": summaries,
            "summary_count": len(summaries),
            "info_json": json.dumps(info, ensure_ascii=False),
        }
    
    def process_statute_source(self, item: Dict) -> Dict:
        """법령 원천 데이터 처리"""
        sentences = item.get("sentences", [])
        text = " ".join([s.get("text", "") for s in sentences if isinstance(s, dict)])
        
        return {
            "statute_name": item.get("statute_name"),
            "effective_date": item.get("effective_date"),
            "proclamation_date": item.get("proclamation_date"),
            "statute_type": item.get("statute_type"),
            "statute_abbrv": item.get("statute_abbrv"),
            "statute_category": item.get("statute_category"),
            "data_class": item.get("data_class"),
            "text": text,
            "sentences": json.dumps(sentences, ensure_ascii=False),
            "char_len": len(text),
            "word_len": len(text.split()),
        }
    
    def process_statute_qa(self, item: Dict) -> Dict:
        """법령 QA 데이터 처리"""
        info = item.get("info", {})
        taskinfo = item.get("taskinfo", [])
        
        questions = []
        answers = []
        for task in taskinfo:
            if isinstance(task, dict):
                q = task.get("question")
                a = task.get("answer")
                if q: questions.append(q)
                if a: answers.append(a)
        
        return {
            "statute_name": info.get("statute_name"),
            "effective_date": info.get("effective_date"),
            "statute_type": info.get("statute_type"),
            "questions": questions,
            "answers": answers,
            "qa_count": len(questions),
            "info_json": json.dumps(info, ensure_ascii=False),
        }
    
    def process_trial_decision_source(self, item: Dict) -> Dict:
        """심결례 원천 데이터 처리"""
        sentences = item.get("sentences", [])
        text = " ".join([s.get("text", "") for s in sentences if isinstance(s, dict)])
        
        return {
            "doc_class": item.get("doc_class"),
            "document_type": item.get("document_type"),
            "doc_id": item.get("doc_id"),
            "decision_date": item.get("decision_date"),
            "result": item.get("result"),
            "text": text,
            "sentences": json.dumps(sentences, ensure_ascii=False),
            "char_len": len(text),
            "word_len": len(text.split()),
        }
    
    def process_trial_decision_qa(self, item: Dict) -> Dict:
        """심결례 QA 데이터 처리"""
        info = item.get("info", {})
        taskinfo = item.get("taskinfo", [])
        
        questions = []
        answers = []
        for task in taskinfo:
            if isinstance(task, dict):
                q = task.get("question")
                a = task.get("answer")
                if q: questions.append(q)
                if a: answers.append(a)
        
        return {
            "doc_id": info.get("doc_id"),
            "document_type": info.get("document_type"),
            "decision_date": info.get("decision_date"),
            "result": info.get("result"),
            "questions": questions,
            "answers": answers,
            "qa_count": len(questions),
            "info_json": json.dumps(info, ensure_ascii=False),
        }
    
    def process_interpretation_source(self, item: Dict) -> Dict:
        """유권해석 원천 데이터 처리"""
        sentences = item.get("sentences", [])
        text = " ".join([s.get("text", "") for s in sentences if isinstance(s, dict)])
        
        return {
            "doc_class": item.get("doc_class"),
            "doc_id": item.get("doc_id"),
            "response_date": item.get("response_date"),
            "response_institute": item.get("response_institute"),
            "title": item.get("title"),
            "text": text,
            "sentences": json.dumps(sentences, ensure_ascii=False),
            "char_len": len(text),
            "word_len": len(text.split()),
        }
    
    def process_item(self, item: Dict, data_type: str, task_type: str, 
                    source_zip: str, source_json: str) -> Dict:
        """항목 처리 라우터"""
        # 타입별 처리 함수 매핑
        processors = {
            ("precedent", "source"): self.process_precedent_source,
            ("precedent", "qa"): self.process_precedent_qa,
            ("precedent", "summary"): self.process_precedent_summary,
            ("statute", "source"): self.process_statute_source,
            ("statute", "qa"): self.process_statute_qa,
            ("trial_decision", "source"): self.process_trial_decision_source,
            ("trial_decision", "qa"): self.process_trial_decision_qa,
            ("interpretation", "source"): self.process_interpretation_source,
        }
        
        processor = processors.get((data_type, task_type))
        if processor:
            record = processor(item)
        else:
            # 처리 함수가 없으면 원본 JSON 보존
            record = {"raw_json": json.dumps(item, ensure_ascii=False)}
        
        # 메타데이터 추가
        record.update({
            "_source_zip": source_zip,
            "_source_json": source_json,
            "_data_type": data_type,
            "_task_type": task_type,
        })
        
        return record
    
    def load_zip(self, zpath: Path):
        """ZIP 파일 로드"""
        split, data_type, task_type = self.classify_zip(zpath)
        dataset_key = f"{data_type}_{task_type}"
        
        with zipfile.ZipFile(zpath, 'r') as zf:
            json_files = [f for f in zf.namelist() if f.endswith('.json')]
            
            for jf in json_files:
                with zf.open(jf) as f:
                    data = json.load(f)
                
                # 단일 dict나 list 처리
                items = [data] if isinstance(data, dict) else data
                
                for item in items:
                    if isinstance(item, dict):
                        record = self.process_item(
                            item, data_type, task_type,
                            zpath.name, jf
                        )
                        self.datasets_by_type[dataset_key][split].append(record)
    
    def build(self) -> Dict[str, DatasetDict]:
        """전체 빌드"""
        print("ZIP 파일 로딩 중...")
        for zpath in tqdm(self.zip_files, desc="Processing ZIPs"):
            try:
                self.load_zip(zpath)
            except Exception as e:
                print(f"✗ {zpath.name}: {e}")
        
        # 각 타입별 DatasetDict 생성
        result = {}
        
        for dataset_key, splits_data in self.datasets_by_type.items():
            print(f"\n{dataset_key} 데이터셋 생성 중...")
            
            dataset_dict = {}
            for split, records in splits_data.items():
                if not records:
                    continue
                
                ds = Dataset.from_list(records)
                ds = ds.add_column("_row_id", [f"{split}-{i:07d}" for i in range(len(ds))])
                ds = ds.add_column("_split", [split] * len(ds))
                
                dataset_dict[split] = ds
                print(f"  {split}: {len(ds):,} rows, {len(ds.column_names)} columns")
            
            if dataset_dict:
                result[dataset_key] = DatasetDict(dataset_dict)
        
        return result


# 빌드 실행
print("="*80)
print("타입별 데이터셋 빌더")
print("="*80)

builder = TypedDatasetBuilder(BASE_DIR)
typed_datasets = builder.build()

print("\n" + "="*80)
print("빌드 완료!")
print("="*80)

for name, ds_dict in typed_datasets.items():
    print(f"\n{name}:")
    print(ds_dict)

타입별 데이터셋 빌더
ZIP 파일 로딩 중...


Processing ZIPs: 100%|██████████| 22/22 [01:23<00:00,  3.82s/it]




statute_source 데이터셋 생성 중...
  train: 12 rows, 17 columns
  validation: 2 rows, 17 columns

trial_decision_source 데이터셋 생성 중...
  train: 2,510 rows, 15 columns
  validation: 406 rows, 15 columns

precedent_source 데이터셋 생성 중...
  train: 2,510 rows, 15 columns
  validation: 406 rows, 15 columns

precedent_source 데이터셋 생성 중...
  train: 76,291 rows, 16 columns
  train: 76,291 rows, 16 columns
  validation: 9,527 rows, 16 columns

interpretation_source 데이터셋 생성 중...
  train: 410 rows, 15 columns
  validation: 66 rows, 15 columns

precedent_qa 데이터셋 생성 중...
  validation: 9,527 rows, 16 columns

interpretation_source 데이터셋 생성 중...
  train: 410 rows, 15 columns
  validation: 66 rows, 15 columns

precedent_qa 데이터셋 생성 중...
  train: 73,065 rows, 15 columns
  validation: 9,135 rows, 15 columns

precedent_summary 데이터셋 생성 중...
  train: 3,228 rows, 14 columns
  validation: 392 rows, 14 columns

statute_qa 데이터셋 생성 중...
  train: 12 rows, 13 columns
  validation: 2 rows, 13 columns

trial_decision_qa 데이터셋 생성 

In [11]:
from datasets import DatasetDict

# typed_datasets가 이미 딕셔너리 형태라고 가정
# 각 키의 값이 DatasetDict인 경우

# 방법 1: typed_datasets가 일반 딕셔너리인 경우
if isinstance(typed_datasets, dict) and not isinstance(typed_datasets, DatasetDict):
    typed_datasets = DatasetDict(typed_datasets)

# 방법 2: 중첩된 구조인 경우 (각 키마다 train/validation이 있는 경우)
# 이미 DatasetDict 형태이므로 변환 불필요
# typed_datasets의 각 키는 이미 DatasetDict입니다

# 사용 예시:
print(f"총 데이터셋 수: {len(typed_datasets)}")
print(f"데이터셋 키들: {list(typed_datasets.keys())}")

# 각 데이터셋 확인
for key, dataset_dict in typed_datasets.items():
    print(f"\n{key}:")
    print(f"  - train: {len(dataset_dict['train'])} rows")
    print(f"  - validation: {len(dataset_dict['validation'])} rows")

총 데이터셋 수: 11
데이터셋 키들: ['statute_source', 'trial_decision_source', 'precedent_source', 'interpretation_source', 'precedent_qa', 'precedent_summary', 'statute_qa', 'trial_decision_qa', 'trial_decision_summary', 'interpretation_qa', 'interpretation_summary']

statute_source:
  - train: 12 rows
  - validation: 2 rows

trial_decision_source:
  - train: 2510 rows
  - validation: 406 rows

precedent_source:
  - train: 76291 rows
  - validation: 9527 rows

interpretation_source:
  - train: 410 rows
  - validation: 66 rows

precedent_qa:
  - train: 73065 rows
  - validation: 9135 rows

precedent_summary:
  - train: 3228 rows
  - validation: 392 rows

statute_qa:
  - train: 12 rows
  - validation: 2 rows

trial_decision_qa:
  - train: 2289 rows
  - validation: 279 rows

trial_decision_summary:
  - train: 1100 rows
  - validation: 140 rows

interpretation_qa:
  - train: 258 rows
  - validation: 38 rows

interpretation_summary:
  - train: 152 rows
  - validation: 28 rows


In [12]:
typed_datasets

DatasetDict({
    statute_source: DatasetDict({
        train: Dataset({
            features: ['statute_name', 'effective_date', 'proclamation_date', 'statute_type', 'statute_abbrv', 'statute_category', 'data_class', 'text', 'sentences', 'char_len', 'word_len', '_source_zip', '_source_json', '_data_type', '_task_type', '_row_id', '_split'],
            num_rows: 12
        })
        validation: Dataset({
            features: ['statute_name', 'effective_date', 'proclamation_date', 'statute_type', 'statute_abbrv', 'statute_category', 'data_class', 'text', 'sentences', 'char_len', 'word_len', '_source_zip', '_source_json', '_data_type', '_task_type', '_row_id', '_split'],
            num_rows: 2
        })
    })
    trial_decision_source: DatasetDict({
        train: Dataset({
            features: ['doc_class', 'document_type', 'doc_id', 'decision_date', 'result', 'text', 'sentences', 'char_len', 'word_len', '_source_zip', '_source_json', '_data_type', '_task_type', '_row_id', '_spli

In [13]:
# 각 데이터셋 샘플 확인
print("="*80)
print("데이터셋 샘플 확인")
print("="*80)

for dataset_name, ds_dict in typed_datasets.items():
    print(f"\n{'='*80}")
    print(f"데이터셋: {dataset_name}")
    print('='*80)
    
    for split in ["train", "validation"]:
        if split not in ds_dict:
            continue
        
        ds = ds_dict[split]
        print(f"\n[{split}] {len(ds):,} rows")
        print(f"Columns: {', '.join(ds.column_names[:15])}")
        if len(ds.column_names) > 15:
            print(f"         ... and {len(ds.column_names) - 15} more")
        
        # 첫 번째 샘플 출력
        if len(ds) > 0:
            print(f"\n첫 번째 샘플:")
            sample = ds[0]
            for key, value in sorted(sample.items()):
                if key.startswith('_'):
                    print(f"  {key}: {value}")
                elif isinstance(value, str):
                    print(f"  {key}: {value[:100]}..." if len(value) > 100 else f"  {key}: {value}")
                elif isinstance(value, list):
                    print(f"  {key}: list[{len(value)}]")
                    if len(value) > 0:
                        print(f"    └─ 첫 항목: {str(value[0])[:80]}...")
                else:
                    print(f"  {key}: {value}")

데이터셋 샘플 확인

데이터셋: statute_source

[train] 12 rows
Columns: statute_name, effective_date, proclamation_date, statute_type, statute_abbrv, statute_category, data_class, text, sentences, char_len, word_len, _source_zip, _source_json, _data_type, _task_type
         ... and 2 more

첫 번째 샘플:
  _data_type: statute
  _row_id: train-0000000
  _source_json: /민사법_법령_7.json
  _source_zip: TS_01. 민사법_002. 법령.zip
  _split: train
  _task_type: source
  char_len: 0
  data_class: 2
  effective_date: 2024-08-01 00:00:00
  proclamation_date: 2023-08-08 00:00:00
  sentences: ["제1조(목적)\n", "이 법은 민사소송 등에서 전자문서 이용에 대한 기본 원칙과 절차를 규정함으로써 민사소송 등의 정보화를 촉진하고 신속성, 투명성을 높여 국민의 권리 실현에...
  statute_abbrv: 민소전자문서법
  statute_category: 민사법
  statute_name: 민사소송등에서의전자문서이용등에관한법률
  statute_type: 법률
  text: 
  word_len: 0

[validation] 2 rows
Columns: statute_name, effective_date, proclamation_date, statute_type, statute_abbrv, statute_category, data_class, text, sentences, char_len, word_len, _source_zip, _source_json, _da

In [17]:
# 데이터셋 저장
output_dir = str(BASE_DIR) + "/processed_datasets"
print(f"데이터셋 저장 중: {output_dir}")
typed_datasets.save_to_disk(output_dir)
print(f"✓ 저장 완료: {output_dir}")

데이터셋 저장 중: /mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/processed_datasets


Saving the dataset (1/1 shards): 100%|██████████| 12/12 [00:00<00:00, 425.77 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 2/2 [00:00<00:00, 91.96 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 2510/2510 [00:00<00:00, 11547.17 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 406/406 [00:00<00:00, 7462.06 examples/s]
Saving the dataset (2/2 shards): 100%|██████████| 76291/76291 [00:05<00:00, 13483.16 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 9527/9527 [00:00<00:00, 13334.71 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 410/410 [00:00<00:00, 12964.12 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 66/66 [00:00<00:00, 2856.71 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 73065/73065 [00:00<00:00, 134185.07 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 9135/9135 [00:00<00:00, 99952.70 examples/s] 
Saving the dataset (1/1 shards): 100%|██████████| 3228/3228 [00:00<00

✓ 저장 완료: /mnt/d/data/01.민사법 LLM 사전학습 및 Instruction Tuning 데이터/processed_datasets





In [18]:
from datasets import DatasetDict

# typed_datasets가 이미 딕셔너리 형태라고 가정
# 각 키의 값이 DatasetDict인 경우

# 방법 1: typed_datasets가 일반 딕셔너리인 경우
if isinstance(typed_datasets, dict) and not isinstance(typed_datasets, DatasetDict):
    typed_datasets = DatasetDict(typed_datasets)

# 방법 2: 중첩된 구조인 경우 (각 키마다 train/validation이 있는 경우)
# 이미 DatasetDict 형태이므로 변환 불필요
# typed_datasets의 각 키는 이미 DatasetDict입니다

# 사용 예시:
print(f"총 데이터셋 수: {len(typed_datasets)}")
print(f"데이터셋 키들: {list(typed_datasets.keys())}")

# 각 데이터셋 확인
for key, dataset_dict in typed_datasets.items():
    print(f"\n{key}:")
    print(f"  - train: {len(dataset_dict['train'])} rows")
    print(f"  - validation: {len(dataset_dict['validation'])} rows")

총 데이터셋 수: 11
데이터셋 키들: ['statute_source', 'trial_decision_source', 'precedent_source', 'interpretation_source', 'precedent_qa', 'precedent_summary', 'statute_qa', 'trial_decision_qa', 'trial_decision_summary', 'interpretation_qa', 'interpretation_summary']

statute_source:
  - train: 12 rows
  - validation: 2 rows

trial_decision_source:
  - train: 2510 rows
  - validation: 406 rows

precedent_source:
  - train: 76291 rows
  - validation: 9527 rows

interpretation_source:
  - train: 410 rows
  - validation: 66 rows

precedent_qa:
  - train: 73065 rows
  - validation: 9135 rows

precedent_summary:
  - train: 3228 rows
  - validation: 392 rows

statute_qa:
  - train: 12 rows
  - validation: 2 rows

trial_decision_qa:
  - train: 2289 rows
  - validation: 279 rows

trial_decision_summary:
  - train: 1100 rows
  - validation: 140 rows

interpretation_qa:
  - train: 258 rows
  - validation: 38 rows

interpretation_summary:
  - train: 152 rows
  - validation: 28 rows


In [20]:
# HuggingFace Hub 업로드 (선택사항)
# 실행하려면 주석 해제하고 토큰/리포지토리명 설정

from huggingface_hub import login
import os

# 로그인 (환경변수 또는 직접 입력)
# login(token="your_hf_token")

# 서브셋별로 config 분리하여 업로드
repo_name = "brainer/civil-law-ko-v2"

# typed_datasets의 모든 키를 configs로 매핑
configs = {}
for dataset_key, dataset_dict in typed_datasets.items():
    configs[dataset_key] = dataset_dict
    print(f"Config 추가: {dataset_key}")

print(f"\n총 {len(configs)}개의 config 준비됨:")
for config_name in configs.keys():
    print(f"  - {config_name}")

# 각 config별로 업로드
for config_name, dataset in configs.items():
    print(f"\n업로드 중: {config_name}")
    
    # 데이터셋 크기에 따라 num_proc 조정
    min_size = min(len(split_ds) for split_ds in dataset.values())
    # num_proc는 최소 split 크기와 CPU 수 중 작은 값으로 설정 (최소 1)
    safe_num_proc = max(1, min(min_size, os.cpu_count() or 1))
    
    print(f"  최소 split 크기: {min_size}, num_proc: {safe_num_proc}")
    
    dataset.push_to_hub(
        repo_name,
        config_name=config_name,
        num_proc=safe_num_proc
    )

print("\n업로드 스크립트 준비 완료")
print(f"✓ 리포지토리: https://huggingface.co/datasets/{repo_name}")


Config 추가: statute_source
Config 추가: trial_decision_source
Config 추가: precedent_source
Config 추가: interpretation_source
Config 추가: precedent_qa
Config 추가: precedent_summary
Config 추가: statute_qa
Config 추가: trial_decision_qa
Config 추가: trial_decision_summary
Config 추가: interpretation_qa
Config 추가: interpretation_summary

총 11개의 config 준비됨:
  - statute_source
  - trial_decision_source
  - precedent_source
  - interpretation_source
  - precedent_qa
  - precedent_summary
  - statute_qa
  - trial_decision_qa
  - trial_decision_summary
  - interpretation_qa
  - interpretation_summary

업로드 중: statute_source
  최소 split 크기: 2, num_proc: 2


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 67.45ba/s]s/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 48.54ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 67.45ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 48.54ba/s]
Processing Files (1 / 1): 100%|██████████|  110kB /  110kB, 61.3kB/s  
New Data Upload: 100%|██████████|  110kB /  110kB, 61.3kB/s  
Processing Files (1 / 1): 100%|██████████|  110kB /  110kB, 61.3kB/s  <00:03,  3.56s/ shards]
New Data Upload: 100%|██████████|  110kB /  110kB, 61.3kB/s  
Processing Files (1 / 1): 100%|██████████|  243kB /  243kB,  111kB/s  <00:03,  3.56s/ shards]
New Data Upload: 100%|██████████|  243kB /  243kB,  111kB/s  
Processing Files (1 / 1): 100%|██████████|  243kB /  243kB,  111kB/s  <00:00,  1.70s/ shards]
New Data Upload: 100%|██████████|  243kB /  243kB,  111kB/s  
Uploading the dataset shards (num_proc=2): 100%|██████████| 2/2 [00


업로드 중: trial_decision_source
  최소 split 크기: 406, num_proc: 16


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 80.00ba/s]rds/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 80.00ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 76.76ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 76.76ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 80.49ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 80.49ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 81.29ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 81.29ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 79.73ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 79.73ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 79.86ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 79.86ba/s]
Creating p


업로드 중: precedent_source
  최소 split 크기: 9527, num_proc: 16


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00,  9.47ba/s]rds/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00,  9.47ba/s]
Processing Files (1 / 1): 100%|██████████| 21.2MB / 21.2MB, 5.58MB/s  
Processing Files (1 / 1): 100%|██████████| 21.2MB / 21.2MB, 5.30MB/s  
New Data Upload: 100%|██████████| 21.2MB / 21.2MB, 5.30MB/s  
Processing Files (1 / 1): 100%|██████████| 21.2MB / 21.2MB, 5.30MB/s   10.47ba/s]
New Data Upload: 100%|██████████| 21.2MB / 21.2MB, 5.30MB/s  
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 10.47ba/s]
Processing Files (1 / 1): 100%|██████████| 19.1MB / 19.1MB, 4.77MB/s  
New Data Upload: 100%|██████████| 19.1MB / 19.1MB, 4.77MB/s  
Processing Files (1 / 1): 100%|██████████| 19.1MB / 19.1MB, 4.77MB/s  
New Data Upload: 100%|██████████| 19.1MB / 19.1MB, 4.77MB/s  
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00,  9.05ba/s]
Creating parquet from Arrow format: 100%|██████████|


업로드 중: interpretation_source
  최소 split 크기: 66, num_proc: 16


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 156.46ba/s]ds/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 110.93ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 156.46ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 110.93ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 106.78ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 143.66ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 106.78ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 143.66ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 122.91ba/s]

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 122.91ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 71.91ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 153.56ba/s]


업로드 중: precedent_qa
  최소 split 크기: 9135, num_proc: 16


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 72.97ba/s]rds/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 72.97ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 77.82ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 77.82ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 78.43ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 78.43ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 69.78ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 69.78ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 67.22ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 67.22ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 77.27ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 77.27ba/s]
Creating p


업로드 중: precedent_summary
  최소 split 크기: 392, num_proc: 16


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 154.87ba/s]ds/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 154.87ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 170.20ba/s]

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 157.73ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 157.73ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 141.61ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 141.61ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 167.50ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 167.50ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 160.66ba/s]

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 160.66ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 75.42ba/s


업로드 중: statute_qa
  최소 split 크기: 2, num_proc: 2


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 191.98ba/s]/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 191.98ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 189.38ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 189.38ba/s]
Processing Files (1 / 1): 100%|██████████| 8.78kB / 8.78kB, 7.32kB/s  
New Data Upload: 100%|██████████| 8.78kB / 8.78kB, 7.32kB/s  
Processing Files (1 / 1): 100%|██████████| 8.78kB / 8.78kB, 7.32kB/s  <00:02,  2.54s/ shards]
New Data Upload: 100%|██████████| 8.78kB / 8.78kB, 7.32kB/s  
Processing Files (1 / 1): 100%|██████████| 8.23kB / 8.23kB, 5.88kB/s  <00:02,  2.54s/ shards]
New Data Upload: 100%|██████████| 8.23kB / 8.23kB, 5.88kB/s  
Processing Files (1 / 1): 100%|██████████| 8.23kB / 8.23kB, 5.88kB/s  
New Data Upload: 100%|██████████| 8.23kB / 8.23kB, 5.88kB/s  
Uploading the dataset shards (num_proc=2): 100%|██████████| 2/2 [00:02<00:00,  1.37s/ s


업로드 중: trial_decision_qa
  최소 split 크기: 279, num_proc: 16


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 141.57ba/s]ds/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 141.57ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 175.87ba/s]

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 175.87ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 159.72ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 159.72ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 155.28ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 141.26ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 141.26ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 132.22ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 132.22ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 117.57ba/s


업로드 중: trial_decision_summary
  최소 split 크기: 140, num_proc: 16


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 126.70ba/s]ds/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 126.70ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 147.38ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 147.38ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 125.00ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 125.00ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 145.02ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 145.02ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 136.94ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 136.94ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 145.88ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 145.88ba/s]


업로드 중: interpretation_qa
  최소 split 크기: 38, num_proc: 16


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 218.50ba/s]ds/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 218.50ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 119.40ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 136.70ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 119.40ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 136.70ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 146.76ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 146.76ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 137.75ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 254.63ba/s]

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 137.75ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 254.63ba/s


업로드 중: interpretation_summary
  최소 split 크기: 28, num_proc: 16


Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 267.37ba/s]ds/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 280.07ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 267.37ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 280.07ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 271.35ba/s]

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 271.35ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 115.07ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 115.07ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 88.53ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 102.02ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 88.53ba/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 102.02ba/s]



업로드 스크립트 준비 완료
✓ 리포지토리: https://huggingface.co/datasets/brainer/civil-law-ko-v2
