# 데이터 추출 및 모델 튜닝 파이프라인

* 데이터셋 리샘플링
* 데이터셋 추출
* 데이터셋과 학습대상 모델 매핑
* 학습 수행
* 모델 추출 및 테스트

기능 실험 테스트코드

## 기능 테스트
경로 설정, 파일명 추출, 데이터셋 분할, 라벨 파일 읽기, 데이터셋 생성 및 삭제 등 기능 테스트

### 데이터셋 구조 정의

생성될 데이터셋 경로 정의, 추출 시 데이터에 대한 레이블 파일 매핑 기능 테스트

In [1]:
!ls ../..

README.md          [1m[36mdemo_datasets[m[m      [1m[36mmvp_scripts[m[m        [1m[36mui_app_test[m[m
[1m[36mda_framework[m[m       drift_datasets.zip [1m[36mnotebooks[m[m          [1m[36mworkspace[m[m
[1m[36mdatasets[m[m           [1m[36mkolpr_test[m[m         [1m[36morange3[m[m


In [2]:
import os

os.listdir('../../datasets/car_kolp_data')

['README.roboflow.txt',
 'valid',
 'README.dataset.txt',
 'test',
 'data.yaml',
 'train']

데이터셋 스플릿 별 경로 파싱

In [5]:
dataset_path = '../../datasets/car_kolp_data'

train_path = os.path.join(dataset_path, 'train')
valid_path = os.path.join(dataset_path, 'valid')
test_path = os.path.join(dataset_path, 'test')

train_images = os.listdir(os.path.join(train_path, 'images'))
valid_images = os.listdir(os.path.join(valid_path, 'images'))
test_images = os.listdir(os.path.join(test_path, 'images'))

레이블 파일 매칭

In [12]:
img_name = os.path.splitext(train_images[0])[0]

if os.path.exists(os.path.join(train_path, 'labels', f'{img_name}.txt')):
    label_name = f'{img_name}.txt'
    print(label_name)

IMG_1567_png.rf.bb44fbd0b5f8948712055a9ef72d7a8e.txt


In [35]:
print(len(train_images) + len(valid_images) + len(test_images))

2449


input 경로, export 경로 설정

In [36]:
# Define paths
input_dataset_path = '../../datasets/car_kolp_data'
output_base_path = '../datasets/512yolo'
view_name = 'kolpr_selection_test'
output_dataset_path = os.path.join(output_base_path, view_name)

추출 경로에 데이터셋 및 스플릿 폴더 생성

In [37]:
out_train_path = os.path.join(output_dataset_path, 'train')
out_valid_path = os.path.join(output_dataset_path, 'valid')
out_test_path = os.path.join(output_dataset_path, 'test')
# Create output directories
os.makedirs(output_dataset_path, exist_ok=True)
os.makedirs(out_train_path, exist_ok=True)
os.makedirs(out_valid_path, exist_ok=True)
os.makedirs(out_test_path, exist_ok=True)

### 데이터셋 추출 기능 종합 테스트

데이터셋 로드 및 샘플링된 데이터-레이블 쌍 추출 테스트

In [15]:
import fiftyone as fo
import fiftyone.brain as fob
import cv2
import os

# Define paths
input_dataset_path = '../../datasets/car_kolp_data'
output_base_path = '../datasets/512yolo'
view_name = 'kolpr_selection_test'
output_dataset_path = os.path.join(output_base_path, view_name)
# data-split dir paths
out_train_path = os.path.join(output_dataset_path, 'train')
out_valid_path = os.path.join(output_dataset_path, 'valid')
out_test_path = os.path.join(output_dataset_path, 'test')
# Create output directories
os.makedirs(output_dataset_path, exist_ok=True)
os.makedirs(out_train_path, exist_ok=True)
os.makedirs(out_valid_path, exist_ok=True)
os.makedirs(out_test_path, exist_ok=True)

dataset = fo.Dataset.from_images_dir(input_dataset_path)

len(dataset)

 100% |███████████████| 2449/2449 [155.4ms elapsed, 0s remaining, 15.9K samples/s]  


2449

랜덤으로 샘플링 테스트

In [16]:
import random

selected_samples = []
random_samples = random.sample(list(dataset), 100)

for sample in random_samples:
    selected_samples.append(sample.filename)

len(selected_samples)

100

데이터 스플릿 생성 (train, val, test)

In [17]:
# Shuffle and split the images
random.shuffle(selected_samples)
num_samples = len(selected_samples)
train_split = int(0.7 * num_samples)
valid_split = int(0.9 * num_samples)

In [18]:
train_samples = selected_samples[:train_split]
valid_samples = selected_samples[train_split:valid_split]
test_samples = selected_samples[valid_split:]

print("train data : ",len(train_samples))
print("valid data : ",len(valid_samples))
print("test data : ",len(test_samples))


train data :  70
valid data :  20
test data :  10


데이터 - 레이블 쌍 매핑 및 추출 기능

In [19]:
# Function to find and copy images and labels
def find_and_copy_images_and_labels(images, subset):
    # 각 스플릿의 images와 labels 디렉토리 생성
    images_dir = os.path.join(output_dataset_path, subset, 'images')
    labels_dir = os.path.join(output_dataset_path, subset, 'labels')
    os.makedirs(images_dir, exist_ok=True)
    os.makedirs(labels_dir, exist_ok=True)
    
    for image_name in images:
        found = False
        for split in ['train', 'valid', 'test']:
            image_src = os.path.join(input_dataset_path, split, 'images', image_name)
            label_name = os.path.splitext(image_name)[0] + '.txt'
            label_src = os.path.join(input_dataset_path, split, 'labels', label_name)
            
            if os.path.exists(image_src) and os.path.exists(label_src):
                image_dst = os.path.join(output_dataset_path, subset, 'images', image_name)
                label_dst = os.path.join(output_dataset_path, subset, 'labels', label_name)
                
                shutil.copy(image_src, image_dst)
                shutil.copy(label_src, label_dst)
                
                found = True
                break
        
        if not found:
            print(f"Warning: {image_name} or its label not found in any split.")

데이터 추출 후 yolo dataset 포맷에 맞는 yaml 설정 파일 생성

In [20]:
import shutil
import yaml

# Copy images and labels to respective directories
find_and_copy_images_and_labels(train_samples, 'train')
find_and_copy_images_and_labels(valid_samples, 'valid')
find_and_copy_images_and_labels(test_samples, 'test')

# Create data.yaml file
data_yaml = {
    'train': os.path.join(output_dataset_path, 'train'),
    'val': os.path.join(output_dataset_path, 'valid'),
    'test': os.path.join(output_dataset_path, 'test'),
    'nc': 1,  # Number of classes
    'names': ['License Plate']  # Replace with actual class names
}
with open(os.path.join(output_dataset_path, 'data.yaml'), 'w') as f:
    yaml.dump(data_yaml, f, default_flow_style=False, sort_keys=False)

print("Dataset split and data.yaml file created successfully.")

Dataset split and data.yaml file created successfully.


새로운 데이터셋 정상 생성 확인

In [21]:
len(os.listdir(os.path.join(output_dataset_path, 'valid', 'images')))

20

### fiftyone dataset load 기능

Fiftyone 자체 라이브러리로 구현된 dataset_type 별 데이터 로드, 추출 기능 실험

In [34]:
name = "yolo_loader_test"

fo.delete_dataset(name)

데이터 읽어오기

In [35]:
dataset_dir = '../../datasets/car_kolp_data'
splits = ['train', 'val', 'test']

dataset = fo.Dataset(name)

for split in splits:
    dataset.add_dir(
        dataset_dir=dataset_dir,
        dataset_type=fo.types.YOLOv5Dataset,
        split=split,
        tags=split,
    )

 100% |███████████████| 2372/2372 [1.2s elapsed, 0s remaining, 2.0K samples/s]         
 100% |███████████████████| 66/66 [30.3ms elapsed, 0s remaining, 2.2K samples/s]     
 100% |███████████████████| 11/11 [7.6ms elapsed, 0s remaining, 1.4K samples/s]      


In [36]:
len(dataset)

2449

In [37]:
fo.launch_app(dataset, port=9999)

Dataset:          yolo_loader_test
Media type:       image
Num samples:      2449
Selected samples: 0
Selected labels:  0
Session URL:      http://localhost:9999/

데이터셋 포맷 별 추출 기능 제공

In [43]:
saved_views = dataset.list_saved_views()
view = dataset.load_saved_view(saved_views[0])
export_dir = './datasets/512yolo/tmp/'

view.export(
    export_dir=export_dir,
    dataset_type=fo.types.YOLOv5Dataset,
)


 100% |█████████████████████| 5/5 [20.6ms elapsed, 0s remaining, 242.2 samples/s] 


추출 결과 확인

In [44]:
fo.list_datasets()

['2024.11.08.16.30.03',
 '2024.11.08.16.48.59',
 '2024.11.08.17.40.32',
 '2024.11.08.17.56.20',
 '2024.11.08.17.57.50',
 '2024.11.08.17.59.03',
 '2024.11.12.11.21.54',
 '2024.11.12.11.25.53',
 '2024.11.13.12.23.50',
 '2024.11.13.12.28.02',
 '2024.11.13.17.45.58',
 '2024.11.13.17.47.49',
 '2024.11.13.17.48.09',
 '2024.11.13.17.48.47',
 '2024.11.13.17.49.03',
 'lghangul_dataset',
 'yolo_loader_test']

In [45]:
fo.load_dataset('2024.11.08.16.30.03')

Name:        2024.11.08.16.30.03
Media type:  image
Num samples: 2449
Persistent:  False
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField

In [48]:
fo.launch_app(dataset, port=9991)

Dataset:          yolo_loader_test
Media type:       image
Num samples:      2449
Selected samples: 0
Selected labels:  0
Session URL:      http://localhost:9991/

데이터셋에 태그를 부여해 포맷 구분

In [10]:
dataset = fo.load_dataset('kolp_dataset')

Name:        kolp_dataset
Media type:  image
Num samples: 2449
Persistent:  True
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField
    ground_truth:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    clip_embeddings:  fiftyone.core.fields.ListField(fiftyone.core.fields.FloatField)

In [21]:
dataset.tags.append("YOLOv5Dataset")
dataset.save()

In [23]:
dataset.tags

['YOLOv5Dataset']

In [24]:
if "YOLOv5Dataset" in dataset.tags:
    print("Dataset type is YOLOv5 Dataset")
elif "FiftyOneDataset" in dataset.tags:
    print("Dataset type is FiftyOne Dataset")

Dataset type is YOLOv5Dataset


# 데모 앱 실행

## Run App
```bash
bash run.sh
```
>args
>- dataset_dir : 데이터셋 경로
>- dataset_name : 데이터셋 이름
>- dataset_type : 데이터셋 포맷 (51, yolo)
>- port : 포트 번호

## 데이터셋 구조

### YOLOv5Dataset
YOLO 데이터셋의 경우 데이터셋 폴더 내에 train, valid, test 폴더와 하위 images, labels 폴더로 구성됨<br>
Dataset config file의 명칭은 dataset.yaml로 고정됨

### FiftyOneDataset
FiftyOne 데이터셋의 경우 자체 포맷에서 정의한 구성을 따름<br>
* brain/ 폴더 내에 임베딩 벡터에 대한 데이터 저장
* data/ 폴더 내에 이미지 데이터 저장
* metadata.json 파일에 데이터셋에 대한 모든 정보 저장
* samples.jsonl 파일에 각 이미지에 대한 정보 저장