# 🟣 Lesson 16: Dataset Preparation

Garbage In, Garbage Out. To train a LoRA, you need good data.

### Requirements:
1.  **Images**: 15-20 high-quality JPEG/PNGs.
2.  **Captions**: Text files describing the image.

### The Folder Structure:
```
dataset/
  img1.jpg
  img1.txt  (Contains: "a photo of sks dog running")
  img2.jpg
  img2.txt
```

### The Trigger Word
We usually use a rare token (like `sks`) as a trigger word so the model associates that specific word with your subject.

In [None]:
# 1. Setup
import notebook_utils
project_root, device, dtype = notebook_utils.setup_notebook()

from pathlib import Path
dataset_path = project_root / "datasets" / "demo_dataset"
dataset_path.mkdir(parents=True, exist_ok=True)

## 1. Create Dummy Dataset

We will programmatically create a tiny dataset just to show the structure.

In [None]:
from PIL import Image, ImageDraw

def create_training_pair(index, color):
    # Create Image (A simple colored square)
    img = Image.new("RGB", (512, 512), color=color)
    draw = ImageDraw.Draw(img)
    draw.ellipse((100, 100, 400, 400), fill="white")
    
    # Save Image
    img.save(dataset_path / f"image_{index}.jpg")
    
    # Save Caption
    with open(dataset_path / f"image_{index}.txt", "w") as f:
        f.write(f"a photo of sks circle, {color} background")

create_training_pair(1, "red")
create_training_pair(2, "blue")
create_training_pair(3, "green")

print(f"Created 3 training pairs in: {dataset_path}")

## 2. Verify
Let's list the files to make sure they match the required format.

In [None]:
for f in dataset_path.glob("*"):
    print(f.name)