# Amazing Logos V4 - Step 1: Data Processing

This notebook processes the amazing_logos_v4 dataset from the input folder:
- Reads the dataset from input/amazing_logos_v4/
- Assigns unique IDs with "amazing_logo_v4" prefix
- Saves logos as 256x256 images in output/amazing_logos_v4/images/256x256/
- Saves metadata (ID + prompt) in output/amazing_logos_v4/data/

In [None]:
# Setup paths
input_dataset = Path('../../input/amazing_logos_v4')
output_base = Path('../../output/amazing_logos_v4')
output_images = output_base / 'images' / '256x256'
output_data = output_base / 'data' / 'amazing_logos_v4_cleanup'

# Create output directories
output_images.mkdir(parents=True, exist_ok=True)
output_data.mkdir(parents=True, exist_ok=True)

print(f"Input dataset: {input_dataset}")
print(f"Output images: {output_images}")
print(f"Output data: {output_data}")

Input path: ..\input\amazing_logos_v4
Output images: ..\output\amazing_logos_v4\images\256x256
Output data: ..\output\amazing_logos_v4\data
Input path exists. Contents:
  - dataset_dict.json
  - train


In [None]:
# Load the dataset
print("Loading amazing_logos_v4 dataset...")
data = None
try:
    # Try loading as HuggingFace dataset
    dataset = load_from_disk(str(input_path))
    print(f"Dataset loaded successfully!")
    print(f"Dataset info: {dataset}")
    
    # Get the train split (assuming it's the main data)
    if 'train' in dataset:
        data = dataset['train']
        print(f"Using train split with {len(data)} entries")
    else:
        data = dataset
        print(f"Using dataset directly with {len(data)} entries")
    
    # Show dataset structure
    print(f"\nDataset columns: {data.column_names}")
    
    # Show first example
    if len(data) > 0:
        first_example = data[0]
        print(f"\nFirst example keys: {first_example.keys()}")
        for key, value in first_example.items():
            if key == 'image':
                print(f"  {key}: PIL Image ({value.size})")
            elif isinstance(value, str) and len(value) > 100:
                print(f"  {key}: {value[:100]}...")
            else:
                print(f"  {key}: {value}")
                
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Trying alternative loading methods...")
    data = None

# Check if data was loaded successfully
if data is None:
    print("❌ Failed to load dataset. Please check the input path and dataset format.")
    raise RuntimeError("Dataset loading failed")
else:
    print(f"✅ Dataset loaded successfully with {len(data)} entries")

Loading amazing_logos_v4 dataset...


Loading dataset from disk:   0%|          | 0/29 [00:00<?, ?it/s]

Dataset loaded successfully!
Dataset info: DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 397251
    })
})
Using train split with 397251 entries

Dataset columns: ['image', 'text']

First example keys: dict_keys(['image', 'text'])
  image: PIL Image ((512, 512))
  text: Simple elegant logo for Mandarin Oriental, Fan Hong kong Lines Paper, Hospitality, successful vibe, ...


In [None]:
# Save metadata to CSV
csv_path = output_data / 'metadata.csv'
df_metadata.to_csv(csv_path, index=False)
print(f"Metadata CSV saved to: {csv_path}")

Creating DataFrame with id and text columns...
Extracting text data from dataset...
Extracted 397251 text entries
Creating ID-text pairs...
Extracted 397251 text entries
Creating ID-text pairs...


Creating metadata: 397252it [00:00, 1076474.78it/s]                            




=== PROCESSING COMPLETE ===
Created DataFrame with 397251 rows
Columns: ['id', 'text']
Metadata CSV saved to: ..\output\amazing_logos_v4\data\amazing_logos_v4_metadata.csv

=== SAMPLE DATA ===
                      id                                               text
0  amazing_logo_v4000000  Simple elegant logo for Mandarin Oriental, Fan...
1  amazing_logo_v4000001  Simple elegant logo for Alfa, Hexagon Poland T...
2  amazing_logo_v4000002  Simple elegant logo for Kuraray, G Japan K Out...
3  amazing_logo_v4000003  Simple elegant logo for Valwood Park, Lines Ro...
4  amazing_logo_v4000004  Simple elegant logo for Cinepaq, C Circle Film...
5  amazing_logo_v4000005  Simple elegant logo for Baumechanik Barleben, ...
6  amazing_logo_v4000006  Simple elegant logo for Werbeagentur Zühlke, ...
7  amazing_logo_v4000007  Simple elegant logo for Josef Grabner, Circles...
8  amazing_logo_v4000008  Simple elegant logo for Danefae, Beard Denmark...
9  amazing_logo_v4000009  Simple elegant logo 