# Amazing Logos V4 - Step 2: Text Analysis and Structured Data Export

This notebook processes the metadata CSV from Step 1:
- Loads amazing_logos_v4_metadata.csv
- Parses the text column to extract structured information (company, description, category, tags)
- Creates a structured CSV with columns: id, company, description, category, tags
- Saves results as amazing_logos_v4_metadata2.csv

In [1]:
import pandas as pd
import json
from pathlib import Path
from tqdm import tqdm
import numpy as np
import sys

# Add utils folder to path
utils_path = Path('../../utils')
sys.path.append(str(utils_path))

from text import parse_text

# Paths
input_csv = Path('../../output/amazing_logos_v4/data/amazing_logos_v4_cleanup/metadata.csv')
output_csv = Path('../../output/amazing_logos_v4/data/amazing_logos_v4_cleanup/metadata2.csv')

print(f"Input CSV: {input_csv}")
print(f"Output CSV: {output_csv}")

# Check if input exists
if not input_csv.exists():
    print(f"ERROR: Input file {input_csv} does not exist!")
else:
    print(f"Input file exists.")

Input CSV: ..\..\output\amazing_logos_v4\data\amazing_logos_v4_cleanup\metadata.csv
Output CSV: ..\..\output\amazing_logos_v4\data\amazing_logos_v4_cleanup\metadata2.csv
Input file exists.


In [2]:
# Load the metadata CSV
print("Loading metadata CSV...")
df = pd.read_csv(input_csv)

print(f"Loaded {len(df)} rows")
print(f"Columns: {list(df.columns)}")
print(f"\nFirst 5 rows:")
print(df.head())

# Show some sample text values to understand the structure
print(f"\nSample text values:")
for i in range(min(5, len(df))):
    print(f"{i+1}. {df.iloc[i]['text']}")

Loading metadata CSV...
Loaded 397251 rows
Columns: ['id', 'text']

First 5 rows:
                      id                                               text
0  amazing_logo_v4000000  Simple elegant logo for Mandarin Oriental, Fan...
1  amazing_logo_v4000001  Simple elegant logo for Alfa, Hexagon Poland T...
2  amazing_logo_v4000002  Simple elegant logo for Kuraray, G Japan K Out...
3  amazing_logo_v4000003  Simple elegant logo for Valwood Park, Lines Ro...
4  amazing_logo_v4000004  Simple elegant logo for Cinepaq, C Circle Film...

Sample text values:
1. Simple elegant logo for Mandarin Oriental, Fan Hong kong Lines Paper, Hospitality, successful vibe, minimalist, thought-provoking, abstract, recognizable, relatable, sharp, vector art, even edges, black and white
2. Simple elegant logo for Alfa, Hexagon Poland Triangles, Chemicals, successful vibe, minimalist, thought-provoking, abstract, recognizable, relatable, sharp, vector art, even edges, black and white
3. Simple elegant logo fo

In [3]:
# Parse the text column to extract structured information
print("Parsing text column...")


# Test the parsing function on a few examples
print("Testing parsing function:")
for i in range(min(3, len(df))):
    text = df.iloc[i]['text']
    company, description, category, tags = parse_text(text)
    print(f"\nExample {i+1}:")
    print(f"  Original: {text}")
    print(f"  Company: {company}")
    print(f"  Description: {description}")
    print(f"  Category: {category}")
    print(f"  Tags: {tags}")

Parsing text column...
Testing parsing function:

Example 1:
  Original: Simple elegant logo for Mandarin Oriental, Fan Hong kong Lines Paper, Hospitality, successful vibe, minimalist, thought-provoking, abstract, recognizable, relatable, sharp, vector art, even edges, black and white
  Company: Simple elegant logo for Mandarin Oriental
  Description: Fan Hong kong Lines Paper
  Category: Hospitality
  Tags: ['successful vibe', 'minimalist', 'thought-provoking', 'abstract', 'recognizable', 'relatable', 'sharp', 'vector art', 'even edges', 'black and white']

Example 2:
  Original: Simple elegant logo for Alfa, Hexagon Poland Triangles, Chemicals, successful vibe, minimalist, thought-provoking, abstract, recognizable, relatable, sharp, vector art, even edges, black and white
  Company: Simple elegant logo for Alfa
  Description: Hexagon Poland Triangles
  Category: Chemicals
  Tags: ['successful vibe', 'minimalist', 'thought-provoking', 'abstract', 'recognizable', 'relatable', 'sharp', 

In [4]:
# Process all rows to create structured data
print("Processing all rows to create structured data...")

# Create list to store structured data
structured_data = []

# Process each row
for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing rows"):
    logo_id = row['id']
    company, description, category, tags = parse_text(row['text'])
    
    # Convert tags list to comma-separated string
    tags_str = ', '.join(tags) if tags else ''
    
    # Ensure all values are strings
    structured_data.append({
        'id': str(logo_id) if logo_id else '',
        'company': str(company) if company else '',
        'description': str(description) if description else '',
        'category': str(category) if category else '',
        'tags': tags_str
    })

print(f"Processed {len(structured_data)} rows")

# Create new DataFrame with structured data
structured_df = pd.DataFrame(structured_data)

print(f"Created structured DataFrame with columns: {list(structured_df.columns)}")
print(f"DataFrame shape before filtering: {structured_df.shape}")

# Filter out rows with empty values in any column
print("\nFiltering out rows with empty values...")
initial_count = len(structured_df)

# Filter out rows where any required column is empty
structured_df = structured_df[
    (structured_df['id'] != '') &
    (structured_df['company'] != '') &
    (structured_df['description'] != '') &
    (structured_df['category'] != '') &
    (structured_df['tags'] != '')
]

final_count = len(structured_df)
removed_count = initial_count - final_count

print(f"Filtered DataFrame shape: {structured_df.shape}")
print(f"Removed {removed_count} rows with empty values ({removed_count/initial_count*100:.1f}%)")
print(f"Kept {final_count} complete rows ({final_count/initial_count*100:.1f}%)")

Processing all rows to create structured data...


Processing rows: 100%|██████████| 397251/397251 [00:14<00:00, 27971.20it/s]


Processed 397251 rows
Created structured DataFrame with columns: ['id', 'company', 'description', 'category', 'tags']
DataFrame shape before filtering: (397251, 5)

Filtering out rows with empty values...
Filtered DataFrame shape: (393298, 5)
Removed 3953 rows with empty values (1.0%)
Kept 393298 complete rows (99.0%)


In [6]:
structured_df[(structured_df.category == 'minimalist') & (structured_df.description == 'successful vibe')]

Unnamed: 0,id,company,description,category,tags
10124,amazing_logo_v4010124,Simple elegant logo for Elka Minimalista,successful vibe,minimalist,"thought-provoking, abstract, recognizable, rel..."
10135,amazing_logo_v4010135,Simple elegant logo for parrot,successful vibe,minimalist,"thought-provoking, abstract, recognizable, rel..."
10152,amazing_logo_v4010152,Simple elegant logo for MIXINGLE,successful vibe,minimalist,"thought-provoking, abstract, recognizable, rel..."
10156,amazing_logo_v4010156,Simple elegant logo for Duck,successful vibe,minimalist,"thought-provoking, abstract, recognizable, rel..."
10181,amazing_logo_v4010181,Simple elegant logo for Black Knight,successful vibe,minimalist,"thought-provoking, abstract, recognizable, rel..."
...,...,...,...,...,...
397153,amazing_logo_v4397153,Simple elegant logo for Unicom,successful vibe,minimalist,"thought-provoking, abstract, recognizable, rel..."
397159,amazing_logo_v4397159,Simple elegant logo for Padel Brasil,successful vibe,minimalist,"thought-provoking, abstract, recognizable, rel..."
397162,amazing_logo_v4397162,Simple elegant logo for Anchors Up,successful vibe,minimalist,"thought-provoking, abstract, recognizable, rel..."
397182,amazing_logo_v4397182,Simple elegant logo for cafe sole,successful vibe,minimalist,"thought-provoking, abstract, recognizable, rel..."


In [7]:
# Extract a list of all tags
print("Extracting and counting tags...")
all_tags = structured_df['tags'].str.split(', ').explode()

# Clean up any leading/trailing whitespace
all_tags = all_tags.str.strip()

# Remove empty strings that might result from splitting
all_tags = all_tags[all_tags != '']

print(f"Found {len(all_tags)} total tag instances.")

Extracting and counting tags...
Found 3539347 total tag instances.


In [8]:
# Get the value counts of each tag
tag_counts = all_tags.value_counts()

# Print the top 20 most common tags
print("\nTop 20 most common tags:")
print(tag_counts.head(20))


Top 20 most common tags:
tags
abstract             393298
thought-provoking    393298
sharp                393298
even edges           393298
relatable            393298
vector art           393298
recognizable         393298
minimalist           367668
successful vibe      367264
black and white        9936
Entertainment           521
Retail                  492
Food                    492
Education               407
Restaurant              404
Sports                  346
entertainment           329
Design                  306
Hospitality             304
retail                  290
Name: count, dtype: int64


In [9]:
# Get top 10 tags
top_10_tags = tag_counts.head(10).index.tolist()
print(f"Top 10 tags: {top_10_tags}")

# Columns to check for tags
columns_to_check = ['company', 'description', 'category']

# Create a copy to track changes
structured_df_cleaned = structured_df.copy()
rows_changed = 0

# Iterate over the DataFrame and apply the logic
for index, row in tqdm(structured_df_cleaned.iterrows(), total=len(structured_df_cleaned), desc="Cleaning columns"):
    # Get current tags, split into a set for efficient adding and avoiding duplicates
    current_tags = set(tag.strip() for tag in row['tags'].split(',') if tag.strip())
    
    changed_in_row = False
    
    for col in columns_to_check:
        # Check if the column value is one of the top 10 tags
        if row[col] in top_10_tags:
            tag_to_move = row[col]
            
            # Add the tag to our set of tags for this row
            current_tags.add(tag_to_move)
            
            # Set the column value to an empty string
            structured_df_cleaned.at[index, col] = ''
            changed_in_row = True

    # If any changes were made, update the 'tags' string
    if changed_in_row:
        rows_changed += 1
        # Join the set back into a comma-separated string
        structured_df_cleaned.at[index, 'tags'] = ', '.join(sorted(list(current_tags)))

print(f"\nProcessing complete.")
print(f"Number of rows modified: {rows_changed}")

# Overwrite the original dataframe with the cleaned one
structured_df = structured_df_cleaned
print("DataFrame has been updated with the cleaned data.")

Top 10 tags: ['abstract', 'thought-provoking', 'sharp', 'even edges', 'relatable', 'vector art', 'recognizable', 'minimalist', 'successful vibe', 'black and white']


Cleaning columns: 100%|██████████| 393298/393298 [00:08<00:00, 45734.16it/s]


Processing complete.
Number of rows modified: 26042
DataFrame has been updated with the cleaned data.





In [10]:
structured_df[(structured_df.category == 'minimalist') & (structured_df.description == 'successful vibe')]

Unnamed: 0,id,company,description,category,tags


In [13]:
# Save the structured data as CSV
print(f"Saving structured data to {output_csv}...")
structured_df.to_csv(output_csv, index=False)

print(f"Structured data saved successfully!")

# Show statistics
print(f"\n=== FINAL STATISTICS ===")
print(f"Total rows: {len(structured_df)}")
print(f"Columns: {list(structured_df.columns)}")
print(f"Output file: {output_csv}")

# Show sample data
print(f"\n=== SAMPLE DATA ===")
print(structured_df.head(10))

Saving structured data to ..\..\output\amazing_logos_v4\data\amazing_logos_v4_cleanup\metadata2.csv...
Structured data saved successfully!

=== FINAL STATISTICS ===
Total rows: 393298
Columns: ['id', 'company', 'description', 'category', 'tags']
Output file: ..\..\output\amazing_logos_v4\data\amazing_logos_v4_cleanup\metadata2.csv

=== SAMPLE DATA ===
                      id                                       company  \
0  amazing_logo_v4000000     Simple elegant logo for Mandarin Oriental   
1  amazing_logo_v4000001                  Simple elegant logo for Alfa   
2  amazing_logo_v4000002               Simple elegant logo for Kuraray   
3  amazing_logo_v4000003          Simple elegant logo for Valwood Park   
4  amazing_logo_v4000004               Simple elegant logo for Cinepaq   
5  amazing_logo_v4000005  Simple elegant logo for Baumechanik Barleben   
6  amazing_logo_v4000006   Simple elegant logo for Werbeagentur Zuhlke   
7  amazing_logo_v4000007         Simple elegant logo f

In [12]:
# Show data quality statistics
print("=== DATA QUALITY ANALYSIS ===")
print(f"Total rows: {len(structured_df)}")
print(f"Rows with company: {len(structured_df[structured_df['company'] != ''])}")
print(f"Rows with description: {len(structured_df[structured_df['description'] != ''])}")
print(f"Rows with category: {len(structured_df[structured_df['category'] != ''])}")
print(f"Rows with tags: {len(structured_df[structured_df['tags'] != ''])}")

# Show unique counts
print(f"\nUnique categories: {structured_df['category'].nunique()}")
print(f"Unique companies: {structured_df['company'].nunique()}")

# Show some examples of different data parts
print(f"\n=== SAMPLE CATEGORIES ===")
unique_categories = structured_df['category'].unique()
for i, category in enumerate(unique_categories[:10]):
    if category:
        print(f"{i+1:2d}. {category}")

print(f"\n=== SAMPLE COMPANIES ===")
unique_companies = structured_df['company'].unique()
for i, company in enumerate(unique_companies[:10]):
    if company:
        print(f"{i+1:2d}. {company}")

=== DATA QUALITY ANALYSIS ===
Total rows: 393298
Rows with company: 393298
Rows with description: 367661
Rows with category: 367263
Rows with tags: 393298

Unique categories: 52982
Unique companies: 283790

=== SAMPLE CATEGORIES ===
 1. Hospitality
 2. Chemicals
 3. Safty Glass
 4. Park
 5. Film
 6. Engineering
 7. Advertising
 8. Design
 9. Accessories
10. Leisure

=== SAMPLE COMPANIES ===
 1. Simple elegant logo for Mandarin Oriental
 2. Simple elegant logo for Alfa
 3. Simple elegant logo for Kuraray
 4. Simple elegant logo for Valwood Park
 5. Simple elegant logo for Cinepaq
 6. Simple elegant logo for Baumechanik Barleben
 7. Simple elegant logo for Werbeagentur Zuhlke
 8. Simple elegant logo for Josef Grabner
 9. Simple elegant logo for Danefae
10. Simple elegant logo for IPCT
Unique companies: 283790

=== SAMPLE CATEGORIES ===
 1. Hospitality
 2. Chemicals
 3. Safty Glass
 4. Park
 5. Film
 6. Engineering
 7. Advertising
 8. Design
 9. Accessories
10. Leisure

=== SAMPLE COMPANI