# Analysing the dataset
The analyze_dataset function stored in functions.py performs an initial exploration of both the CSV metadata and image files, providing key statistics about the dataset such as the number of entries, category distribution, and image characteristics (sizes and dimensions). This analysis helps understand the dataset structure and identify any potential issues before proceeding with model training.

In [None]:
def analyze_dataset(CSV_PATH, IMAGES_DIR):
    """Analyze both metadata and images"""
    print("\nLoading and analyzing dataset...")

    # Load metadata
    df = pd.read_csv(CSV_PATH)
    print("\n1. BASIC DATASET INFORMATION")
    print("-" * 50)
    print(f"Total number of entries in CSV: {len(df)}")

    # Count images
    image_files = list(Path(IMAGES_DIR).glob("*.jpg"))
    print(f"Total number of images found: {len(image_files)}")

    # Category analysis
    print("\n2. CATEGORY ANALYSIS")
    print("-" * 50)

    # Categories
    categories = df['categories'].value_counts()
    print("\nCategories:")
    print(categories)
    print(f"Total number of categories: {len(categories)}")

    # Title analysis (optional)
    print("\n3. TITLE ANALYSIS")
    print("-" * 50)
    print(f"Sample of titles (first 5):")
    print(df['title'].head())

    # Check for missing values
    missing_values = df.isnull().sum()
    if missing_values.any():
        print("\n4. MISSING VALUES")
        print("-" * 50)
        print(missing_values[missing_values > 0])

    # Analyze a small sample of images for size information
    print("\n5. IMAGE SAMPLE ANALYSIS")
    print("-" * 50)
    sample_size = 50
    sample_files = np.random.choice(image_files, min(sample_size, len(image_files)), replace=False)

    file_sizes = []
    dimensions = []

    for img_path in tqdm(sample_files, desc="Analyzing sample images"):
        try:
            # Get file size in KB
            file_sizes.append(os.path.getsize(img_path) / 1024)

            # Get dimensions
            with Image.open(img_path) as img:
                width, height = img.size
                dimensions.append((width, height))

        except Exception as e:
            print(f"Error processing {img_path}: {e}")

    if file_sizes and dimensions:
        print(f"\nAverage file size: {np.mean(file_sizes):.2f} KB")
        print(f"Average dimensions: {np.mean([d[0] for d in dimensions]):.0f} x {np.mean([d[1] for d in dimensions]):.0f}")

In order to run the above function, run the code below

In [None]:
from functions import analyze_dataset

CSV_PATH = "Dataset/styles.csv"
IMAGES_DIR = "Dataset/train_images"

analyze_dataset(CSV_PATH, IMAGES_DIR)

## Console output

1. BASIC DATASET INFORMATION
--------------------------------------------------
Total number of entries in CSV: 46229
Total number of images found: 42000

2. CATEGORY ANALYSIS
--------------------------------------------------

Categories:
categories
Arts, Crafts & Sewing        2225,
Beauty                       2202,
Grocery & Gourmet Food       2201,
Sports & Outdoors            2201,
Automotive                   2200,
Industrial & Scientific      2200,
Musical Instruments          2200,
Appliances                   2200,
Office Products              2200,
All Beauty                   2200,
Toys & Games                 2200,
Electronics                  2200,
All Electronics              2200,
Cell Phones & Accessories    2200,
Patio, Lawn & Garden         2200,
Baby                         2200,
Baby Products                2200,
Health & Personal Care       2200,
Tools & Home Improvement     2200,
Clothing, Shoes & Jewelry    2200,
Pet Supplies                 2200,
Name: count, dtype: int64
Total number of categories: 21

3. TITLE ANALYSIS
--------------------------------------------------
Sample of titles (first 5):
0                     TUNGSTEN SOLDER PICK WITH HANDLE
1    Write Right 98167 Screen Protector for Sony T615C
2    Casio Mens DBC310-1 Databank 300 Digital Watch...
3    Factory-Reconditioned DEWALT DW260KR Heavy-Dut...
4                               Energizer 2 in 1 Light
Name: title, dtype: object

4. MISSING VALUES
--------------------------------------------------
title             1
description    1042
dtype: int64

5. IMAGE SAMPLE ANALYSIS
--------------------------------------------------
Analyzing sample images: 100%|██████████| 50/50 [00:00<00:00, 1784.95it/s]

Average file size: 5.42 KB
Average dimensions: 100 x 100

## Key observations
- __Data Completeness:__
We have 46,229 entries in CSV but only 42,000 images, indicating some missing images
Some missing data: 1,042 missing descriptions and 1 missing title

- __Category Distribution:__
21 distinct categories, very balanced (around 2,200 items each)
Wide variety of product types from "Arts, Crafts & Sewing" to "Pet Supplies"
More general categories than our previous dataset

- __Image Characteristics:__
Small, consistent images (100x100 pixels)
Very small file sizes (average 5.42 KB)
Good for processing efficiency

After visual exploration of the csv file, it appears that missing values are all in the title and description columns. Those columns are not relevant to our analysis and will be dropped.
Despite having 4,229 missing images (46,229 CSV entries - 42,000 images), this won't impact our analysis as we still have a large, balanced dataset of 42,000 images, and our code is designed to automatically handle missing images by working only with existing ones.