## Case Study: Enhancing Product Discovery for a Fashion Retailer with AI-Driven Image Classification

### Problem Statement

FashionTrend, a leading online fashion retailer, faces increasing challenges in managing its vast and diverse product catalog. As the product range grows, customers struggle to find items that match their preferences due to inconsistent categorization and limited filtering options. This results in a less-than-optimal shopping experience and potential loss in sales.

To address this, FashionTrend aims to streamline product categorization by developing an AI-based image classification system to automatically categorize each new product image. This solution intends to:

- **Enhance Search and Filtering**: Improve customers' ability to find items quickly and accurately by organizing products into detailed categories such as gender, category (e.g., apparel, accessories), and type (e.g., shirts, jeans, watches).
- **Optimize Inventory Management**: Support the operations team by ensuring each item is categorized correctly, making it easier to track and restock products according to demand.

### Project Approach

#### Objective
The primary objective of this project is to build a robust machine learning model that categorizes products based on images across multiple attributes, including `masterCategory`,`Gender`.Automating this labeling process will improve product discoverability and inventory management efficiency.

#### Dataset Overview

1. **Styles.csv**: Contains metadata for each product, including attributes like gender, masterCategory, subCategory, articleType, baseColour, season, year, usage, and productDisplayName.
2. **Images**: High-quality product images labeled with IDs that correspond to entries in `styles.csv`.

#### Proposed Steps

1. **Data Exploration and Preprocessing**:
   - **Inspect Metadata**: Analyze `styles.csv` to understand the distribution of categories and prepare labels.
   - **Image Preparation**: Resize and standardize images for model training, matching each image to its corresponding metadata in `styles.csv`.

2. **Building the Model**:
   - **Image Classification Model**: Develop a convolutional neural network (CNN) to classify images by multiple attributes. Begin with `masterCategory` (e.g., Apparel, Accessories) and expand to `gender` as multi-label classification tasks.
   - **Multi-Label Classification**: Use multi-label classification techniques to predict multiple attributes for each image, making the model versatile for complex queries.

3. **Model Evaluation and Optimization**:
   - **Accuracy Metrics**: Evaluate model performance using metrics such as accuracy, precision

  



In [None]:
## Set seeds for reproducibilitys
import random
random.seed(0)

import numpy as np
np.random.seed(0)

import tensorflow as tf
tf.random.set_seed(0)

import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt

## Importing the dependencies


import os
import json
import random
from zipfile import ZipFile
from PIL import Image

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import layers, models
from tensorflow.keras.models import load_model

### Kaggle key

In [None]:
try:
    kaggle_credentials = json.load(open("kaggle.json"))
    os.environ['KAGGLE_USERNAME'] = kaggle_credentials["username"]
    os.environ['KAGGLE_KEY'] = kaggle_credentials["key"]
except FileNotFoundError:
    print("kaggle.json file not found. Please place it in the directory.")
except KeyError:
    print("Error: Invalid JSON format in kaggle.json.")

In [None]:
import kaggle
from zipfile import ZipFile

# Define dataset name and zip file name
dataset_name = 'fashionData'
zip_file_name = 'fashionData.zip'

# Check if the dataset folder exists
if not os.path.exists(dataset_name): 
    # Check if the zip file exists
    if not os.path.exists(zip_file_name):
        # Use Kaggle API to download the dataset
        kaggle.api.dataset_download_files("bhavikjikadara/e-commerce-products-images", path='.', unzip=False)
        
    # Extract the dataset if the directory doesn't exist
    with ZipFile(zip_file_name, 'r') as zip_ref:
        zip_ref.extractall(dataset_name)

print(f"Dataset is ready and extracted in '{dataset_name}' directory.")

In [None]:
styles = pd.read_csv(r"C:\Users\pascal\Desktop\New Projects\AI-Driven-Image-Classification-for-Fashion-Retailer\fashionData\styles.csv")
styles.head()

In [None]:
styles.columns

In [None]:
styles.info()

In [None]:
duplicates = styles.duplicated().sum()
duplicates

In [None]:
# Updated color mapping dictionary based on unique baseColour values
color_mapping = {
    'Navy Blue': 'Blue', 'Blue': 'Blue', 'Turquoise Blue': 'Blue', 'Teal': 'Blue', 'Steel': 'Gray',
    'Silver': 'Gray', 'Grey': 'Gray', 'Charcoal': 'Gray', 'Grey Melange': 'Gray',
    'Green': 'Green', 'Olive': 'Green', 'Lime Green': 'Green', 'Sea Green': 'Green', 'Fluorescent Green': 'Green',
    'Purple': 'Purple', 'Lavender': 'Purple', 'Mauve': 'Purple', 'Magenta': 'Purple',
    'Black': 'Black', 'White': 'White', 'Off White': 'White', 'Cream': 'White',
    'Beige': 'Beige', 'Khaki': 'Beige', 'Skin': 'Beige', 'Taupe': 'Beige',
    'Brown': 'Brown', 'Coffee Brown': 'Brown', 'Mushroom Brown': 'Brown', 'Tan': 'Brown', 'Bronze': 'Brown', 'Copper': 'Brown',
    'Red': 'Red', 'Maroon': 'Red', 'Burgundy': 'Red', 'Rose': 'Pink', 'Pink': 'Pink',
    'Yellow': 'Yellow', 'Mustard': 'Yellow', 'Gold': 'Yellow',
    'Orange': 'Orange', 'Peach': 'Orange', 'Rust': 'Orange',
    'Multi': 'Multi', 'Metallic': 'Multi', 'Nude': 'Other'
}

# Apply the mapping to create a new column for the broader color categories
styles['mainColor'] = styles['baseColour'].map(color_mapping).fillna('Other')

# Check the unique values in the new mainColor column to verify the mapping
print(styles['mainColor'].unique())


In [None]:


def explore_selected_columns(df):
    # List of columns to explore
    columns_to_explore = ['masterCategory', 'gender', 'mainColor', 'usage', 'season']
    
    # Loop through each selected column
    for column in columns_to_explore:
        if column in df.columns:
            # Check if the column is of type object (string/categorical data)
            if df[column].dtype == 'object':
                print(f"Column: {column}")
                print("Unique Values:", df[column].unique())
                print("\n")
                
                # Plot distribution for categorical columns
                plt.figure(figsize=(8, 4))
                sns.countplot(y=column, data=df, order=df[column].value_counts().index)
                plt.title(f"Distribution of {column}")
                plt.show()
            
            # If the column is numerical, plot a histogram
            elif df[column].dtype in ['int64', 'float64']:
                print(f"Column: {column}")
                print("Statistical Summary:")
                print(df[column].describe())
                print("\n")
                
                # Plot distribution for numerical columns
                plt.figure(figsize=(8, 4))
                sns.histplot(df[column], kde=True)
                plt.title(f"Distribution of {column}")
                plt.xlabel(column)
                plt.show()

# Usage
explore_selected_columns(styles)


In [None]:

# Directory path for images
image_dir = r"C:\Users\pascal\Desktop\New Projects\AI-Driven-Image-Classification-for-Fashion-Retailer\fashionData\e-commerce\images"

# Function to display 2 images from each category, with a check for available images
def visualize_two_samples_per_category(styles, image_dir, samples_per_category=2):
    categories = styles['masterCategory'].unique()
    
    # Set up a compact plot grid
    fig, axes = plt.subplots(len(categories), samples_per_category, figsize=(6, len(categories) * 1.8))
    fig.suptitle('Two Samples from Each Category', fontsize=10)

    for i, category in enumerate(categories):
        # Get valid image IDs for the current category, up to the required number
        category_images = styles[styles['masterCategory'] == category]['id'].tolist()
        valid_images = [img_id for img_id in category_images if os.path.exists(os.path.join(image_dir, f"{img_id}.jpg"))][:samples_per_category]

        for j, img_id in enumerate(valid_images):
            img_path = os.path.join(image_dir, f"{img_id}.jpg")
            img = mpimg.imread(img_path)
            axes[i, j].imshow(img)
            axes[i, j].axis('off')
            if j == 0:
                axes[i, j].set_title(category, loc='left', fontsize=10)

        # Turn off axes for any unused spots in the row (if fewer images than samples_per_category)
        for j in range(len(valid_images), samples_per_category):
            axes[i, j].axis('off')

    plt.tight_layout()
    plt.subplots_adjust(top=0.92, hspace=0.4)  # Adjust title position and spacing
    plt.show()

# Call the function to display 2 samples per category
visualize_two_samples_per_category(styles, image_dir)




In [None]:
# Step 1: Create a simplified multi_label column with only masterCategory and gender
styles['multi_label'] = styles[['masterCategory', 'gender']].apply(lambda x: '_'.join(x.astype(str)), axis=1)

# Step 2: Define the categories to merge into "Other_Unisex"
categories_to_merge = ['Sporting Goods_Unisex', 'Personal Care_Unisex', 'Free Items_Unisex', 'Home_Unisex']

# Step 3: Update the multi_label column, replacing specific categories with "Other_Unisex"
styles['multi_label'] = styles['multi_label'].apply(lambda x: 'Other_Unisex' if x in categories_to_merge else x)

# Step 4: Display unique names and their counts to confirm the final class distribution
unique_counts = styles['multi_label'].value_counts()
print("Updated unique names and their counts:")
print(unique_counts)

# Step 5: Verify the total number of unique classes in multi_label
print("\nTotal unique classes in 'multi_label':", unique_counts.nunique())


In [None]:
# Step 1: Ensure 'id' column is formatted correctly with '.jpg' extension for image filenames
styles['id'] = styles['id'].astype(str) + ".jpg"  # Append .jpg to match image file names

### Image setup, Data Processing & Train Test Split

In [None]:


# Step 2: Data Processing and Train-Test Split

# Define image size and batch size for efficient processing
img_size = 128
batch_size = 32

# Step 3: Initialize ImageDataGenerator with augmentations for training and validation sets
# The augmentations will help improve model generalization by applying small variations to the training images.
data_gen = ImageDataGenerator(
    rescale=1. / 255,            # Normalize pixel values between 0 and 1
    rotation_range=20,           # Increase rotation to 30 degrees for more variability
    width_shift_range=0.1,       # Increase horizontal shift to 20%
    height_shift_range=0.1,      # Increase vertical shift to 20%
    zoom_range=0.1,              # Increase zoom range to 20%
    shear_range=0.1,             # Add shear transformations
    horizontal_flip=True,        # Flip images horizontally
    validation_split=0.2         # Reserve 20% of data for validation
)

# Step 4: Configure the training data generator
# This generator will create batches of augmented images for training.
train_generator = data_gen.flow_from_dataframe(
    dataframe=styles,            # Data source: styles DataFrame containing image IDs and labels
    directory=image_dir,         # Directory where images are stored
    x_col="id",                  # Column name for image filenames in styles DataFrame
    y_col="multi_label",         # Column name for labels in styles DataFrame
    target_size=(img_size, img_size),  # Resize images to a consistent size
    batch_size=batch_size,       # Number of images per batch
    class_mode='categorical',    # Treat each unique label as a separate class
    subset='training',           # Use the training subset
    seed=42                      # Seed for reproducibility
)

# Step 5: Configure the validation data generator
# This generator will provide batches of images for validation, without additional augmentation.
validation_generator = data_gen.flow_from_dataframe(
    dataframe=styles,            # Data source: styles DataFrame
    directory=image_dir,         # Directory containing the images
    x_col="id",                  # Image filenames column
    y_col="multi_label",         # Label column
    target_size=(img_size, img_size),  # Consistent resizing for validation images
    batch_size=batch_size,       # Batch size
    class_mode='categorical',    # Multi-class setup for categorical labels
    subset='validation',         # Use the validation subset
    seed=42                      # Seed for reproducibility
)

# Confirmation of setup completion
print("Train generator setup complete.")
print("Validation generator setup complete.")


### Cnn Setup

In [None]:
from tensorflow.keras import layers, models
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Define a deeper sequential model
model = models.Sequential()

# First Convolutional Block
model.add(layers.Conv2D(32, (3, 3), activation='relu', padding='same',input_shape=(img_size, img_size, 3)))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(2, 2))

# Second Convolutional Block
model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(2, 2))

# Third Convolutional Block
model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(2, 2))

# Fourth Convolutional Block
model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(2, 2))

# Flatten layer to convert 3D feature maps to 1D feature vectors
model.add(layers.Flatten())

# Fully Connected Layers with Dropout for regularization
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dropout(0.5))

# Output Layer: Softmax activation for multi-class classification
num_classes = len(train_generator.class_indices)
model.add(layers.Dense(num_classes, activation='softmax'))

# Displaying the model summary
model.summary()


### Compile model 

In [None]:
# Compile the model with Adam optimizer and categorical crossentropy loss
model.compile(optimizer=Adam(learning_rate=0.0001),  # Adaptive optimizer with a learning rate of 0.0001
              loss='categorical_crossentropy',       # Loss function for multi-class classification
              metrics=['accuracy'])                  # Metric to evaluate performance during training


### Model Training

In [None]:
# Define callbacks for training
callbacks = [
    EarlyStopping(monitor='val_loss', patience=7, restore_best_weights=True),  # Stop if validation loss doesn't improve for 7 epochs, restoring best weights
    ModelCheckpoint('best_model.keras', save_best_only=True)  # Save only the best model during training
]

# Train the model
history = model.fit(
    train_generator,                                 # Training data generator
    steps_per_epoch=train_generator.samples //batch_size,  # Steps per epoch based on dataset size and batch size
    epochs=10,                                       # Number of epochs to train
    validation_data=validation_generator,            # Validation data generator
    validation_steps=validation_generator.samples // batch_size,  # Validation steps per epoch
    callbacks=callbacks                  # Apply callbacks for early stopping and best model saving
)



### Evaluation

In [None]:
# Model Evaluation on the validation set

print("Evaluating Cnn model...")
val_loss, val_accuracy = model.evaluate(validation_generator, steps=validation_generator.samples // batch_size)  # Evaluate on validation data
print(f"Validation Accuracy: {val_accuracy * 100:.2f}%")  # Print validation accuracy in percentage


### SAve model

In [None]:
### Save the model locally

model.save('cnn_fashion_model.h5')
model.save('cnn_fashion_model.keras')

### Acuracy Plot

In [None]:
# Plot training & validation accuracy values across epochs
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

# Plot training & validation loss values across epochs
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

### Class Mapping 

In [None]:
# Mapping class indices to class names
class_indices = {v: k for k, v in train_generator.class_indices.items()}  # Create class mapping

# Save class indices to JSON
with open('class_indices.json', 'w') as f:
    json.dump(class_indices, f)


### CNN Image Preprocessing pipeline and Prediction Test:

In [None]:
# Define constants
TARGET_SIZE = (128, 128)  

# Load and preprocess a single image
def load_and_preprocess_image(image_path, target_size=TARGET_SIZE):
    """
    Load an image, resize it to target size, convert to numpy array, 
    add batch dimension, and normalize pixel values.
    """
    # Load the image with Pillow
    img = Image.open(image_path)
    # Resize to target size
    img = img.resize(target_size)
    # Convert to numpy array
    img_array = np.array(img)
    # Expand dimensions to match model input shape (1, 128, 128, 3)
    img_array = np.expand_dims(img_array, axis=0)
    # Normalize pixel values
    img_array = img_array.astype('float32') / 255.
    return img_array

# Function to predict class for a single image using CNN model
def predict_image_class_cnn(image_path, class_indices):
    """
    Predict the class of a preprocessed image using the trained CNN model.
    """
    # Load the CNN model
    cnn_model = load_model('cnn_fashion_model.keras')
    # Preprocess the input image
    preprocessed_img = load_and_preprocess_image(image_path)
    # Predict with the CNN model
    predictions = cnn_model.predict(preprocessed_img)
    # Get the index of the class with the highest prediction score
    predicted_class_index = np.argmax(predictions, axis=1)[0]
    # Convert the class index back to the class name
    predicted_class_name = class_indices[predicted_class_index]
    return predicted_class_name

# Example usage for CNN model
test_image_path = r"C:\Users\pascal\Desktop\New Projects\AI-Driven-Image-Classification-for-Fashion-Retailer\fashionData\e-commerce\images\1536.jpg"
predicted_class_cnn = predict_image_class_cnn(test_image_path, class_indices)
print("Predicted Class using CNN model:", predicted_class_cnn)


## MobileNetV2 Model

In [None]:
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras import layers, models
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import json

### Model setup

In [None]:
# Set input image size for MobileNetV2 (recommended size for this architecture)
img_size = 128

# Load the MobileNetV2 model with pre-trained weights from ImageNet
# Exclude the top layers, as we'll add custom layers for our classification task
base_model = MobileNetV2(input_shape=(img_size, img_size, 3), include_top=False, weights='imagenet')

# Freeze the base layers to retain pre-trained weights
base_model.trainable = False

# Define the model
model = models.Sequential([
    base_model,  # Use MobileNetV2 as the base
    layers.GlobalAveragePooling2D(),  # Global average pooling to reduce dimensions
    layers.Dense(256, activation='relu'),  # Dense layer with ReLU activation
    layers.Dropout(0.5),  # Dropout layer for regularization
    layers.Dense(len(train_generator.class_indices), activation='softmax')  # Output layer with softmax for classification
])

# Display the model summary
model.summary()

### Compile Model 

In [None]:
# Compile the model
model.compile(optimizer=Adam(learning_rate=0.0001), 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

### Train Model

In [None]:
# Define callbacks for early stopping and model checkpointing
callbacks = [
    EarlyStopping(monitor='val_loss', patience=7, restore_best_weights=True),
    ModelCheckpoint('best_mobilenetv2_model.keras', save_best_only=True)
]

# Train the model
history = model.fit(
    train_generator,
    steps_per_epoch=train_generator.samples // batch_size,
    epochs=10,  
    validation_data=validation_generator,
    validation_steps=validation_generator.samples // batch_size,
    callbacks=callbacks
)


#### Evaluate Model

In [None]:


# Model Evaluation
print("Evaluating MobileNetV2 model...")
val_loss, val_accuracy = model.evaluate(validation_generator, steps=validation_generator.samples // batch_size)
print(f"MobileNetV2 Validation Accuracy: {val_accuracy * 100:.2f}%")

#### Save models

In [None]:
### Save the model locally

model.save('mobilenetv2_model.h5')
model.save('mobilenetv2_model.keras')

### Acuracy Plot

In [None]:
# Plot training & validation accuracy values across epochs
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

# Plot training & validation loss values across epochs
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

#### Class Mapping

In [None]:
# Mapping class indices to class names
class_indices = {v: k for k, v in train_generator.class_indices.items()}  # Create class mapping

# Save class indices to JSON
with open('class_indices_mobilenetv2.json', 'w') as f:
    json.dump(class_indices, f)

### Image processing and predicting Test

In [None]:


# Define constants
TARGET_SIZE = (128, 128)  # Ensure this matches the input size used for training

# Load and preprocess a single image
def load_and_preprocess_image(image_path, target_size=TARGET_SIZE):
    """
    Load an image, resize it to target size, convert to numpy array, 
    add batch dimension, and normalize pixel values.
    """
    # Load the image with Pillow
    img = Image.open(image_path)
    # Resize to target size
    img = img.resize(target_size)
    # Convert to numpy array
    img_array = np.array(img)
    # Expand dimensions to match model input shape (1, 128, 128, 3)
    img_array = np.expand_dims(img_array, axis=0)
    # Normalize pixel values
    img_array = img_array.astype('float32') / 255.
    return img_array

# Function to predict class for a single image using MobileNetV2 model
def predict_image_class_mobilenetv2(image_path, class_indices):
    """
    Predict the class of a preprocessed image using the trained MobileNetV2 model.
    """
    # Load the MobileNetV2 model
    mobilenetv2_model = load_model('mobilenetv2_model.keras')
    # Preprocess the input image
    preprocessed_img = load_and_preprocess_image(image_path)
    # Predict with the MobileNetV2 model
    predictions = mobilenetv2_model.predict(preprocessed_img)
    # Get the index of the class with the highest prediction score
    predicted_class_index = np.argmax(predictions, axis=1)[0]
    # Convert the class index back to the class name
    predicted_class_name = class_indices[predicted_class_index]
    return predicted_class_name

# Example usage for MobileNetV2 model
test_image_path = r"C:\Users\pascal\Desktop\New Projects\AI-Driven-Image-Classification-for-Fashion-Retailer\fashionData\e-commerce\images\1536.jpg"
predicted_class_mobilenet = predict_image_class_mobilenetv2(test_image_path, class_indices)
print("Predicted Class using MobileNetV2 model:", predicted_class_mobilenet)
