## Table of Content

- [1.0 - Objective](#1.0)
- [2.0 - Packages](#2.0)
- [3.0 - Data Pre-Processing](#3.0)
    - [3.1 - Check for corrupt images](#3-1)
    - [3.2 - Bounding Box Calculations](#3-2)
    - [3.3 - Label Mapping and Data Organization](#3-3)
    

<a name='1.0'></a>
#### 1.0 Objective

The objective of this project is to develop a model to detect and identify fashion objects in images and provide bounding box coordinates for each identified fashion class.

This project utilizes an object detection model trained on fashion categories, specifically the YOLOv5 network from Ultralytics. 

This project is separated into 3 notebooks:
* This notebook: data collection and pre-processing.
* Model Training
* Model Prediction

A separate blog post has been created to provide analysis and recommendations from data collection to model training and prediction. The link to this post is available on the README.md of this repository.

<a name='2.0'></a>
#### 2.0 Packages

In [1]:
import os
from glob import glob
import pandas as pd
import numpy as np
from functools import reduce
from xml.etree import ElementTree as et
import random
import matplotlib.pyplot as plt
import os
import cv2
from PIL import Image, ImageDraw
import requests
import os
from shutil import move
%matplotlib inline

<a name='3.0'></a>
#### 3.0 Data Pre-Processing

In [2]:
with open('../complete-the-look-dataset/datasets/raw_train.tsv', 'r') as file:
    lines = file.readlines()

# extract column names (first line)
raw_column_names = lines[0].split(' ')

# remove empty strings from the list
column_names = [name for name in raw_column_names if name]

# remove any empty strings from the list and strip '\n' from the 'label' column
column_names = [name.strip() for name in raw_column_names if name]

data = [line.split('\t') for line in lines[1:]]
df = pd.DataFrame(data, columns=column_names)

# replace the '\n' in the 'label' column with an empty string
df['label'] = df['label'].str.replace('\n', '')
df.head()

Unnamed: 0,image_signature,bounding_x,bounding_y,bounding_width,bounding_height,label
0,04fcde5521c0a404a4552329e5200673,0.14412735,0.8294209,0.32117385,0.15830207,Shoes
1,04fcde5521c0a404a4552329e5200673,0.70579976,0.016558629,0.27449077,0.38741693,Scarves & Shawls
2,04fcde5521c0a404a4552329e5200673,0.0007892251,0.0,0.41146725,0.47764385,Coats & Jackets
3,04fcde5521c0a404a4552329e5200673,0.6575671,0.56249845,0.34243292,0.43750155,Handbags
4,04fd71ac51937e85fe1bfbda62d8ef45,0.73806214,0.025290241,0.19626653,0.2942092,Handbags


In [3]:
# Set the number of rows to collect for each label
rows_per_label = 1000

result_df = pd.DataFrame(columns=df.columns)

for label in df['label'].unique():
    label_rows = df[df['label'] == label]

    if len(label_rows) <= rows_per_label:
        result_df = pd.concat([result_df, label_rows])
    else:
        result_df = pd.concat([result_df, label_rows.head(rows_per_label)])

len(result_df)
result_df.head()

Unnamed: 0,image_signature,bounding_x,bounding_y,bounding_width,bounding_height,label
0,04fcde5521c0a404a4552329e5200673,0.14412735,0.8294209,0.32117385,0.15830207,Shoes
10,04fe26d41dfdd388e78d384a70dee127,0.04124359,0.45141608,0.25114998,0.28614372,Shoes
17,04fed029fe1fc1479d4919f8834c0657,0.36369017,0.47505257,0.25783226,0.304278,Shoes
22,04fef5168cbd48dc0007267cfa7f5e3d,0.30794728,0.5984399,0.1624996,0.3552931,Shoes
25,04ff04c2018aa80f8854614cc9e7345a,0.71438456,0.7589358,0.28561544,0.24106419,Shoes


In [4]:
len(result_df)

15657

In [5]:
label_counts = result_df['label'].value_counts().reset_index()
label_counts.columns = ['label', 'count']
label_counts

Unnamed: 0,label,count
0,Shoes,1000
1,Sunglasses,1000
2,Watches,1000
3,Skirts,1000
4,Scarves & Shawls,1000
5,Shorts,1000
6,Dresses,1000
7,Hats,1000
8,Jewelry,1000
9,Shirts & Tops,1000


In [6]:
images = result_df['image_signature'].unique()
len(images)

9235

In [7]:
img_df = pd.DataFrame(images, columns=['image_signature'])
img_df.head()

Unnamed: 0,image_signature
0,04fcde5521c0a404a4552329e5200673
1,04fe26d41dfdd388e78d384a70dee127
2,04fed029fe1fc1479d4919f8834c0657
3,04fef5168cbd48dc0007267cfa7f5e3d
4,04ff04c2018aa80f8854614cc9e7345a


In [8]:
img_train = tuple(img_df.sample(frac=0.8)['image_signature'])
img_test = tuple(img_df.query(f'image_signature not in {img_train}')['image_signature'])

In [9]:
len(img_train), len(img_test)

(7388, 1847)

In [10]:
train_df = result_df.query(f'image_signature in {img_train}')
test_df = result_df.query(f'image_signature in {img_test}')

In [12]:
def convert_to_url(signature):
    prefix = 'http://i.pinimg.com/400x/%s/%s/%s/%s.jpg'
    return prefix % (signature[0:2], signature[2:4], signature[4:6], signature)

In [13]:
output_directory = 'test_ds'
os.makedirs(output_directory, exist_ok=True)

for sig in test_df['image_signature']:
    url = convert_to_url(sig)
    image = Image.open(requests.get(url, stream=True).raw)
    filename = f"{sig}.jpg"
    image.save(os.path.join(output_directory, filename))

In [14]:
output_directory = 'train_ds'
os.makedirs(output_directory, exist_ok=True)

for sig in train_df['image_signature']:
    try:
        url = convert_to_url(sig)
        image = Image.open(requests.get(url, stream=True).raw)
        filename = f"{sig}.jpg"
        image.save(os.path.join(output_directory, filename))
    except Exception as e:
        print(f"Error processing image {sig}: {str(e)}")

<a name='3-1'></a>

#### 3.1 Check for corrupt images

In [19]:
folder_path = './test_ds/'

for file in os.listdir(folder_path):
    if file.endswith(('.jpg', '.jpeg', '.png')):
        file_path = os.path.join(folder_path, file)

        with open(file_path, 'rb') as f:
            check_chars = f.read()[-2:]

        if check_chars != b'\xff\xd9':
            print(f'Not a complete image: {file}')
        else:
            imrgb = cv2.imread(file_path, 1)

In [None]:
# to fix corrupt images, activate venv, go to image folder, run this script on terminal
# $ mogrify -set comment 'Image rewritten with ImageMagick' *.jpg

In [None]:
train_df['filename'] = train_df['image_signature'] + '.jpg'
test_df['filename'] = test_df['image_signature'] + '.jpg'

In [None]:
train_df['filepath'] = './train_ds/' + train_df['filename']
test_df['filepath'] = './test_ds/' + test_df['filename']

In [32]:
train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

<a name='3-2'></a>

#### 3.2 Bounding Box Calculations

In [34]:
for i in range(len(train_df)):
    image_path = train_df.loc[i, 'filepath']
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)

    x = float(train_df.loc[i, 'bounding_x'])
    y = float(train_df.loc[i, 'bounding_y'])
    w = float(train_df.loc[i, 'bounding_width'])
    h = float(train_df.loc[i, 'bounding_height'])

    if 'xmin' not in train_df.columns:
        train_df['xmin'] = 0  # or any default value if needed
    if 'ymin' not in train_df.columns:
        train_df['ymin'] = 0
    if 'xmax' not in train_df.columns:
        train_df['xmax'] = 0
    if 'ymax' not in train_df.columns:
        train_df['ymax'] = 0

    train_df.loc[i, 'xmin'] = int(x * image.width)
    train_df.loc[i, 'ymin'] = int(y * image.height)
    train_df.loc[i, 'xmax'] = int((x + w) * image.width)
    train_df.loc[i, 'ymax'] = int((y + h) * image.height)


In [None]:
for i in range(len(test_df)):
    image_path = test_df.loc[i, 'filepath']
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)

    x = float(test_df.loc[i, 'bounding_x'])
    y = float(test_df.loc[i, 'bounding_y'])
    w = float(test_df.loc[i, 'bounding_width'])
    h = float(test_df.loc[i, 'bounding_height'])

    if 'xmin' not in test_df.columns:
        test_df['xmin'] = 0  # or any default value if needed
    if 'ymin' not in test_df.columns:
        test_df['ymin'] = 0
    if 'xmax' not in test_df.columns:
        test_df['xmax'] = 0
    if 'ymax' not in test_df.columns:
        test_df['ymax'] = 0

    test_df.loc[i, 'xmin'] = int(x * image.width)
    test_df.loc[i, 'ymin'] = int(y * image.height)
    test_df.loc[i, 'xmax'] = int((x + w) * image.width)
    test_df.loc[i, 'ymax'] = int((y + h) * image.height)


In [None]:
for i in range(len(train_df)):
    image_path = train_df.loc[i, 'filepath']
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)

    train_df.at[i, 'width'] = image.width
    train_df.at[i, 'height'] = image.height

In [None]:
for i in range(len(test_df)):
    image_path = test_df.loc[i, 'filepath']
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)
    test_df.at[i, 'width'] = image.width
    test_df.at[i, 'height'] = image.height

In [None]:
# center x, center y, w, h train_df
train_df['center_x'] = ((train_df['xmax'] + train_df['xmin']) / 2) / train_df['width']
train_df['center_y'] = ((train_df['ymax'] + train_df['ymin']) / 2) / train_df['height']
train_df['w'] = (train_df['xmax'] - train_df['xmin']) / train_df['width']
train_df['h'] = (train_df['ymax'] - train_df['ymin']) / train_df['height']

In [None]:
# test_df: center_x, center_y, w, h
test_df['center_x'] = ((test_df['xmax'] + test_df['xmin']) / 2) / test_df['width']
test_df['center_y'] = ((test_df['ymax'] + test_df['ymin']) / 2) / test_df['height']
test_df['w'] = (test_df['xmax'] - test_df['xmin']) / test_df['width']
test_df['h'] = (test_df['ymax'] - test_df['ymin']) / test_df['height']

In [54]:
len(train_df), len(test_df)

(12523, 3134)

<a name='3-3'></a>

#### 3.3 Label Mapping and Data Organization

In [55]:
def label_encoding(x):
    labels = {
        'Pants': 0,
        'Handbags': 1,
        'Shirts & Tops': 2,
        'Shoes': 3,
        'Scarves & Shawls': 4,
        'Jewelry': 5,
        'Skirts': 6,
        'Coats & Jackets': 7,
        'Hats': 8,
        'Dresses': 9,
        'Shorts': 10,
        'Watches': 11,
        'Sunglasses': 12,
        'Jumpsuits & Rompers': 13,
        'Socks': 14,
        'Rings': 15,
        'Belts': 16,
        'Gloves & Mittens': 17,
        'Swimwear': 18,
        'Stockings': 19,
        'Neckties': 20
    }
    return labels[x]

In [None]:
train_df.loc[:, 'id'] = train_df['label'].apply(label_encoding)
test_df.loc[:, 'id'] = test_df['label'].apply(label_encoding)

In [62]:
# Define the mapping of old values to new values
label_mapping = {
    'Shirts & Tops': 'Shirts',
    'Scarves & Shawls': 'Scarves',
    'Coats & Jackets': 'Coats',
    'Jumpsuits & Rompers': 'Jumpsuits',
    'Gloves & Mittens': 'Gloves'
}

In [None]:
test_df['label'] = test_df['label'].replace(label_mapping)
train_df['label'] = train_df['label'].replace(label_mapping)

In [None]:
gloves_df = test_df[test_df['label'] == 'Gloves']

In [None]:
jumpsuits_df = train_df[train_df['label'] == 'Jumpsuits']

In [68]:
train_df.to_csv('train_df_objdetect.csv', index=False)
test_df.to_csv('test_df_objdetect.csv', index=False)

In [72]:
train_folder = './data_images/train_yolo'
test_folder = './data_images/test_yolo'

os.mkdir(train_folder)
os.mkdir(test_folder)

In [73]:
cols = ['filename', 'id', 'center_x', 'center_y', 'w', 'h']
groupby_obj_train = train_df[cols].groupby('filename')
groupby_obj_test = test_df[cols].groupby('filename')

In [74]:
# test on a sample file before implementing onto the entire dataset

groupby_obj_train.get_group('04fcde5521c0a404a4552329e5200673.jpg').set_index('filename').to_csv('sample.txt', index=False, header=False)

In [78]:
# save each image in train/test folder and respective labels in .txt
def save_data(filename, folder_path, groupby_obj):
    # move image
    src = os.path.join('./train_ds/', filename)
    dst = os.path.join(folder_path, filename)
    move(src, dst) # move images to the destination folder

    # save the labels
    text_filename = os.path.join(folder_path, os.path.splitext(filename)[0]+'.txt')
    groupby_obj.get_group(filename).set_index('filename').to_csv(text_filename, sep=' ', index=False, header=False)

In [None]:
filename_series_test = pd.Series(groupby_obj_test.groups.keys())
filename_series_test.apply(save_data, args=(test_folder, groupby_obj_test))

In [79]:
filename_series = pd.Series(groupby_obj_train.groups.keys())
filename_series.apply(save_data, args=(train_folder, groupby_obj_train))

0       None
1       None
2       None
3       None
4       None
        ... 
7383    None
7384    None
7385    None
7386    None
7387    None
Length: 7388, dtype: object

Final Check for corrupt images

In [81]:
folder_path = './data_images/test_yolo/'

for file in os.listdir(folder_path):
    if file.endswith(('.jpg', '.jpeg', '.png')):
        file_path = os.path.join(folder_path, file)

        with open(file_path, 'rb') as f:
            check_chars = f.read()[-2:]

        if check_chars != b'\xff\xd9':
            print(f'Not a complete image: {file}')
        else:
            imrgb = cv2.imread(file_path, 1)

Create YAML file