<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Extract-Categorization-Information-from-the-Outfits-Recommendation-Data" data-toc-modified-id="Extract-Categorization-Information-from-the-Outfits-Recommendation-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Extract Categorization Information from the Outfits Recommendation Data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing</a></span></li><li><span><a href="#Categorization-of-Products" data-toc-modified-id="Categorization-of-Products-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Categorization of Products</a></span></li><li><span><a href="#Recommendation-Function" data-toc-modified-id="Recommendation-Function-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Recommendation Function</a></span></li></ul></div>

**For Part 2 of the project, we first wanted to categorize all products in the product data, and we checked out the outfir combinations data to see the `outfit_item_type` column in order to detect how Behold categorizes the products. Then we looked at what are the common product items included in each categorization so that we could build up our own categorization and applied on the product data.**

In [1]:
import pandas as pd
import numpy as np
import re
import warnings
import spacy
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
nltk_stopwords = set(stopwords.words('english'))
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('outfit_combinations USC.csv')

In [3]:
df.head()

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name
0,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt
1,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2PEPWFTT7RMP5AA1T,top,Eileen Fisher,Rib Mock Neck Tank
2,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2S5T9W793F4CY41HE,accessory1,kate spade new york,medium margaux leather satchel
3,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2ZFDYRYY5TRQZJTBD,shoe,Tory Burch,Penelope Mid Cap Toe Pump
4,01DMHCX50CFX5YNG99F3Y65GQW,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5291 entries, 0 to 5290
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   outfit_id          5291 non-null   object
 1   product_id         5291 non-null   object
 2   outfit_item_type   5291 non-null   object
 3   brand              5291 non-null   object
 4   product_full_name  5291 non-null   object
dtypes: object(5)
memory usage: 206.8+ KB


In [5]:
df.shape

(5291, 5)

## Extract Categorization Information from the Outfits Recommendation Data

In [6]:
df.outfit_id.nunique()

1137

In [7]:
# Behold characterizes their products into shoe, accesory, top, bottom, and onepiece. 

df['outfit_item_type'].value_counts()

shoe          1149
accessory1    1064
accessory2     978
top            950
bottom         928
onepiece       221
accessory3       1
Name: outfit_item_type, dtype: int64

In [8]:
df.head()

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name
0,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt
1,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2PEPWFTT7RMP5AA1T,top,Eileen Fisher,Rib Mock Neck Tank
2,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2S5T9W793F4CY41HE,accessory1,kate spade new york,medium margaux leather satchel
3,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2ZFDYRYY5TRQZJTBD,shoe,Tory Burch,Penelope Mid Cap Toe Pump
4,01DMHCX50CFX5YNG99F3Y65GQW,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt


In [9]:
df['product_full_name'] = df['product_full_name'].str.lower()

**For this part, we extracted the last word in the `product_full_name` column to get an idea on what kinds of products are categorized in each category.**

In [10]:
# Get commonly appreared products in bottom category

bottom_list = []
bottom = df[df['outfit_item_type'] == 'bottom'].reset_index()
for i in bottom['product_full_name'].str.extract(r'\b(\w+)\W*$')[0]:
    if i not in bottom_list:
        bottom_list.append(i)

In [11]:
bottom = r'\b(skirt|pant|jean|dress|cargo|leg|crop|trouser|tights|skinny|slim|shorts)'

In [12]:
# Get commonly appreared products in top category

top_list = []
top = df[df['outfit_item_type'] == 'top'].reset_index()
for i in top['product_full_name'].str.extract(r'\b(\w+)\W*$')[0]:
    if i not in top_list:
        top_list.append(i)

In [13]:
top = r'\b(tank|blouse|top|sweater|shirt|tee|yarn|popover|camisole|turtleneck|hoodie|sweatshirt|neckline|bra|pullover)'

In [14]:
# Get commonly appreared products in shoe category

shoe_list = []
shoe = df[df['outfit_item_type'] == 'shoe'].reset_index()
for i in shoe['product_full_name'].str.extract(r'\b(\w+)\W*$')[0]:
    if i not in shoe_list:
        shoe_list.append(i)

In [15]:
shoe = r"\b(shoe|boot|flat|heel|sneaker|sandal|loafer|espadrille|oxford|moccasin|monkstrap|mule|slide|slingback|buckle)"

In [16]:
# Get commonly appreared products in onepiece category

onepiece_list = []
onepiece = df[df['outfit_item_type'] == 'onepiece'].reset_index()
for i in onepiece['product_full_name'].str.extract(r'\b(\w+)\W*$')[0]:
    if i not in onepiece_list:
        onepiece_list.append(i)

In [17]:
onepiece = r"\b(one piece|one-piece|all-in-one|onepiece|jumpsuit|playsuit|bodysuit|overall|shirtdress|minidress)"

In [18]:
# Get commonly appreared products in accesory category

acce_list = []
acce = df[(df['outfit_item_type'] == 'accessory1') | (df['outfit_item_type'] == 'accessory2')].reset_index()
for i in acce['product_full_name'].str.extract(r'\b(\w+)\W*$')[0]:
    if i not in acce_list:
        acce_list.append(i)

In [19]:
accesory = r"\b(jacket|cardigan|coat|topcoat|satchel|clutch|bag|trench|tote|scarf|noir|blazer|shawl|briefcase|backpack|camisole|vest|chain|pouch|shopper|sunglasses|glasses)"

## Preprocessing

In [20]:
# Import the Behold product data

data = pd.read_excel('Behold+product+data+04262021.xlsx')

In [21]:
data.head(2)

Unnamed: 0,product_id,brand,brand_category,name,details,created_at,brand_canonical_url,description,brand_description,brand_name,product_active
0,01EX0PN4J9WRNZH5F93YEX6QAF,Two,Unknown,Khadi Stripe Shirt-our signature shirt,,2021-01-27 01:17:19.305 UTC,https://two-nyc.myshopify.com/products/white-k...,Our signature khadi shirt\navailable in black ...,Our signature khadi shirt\n\navailable in blac...,Khadi Stripe Shirt-our signature shirt,True
1,01F0C4SKZV6YXS3265JMC39NXW,Collina Strada,Unknown,RUFFLE MARKET DRESS LOOPY PINK SISTINE TOMATO,,2021-03-09 18:43:10.457 UTC,https://collina-strada-2.myshopify.com/product...,Mid-length dress with ruffles and adjustable s...,Mid-length dress with ruffles and adjustable s...,RUFFLE MARKET DRESS LOOPY PINK SISTINE TOMATO,True


In [22]:
columns = ['brand_category', 'details', 'name', 'description']

# Change text to lower case
for col in columns:
    data[col] = data[col].str.lower()

In [23]:
# Replace the newline characters

data['description'] = data['description'].str.replace("\n", " ")
data['details'] = data['details'].str.replace("\n", " ")
data['brand_category'] = data['details'].str.replace("\n", " ")
data['name'] = data['name'].str.replace("\n", " ")

In [24]:
# Create functions used in preprocessing

def remove_punctuations(text):
    for punctuation in '!%"#$&\'()*+,-./:;<=>?@[\\]^_`{|}~':
        text = text.replace(punctuation, '')
    return text
def remove_stopwords(text):
    words = text.split(" ")
    temp = []
    for word in words:
        if word in nltk_stopwords:
            continue
        temp.append(word)
    cleaned = " ".join(temp)
    return cleaned
def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    words = text.split(" ")
    temp = []
    for word in words:
        res = lemmatizer.lemmatize(word)
        temp.append(res)
    cleaned = " ".join(temp)
    return cleaned
def more_preprocess(text):
    result = re.sub(r'—', ' ', text)
    result = re.sub(r'•', ' ', result)
    result = re.sub(r'\s+', ' ', result)
    result = re.sub(r'\b[A-Z]?pattern[A-Z]?\b', '', result)
    return result

In [25]:
# Apply the functions created in columns of the product data

data['description'] = data['description'].fillna('Unknown')
data['name'] = data['name'].fillna('Unknown')
data['brand_category'] = data['brand_category'].fillna('Unknown')
data['details'] = data['details'].fillna('Unknown')

In [26]:
data['description'] = data['description'].apply(remove_punctuations)
data['description'] = data['description'].apply(more_preprocess)
data['name'] = data['name'].apply(remove_punctuations)
data['name'] = data['name'].apply(more_preprocess)
data['brand_category'] = data['brand_category'].apply(remove_punctuations)
data['brand_category'] = data['brand_category'].apply(more_preprocess)
data['details'] = data['details'].apply(remove_punctuations)
data['details'] = data['details'].apply(more_preprocess)

## Categorization of Products

In [27]:
# Create the category variable

data['category'] = 'Unknown'

In [28]:
# Define function used in categorization

def get_category(column):
    for i in range(len(data)):
        if data['category'][i] == 'Unknown':
            if len(re.findall(r"(skirt|pant|jean|dress|cargo|leg|crop|trouser|tights|skinny|slim|short|highwaist|denim|wash|saratoga)", data[column][i], flags=re.IGNORECASE)) >= 1:
                data['category'][i] = 'bottom'
            elif len(re.findall(r"(shoe|boot|flat|heel|sneaker|sandal|loafer|espadrille|oxford|moccasin|monkstrap|mule|slide|slingback|buckle|feet|sole)", data[column][i], flags=re.IGNORECASE)) >= 1:
                data['category'][i] = 'shoe'
            elif len(re.findall(r"(accesories|jewelry|accesory|jewelries|jacket|cardigan|coat|topcoat|satchel|clutch|bag|hat|trench|tote|scarf|noir|blazer|shawl|wallet|card|briefcase|backpack|camisole|vest|chain|pouch|shopper|sunglasses|glasses|band|fedora|catchall|cap|beret|mask|face|tunic|earring(?:s)|necklace|bracelet|band)", data[column][i], flags=re.IGNORECASE)) >= 1:
                data['category'][i] = 'accesory'
            elif len(re.findall(r'(tank|blouse|top|sweater|shirt|tee|cardigan|yarn|popover|camisole|turtleneck|hoodie|sweat|neckline|bra|pullover|sleeve|neck|cape)', data[column][i], flags=re.IGNORECASE)) >= 1:
                data['category'][i] = 'top'
            elif len(re.findall(r"(one piece|one-piece|all-in-one|onepiece|jumpsuit|playsuit|bodysuit|overall|shirtdress|minidress|robe|caftan)", data[column][i], flags=re.IGNORECASE)) >= 1:
                data['category'][i] = 'onepiece'

In [29]:
# Apply the fucntion on description, name, details, and brand_category columns to extract information on category

get_category('description')

In [30]:
data['category'].value_counts()

bottom      28734
Unknown     13100
accesory     8532
top          5568
shoe         5244
onepiece      177
Name: category, dtype: int64

In [31]:
get_category('name')

In [32]:
data['category'].value_counts()

bottom      32340
accesory     9352
top          7097
Unknown      6667
shoe         5591
onepiece      308
Name: category, dtype: int64

In [33]:
get_category('details')

In [34]:
data['category'].value_counts()

bottom      33424
accesory     9377
top          7168
shoe         5616
Unknown      5462
onepiece      308
Name: category, dtype: int64

In [35]:
get_category('brand_category')

In [36]:
data['category'].value_counts()

bottom      33424
accesory     9377
top          7168
shoe         5616
Unknown      5462
onepiece      308
Name: category, dtype: int64

In [37]:
def get_other(x):
    try:
        x = re.sub('Unknown', 'Other',x)
        return x
    except:
        return x

In [38]:
data['category'] = data['category'].apply(get_other)

In [39]:
data['category'].value_counts()

bottom      33424
accesory     9377
top          7168
shoe         5616
Other        5462
onepiece      308
Name: category, dtype: int64

In [40]:
# Perform preprocessing and data cleaning

data['description'] = data['description'].apply(remove_stopwords)
data['description'] = data['description'].apply(lemmatize)
data['name'] = data['name'].apply(remove_stopwords)
data['name'] = data['name'].apply(lemmatize)
data['brand_category'] = data['brand_category'].apply(remove_stopwords)
data['brand_category'] = data['brand_category'].apply(lemmatize)
data['details'] = data['details'].apply(remove_stopwords)
data['details'] = data['details'].apply(lemmatize)

## Recommendation Function

In [41]:
#First, we add data from different features to the same "text" feature.
data.brand=[str(a) for a in data.brand]
data['text'] = (data['brand'] + ' ' + data['brand_category']+' ' + data['name']+ ' ' + data['details'] + ' ' + data['category'] + ' ' + data['description']).apply(str)

In [42]:
def recommendation(query):
    """
    query is a string that is provided by the user, and this function returns a dictionary of outfit results.
    The result of each category of outfit contains product full name and corresponding product id.
    Example:
    search("Black Satin Pandora Dress") -> { 
    "top": "Satin Vneck Tie Top (01EHAYBHVC24EHHRV9WYVPY8TK)",
    "bottom": "Pandora Sweater New Arrival (01EZ7H2QFAY6WNHPSCVGFCTEVZ)",
    "shoe": "Willow Iii Mustard Satin (01E96FFWR4MKEDDDBWN3RVW3Z7)",
    "accesory": "Small Satin Clutch (01EP5T1EVBZMVX3RYGF2D5DCG2)",
    "onepiece": "Satin Button Jump Suit (01EMPJK353KB6PYHKNY3982JK0)"
    } 
    
    """
    
    #First, we clean the query
    query = remove_punctuations(query)
    query = remove_stopwords(query)
    query = more_preprocess(query)
    query = lemmatize(query)
    
    #Then compute the tfidf of query and our dataset
    query=[query]
    X = list(data['text'].values)
    vectorizer = TfidfVectorizer(min_df=4, max_df=0.85)
    #remove words which appear less than 4 times or frquency is more than 0.85 
    
    X = vectorizer.fit_transform(X)
    query = vectorizer.transform(query)
    tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    tfidf_query = pd.DataFrame(query.toarray(), columns=vectorizer.get_feature_names())          
    
    #After that, compute cosine similarity
    data['similarity'] = cosine_similarity(tfidf_df, tfidf_query)
    idx = data.groupby(['category'])['similarity'].transform(max)  == data['similarity']
    similar = data[idx]
    temp = [i for i in similar['product_id']]
    
    #find out the highest similarity score item of each category
    return_list = ['top','bottom','shoe','accesory','onepiece']
    return_dict = {}
    for cat in return_list:
        return_dict[cat] = similar.loc[data['category'] == cat].reset_index().name[0].title()
        return_dict[cat] += ' '+'('+similar.loc[data['category'] == cat].reset_index().product_id[0]+')'
    result = ''
    for item in return_list:
        result += item + ': ' + return_dict[item] + '\n'
    output_type = 0
    
    similar = similar.sort_values(by = 'similarity', ascending = False)
    
    #Check if the item exist in outfit dataset, if yes, output the the corresponding outfit.
    #If no, output the original result.
    for pro_id in similar['product_id']:
        if df['product_id'].str.contains(pro_id).sum() >= 1:
            output_type = 1
            outfit = df[df['product_id'] == pro_id].outfit_id[0]
            outfit_all = df[df['outfit_id'] == outfit]
            new_return_list = [i for i in outfit_all['outfit_item_type']]
            new_return_dict = {}
            for cat in return_list:
                new_return_dict[cat] = outfit_all.loc[outfit_all['outfit_item_type'] == cat, 'product_full_name'].item().title()
                new_return_dict[cat] += ' '+'('+outfit_all.loc[outfit_all['outfit_item_type'] == cat, 'product_id'].item()+')'
            new_result = ''
            for item in new_return_list:
                result += item + ': ' + new_return_dict[item] + '\n'

    if output_type == 0:
        print(result)
    else:
        print(new_result)
    

In [43]:
# Test1

recommendation('Black Satin Pandora Dress')

top: Satin Vneck Tie Top (01EHAYBHVC24EHHRV9WYVPY8TK)
bottom: Pandora Sweater New Arrival (01EZ7H2QFAY6WNHPSCVGFCTEVZ)
shoe: Willow Iii Mustard Satin (01E96FFWR4MKEDDDBWN3RVW3Z7)
accesory: Small Satin Clutch (01EP5T1EVBZMVX3RYGF2D5DCG2)
onepiece: Satin Button Jump Suit (01EMPJK353KB6PYHKNY3982JK0)



In [44]:
# Test2

recommendation('Off-White Linen Openwork Crochet Stitch Cardigan')

top: Dilay Crochet Pullover (01EC39Z55G7PJRSFNMFJFSNTP6)
bottom: Crochet Cardigan Mauve (01EYJZ74HT0AC9JQZ8ZRYGBKGY)
shoe: Perrin Cardigan (01EC3AXR1YBHAHN1MDFCGJHYRR)
accesory: Crochet Cardigan Mustard (01ENTC62495Y55ZB68SR5ZK21C)
onepiece: Priya Romper (01EFDA5EF5NY5PJRS8Z1BESN2D)



In [45]:
# Test3
recommendation('White Leather Church Boots')

top: Tshirt Mil White (01EWHDJG0DKSFE593AH9XZ26TV)
bottom: White Sheer Dhaka White Embroidery (01ETV33P5G00JHHFR9KDJ1B45X)
shoe: White Lambskin (01ESS0TPGAHRGWWVJRNC3CHQFM)
accesory: White Leather Foldover Tote (01EHB2QAN8DR16WFQ2JXXZBQB9)
onepiece: White Jamdani Caftan1 Left (01ETV335QGZKT81KEX4NDB4WAX)

