# Outfit Recommender

Zach Cummings, Austin Martinez, Bowen Wong, Perveen Wong

## Data Import

In [27]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
import nltk

productData = pd.read_excel('Behold+product+data+04262021.xlsx')
outfits = pd.read_csv('outfit_combinations USC.csv')
#additionalTags = pd.read_csv('usc_additional_tags USC.csv')

We begin by importing the two datasets we need for this project. The first dataset, productData, contains information about each product in the catalog, including name, brand, and description. The second dataset, outfits, contains data on premade outfits selected by Behold's experts. Each is important for our recommender algorithm; we'll need the product data to determine which products are most similar to each other, and in the case that we match a product to the user query that is listed in the premade outfits dataset, we'll bring in the rest of that outfit to recommend a relevant, expertly designed outfit to the user. 

## Data Cleaning

We start by creating a dataset called pre (pre-processed) that contains only the columns we need for our recommender. These columns include the 

- product ID (which we'll use to uniquely identify and pull products into the recommender)
- the name, details, brand category, and description (which we'll use to calculate similarity between products and engineer additional features)
- the product active variable (which we'll use to give the user the option to only receive recommendations for active products). 

In [28]:
pre = productData[['product_id', 'name', 'details', 'brand_category','description', 'product_active']]

Next, we convert the text fields to strings, remove line breaks, and lowercase all text so that we can match stopwords and reduce the total number of unique tokens. We also create a copy of the pre dataset called cat, which we'll use to create a category feature. We create a copy so that if things go wrong we can refer to the pre dataset instead of starting from scratch. 

In [29]:
pre['name'] = pre['name'].astype(str)
pre['description'] = pre['description'].astype(str)
pre['details'] = pre['details'].astype(str)
pre['brand_category'] = pre['brand_category'].astype(str)

pre = pre.replace(r'\\n',' ', regex=True) 

pre['description'] = pre['description'].str.lower()
pre['name'] = pre['name'].str.lower()
pre['details'] = pre['details'].str.lower()
pre['brand_category'] = pre['brand_category'].str.lower()

cat = pre.copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pre['name'] = pre['name'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pre['description'] = pre['description'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pre['details'] = pre['details'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try 

In [31]:
cat.head()

Unnamed: 0,product_id,name,details,brand_category,description,product_active
0,01EX0PN4J9WRNZH5F93YEX6QAF,khadi stripe shirt-our signature shirt,,unknown,our signature khadi shirt\navailable in black ...,True
1,01F0C4SKZV6YXS3265JMC39NXW,ruffle market dress loopy pink sistine tomato,,unknown,mid-length dress with ruffles and adjustable s...,True
2,01EY4Y1BW8VZW51BWG5VZY82XW,ibi slip on raw red knit sneaker women,,unknown,ibi slip on raw red knit sneaker women,False
3,01EY50E27A0P5V6KCW01XPDB43,ibi slip on black knit sneaker women,,unknown,ibi slip on black knit sneaker women,False
4,01EY6DWHC2W5HPNEGXKEJ4A1CX,catiba pro skate black suede and canvas contra...,,unknown,,False


## Engineering the Category Variable

The category variable is extremely important to our recommender. Thus, we put a lot of work into ensuring that products were thoroughly and accurately classified into their respective categories. We chose to create five category classes, four of which we decided to use in our recommender. A recommended outfit consists of a: 

- bottom
- top
- shoe
- accessory

The final category is unknown, which we assigned to products that weren't classified into one of the four above categories. 


We considered employing a machine learning model to categorize the clothing into these classes, using the pre-made outfit product type field as a class label. However, we wanted to try regex first before spending resources on a complex model, and found that regex did more than a good enough job to be used as the final method for creating the category feature. 

In [32]:
import numpy as np
cat['category'] = np.nan
cat['BOTTOM'] = 0
cat['TOP'] = 0
cat['SHOE'] = 0
cat['ACCESSORY'] = 0

For each category, we replace language representations of that category with a category label. We can then search for that label when assigning a count of occurences of a given label's language representations for a given product. We search for these representations in all four of our main text columns: 

- description
- details
- name
- brand category

When I say language representation, I mean, for example, that a "SHOE" could be represented by tokens such as "sneakers", "high heels", and "slippers". We essentially grouped as many representations as we could think of into their respective categories by replacing the representative tokens with the category itself. 

Note that the development of the list of tokens which could be considered a representation of that product category was an iterative process. We inspected which products were classified as "unknown" after each round, and then looked for common clothing items that we weren't including in our regex lists. We continued this process until we were satisfied with the number of product's we had classified. 

Note also that we perform these replacement searches before lemmatization, which might change the tokens to unrecognizable forms in some instances. 

In [33]:
#bottoms
cat['name'] = cat['name'].str\
.replace(r'\bover-?alls?\b|\bjumpsuits?\b|\bjeans?\b|\bpants?\b|\bbottoms?\b|\blegs?\b|\bslacks?\b|\bshorts?\b|\bskirts?\b|\bunder(?:wear|garments?)\b|\bleggings?\b|\btrousers?\b|\bone-pieces?\b|\bone pieces?\b|\brompers?\b|\bdress(?:es)?\b|\bjumpers?\b|\bleotards?\b|\bonesies?\b|\bkhakis\b|\bchinos?\b|\bculottes?\b|\bharems?\b|\bjodhpurs?\b|\bpegged\b|\bsailors?\b|\btoreadors?\b', 'BOTTOM', regex = True)

cat['details'] = cat['details'].str\
.replace(r'\bover-?alls?\b|\bjumpsuits?\b|\bjeans?\b|\bpants?\b|\bbottoms?\b|\blegs?\b|\bslacks?\b|\bshorts?\b|\bskirts?\b|\bunder(?:wear|garments?)\b|\bleggings?\b|\btrousers?\b|\bone-pieces?\b|\bone pieces?\b|\brompers?\b|\bdress(?:es)?\b|\bjumpers?\b|\bleotards?\b|\bonesies?\b|\bkhakis\b|\bchinos?\b|\bculottes?\b|\bharems?\b|\bjodhpurs?\b|\bpegged\b|\bsailors?\b|\btoreadors?\b', 'BOTTOM', regex = True)
cat['description'] = cat['description'].str\
.replace(r'\bover-?alls?\b|\bjumpsuits?\b|\bjeans?\b|\bpants?\b|\bbottoms?\b|\blegs?\b|\bslacks?\b|\bshorts?\b|\bskirts?\b|\bunder(?:wear|garments?)\b|\bleggings?\b|\btrousers?\b|\bone-pieces?\b|\bone pieces?\b|\brompers?\b|\bdress(?:es)?\b|\bjumpers?\b|\bleotards?\b|\bonesies?\b|\bkhakis\b|\bchinos?\b|\bculottes?\b|\bharems?\b|\bjodhpurs?\b|\bpegged\b|\bsailors?\b|\btoreadors?\b', 'BOTTOM', regex = True)
cat['brand_category'] = cat['brand_category'].str\
.replace(r'\bover-?alls?\b|\bjumpsuits?\b|\bjeans?\b|\bpants?\b|\bbottoms?\b|\blegs?\b|\bslacks?\b|\bshorts?\b|\bskirts?\b|\bunder(?:wear|garments?)\b|\bleggings?\b|\btrousers?\b|\bone-pieces?\b|\bone pieces?\b|\brompers?\b|\bdress(?:es)?\b|\bjumpers?\b|\bleotards?\b|\bonesies?\b|\bkhakis\b|\bchinos?\b|\bculottes?\b|\bharems?\b|\bjodhpurs?\b|\bpegged\b|\bsailors?\b|\btoreadors?\b', 'BOTTOM', regex = True)

Once we replace the representative tokens with the class token, we search for those tokens in all 4 text columns. We then fill in the cell for the category in question for a given product with the total count of the category token found in all four text columns for that product. 

Our original design simply filled in the category columns with a binary variable detailing whether or not there was at least one category token in any of the 4 columns for a product, but there were too many instances where two or more categories were present. The counting system allows us to assign a product to the category that is most prevalent in the various text columns for that product.

In [34]:
#fill in the new category feature tags for bottom products
for row in range(len(cat)):
    cat['BOTTOM'][row] = cat['name'][row].count('BOTTOM') + cat['description'][row].count('BOTTOM')\
    + cat['details'][row].count('BOTTOM') + cat['brand_category'][row].count('BOTTOM')
    
bottom_count = len(cat[cat['BOTTOM'] > 0])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat['BOTTOM'][row] = cat['name'][row].count('BOTTOM') + cat['description'][row].count('BOTTOM')\


After we assign a count to a given category for a product, we reset the text columns before searching for instances of the next category, in case that any tokens apply to both. It's unlikely but more of a precaution than anything. 

In [35]:
cat['name'] = pre['name']
cat['details'] = pre['details']
cat['description'] = pre['description']
cat['brand_category'] = pre['brand_category']

In [36]:
#tops
cat['name'] = cat['name'].str\
.replace(r'\brobes?\b|\bdojos?\b|\bturtle-?necks?\b|\bgowns?\b|\bcover-?ups?\b|\bkimonos?\b|\bbras?\b|\bbreast(?:ed)?\b|\bvests?\b|\btops?\b|\bshirts?\b|\bhoodies?\b|\bcrewnecks?\b|\bv-?necks?\b|\bsweat(?:ers?|shirts?)\b|\bblouses?\b|\btank(?:top)?\b|\btee\b|\bt-?shirt\b|\bcami(?:sole?)?\b|\bcardigans?\b|\bpull-?overs?\b|\bblazers?\b|\bjackets?\b|\btubes?\b|\bwraps?\b|\bringers?\b|\bsleeves?\b|\bcoats?\b', 'TOP', regex = True)

cat['details'] = cat['details'].str\
.replace(r'\brobes?\b|\bdojos?\b|\bturtle-?necks?\b|\bgowns?\b|\bcover-?ups?\b|\bkimonos?\b|\bbras?\b|\bbreast(?:ed)?\b|\bvests?\b|\btops?\b|\bshirts?\b|\bhoodies?\b|\bcrewnecks?\b|\bv-?necks?\b|\bsweat(?:ers?|shirts?)\b|\bblouses?\b|\btank(?:top)?\b|\btee\b|\bt-?shirt\b|\bcami(?:sole?)?\b|\bcardigans?\b|\bpull-?overs?\b|\bblazers?\b|\bjackets?\b|\btubes?\b|\bwraps?\b|\bringers?\b|\bsleeves?\b|\bcoats?\b', 'TOP', regex = True)

cat['description'] = cat['description'].str\
.replace(r'\brobes?\b|\bdojos?\b|\bturtle-?necks?\b|\bgowns?\b|\bcover-?ups?\b|\bkimonos?\b|\bbras?\b|\bbreast(?:ed)?\b|\bvests?\b|\btops?\b|\bshirts?\b|\bhoodies?\b|\bcrewnecks?\b|\bv-?necks?\b|\bsweat(?:ers?|shirts?)\b|\bblouses?\b|\btank(?:top)?\b|\btee\b|\bt-?shirt\b|\bcami(?:sole?)?\b|\bcardigans?\b|\bpull-?overs?\b|\bblazers?\b|\bjackets?\b|\btubes?\b|\bwraps?\b|\bringers?\b|\bsleeves?\b|\bcoats?\b', 'TOP', regex = True)

cat['brand_category'] = cat['brand_category'].str\
.replace(r'\brobes?\b|\bdojos?\b|\bturtle-?necks?\b|\bgowns?\b|\bcover-?ups?\b|\bkimonos?\b|\bbras?\b|\bbreast(?:ed)?\b|\bvests?\b|\btops?\b|\bshirts?\b|\bhoodies?\b|\bcrewnecks?\b|\bv-?necks?\b|\bsweat(?:ers?|shirts?)\b|\bblouses?\b|\btank(?:top)?\b|\btee\b|\bt-?shirt\b|\bcami(?:sole?)?\b|\bcardigans?\b|\bpull-?overs?\b|\bblazers?\b|\bjackets?\b|\btubes?\b|\bwraps?\b|\bringers?\b|\bsleeves?\b|\bcoats?\b', 'TOP', regex = True)


In [37]:
#fill in the new category feature tags for top products
for row in range(len(cat)):
    cat['TOP'][row] = cat['name'][row].count('TOP') + cat['description'][row].count('TOP')\
    + cat['details'][row].count('TOP') + cat['brand_category'][row].count('TOP')
    
top_count = len(cat[cat['TOP'] > 0])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat['TOP'][row] = cat['name'][row].count('TOP') + cat['description'][row].count('TOP')\


In [38]:
cat['name'] = pre['name']
cat['details'] = pre['details']
cat['description'] = pre['description']
cat['brand_category'] = pre['brand_category']

In [39]:
#shoes
cat['name'] = cat['name'].str\
.replace(r'\bsneakers?\b|\bheels?\b|\bsandals?\b|\bflip-?(?:flops?)\b|\bwedges?\b|\bpumps?\b|\bboots?\b|\bslippers?\b|\bhi-tops?\b|\bshoes?\b|\bloafers?\b|\bcrocs?\b|\bmoccasins?\b|\bmukluks?\b|\bopen-?toed?\b|\boxfords?\b|\bpenny?\b|\bplatforms?\b|\bslides?\b|\bclogs?\b', 'SHOE', regex = True)

cat['details'] = cat['details'].str\
.replace(r'\bsneakers?\b|\bheels?\b|\bsandals?\b|\bflip-?(?:flops?)\b|\bwedges?\b|\bpumps?\b|\bboots?\b|\bslippers?\b|\bhi-tops?\b|\bshoes?\b|\bloafers?\b|\bcrocs?\b|\bmoccasins?\b|\bmukluks?\b|\bopen-?toed?\b|\boxfords?\b|\bpenny?\b|\bplatforms?\b|\bslides?\b|\bclogs?\b', 'SHOE', regex = True)

cat['description'] = cat['description'].str\
.replace(r'\bsneakers?\b|\bheels?\b|\bsandals?\b|\bflip-?(?:flops?)\b|\bwedges?\b|\bpumps?\b|\bboots?\b|\bslippers?\b|\bhi-tops?\b|\bshoes?\b|\bloafers?\b|\bcrocs?\b|\bmoccasins?\b|\bmukluks?\b|\bopen-?toed?\b|\boxfords?\b|\bpenny?\b|\bplatforms?\b|\bslides?\b|\bclogs?\b', 'SHOE', regex = True)

cat['brand_category'] = cat['brand_category'].str\
.replace(r'\bsneakers?\b|\bheels?\b|\bsandals?\b|\bflip-?(?:flops?)\b|\bwedges?\b|\bpumps?\b|\bboots?\b|\bslippers?\b|\bhi-tops?\b|\bshoes?\b|\bloafers?\b|\bcrocs?\b|\bmoccasins?\b|\bmukluks?\b|\bopen-?toed?\b|\boxfords?\b|\bpenny?\b|\bplatforms?\b|\bslides?\b|\bclogs?\b', 'SHOE', regex = True)


In [40]:
#fill in the new category feature tags for shoe products
for row in range(len(cat)):
    cat['SHOE'][row] = cat['name'][row].count('SHOE') + cat['description'][row].count('SHOE')\
    + cat['details'][row].count('SHOE') + cat['brand_category'][row].count('SHOE')
    
shoe_count = len(cat[cat['SHOE'] > 0])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat['SHOE'][row] = cat['name'][row].count('SHOE') + cat['description'][row].count('SHOE')\


In [41]:
cat['name'] = pre['name']
cat['details'] = pre['details']
cat['description'] = pre['description']
cat['brand_category'] = pre['brand_category']

In [42]:
#accessories
cat['name'] = cat['name'].str\
.replace(r'\bmasks?\b|\bfacemasks?\b|\bglasses?\b|\bsunglasses?\b|\brims?\b|\bbandanas?\b|\bbelts?\b|\bframes?\b|\bcaps?\b|\bbrims?\b|\bhats?\b|\bclutch\b|\bpurses?\b|\bhandbags?\b|\bcross-body\b|\btotes?\b|\bbags?\b|\bpack\b|\bsatchel\b|\bhobo\b|\bbaguette\b|\bshopper\b|\bwristlet\b|\bbucket\b|\bscar(?:f|ves)\b|\bwrap\b|\binfinty\b|\bcowl\b|\bcircle\b|\bmuffler\b|\btriangle\b|\bwatch(?:es)?\b|\bbracelets?\b|\bchokers?\b|\bnecklaces?\b|\banklets?\b|\bpendants?\b|\bbangles?\b|\bcuffs?\b|\brings?\b|\bbrooch\b|\blockets?\b|\bmedallions?\b|\bpendants?\b|\bearr?ings?\b|\bhairpins?\b|hair\b', 'ACCESSORY', regex = True)

cat['details'] = cat['details'].str\
.replace(r'\bmasks?\b|\bfacemasks?\b|\bglasses?\b|\bsunglasses?\b|\brims?\b|\bbandanas?\b|\bbelts?\b|\bframes?\b|\bcaps?\b|\bbrims?\b|\bhats?\b|\bclutch\b|\bpurses?\b|\bhandbags?\b|\bcross-body\b|\btotes?\b|\bbags?\b|\bpack\b|\bsatchel\b|\bhobo\b|\bbaguette\b|\bshopper\b|\bwristlet\b|\bbucket\b|\bscar(?:f|ves)\b|\bwrap\b|\binfinty\b|\bcowl\b|\bcircle\b|\bmuffler\b|\btriangle\b|\bwatch(?:es)?\b|\bbracelets?\b|\bchokers?\b|\bnecklaces?\b|\banklets?\b|\bpendants?\b|\bbangles?\b|\bcuffs?\b|\brings?\b|\bbrooch\b|\blockets?\b|\bmedallions?\b|\bpendants?\b|\bearr?ings?\b|\bhairpins?\b|hair\b', 'ACCESSORY', regex = True)

cat['description'] = cat['description'].str\
.replace(r'\bmasks?\b|\bfacemasks?\b|\bglasses?\b|\bsunglasses?\b|\brims?\b|\bbandanas?\b|\bbelts?\b|\bframes?\b|\bcaps?\b|\bbrims?\b|\bhats?\b|\bclutch\b|\bpurses?\b|\bhandbags?\b|\bcross-body\b|\btotes?\b|\bbags?\b|\bpack\b|\bsatchel\b|\bhobo\b|\bbaguette\b|\bshopper\b|\bwristlet\b|\bbucket\b|\bscar(?:f|ves)\b|\bwrap\b|\binfinty\b|\bcowl\b|\bcircle\b|\bmuffler\b|\btriangle\b|\bwatch(?:es)?\b|\bbracelets?\b|\bchokers?\b|\bnecklaces?\b|\banklets?\b|\bpendants?\b|\bbangles?\b|\bcuffs?\b|\brings?\b|\bbrooch\b|\blockets?\b|\bmedallions?\b|\bpendants?\b|\bearr?ings?\b|\bhairpins?\b|hair\b', 'ACCESSORY', regex = True)

cat['brand_category'] = cat['brand_category'].str\
.replace(r'\bmasks?\b|\bfacemasks?\b|\bglasses?\b|\bsunglasses?\b|\brims?\b|\bbandanas?\b|\bbelts?\b|\bframes?\b|\bcaps?\b|\bbrims?\b|\bhats?\b|\bclutch\b|\bpurses?\b|\bhandbags?\b|\bcross-body\b|\btotes?\b|\bbags?\b|\bpack\b|\bsatchel\b|\bhobo\b|\bbaguette\b|\bshopper\b|\bwristlet\b|\bbucket\b|\bscar(?:f|ves)\b|\bwrap\b|\binfinty\b|\bcowl\b|\bcircle\b|\bmuffler\b|\btriangle\b|\bwatch(?:es)?\b|\bbracelets?\b|\bchokers?\b|\bnecklaces?\b|\banklets?\b|\bpendants?\b|\bbangles?\b|\bcuffs?\b|\brings?\b|\bbrooch\b|\blockets?\b|\bmedallions?\b|\bpendants?\b|\bearr?ings?\b|\bhairpins?\b|hair\b', 'ACCESSORY', regex = True)


In [43]:
#fill in the new category feature tags for shoe products
for row in range(len(cat)):
    cat['ACCESSORY'][row] = cat['name'][row].count('ACCESSORY') + cat['description'][row].count('ACCESSORY')\
    + cat['details'][row].count('ACCESSORY') + cat['brand_category'][row].count('ACCESSORY')
    
accessory_count = len(cat[cat['ACCESSORY'] > 0])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat['ACCESSORY'][row] = cat['name'][row].count('ACCESSORY') + cat['description'][row].count('ACCESSORY')\


In [44]:
cat['name'] = pre['name']
cat['details'] = pre['details']
cat['description'] = pre['description']
cat['brand_category'] = pre['brand_category']

Now we'll use a few loops to assign the class with the highest count for each product to the category column. 

In [46]:
#create a temporary subset dataframe including only the category count columns
temp = cat[['BOTTOM', 'TOP', 'SHOE', 'ACCESSORY']]

#for each row (product) in the dataframe...
for row in range(len(cat)):
    true_classes = []
    for col in temp.columns:
        if temp[col][row] > 0:
            #build a list of categories that have at least 1 occurence for that product. 
            true_classes.append(col)
    
    #count the total number of occurrences for all categories for each product
    class_count = temp.iloc[row, :].sum()
    
    #If no classes have occurences for the product, assign UNKNOWN
    if class_count == 0:
        cat['category'][row] = 'UNKNOWN'
    #If the total occurence count across all categories is the same as the maximum category's count, then assign that 
    #class to the product
    elif class_count == temp.iloc[row, :].max():
        cat['category'][row] = true_classes[0]
    #otherwise, there's at least one occurence for multiple categories for a given product. In this case we enter a subloop.  
    else:
        #we check if each category's column is equal to the max across the row. If it is we assign that category. 
        #In the event of a tie, the priority goes in the order in which the categories appear in this sub loop. 
        if cat['BOTTOM'][row] == temp.iloc[row, :].max():
            cat['category'][row] = 'BOTTOM'
        elif cat['TOP'][row]  == temp.iloc[row, :].max():
            cat['category'][row] = 'TOP'
        elif cat['SHOE'][row]  == temp.iloc[row, :].max():
            cat['category'][row] = 'SHOE'
        else:
            cat['category'][row] = 'ACCESSORY'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat['category'][row] = true_classes[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat['category'][row] = 'BOTTOM'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat['category'][row] = 'UNKNOWN'
A value is trying to 

Here are the counts for the classes we've assigned. 

In [47]:
cat['category'].value_counts()

TOP          20755
BOTTOM       18495
UNKNOWN      10234
ACCESSORY     6873
SHOE          4998
Name: category, dtype: int64

We assigned 51,121 of the records to a category that wasn't unknown. 

In [48]:
len(cat) - len(cat[cat['category'] == 'UNKNOWN'])

51121

Here's a look at our iterative process mentioned above for examining the unknown products for trends, or representations for categories we might be missing. 

In [None]:
cat[cat['category'] == 'UNKNOWN']['description'].tolist()

A lot of the remaining descriptions aren't really even clothing, or they're obscure clothing items, so they really should be classified as unknown. We could spend all day picking off the remaining items and categorizing them into the regex streams, but what we have will give us over 51,000 products to work with in our recommender, which we are happy with. 

In [50]:
cat = cat[['product_id', 'category']]

Finally, we subset the dataframe to only the product id column (which we'll use to merge the categories to the rest of our recommender input data) and the category column. 

## Cleaning the text

In [51]:
#%pip install gensim 
import gensim 
from gensim.parsing.preprocessing import remove_stopwords



In [52]:
from gensim.parsing.preprocessing import STOPWORDS
print(STOPWORDS)

frozenset({'neither', 'thus', 'since', 'five', 'against', 'cry', 'third', 'each', 'anything', 'using', 'may', 'beside', 'ten', 'within', 'sincere', 'should', 'did', 'done', 'various', 'six', 'behind', 'fifteen', 'through', 'somewhere', 'also', 'i', 'nowhere', 'such', 'while', 'ever', 'twenty', 'than', 'during', 'do', 'thereby', 'rather', 'who', 'most', 'others', 'which', 'always', 'becoming', 'me', 'many', 'please', 'can', 'formerly', 'already', 'thick', 'fifty', 'then', 'after', 'several', 'too', 'more', 'among', 'yourself', 'almost', 'has', 'were', 'if', 'say', 'anywhere', 'hence', 'over', 'give', 'towards', 'across', 'made', 'with', 'there', 'computer', 'how', 'every', 'someone', 'take', 'last', 'less', 'name', 'indeed', 'wherein', 'often', 'is', 'further', 'either', 'why', 'cannot', 'enough', 'whereby', 'thence', 'call', 'those', 'him', 'amongst', 'does', 'km', 'an', 'well', 'whole', 'all', 'only', 'top', 'we', 'amoungst', 'and', 'whenever', 'ours', 'two', 'make', 'not', 'it', 'ful

We'll define a function for removing stopwords, and apply it to all of our text columns. 

In [53]:
import re

def punc_remove(text):
    
    return re.sub(r'[^\w\s]','',text)

We'll also convert all of our text columns to strings, and lowercase them. 

In [54]:
pre = productData[['product_id', 'name', 'details', 'brand_category','description', 'product_active']]

pre['name'] = pre['name'].astype(str)
pre['description'] = pre['description'].astype(str)
pre['details'] = pre['details'].astype(str)
pre['brand_category'] = pre['brand_category'].astype(str)

pre = pre.replace(r'\\n',' ', regex = True) 

pre['name'] = pre['name'].apply(punc_remove)
pre['description'] = pre['description'].apply(punc_remove)
pre['details'] = pre['details'].apply(punc_remove)
pre['brand_category'] = pre['brand_category'].apply(punc_remove)

pre['description'] = pre['description'].str.lower()
pre['name'] = pre['name'].str.lower()
pre['details'] = pre['details'].str.lower()
pre['brand_category'] = pre['brand_category'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pre['name'] = pre['name'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pre['description'] = pre['description'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pre['details'] = pre['details'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try 

Next we remove all stopwords after inspecting the list above for any stopwords we might not want to remove. 

In [55]:
pre['name'] = pre['name'].apply(remove_stopwords)
pre['details'] = pre['details'].apply(remove_stopwords)
pre['description'] = pre['description'].apply(remove_stopwords)
pre['brand_category'] = pre['brand_category'].apply(remove_stopwords)

Next, we'll use spacy's lemmatizer to lemmatize the text columns. These are the columns we'll be using for our cosine distance calculator in our recommender, so its important to lemmatize in order to ensure words that are semantically the same are represented as the same in the text. 

In [56]:
#%pip install spacy
import spacy

nlp = spacy.load('en_core_web_sm')

import pandas as pd
def lemmatize_text(text):
    sentence = ''
    lemmas = []
    doc = nlp(text)
    for token in doc:
        lemmas.append(token.lemma_)
    for token in lemmas:
        sentence = sentence + token + ' '
    return sentence

In [57]:
# function for lemmatization
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [58]:
pre['name'] = pre['name'].apply(lemmatize_sentence)
pre['details'] = pre['details'].apply(lemmatize_sentence)
pre['description'] = pre['description'].apply(lemmatize_sentence)
pre['brand_category'] = pre['brand_category'].apply(lemmatize_sentence)

We merge the cleaned text columns with the product category column we created earlier. We also include the product_active column. 

In [59]:
merged = pre.merge(cat, on = 'product_id')

In [66]:
merged

Unnamed: 0,product_id,name,details,brand_category,description,product_active,category
0,01EX0PN4J9WRNZH5F93YEX6QAF,khadi stripe shirtour signature shirt,,unknown,signature khadi shirt available black white ea...,True,TOP
1,01F0C4SKZV6YXS3265JMC39NXW,ruffle market dress loopy pink sistine tomato,,unknown,midlength dress ruffle adjustable strap bias c...,True,BOTTOM
2,01EY4Y1BW8VZW51BWG5VZY82XW,ibi slip raw red knit sneaker woman,,unknown,ibi slip raw red knit sneaker woman,False,SHOE
3,01EY50E27A0P5V6KCW01XPDB43,ibi slip black knit sneaker woman,,unknown,ibi slip black knit sneaker woman,False,SHOE
4,01EY6DWHC2W5HPNEGXKEJ4A1CX,catiba pro skate black suede canvas contrast t...,,unknown,,False,SHOE
...,...,...,...,...,...,...,...
61350,01EYB5ERGYPFNGM6C9QK7Q9EV0,bowvida mule black suede kidskin,feminine flat mule square shape v line gently ...,sandalssales,flat bowvida mule black suede ideal spring sum...,False,SHOE
61351,01EHWTBFP368Q035FW95TRJDAY,sandale vida mule tangerine suede kidskin,feminine flat mule square shape v line gently ...,flat sandalsarchives,flat vida mule tangerine suede comfortable fem...,False,SHOE
61352,01EHWTCFTPPSCW10D4XBQZF28H,bowvida mule fuschia suede kidskin,feminine flat mule square shape v line gently ...,flat sandalsarchives,flat bowvida mule fuschia suede ideal spring s...,False,SHOE
61353,01EYB5B5FH7JESXF82ZEMVXZMS,vida mule silver metalize leather,feminine flat mule square shape v line gently ...,sandalssales,flat vida mule silver metalize leather comfort...,False,SHOE


# Recommender

In [72]:
outfits = pd.read_csv('outfit_combinations USC.csv')

In [73]:
import re
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [74]:
# Define the cosine query function
def cosineQuery(docs, query):
    """
    docs: string of text for a document
    query: query string

    return: cosine similarity between query and all docs
    """
    TFIDF = TfidfVectorizer().fit_transform(docs)
    qTFIDF = TfidfVectorizer().fit(docs)
    qTFIDF = qTFIDF.transform([query])
    return cosine_similarity(qTFIDF, TFIDF).flatten()

In [75]:
# Get first outfit from the curated outfit data
def getOutfit(productID, rec):
    """
    productID: product ID
    rec: recommended outfit combination dataframe

    return: cosine similarity between query and all docs
    """
    outfitID = rec[rec['product_id'] == productID].iloc[0,:][0]
    df = rec[rec['outfit_id'] == outfitID].loc[:, ['outfit_item_type', 'product_full_name']]
    d = df.to_dict(orient='records')
    returnDict = {}
    for i in d:
        returnDict[i['outfit_item_type']] = i['product_full_name']
    return returnDict

Our final recommender algorithm gives users the option on whether or not they want to include inactive products in the recommendation set. We chose to include this option instead of by default recommending only active products, since most of the products in the dataset are inactive: 

In [None]:
merged['product_active'].value_counts()

In [76]:
def search(query, limitToActive=False): 
    """
    query is a string that is passed in by the user, and this function returns a 
    dictionary of outfit results. 
    Example:
    search("pleated casual skirt") -> { 
        "top": "...",
        "bottom": "...",
        "shoe": "..."
    } """   
    if (query == '' or len(query) <= 0 or not query):
        print('Query Empty...')
        return
    query = query.lower()
    
    # GET the text
    dat = merged.copy()
    rec = outfits
    
    # Combine columns and create new text column
    text = []
    for index, row in dat.drop(['product_id', 'product_active', 'category'], axis=1).iterrows():
        s = ''
        for i in row:
            temp = str(i)
            if (temp != 'NaN' and temp != 'unknown'):
                s += temp
        text.append(' '.join(re.findall(r'[a-z]+', s)))
    dat['text'] = text
    
    # Run the cosine query
    cos = cosineQuery(dat['text'], query)
    dat['similarities'] = cos
    
    # Only consider active products if `limitToActive` is flagged
    if(limitToActive):
        dat = dat[dat['product_active']]
        
    dat = dat[dat['category'] != 'UNKNOWN']
    
    # Sort and get top result 
    top = dat.sort_values('similarities', ascending=False).reset_index(drop=True).iloc[0]
    
    # If the top result is in the outfit combos, return the outfit
    if (rec[rec['product_id'] == top['product_id']].shape[0] > 0):
        return getOutfit(top['product_id'], rec)
    
    # Remove the top result and its class from the full dataset
    # We can do both in one go
    df = dat[dat['category'] != top['category']].copy()
    
    # Query again with data from the top result, whatever that may be
    cos = cosineQuery(df['text'], top['text'])
    df['similarities'] = cos
    
    # Select the top results for the remaining classes
    returnDict = {top['category']: top['name']}
    classMask = dat['category'].unique().tolist()
    if (top['category'] in classMask):
        classMask.remove(top['category'])
    for i in classMask:
        temp = df[df['category'] == i].sort_values('similarities', ascending=False).reset_index(drop=True).iloc[0]
        returnDict[temp['category']] = temp['name']
    return returnDict

In [84]:
# Run the recommendation function
print(search('plain white tee', limitToActive = False))

{'TOP': 'crystal star', 'BOTTOM': 'starry night dress', 'SHOE': 'ibi slip white knit sneaker woman', 'ACCESSORY': 'talisman energy earring'}
