# Label Classification: DSO 560 Project

In this problem statement, we are doing supervised machine learning classification using nlp techniques. We have broadly divided our deliverable into three main areas
- Data Cleaning : In this step, we loaded the data with description of each item with the provided labels. Then we applied regex to clean the text, used spacy library to remove stop words. And finally did lemmatization.
- Model Selection: In this step, we implemented various models like logistic regression and keras with different vectorization techniques ranging from count, tfidf vectorizer and word embedings. We then evaluated our model on a portion of labelled data. We then selected the top performing models for each label
- Model Prediction: In this step, we take input from user and predict labels based on the best model selected in the previous step and return the predicted labels closest to the input.

## Data Cleaning

In [1]:
# Importing relevant libraries
import pandas as pd
import numpy as np
import re
from collections import Counter
import nltk
import spacy
import functools
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
from nltk import word_tokenize
from keras.preprocessing.text import Tokenizer
import tensorflow as tf
from random import randint
from numpy import array, argmax, asarray, zeros
from keras.layers.recurrent import SimpleRNN, LSTM
from keras.layers import Flatten, Masking
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from gensim.test.utils import common_texts, get_tmpfile
import warnings
warnings.filterwarnings("ignore")
import spacy
import string

Using TensorFlow backend.


In [2]:
import en_core_web_md
nlp = en_core_web_md.load()
from tqdm import tqdm_notebook

In [3]:
from typing import List

In [4]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

## Loading the data

In this step we combine the full data file with the files containing the values of tags

In [5]:
# loading full data
df_all = pd.read_csv("Full Data.csv")

In [6]:
df_all.head(5)

Unnamed: 0,product_id,brand,mpn,product_full_name,description,brand_category,created_at,updated_at,deleted_at,brand_canonical_url,details,labels,bc_product_id
0,01DSE9TC2DQXDG6GWKW9NMJ416,Banana Republic,514683.0,Ankle-Strap Pump,"A modern pump, in a rounded silhouette with an...",Unknown,2019-11-11 22:37:15.719107+00,2019-12-19 20:40:30.786144+00,,https://bananarepublic.gap.com/browse/product....,"A modern pump, in a rounded silhouette with an...","{""Needs Review""}",
1,01DSE9SKM19XNA6SJP36JZC065,Banana Republic,526676.0,Petite Tie-Neck Top,Dress it down with jeans and sneakers or dress...,Unknown,2019-11-11 22:36:50.682513+00,2019-12-19 20:40:30.786144+00,,https://bananarepublic.gap.com/browse/product....,Dress it down with jeans and sneakers or dress...,"{""Needs Review""}",
2,01DSJX8GD4DSAP76SPR85HRCMN,Loewe,400100000000.0,52MM Padded Leather Round Sunglasses,Padded leather covers classic round sunglasses.,JewelryAccessories/SunglassesReaders/RoundOval...,2019-11-13 17:33:59.581661+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/loewe-52mm-pad...,100% UV protection\nCase and cleaning cloth in...,"{""Needs Review""}",
3,01DSJVKJNS6F4KQ1QM6YYK9AW2,Converse,400012000000.0,Baby's & Little Kid's All-Star Two-Tone Mid-To...,The iconic mid-top design gets an added dose o...,"JustKids/Shoes/Baby024Months/BabyGirl,JustKids...",2019-11-13 17:05:05.203733+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/converse-babys...,Canvas upper\nRound toe\nLace-up vamp\nSmartFO...,"{""Needs Review""}",
4,01DSK15ZD4D5A0QXA8NSD25YXE,Alexander McQueen,400011000000.0,64MM Rimless Sunglasses,Hexagonal shades offer a rimless view with int...,JewelryAccessories/SunglassesReaders/RoundOval,2019-11-13 18:42:30.941321+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/alexander-mcqu...,100% UV protection\nGradient lenses\nAdjustabl...,"{""Needs Review""}",


In [7]:
# keeping the columns relevant to the analyses
df_all = df_all[['product_id', 'brand', 'product_full_name', 'description', 'brand_category', 'details', 'labels']]

In [8]:
df_all.shape

(48979, 7)

In [9]:
# Checking NA values
df_all.isnull().sum()

product_id              0
brand                   0
product_full_name       0
description          7974
brand_category        238
details              9866
labels                  0
dtype: int64

In [10]:
# there are some rows which do not contain any description or details
# replace the null values with
df_all=df_all.fillna("Unknown")

In [11]:
df_all.shape

(48979, 7)

In [12]:
# checking for duplicate product id entries
df_check = df_all.groupby("product_id").size().reset_index(name = 'count').sort_values('count', ascending = False)
#len(df_check[df_check['count']>1])
df_check.head(5)

Unnamed: 0,product_id,count
0,01DMBRYVA2P5H24WK0HTK4R0A1,2
41932,01DT51234VHAHGPTR89SZJ50V0,2
6064,01DPGTXH6QTM161M660N9W7C3S,2
6063,01DPGTXD3HEJ83GAWGBNB0PV92,2
42074,01DTJCE596G5WGANPMXNENAXFJ,2
42075,01DTJCE9ZMH29TWHQC1CC8AWG5,2
42076,01DTJCEEPSEHH29G98KNDG4TFK,2
42077,01DTJCEKTMNHVSB3WHG9M5V1P7,2
42078,01DTJCERF6F4NRZ2WSJFFA1EYS,2
42079,01DTJCEX7H9S5ZQ2MXD019M39N,2


In [13]:
# Dropping duplicate product ids
df_all = df_all.drop_duplicates(subset="product_id")

In [14]:
df_all.shape

(48072, 7)

In [15]:
# loading first file with tags
df_tagged = pd.read_excel("USC+Product+Attribute+Data+03302020.xlsx")

In [16]:
df_tagged.head(5)

Unnamed: 0,product_id,product_color_id,attribute_name,attribute_value
0,01DVBTBPHR8WJTCVEN5AJRHF47,01DVBTBPJ41VVT00JJCG8TTZ2W,gender,Women
1,01DVA7QRXM928ZM0WWR7HFNTC1,01DVA7QRXXR9F0TWVE1HMC5ZQ3,Primary Color,Blacks
2,01DPGV4YRP3Z8J85DASGZ1Y99W,01DPGVGBK6YGNYGNF2S6FSH02T,style,Casual
3,01E1JM43NQ3H17PB22EV3074NX,01E1JM5WFWWCCCH3JTTTCYQCEQ,style,Modern
4,01DSE8Z2ZDAZKZ2SKCS1E3B3HK,01DSE8ZG8Y3FR8KWE2TY1QDWBF,shoe_width,Medium


In [17]:
df_tagged.shape

(21925, 4)

In [18]:
# loading second file with tags
df_tagged2 = pd.read_csv('usc_additional_tags.csv')
df_tagged2.head(5)

Unnamed: 0,product_id,product_color_id,attribute_name,attribute_value
0,01E5ZXP5H0BTEZT9QD2HRZJ47A,01E5ZXP5JCREDC7WJVMWHK5Q40,materialclothing,linenblend
1,01E5ZXP5H0BTEZT9QD2HRZJ47A,01E5ZXP5JCREDC7WJVMWHK5Q40,materialclothing,cottonblend
2,01E5ZXP5H0BTEZT9QD2HRZJ47A,01E5ZXP5JCREDC7WJVMWHK5Q40,style,modern
3,01E5ZXP5H0BTEZT9QD2HRZJ47A,01E5ZXP5JCREDC7WJVMWHK5Q40,style,businesscasual
4,01E5ZXP5H0BTEZT9QD2HRZJ47A,01E5ZXP5JCREDC7WJVMWHK5Q40,style,classic


In [19]:
df_tagged2.shape

(97420, 4)

In [20]:
# combining both tag files in one dataframe
df_tag = pd.concat([df_tagged, df_tagged2])

In [21]:
df_tag.shape

(119345, 4)

In [22]:
# creating a new column which will concat values of all columns
# we will remove values which repeat in the error column to ensure there are no duplicates
df_tag['error'] = df_tag['product_id'] + df_tag['product_color_id'] + df_tag['attribute_name'] + df_tag['attribute_value']

In [23]:
# dropping any duplicates
df_tag.drop_duplicates(subset=["error"],keep='first')
# there are no duplicate values
df_tag.shape

(119345, 5)

In [24]:
# dropping the error column
df_tag = df_tag.drop(columns = ['error'])

In [25]:
df_tag.shape

(119345, 4)

In [26]:
# checking for duplicate product id entries
df_tag.groupby("product_id").size().reset_index(name = 'count').sort_values('count', ascending = False).head(10)

Unnamed: 0,product_id,count
61,01DPGSTG4M1RXB26QMMN0MPPB8,572
167,01DPH1GQ33PHX8WG6C0RGSZDQQ,486
71,01DPGV4YRP3Z8J85DASGZ1Y99W,343
421,01DT0DKMM6G7HDJS12QCWK5X4H,330
910,01E1JKV4WQYPMJYVXYNN4NVYK8,327
13,01DPCHNEW5F2RHJQ3NJMVPK6SE,288
431,01DT0DM6NX4ZFPGP2ZCADMKVQW,252
927,01E1JM0023VJ552BMC0266SWNC,210
154,01DPH134V7QPAP0YX0DNH7HCVR,200
2267,01E2M4BTQCZMT0WRCAMT9CX3PY,198


We can observe above there are multiple product ids owing to multiple labels. Our current problem is a supervised machine learning problem and it is necessary to have labels of the product ids for training the models. However, if we combine at this stage we would be processing same rows of data again. Hence, we recommend pre-processing the text data before adding in labels to the dataframe. 

## Pre-Processing the Data

Text inherently contains a lot of noise, it is imperative that necessary steps are taken to remove the noise which would help im optimizing the results from the model.

In [27]:
df_all.head(10)

Unnamed: 0,product_id,brand,product_full_name,description,brand_category,details,labels
0,01DSE9TC2DQXDG6GWKW9NMJ416,Banana Republic,Ankle-Strap Pump,"A modern pump, in a rounded silhouette with an...",Unknown,"A modern pump, in a rounded silhouette with an...","{""Needs Review""}"
1,01DSE9SKM19XNA6SJP36JZC065,Banana Republic,Petite Tie-Neck Top,Dress it down with jeans and sneakers or dress...,Unknown,Dress it down with jeans and sneakers or dress...,"{""Needs Review""}"
2,01DSJX8GD4DSAP76SPR85HRCMN,Loewe,52MM Padded Leather Round Sunglasses,Padded leather covers classic round sunglasses.,JewelryAccessories/SunglassesReaders/RoundOval...,100% UV protection\nCase and cleaning cloth in...,"{""Needs Review""}"
3,01DSJVKJNS6F4KQ1QM6YYK9AW2,Converse,Baby's & Little Kid's All-Star Two-Tone Mid-To...,The iconic mid-top design gets an added dose o...,"JustKids/Shoes/Baby024Months/BabyGirl,JustKids...",Canvas upper\nRound toe\nLace-up vamp\nSmartFO...,"{""Needs Review""}"
4,01DSK15ZD4D5A0QXA8NSD25YXE,Alexander McQueen,64MM Rimless Sunglasses,Hexagonal shades offer a rimless view with int...,JewelryAccessories/SunglassesReaders/RoundOval,100% UV protection\nGradient lenses\nAdjustabl...,"{""Needs Review""}"
5,01DSRCS6FWFEXE6T3QYZ7QHN7V,Kissy Kissy,Baby Girl's Long-Sleeve Cotton Footie,Comfortable cotton footie with scalloped trim,JustKids/Baby024months/InfantGirls/FootiesRompers,Crewneck\nLong sleeves\nFront snap closure\nPi...,"{""Needs Review""}"
6,01DSE9J9B1WCDKEZDBXY9JCD1H,Banana Republic,Untucked Slim-Fit Tech-Stretch Cotton Shirt,"Made with our Tech Stretch Cotton fabric, this...",Unknown,"Made with our Tech Stretch Cotton fabric, this...","{""Needs Review""}"
7,01DSEAZS5R8HJBGWY3XHJSDC1W,Banana Republic,Classic-Fit Bi-Stretch Blazer,Our Classic Fit blazer has a single-button fro...,Unknown,Our Classic Fit blazer has a single-button fro...,"{""Needs Review""}"
8,01DSE9HTZN1C0YDB1W421K6TY0,Banana Republic,Print Puff-Sleeve Wrap Dress,Add a little feminine flair to your look with ...,Unknown,Add a little feminine flair to your look with ...,"{""Needs Review""}"
9,01DSJT9E7E6K26W1F0CP4H2E8H,Hunter,Hunter x Peppa Pig Print Umbrella,From the Hunter x Peppa Pig Collaboration. Lig...,"JustKids/Girls214/Accessories,JustKids/Boys220...",Easy-to-hold curved handle\nSafety feature on ...,"{""Needs Review""}"


In [28]:
# dimensions of the final data file
df_all.shape

(48072, 7)

### Using regex to correct punctuations

In this step we would be removing punctuations which won't necesaarily add value to the corpus. We have kept punctuations such as hyphen(-) and apostrophe(") as they might add value. For instance apostrophe is used to denote inches in the corpus.

In [29]:
# Inspiration from https://stackoverflow.com/questions/34860982/replace-the-punctuation-with-whitespace
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [30]:
punc_syn = string.punctuation
punc_syn = punc_syn.replace('"','')
punc_syn = punc_syn.replace('-','')
punc_syn

"!#$%&'()*+,./:;<=>?@[\\]^_`{|}~"

In [31]:
def clean_punct(review_to_be_cleaned):
    '''
    Function which takes a list of sentences and cleans the data. Returns a cleaned list of sentences
    '''
    temp = review_to_be_cleaned

    #Normalize tabs and remove newlines
    temp = temp.replace('\t', ' ').replace('\n', ' ')
    
    # Replace punctuation with whitespace
    punc_syn = string.punctuation
    punc_syn = punc_syn.replace('"','')
    punc_syn = punc_syn.replace('-','')
    temp = re.sub(r'[{}]'.format(punc_syn), ' ', temp)
    
    # Single character removal 
    temp = re.sub(r"\s+[a-zA-Z]\s+", ' ', temp)
    
    #Remove leading whitespaces
    temp = temp.strip()

    #Normalize spaces to 1
    temp = re.sub(" +", " ", temp)

    #Normalize all characters to lowercase
    temp = temp.lower()
    
    return temp

In [32]:
# Using a copy of the original dataframe in case of refresh is required
df_pre = df_all.copy()

In [33]:
list_cols = ['brand', 'product_full_name', 'description', 'brand_category', 'details']

for i in list_cols:
    df_pre[i] = df_pre[i].astype(str).apply(clean_punct)

In [34]:
df_pre.head(5)

Unnamed: 0,product_id,brand,product_full_name,description,brand_category,details,labels
0,01DSE9TC2DQXDG6GWKW9NMJ416,banana republic,ankle-strap pump,a modern pump in rounded silhouette with an an...,unknown,a modern pump in rounded silhouette with an an...,"{""Needs Review""}"
1,01DSE9SKM19XNA6SJP36JZC065,banana republic,petite tie-neck top,dress it down with jeans and sneakers or dress...,unknown,dress it down with jeans and sneakers or dress...,"{""Needs Review""}"
2,01DSJX8GD4DSAP76SPR85HRCMN,loewe,52mm padded leather round sunglasses,padded leather covers classic round sunglasses,jewelryaccessories sunglassesreaders roundoval...,100 uv protection case and cleaning cloth incl...,"{""Needs Review""}"
3,01DSJVKJNS6F4KQ1QM6YYK9AW2,converse,baby little kid all-star two-tone mid-top chuc...,the iconic mid-top design gets an added dose o...,justkids shoes baby024months babygirl justkids...,canvas upper round toe lace-up vamp smartfoam ...,"{""Needs Review""}"
4,01DSK15ZD4D5A0QXA8NSD25YXE,alexander mcqueen,64mm rimless sunglasses,hexagonal shades offer rimless view with intri...,jewelryaccessories sunglassesreaders roundoval,100 uv protection gradient lenses adjustable n...,"{""Needs Review""}"


## Removing stop words

In this step, we have removed words like "the", "is" which are commonly used in english language. However, these text add a lot of noise to the data as they might add more features without adding any value to the text. We used spacy library to remove these words. 

In [36]:
# using spacy library to remove the stop words
for i in list_cols:
    df_pre[i] = list(
    map(lambda doc: " ".join([token.text for token in nlp(doc) if not token.is_stop]), list(df_pre[i])))

In [37]:
# looking at the first 5 observations
df_pre.head(5)

Unnamed: 0,product_id,brand,product_full_name,description,brand_category,details,labels
0,01DSE9TC2DQXDG6GWKW9NMJ416,banana republic,ankle - strap pump,modern pump rounded silhouette ankle strap ext...,unknown,modern pump rounded silhouette ankle strap ext...,"{""Needs Review""}"
1,01DSE9SKM19XNA6SJP36JZC065,banana republic,petite tie - neck,dress jeans sneakers dress tailored trouser he...,unknown,dress jeans sneakers dress tailored trouser he...,"{""Needs Review""}"
2,01DSJX8GD4DSAP76SPR85HRCMN,loewe,52 mm padded leather round sunglasses,padded leather covers classic round sunglasses,jewelryaccessories sunglassesreaders roundoval...,100 uv protection case cleaning cloth included...,"{""Needs Review""}"
3,01DSJVKJNS6F4KQ1QM6YYK9AW2,converse,baby little kid - star - tone mid - chuck tayl...,iconic mid - design gets added dose support pa...,justkids shoes baby024months babygirl justkids...,canvas upper round toe lace - vamp smartfoam i...,"{""Needs Review""}"
4,01DSK15ZD4D5A0QXA8NSD25YXE,alexander mcqueen,64 mm rimless sunglasses,hexagonal shades offer rimless view intricate ...,jewelryaccessories sunglassesreaders roundoval,100 uv protection gradient lenses adjustable n...,"{""Needs Review""}"


### Lemmatization

In [None]:
In English language, we have words similar in meaning with suffixIn this step, we have utilized lemmatization to ensure that r

In [41]:
from nltk.stem import WordNetLemmatizer

def lemm(list_to_process):
    '''
    This function returns the list of reviews after lemmatization
    '''
    lemmatizer = WordNetLemmatizer()
    sentences = []
    
    for i in list_to_process:
        tokens = nltk.word_tokenize(i)
        words = []
        for word in tokens:
            words.append(lemmatizer.lemmatize(word))
        sentence = " ".join(words)
        sentences.append(sentence)
    return sentences

In [42]:
lemm(['cases'])

['case']

In [43]:
for i in list_cols:
    df_pre[i] = lemm(df_pre[i])

In [44]:
df_pre['combined'] =  df_pre.brand+' '+df_pre.product_full_name+' '+df_pre.description+' '+df_pre.brand_category+' '+df_pre.details

In [45]:
df_pre.head(5)

Unnamed: 0,product_id,brand,product_full_name,description,brand_category,details,labels,combined
0,01DSE9TC2DQXDG6GWKW9NMJ416,banana republic,ankle - strap pump,modern pump rounded silhouette ankle strap ext...,unknown,modern pump rounded silhouette ankle strap ext...,"{""Needs Review""}",banana republic ankle - strap pump modern pump...
1,01DSE9SKM19XNA6SJP36JZC065,banana republic,petite tie - neck,dress jean sneaker dress tailored trouser heel...,unknown,dress jean sneaker dress tailored trouser heel...,"{""Needs Review""}",banana republic petite tie - neck dress jean s...
2,01DSJX8GD4DSAP76SPR85HRCMN,loewe,52 mm padded leather round sunglass,padded leather cover classic round sunglass,jewelryaccessories sunglassesreaders roundoval...,100 uv protection case cleaning cloth included...,"{""Needs Review""}",loewe 52 mm padded leather round sunglass padd...
3,01DSJVKJNS6F4KQ1QM6YYK9AW2,converse,baby little kid - star - tone mid - chuck tayl...,iconic mid - design get added dose support pad...,justkids shoe baby024months babygirl justkids ...,canvas upper round toe lace - vamp smartfoam i...,"{""Needs Review""}",converse baby little kid - star - tone mid - c...
4,01DSK15ZD4D5A0QXA8NSD25YXE,alexander mcqueen,64 mm rimless sunglass,hexagonal shade offer rimless view intricate n...,jewelryaccessories sunglassesreaders roundoval,100 uv protection gradient lens adjustable nos...,"{""Needs Review""}",alexander mcqueen 64 mm rimless sunglass hexag...


## Create separate dataframes for each category

We selected following four categories to focus our analysis upon -
- style
- occassion
- color
- fit

In [46]:
final_cat = ['style', 'occasion', 'fit', 'Primary Color', 'Additional Color', 'primarycolor', 'additionalcolor', 'color']

In [47]:
#https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python

one_hot = pd.get_dummies(df_tag['attribute_name'])

In [48]:
one_hot.head(5)

Unnamed: 0,Additional Color,Color,Pattern,Primary Color,Print,additionalcolor,beltbucklematerial,beltbuckleshape,beltclosure,beltmaterial,beltwidth,calf_width,calfwidth,category,class_blazers_coats_and_jackets,class_booties,class_boots,class_dress,class_flats,class_handbags,class_jumpsuit_and_romper,class_mules_and_slides,class_pants_and_leggings,class_pumps_and_heels,class_sandals,class_shorts,class_skirts,class_sneakers_and_athletic,class_wedges,classbelts,classblazerscoatsandjackets,classbooties,classboots,classdress,classflats,classhandbags,classjumpsuitandromper,classmulesandslides,classpantsandleggings,classpumpsandheels,classsandals,classshorts,classskirts,classslippers,classsneakersandathletic,classsunglasses,classwedges,closure_blazers_coats_and_jackets,closure_handbag,closure_top,closureblazerscoatsandjackets,closurehandbag,closureonepiece,closurepantsandleggings,closureshoe,closureshorts,closureskirts,closuresweater,closuretop,color,dry_clean_only,drycleanonly,embellishment,fit,gender,heel_height,heel_shape,heelheight,heelshape,leg_style,legstyle,legstylejeans,length_blazers_coats_and_jackets,length_one_piece,length_pants_and_leggings,length_shorts,length_skirts,length_top,lengthblazers,lengthblazerscoatsandjackets,lengthcoatsandjackets,lengthjeans,lengthonepiece,lengthpantsandleggings,lengthshorts,lengthskirts,lengthtop,material,material_clothing,material_purse,materialclothing,materialpurse,neckline,occasion,pattern,primarycolor,print,rise,risejeans,shaft_height,shaftheight,sheer,shoe_width,shoewidth,sizing,sleeve_length,sleevelength,strap,strap_material,strapmaterial,style,subcategory_accessory,subcategory_blazers_coats_and_jackets,subcategory_bottom,subcategory_one_piece,subcategory_shoe,subcategory_sweater,subcategory_sweatshirt_and_hoodie,subcategory_top,subcategoryaccessory,subcategoryblazerscoatsandjackets,subcategorybottom,subcategoryonepiece,subcategoryshoe,subcategorysweater,subcategorysweatshirtandhoodie,subcategorytop,sunglassframematerial,sweatshirtandhoodieclosure,toe_style,toeexposure,toestyle,trend,upper_material,uppermaterial,wash
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [49]:
one_hot = one_hot[final_cat]
one_hot.head(5)

Unnamed: 0,style,occasion,fit,Primary Color,Additional Color,primarycolor,additionalcolor,color
0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0
2,1,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0


In [50]:
one_hot.shape

(119345, 8)

In [51]:
df_tag_keep = df_tag.copy()

In [52]:
df_tag_keep = pd.concat([df_tag_keep, one_hot], axis=1)

In [53]:
df_tag_keep.head(5)

Unnamed: 0,product_id,product_color_id,attribute_name,attribute_value,style,occasion,fit,Primary Color,Additional Color,primarycolor,additionalcolor,color
0,01DVBTBPHR8WJTCVEN5AJRHF47,01DVBTBPJ41VVT00JJCG8TTZ2W,gender,Women,0,0,0,0,0,0,0,0
1,01DVA7QRXM928ZM0WWR7HFNTC1,01DVA7QRXXR9F0TWVE1HMC5ZQ3,Primary Color,Blacks,0,0,0,1,0,0,0,0
2,01DPGV4YRP3Z8J85DASGZ1Y99W,01DPGVGBK6YGNYGNF2S6FSH02T,style,Casual,1,0,0,0,0,0,0,0
3,01E1JM43NQ3H17PB22EV3074NX,01E1JM5WFWWCCCH3JTTTCYQCEQ,style,Modern,1,0,0,0,0,0,0,0
4,01DSE8Z2ZDAZKZ2SKCS1E3B3HK,01DSE8ZG8Y3FR8KWE2TY1QDWBF,shoe_width,Medium,0,0,0,0,0,0,0,0


In [54]:
df_tag_keep.shape

(119345, 12)

In [55]:
final_cat

['style',
 'occasion',
 'fit',
 'Primary Color',
 'Additional Color',
 'primarycolor',
 'additionalcolor',
 'color']

Now we create dataframes for each category

In [56]:
df_style_temp = df_tag_keep[df_tag_keep['style']>0].copy()

In [57]:
df_style_temp.shape

(18335, 12)

In [58]:
df_style_temp.attribute_value.unique()

array(['Casual', 'Modern', 'Androgynous', 'Romantic', 'Boho',
       'Business Casual', 'Edgy', 'Glam', 'Classic', 'Athleisure',
       'Retro', 'modern', 'businesscasual', 'classic', 'glam', 'edgy',
       'casual', 'retro', 'boho', 'androgynous', 'romantic', 'athleisure'],
      dtype=object)

Looking at the unique values of each category, we can observe that there are similar labels with different capitalization.
We convert all of the similar labels to one.

In [59]:
def change_label_style(x):
    x = x.lower()
    if x == 'businesscasual':
        return 'business casual'
    else:
        return x

df_style_temp.attribute_value = df_style_temp.attribute_value.apply(change_label_style)

df_style_temp.attribute_value.unique()

array(['casual', 'modern', 'androgynous', 'romantic', 'boho',
       'business casual', 'edgy', 'glam', 'classic', 'athleisure',
       'retro'], dtype=object)

In [60]:
# We then one hot encode the labels for each row
df_style_temp = df_style_temp[['product_id', 'attribute_name', 'attribute_value']]

# Remove duplicates
df_style_temp = df_style_temp.drop_duplicates(subset = ['product_id', 'attribute_name', 'attribute_value'], keep='first')

# one hot encoding
one_hot_style = pd.get_dummies(df_style_temp['attribute_value'])

# combining in one dataframe
df_style_temp = pd.concat([df_style_temp, one_hot_style], axis=1)

In [61]:
df_style_temp.shape

(10514, 14)

In [62]:
df_style_temp[df_style_temp['product_id']=='01DPGV4YRP3Z8J85DASGZ1Y99W'].head(10)

Unnamed: 0,product_id,attribute_name,attribute_value,androgynous,athleisure,boho,business casual,casual,classic,edgy,glam,modern,retro,romantic
2,01DPGV4YRP3Z8J85DASGZ1Y99W,style,casual,0,0,0,0,1,0,0,0,0,0,0
129,01DPGV4YRP3Z8J85DASGZ1Y99W,style,business casual,0,0,0,1,0,0,0,0,0,0,0
409,01DPGV4YRP3Z8J85DASGZ1Y99W,style,classic,0,0,0,0,0,1,0,0,0,0,0
555,01DPGV4YRP3Z8J85DASGZ1Y99W,style,edgy,0,0,0,0,0,0,1,0,0,0,0
1474,01DPGV4YRP3Z8J85DASGZ1Y99W,style,modern,0,0,0,0,0,0,0,0,1,0,0
2407,01DPGV4YRP3Z8J85DASGZ1Y99W,style,androgynous,1,0,0,0,0,0,0,0,0,0,0


We can observe that same product id with different attribute values needs to combined in one row

In [63]:
df_style_sum = df_style_temp.groupby('product_id').sum().reset_index()
df_style_concat = df_style_temp.groupby(['product_id', 'attribute_name'])['attribute_value'].apply(lambda x: ', '.join(x)).reset_index()

In [64]:
df_style = pd.merge(df_style_sum, df_style_concat[['product_id', 'attribute_value']], on = 'product_id')

In [65]:
df_style.shape

(3916, 13)

In [66]:
df_style = pd.merge(df_style, df_pre[['product_id', 'combined']], on = 'product_id', how = "left")

In [67]:
df_style.shape

(3916, 14)

In [68]:
df_style.combined.isna().sum()

0

In [69]:
df_style.iloc[:,1:-2].max()

androgynous        1
athleisure         1
boho               1
business casual    1
casual             1
classic            1
edgy               1
glam               1
modern             1
retro              1
romantic           1
dtype: uint8

In [70]:
df_style.head(5)

Unnamed: 0,product_id,androgynous,athleisure,boho,business casual,casual,classic,edgy,glam,modern,retro,romantic,attribute_value,combined
0,01DPC9GSTT72KHNN0MNDNKH7RD,0,0,0,1,0,1,0,0,0,0,0,"business casual, classic",j crew devon bonded leather tote new wear - - ...
1,01DPCB2KEAVXXKFVM7FXBNE4VY,0,0,1,0,0,0,0,0,1,0,0,"modern, boho",j crew fiona lace - kitten heel ankle boot bla...
2,01DPCDEF6SYX2E1NT5X7HJBFGY,0,0,0,0,0,1,0,0,0,0,0,classic,j crew ribbed scarf everyday cashmere come qua...
3,01DPCG1C1P0MQAV9NMS3N1TDAA,0,0,0,0,0,0,0,1,0,0,1,"glam, romantic",j crew collection fluted sheath dress ratti ® ...
4,01DPCHNEW5F2RHJQ3NJMVPK6SE,1,0,0,1,1,1,0,0,0,0,0,"business casual, classic, casual, androgynous",j crew long - sleeve everyday cashmere mocknec...


We repeat the above steps done for style for other three categories

In [71]:
# occasion

In [72]:
df_occasion_temp = df_tag_keep[df_tag_keep['occasion']>0].copy()

In [73]:
df_occasion_temp.shape

(15694, 12)

In [74]:
df_occasion_temp.attribute_value.unique()

array(['Weekend', 'Day to Night', 'Work', 'Vacation', 'Night Out',
       'Workout', 'weekend', 'workout', 'daytonight', 'coldweather',
       'vacation', 'nightout', 'work'], dtype=object)

Looking at the unique values of each category, we can observe that there are similar labels with different capitalization.
We convert all of the similar labels to one.

In [75]:
def change_label_occasion(x):
    x = x.lower()
    if x == 'daytonight':
        return 'day to night'
    if x == 'nightout':
        return 'night out'
    else:
        return x

df_occasion_temp.attribute_value = df_occasion_temp.attribute_value.apply(change_label_occasion)

df_occasion_temp.attribute_value.unique()

array(['weekend', 'day to night', 'work', 'vacation', 'night out',
       'workout', 'coldweather'], dtype=object)

In [76]:
# We then one hot encode the labels for each row
df_occasion_temp = df_occasion_temp[['product_id', 'attribute_name', 'attribute_value']]

# Remove duplicates
df_occasion_temp = df_occasion_temp.drop_duplicates(subset = ['product_id', 'attribute_name', 'attribute_value'], keep='first')

# one hot encoding
one_hot_occasion = pd.get_dummies(df_occasion_temp['attribute_value'])

# combining in one dataframe
df_occasion_temp = pd.concat([df_occasion_temp, one_hot_occasion], axis=1)

In [77]:
df_occasion_temp.shape

(9054, 10)

In [78]:
df_occasion_sum = df_occasion_temp.groupby('product_id').sum().reset_index()
df_occasion_concat = df_occasion_temp.groupby(['product_id', 'attribute_name'])['attribute_value'].apply(lambda x: ', '.join(x)).reset_index()

In [79]:
df_occasion = pd.merge(df_occasion_sum, df_occasion_concat[['product_id', 'attribute_value']], on = 'product_id')

In [80]:
df_occasion.shape

(3914, 9)

In [81]:
df_occasion = pd.merge(df_occasion, df_pre[['product_id', 'combined']], on = 'product_id')

In [82]:
df_occasion.shape

(3914, 10)

In [83]:
df_occasion.combined.isna().sum()

0

In [84]:
df_occasion.iloc[:,1:-2].max()

coldweather     1
day to night    1
night out       1
vacation        1
weekend         1
work            1
workout         1
dtype: uint8

In [85]:
df_occasion.head(5)

Unnamed: 0,product_id,coldweather,day to night,night out,vacation,weekend,work,workout,attribute_value,combined
0,01DPC9GSTT72KHNN0MNDNKH7RD,0,1,0,0,0,1,0,"day to night, work",j crew devon bonded leather tote new wear - - ...
1,01DPCB2KEAVXXKFVM7FXBNE4VY,0,1,0,0,1,1,0,"day to night, weekend, work",j crew fiona lace - kitten heel ankle boot bla...
2,01DPCG1C1P0MQAV9NMS3N1TDAA,0,0,1,0,1,0,0,"weekend, night out",j crew collection fluted sheath dress ratti ® ...
3,01DPCHNEW5F2RHJQ3NJMVPK6SE,0,1,0,0,1,1,0,"work, day to night, weekend",j crew long - sleeve everyday cashmere mocknec...
4,01DPCHNQM0PA0SXZZZX85PF2ZJ,0,1,0,0,1,0,0,"day to night, weekend",j crew slim boyfriend jean hydrangea blue wash...


In [86]:
# fit

In [87]:
df_fit_temp = df_tag_keep[df_tag_keep['fit']>0].copy()

In [88]:
df_fit_temp.shape

(4728, 12)

In [89]:
df_fit_temp.attribute_value.unique()

array(['Semi-Fitted', 'Straight / Regular', 'Relaxed',
       'Fitted / Tailored', 'Oversized', 'straightregular', 'semifitted',
       'relaxed', 'oversized', 'fittedtailored'], dtype=object)

Looking at the unique values of each category, we can observe that there are similar labels with different capitalization.
We convert all of the similar labels to one.

In [90]:
def change_label_fit(x):
    x = x.lower()
    if x == 'semifitted':
        return 'semi-fitted'
    if x == 'straightregular':
        return 'straight / regular'
    if x == 'fittedtailored':
        return 'fitted / tailored'
    else:
        return x

df_fit_temp.attribute_value = df_fit_temp.attribute_value.apply(change_label_fit)

df_fit_temp.attribute_value.unique()

array(['semi-fitted', 'straight / regular', 'relaxed',
       'fitted / tailored', 'oversized'], dtype=object)

In [91]:
# We then one hot encode the labels for each row
df_fit_temp = df_fit_temp[['product_id', 'attribute_name', 'attribute_value']]

# Remove duplicates
df_fit_temp = df_fit_temp.drop_duplicates(subset = ['product_id', 'attribute_name', 'attribute_value'], keep='first')

# one hot encoding
one_hot_fit = pd.get_dummies(df_fit_temp['attribute_value'])

# combining in one dataframe
df_fit_temp = pd.concat([df_fit_temp, one_hot_fit], axis=1)

In [92]:
df_fit_temp.shape

(3040, 8)

In [93]:
df_fit_sum = df_fit_temp.groupby('product_id').sum().reset_index()
df_fit_concat = df_fit_temp.groupby(['product_id', 'attribute_name'])['attribute_value'].apply(lambda x: ', '.join(x)).reset_index()

In [94]:
df_fit = pd.merge(df_fit_sum, df_fit_concat[['product_id', 'attribute_value']], on = 'product_id')

In [95]:
df_fit.shape

(2949, 7)

In [96]:
df_fit = pd.merge(df_fit, df_pre[['product_id', 'combined']], on = 'product_id')

In [97]:
df_fit.shape

(2949, 8)

In [98]:
df_fit.combined.isna().sum()

0

In [99]:
df_fit.iloc[:,1:-2].max()

fitted / tailored     1
oversized             1
relaxed               1
semi-fitted           1
straight / regular    1
dtype: uint8

In [100]:
df_fit.head(5)

Unnamed: 0,product_id,fitted / tailored,oversized,relaxed,semi-fitted,straight / regular,attribute_value,combined
0,01DPCG1C1P0MQAV9NMS3N1TDAA,0,0,0,1,0,semi-fitted,j crew collection fluted sheath dress ratti ® ...
1,01DPCHNEW5F2RHJQ3NJMVPK6SE,0,0,1,0,0,relaxed,j crew long - sleeve everyday cashmere mocknec...
2,01DPCHNQM0PA0SXZZZX85PF2ZJ,0,0,1,0,1,"straight / regular, relaxed",j crew slim boyfriend jean hydrangea blue wash...
3,01DPCQ0CNKKHD3899ZEY9SEDHA,0,0,0,0,1,straight / regular,j crew curvy slim stretch perfect shirt stripe...
4,01DPCXVC45EFA5ZYKAXT412KG1,0,0,0,1,0,semi-fitted,j crew collection cropped moto jacket crackled...


In [101]:
# color

In [102]:
df_color_temp = df_tag_keep[(df_tag_keep['color']>0)|
                      (df_tag_keep['Primary Color']>0)|
                      (df_tag_keep['Additional Color']>0)|
                      (df_tag_keep['primarycolor']>0)|
                      (df_tag_keep['additionalcolor']>0)].copy()

In [103]:
df_color_temp.shape

(9856, 12)

In [104]:
df_color_temp.attribute_value.unique()

array(['Blacks', 'Pinks', 'Browns', 'Golds', 'Whites', 'Reds', 'Navy',
       'Beiges', 'Blues', 'Greens', 'Silvers', 'Neutrals', 'Yellows',
       'Grays', 'Burgundies', 'Purples', 'Multi', 'Oranges', 'Teal',
       'blues', 'blacks', 'yellows', 'oranges', 'whites', 'multi', 'navy',
       'darkbrowns', 'grays', 'greens', 'pinks', 'beiges', 'golds',
       'silvers', 'lightbrowns', 'reds', 'lightbrown', 'burgundies',
       'purples', 'teal'], dtype=object)

Looking at the unique values of each category, we can observe that there are similar labels with different capitalization.
We convert all of the similar labels to one.

In [105]:
def change_label_color(x):
    x = x.lower()
    if x == 'lightbrown':
        return 'lightbrowns'
    else:
        return x

df_color_temp.attribute_value = df_color_temp.attribute_value.apply(change_label_color)

df_color_temp.attribute_value.unique()

array(['blacks', 'pinks', 'browns', 'golds', 'whites', 'reds', 'navy',
       'beiges', 'blues', 'greens', 'silvers', 'neutrals', 'yellows',
       'grays', 'burgundies', 'purples', 'multi', 'oranges', 'teal',
       'darkbrowns', 'lightbrowns'], dtype=object)

In [106]:
# We then one hot encode the labels for each row
df_color_temp = df_color_temp[['product_id', 'attribute_name', 'attribute_value']]

# Remove duplicates
df_color_temp = df_color_temp.drop_duplicates(subset = ['product_id', 'attribute_name', 'attribute_value'], keep='first')

# one hot encoding
one_hot_color = pd.get_dummies(df_color_temp['attribute_value'])

# combining in one dataframe
df_color_temp = pd.concat([df_color_temp, one_hot_color], axis=1)

In [107]:
df_color_temp.shape

(9346, 24)

In [108]:
df_color_sum = df_color_temp.groupby('product_id').sum().reset_index()
df_color_concat = df_color_temp.groupby(['product_id', 'attribute_name'])['attribute_value'].apply(lambda x: ', '.join(x)).reset_index()

In [109]:
df_color = pd.merge(df_color_sum, df_color_concat[['product_id', 'attribute_value']], on = 'product_id')

In [110]:
df_color.shape

(6672, 23)

In [111]:
df_color = pd.merge(df_color, df_pre[['product_id', 'combined']], on = 'product_id')

In [112]:
df_color.shape

(6672, 24)

In [113]:
df_color.combined.isna().sum()

0

In [114]:
df_color.iloc[:,1:-2].max()

beiges         4
blacks         4
blues          4
browns         2
burgundies     2
darkbrowns     2
golds          4
grays          4
greens         4
lightbrowns    2
multi          4
navy           2
neutrals       1
oranges        3
pinks          4
purples        4
reds           4
silvers        2
teal           2
whites         4
yellows        3
dtype: uint8

In [115]:
def ret_1(x):
    
    if x>=1:
        return 1
    else:
        return 0

In [116]:
for i in list(df_color.iloc[:,1:-2].columns):
    df_color[i] = df_color[i].apply(ret_1)

In [117]:
df_color.iloc[:,1:-2].max()

beiges         1
blacks         1
blues          1
browns         1
burgundies     1
darkbrowns     1
golds          1
grays          1
greens         1
lightbrowns    1
multi          1
navy           1
neutrals       1
oranges        1
pinks          1
purples        1
reds           1
silvers        1
teal           1
whites         1
yellows        1
dtype: int64

In [118]:
df_color.head(5)

Unnamed: 0,product_id,beiges,blacks,blues,browns,burgundies,darkbrowns,golds,grays,greens,lightbrowns,multi,navy,neutrals,oranges,pinks,purples,reds,silvers,teal,whites,yellows,attribute_value,combined
0,01DMBRYVA2P5H24WK0HTK4R0A1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,beiges,eileen fisher slim knit skirt nice skirt appar...
1,01DPC9GSTT72KHNN0MNDNKH7RD,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,"burgundies, lightbrowns, blacks",j crew devon bonded leather tote new wear - - ...
2,01DPCB2KEAVXXKFVM7FXBNE4VY,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"browns, blacks",j crew fiona lace - kitten heel ankle boot bla...
3,01DPCB2KEAVXXKFVM7FXBNE4VY,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"blacks, darkbrowns",j crew fiona lace - kitten heel ankle boot bla...
4,01DPCDEF6SYX2E1NT5X7HJBFGY,1,1,0,0,1,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,pinks,j crew ribbed scarf everyday cashmere come qua...


In [120]:
df_style.to_csv('01 style.csv', index = False)
df_occasion.to_csv('02 occasion.csv', index = False)
df_fit.to_csv('03 fit.csv', index = False)
df_color.to_csv('04 color.csv')

## Model Selection

In this part we define several vectorizer methods and models. We divide the current data into training and test data set in the proportion of 0.75-0.25. We evaluate our models on the test data with 50% accuracy as the baseline accuracy.

In [121]:
def count_(docs):
    '''
    This returns count vectorized vectors of the docs
    '''
    
    # use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
    vectorizer = CountVectorizer(ngram_range=(1,1), stop_words="english", binary=True, min_df=10, max_df=5000) 
    vectorizer = vectorizer.fit(docs)
    X = vectorizer.transform(docs)

    vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    
    return vectorized_df, vectorizer

def count_2gram(docs):
    
    vectorizer = CountVectorizer(ngram_range=(2,2), stop_words="english", binary=True, min_df=10, max_df=5000)
    vectorizer = vectorizer.fit(docs)
    X = vectorizer.transform(docs)

    vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    
    return vectorized_df, vectorizer

def count_3gram(docs):
    
    vectorizer = CountVectorizer(ngram_range=(3,3), stop_words="english", binary=True, min_df=10, max_df=5000)
    vectorizer = vectorizer.fit(docs)
    X = vectorizer.transform(docs)
    
    vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    
    return vectorized_df, vectorizer

def tfidf(docs):
    
    vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words="english", max_df=0.75)
    vectorizer = vectorizer.fit(docs)
    X = vectorizer.transform(docs)

    vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    
    return vectorized_df, vectorizer

def tfidf_2gram(docs):
    
    vectorizer = TfidfVectorizer(ngram_range=(2,2), stop_words="english", max_df=0.75)
    vectorizer = vectorizer.fit(docs)
    X = vectorizer.transform(docs)

    vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    
    return vectorized_df, vectorizer

def tfidf_3gram(docs):
    
    vectorizer = TfidfVectorizer(ngram_range=(3,3), stop_words="english", max_df=0.75)
    vectorizer = vectorizer.fit(docs)
    X = vectorizer.transform(docs)

    vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    
    return vectorized_df, vectorizer

In [122]:
# Model
def log_reg(X_train, Y_train, X_test, Y_test):
    
    lr = LogisticRegression()
    lr.fit(X_train, Y_train)
    
    y_pred = lr.predict(X_test)
    
    return round(np.mean(y_pred == Y_test),2)

In [239]:
category_list = []
label_list = []
vectorizer_list = []
logreg_accuracy = []

cat = ['style', 'occasion', 'fit', 'color']

for i in tqdm_notebook(cat):
    
    if i == 'style':
        df = df_style.copy()
        
    if i == 'occasion':
        df = df_occasion.copy()
    
    if i == 'fit':
        df = df_fit.copy()
    
    if i == 'color':
        df = df_color.copy()
        
        
    labels = list(df.iloc[:,1:-2].columns)

    for j in tqdm_notebook(labels):
        
        for k in ['c_1ng', 'c_2ng', 'c_3ng', 'tfidf_1ng', 'tfidf_2ng', 'tfidf_3ng']:
            category_list.append(i)
            label_list.append(j)
            
            X_train, X_test, y_train, y_test = train_test_split(df['combined'], df[j], test_size=0.25, random_state=42)

            if k=="c_1ng":
                train_df, vectorizer = count_(X_train)
                vectorizer_list.append("count_1gram")

            if k=="c_2ng":
                train_df, vectorizer = count_2gram(X_train)
                vectorizer_list.append("count_2gram")

            if k=="c_3ng":
                train_df, vectorizer = count_3gram(X_train)
                vectorizer_list.append("count_3gram")

            if k=="tfidf_1ng":
                train_df, vectorizer = tfidf(X_train)
                vectorizer_list.append("tfid_1gram")

            if k=="tfidf_2ng":
                train_df, vectorizer = tfidf_2gram(X_train)
                vectorizer_list.append("tfidf_2gram")

            if k=="tfidf_3ng":
                train_df, vectorizer = tfidf_3gram(X_train)
                vectorizer_list.append("tfidf_3gram")

            test_vector = vectorizer.transform(X_test)

            test_df = pd.DataFrame(test_vector.toarray(), columns=vectorizer.get_feature_names())

            accuracy = log_reg(train_df, y_train, test_df, y_test)

            logreg_accuracy.append(accuracy)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=21.0), HTML(value='')))





In [240]:
results_logreg = pd.DataFrame({"category":category_list, "label":label_list, "vectorizer":vectorizer_list,
                               "LogReg_Accuracy":logreg_accuracy})

In [241]:
results_logreg.head(10)

Unnamed: 0,category,label,vectorizer,LogReg_Accuracy
0,style,androgynous,count_1gram,0.82
1,style,androgynous,count_2gram,0.82
2,style,androgynous,count_3gram,0.81
3,style,androgynous,tfid_1gram,0.84
4,style,androgynous,tfidf_2gram,0.82
5,style,androgynous,tfidf_3gram,0.82
6,style,athleisure,count_1gram,0.97
7,style,athleisure,count_2gram,0.97
8,style,athleisure,count_3gram,0.96
9,style,athleisure,tfid_1gram,0.96


In [242]:
results_logreg.to_csv('01 Results_logreg.csv', index = False)

### Using Word Embeddings with Keras

In [127]:
def get_max_token_length_per_doc(docs: List[List[str]])-> int:
    return max(list(map(lambda x: len(x.split()), docs)))

In [128]:
def integer_encode_documents(docs, tokenizer):
    return tokenizer.texts_to_sequences(docs)

In [129]:
def keras_m(X_train, y_train, X_test, y_test, vocab_size, EMBEDDING_SIZE, max_length):
    # define the model
    model = Sequential()
    model.add(Embedding(vocab_size, EMBEDDING_SIZE, input_length=max_length))
    model.add(Flatten())
    
    # since we are doing binary classification, activation function is sigmoid
    model.add(Dense(1, activation='sigmoid')) 

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    
    model.fit(X_train, y_train, epochs=20, verbose=0)
    
    loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
    
    return round(accuracy,2)

In [130]:
category_list = []
label_list = []
vectorizer_list = []
keras_accuracy = []

cat = ['style', 'occasion', 'fit', 'color']

for i in tqdm_notebook(cat):
    
    if i == 'style':
        df = df_style.copy()
        
    if i == 'occasion':
        df = df_occasion.copy()
    
    if i == 'fit':
        df = df_fit.copy()

    if i == 'color':
        df = df_color.copy()
        
    labels = list(df.iloc[:,1:-2].columns)

    for j in tqdm_notebook(labels):
        
        tokenizer = Tokenizer(num_words=5000, oov_token="UNKNOWN_TOKEN")
        tokenizer.fit_on_texts(df['combined'])
        
        max_length = get_max_token_length_per_doc(df['combined'])
        
        # integer encode the training data
        encoded_docs = integer_encode_documents(df['combined'], tokenizer)
        # pad the documents
        padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
        # get vocab size
        vocab_size = int(len(tokenizer.word_index) + 1)
        
        X_train, X_test, y_train, y_test = train_test_split(padded_docs, df[j], test_size=0.25, random_state=42)
        
        EMBEDDING_SIZE = 50
        
        accuracy = keras_m(X_train, y_train, X_test, y_test, vocab_size, EMBEDDING_SIZE, max_length)
        
        # Storing values in a list
        keras_accuracy.append(accuracy)
        category_list.append(i)
        label_list.append(j)
        vectorizer_list.append("keras-word embed")

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=21.0), HTML(value='')))





In [None]:
results_keras_w = pd.DataFrame({"category":category_list, "label":label_list, "vectorizer":vectorizer_list,
                               "Keras_Accuracy":keras_accuracy})

In [238]:
results_keras_w.head(10)

Unnamed: 0,category,label,vectorizer,Accuracy,model
0,style,androgynous,keras-word embed,0.84,Keras-Word
1,style,athleisure,keras-word embed,0.98,Keras-Word
2,style,boho,keras-word embed,0.9,Keras-Word
3,style,business casual,keras-word embed,0.81,Keras-Word
4,style,casual,keras-word embed,0.79,Keras-Word
5,style,classic,keras-word embed,0.72,Keras-Word
6,style,edgy,keras-word embed,0.82,Keras-Word
7,style,glam,keras-word embed,0.89,Keras-Word
8,style,modern,keras-word embed,0.74,Keras-Word
9,style,retro,keras-word embed,0.95,Keras-Word


In [135]:
results_keras_w.to_csv('01 Results_keras_w.csv', index = False)

### Using Pre_Trained Embedings with Keras

In [138]:
embeddings_index = dict()
f = open('glove.6B.100d.txt', encoding = 'utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

In [148]:
def keras_gl(X_train, y_train, X_test, y_test, vocab_size, embedding_matrix, max_length, padded_docs):
    # define model
    model = Sequential()
    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)
    model.add(e)
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))
    
    # compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

    # fit the model
    model.fit(X_train, y_train, epochs=20, verbose=0)
    # evaluate the model
    loss, accuracy = model.evaluate(X_test, y_test, verbose=0)

    return round(accuracy,2)

In [151]:
category_list = []
label_list = []
vectorizer_list = []
keras_glove_accuracy = []

cat = ['style', 'occasion', 'fit', 'color']

for c in tqdm_notebook(cat):
    
    if c == 'style':
        df = df_style.copy()
        
    if c == 'occasion':
        df = df_occasion.copy()
    
    if c == 'fit':
        df = df_fit.copy()

    if c == 'color':
        df = df_color.copy()
        
    labels = list(df.iloc[:,1:-2].columns)

    for j in tqdm_notebook(labels):
        
        tokenizer = Tokenizer(num_words=5000, oov_token="UNKNOWN_TOKEN")
        tokenizer.fit_on_texts(df['combined'])
        
        max_length = get_max_token_length_per_doc(df['combined'])
        
        # integer encode the training data
        encoded_docs = integer_encode_documents(df['combined'], tokenizer)
        # pad the documents
        padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
        # get vocab size
        vocab_size = int(len(tokenizer.word_index) + 1)
        
        # create a weight matrix for words in training docs
        embedding_matrix = zeros((vocab_size, 100))
        for word, i in tokenizer.word_index.items():
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None: # check that it is an actual word that we have embeddings for
                embedding_matrix[i] = embedding_vector
                
        X_train, X_test, y_train, y_test = train_test_split(padded_docs, df[j], test_size=0.25, random_state=42)
        
        EMBEDDING_SIZE = 100
        
        accuracy = keras_gl(X_train, y_train, X_test, y_test, vocab_size, embedding_matrix, max_length, padded_docs)
        
        # Storing values in a list
        keras_glove_accuracy.append(accuracy)
        category_list.append(c)
        label_list.append(j)
        vectorizer_list.append("keras-glove")

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=21.0), HTML(value='')))





In [159]:
results_keras_gl = pd.DataFrame({"category":category_list, "label":label_list, "vectorizer":vectorizer_list,
                               "KerasGlove_Accuracy":keras_glove_accuracy})

In [160]:
results_keras_gl.head(5)

Unnamed: 0,category,label,vectorizer,KerasGlove_Accuracy
0,style,androgynous,keras-glove,0.8
1,style,athleisure,keras-glove,0.96
2,style,boho,keras-glove,0.88
3,style,business casual,keras-glove,0.75
4,style,casual,keras-glove,0.71


In [153]:
results_keras_gl.to_csv('04 Results_keras_gl.csv', index = False)

### Getting the best model for each label

In [156]:
results_logreg['model'] = 'LogReg'
results_logreg = results_logreg.rename(columns={'LogReg_Accuracy':'Accuracy'})

results_logreg.head(5)

Unnamed: 0,category,label,vectorizer,Accuracy,model
0,style,androgynous,count_1gram,0.82,LogReg
1,style,androgynous,count_2gram,0.82,LogReg
2,style,androgynous,count_3gram,0.81,LogReg
3,style,androgynous,tfid_1gram,0.84,LogReg
4,style,androgynous,tfidf_2gram,0.82,LogReg


In [171]:
results_keras_gl['model'] = 'Keras-Glove'
results_keras_gl = results_keras_gl.rename(columns={'KerasGlove_Accuracy':'Accuracy'})

results_keras_gl.head(5)

Unnamed: 0,category,label,vectorizer,Accuracy,model
0,style,androgynous,keras-glove,0.8,Keras-Glove
1,style,athleisure,keras-glove,0.96,Keras-Glove
2,style,boho,keras-glove,0.88,Keras-Glove
3,style,business casual,keras-glove,0.75,Keras-Glove
4,style,casual,keras-glove,0.71,Keras-Glove


In [172]:
results_keras_w['model'] = 'Keras-Word'
results_keras_w = results_keras_w.rename(columns={'Keras_Accuracy':'Accuracy'})

results_keras_w.head(5)

Unnamed: 0,category,label,vectorizer,Accuracy,model
0,style,androgynous,keras-word embed,0.84,Keras-Word
1,style,athleisure,keras-word embed,0.98,Keras-Word
2,style,boho,keras-word embed,0.9,Keras-Word
3,style,business casual,keras-word embed,0.81,Keras-Word
4,style,casual,keras-word embed,0.79,Keras-Word


In [173]:
results_all = pd.concat([results_logreg, results_keras_w, results_keras_gl])
results_all.shape

(352, 5)

In [230]:
results_fin = results_all.copy()

results_fin = results_fin.sort_values('Accuracy', ascending=False).drop_duplicates(['category','label'])
results_fin = results_fin.reset_index()
results_fin = results_fin.drop(columns = ['index'])
results_fin = results_fin.sort_values(['category', 'label'], ascending=False)
results_fin.head(10)

Unnamed: 0,category,label,vectorizer,Accuracy,model
27,style,romantic,tfid_1gram,0.89,LogReg
22,style,retro,keras-word embed,0.95,Keras-Word
42,style,modern,tfid_1gram,0.74,LogReg
26,style,glam,keras-word embed,0.89,Keras-Word
33,style,edgy,keras-word embed,0.82,Keras-Word
43,style,classic,tfid_1gram,0.73,LogReg
34,style,casual,tfid_1gram,0.82,LogReg
35,style,business casual,keras-word embed,0.81,Keras-Word
25,style,boho,keras-word embed,0.9,Keras-Word
9,style,athleisure,keras-word embed,0.98,Keras-Word


In [231]:
results_fin.shape

(44, 5)

In [316]:
results_fin.vectorizer.value_counts()

keras-word embed    24
tfid_1gram           8
keras-glove          7
count_1gram          4
tfidf_3gram          1
Name: vectorizer, dtype: int64

In [315]:
results_fin.to_csv('01 Final Results.csv', index = False)

In [353]:
def log_reg_fin(X_train, Y_train):
    
    model = LogisticRegression()
    print(X_train.shape)
    print(Y_train.shape)
    model.fit(X_train, Y_train)
    
    return model

In [354]:
def keras_m_fin(X_train, y_train, vocab_size, EMBEDDING_SIZE, max_length):
    
    # define the model
    model = Sequential()
    model.add(Embedding(vocab_size, EMBEDDING_SIZE, input_length=max_length))
    model.add(Flatten())
    
    # since we are doing binary classification, activation function is sigmoid
    model.add(Dense(1, activation='sigmoid')) 

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    
    model.fit(X_train, y_train, epochs=20, verbose=0)
    
    return model

In [355]:
def keras_gl_fin(X_train, y_train, vocab_size, embedding_matrix, max_length):
    
    # define model
    model = Sequential()
    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)
    model.add(e)
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))
    
    # compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

    # fit the model
    model.fit(X_train, y_train, epochs=20, verbose=0)

    return model

In [356]:
category_list = []
label_list = []
vectorizer_list = []
models_list = []

for i in tqdm_notebook(cat):
    
    if i == 'style':
        df = df_style.copy()
        
    if i == 'occasion':
        df = df_occasion.copy()
    
    if i == 'fit':
        df = df_fit.copy()
    
    if i == 'color':
        df = df_color.copy()
        
    
    labels = list(df.iloc[:,1:-2].columns)
    
    tokenizer = Tokenizer(num_words=5000, oov_token="UNKNOWN_TOKEN")
    tokenizer.fit_on_texts(df['combined'])

    max_length = get_max_token_length_per_doc(df['combined'])

    # integer encode the training data
    encoded_docs = integer_encode_documents(df['combined'], tokenizer)
    # pad the documents
    padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
    # get vocab size
    vocab_size = int(len(tokenizer.word_index) + 1)

    # create a weight matrix for words in training docs
    embedding_matrix = zeros((vocab_size, 100))
    for word, p in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: # check that it is an actual word that we have embeddings for
            embedding_matrix[p] = embedding_vector

    for j in tqdm_notebook(labels):
        
        category_list.append(i)
        label_list.append(j)

        X_train = df['combined'].copy()
        y_train = df[j].copy()
        
        vectorizer_fin = results_fin.loc[(results_fin.category==i)&(results_fin.label==j),'vectorizer'].values[0]

        if vectorizer_fin=="count_1gram":
            train_df, vectorizer = count_(X_train)
            vectorizer_list.append(vectorizer)

        if vectorizer_fin=="count_2gram":
            train_df, vectorizer = count_2gram(X_train)
            vectorizer_list.append(vectorizer)

        if vectorizer_fin=="count_3gram":
            train_df, vectorizer = count_3gram(X_train)
            vectorizer_list.append(vectorizer)

        if vectorizer_fin=="tfid_1gram":
            train_df, vectorizer = tfidf(X_train)
            vectorizer_list.append(vectorizer)

        if vectorizer_fin=="tfidf_2gram":
            train_df, vectorizer = tfidf_2gram(X_train)
            vectorizer_list.append(vectorizer)

        if vectorizer_fin=="tfidf_3gram":
            train_df, vectorizer = tfidf_3gram(X_train)
            vectorizer_list.append(vectorizer)
            
        model_fin = results_fin.loc[(results_fin.category==i)&(results_fin.label==j),'model'].values[0]
        
        if model_fin=="LogReg":
            print (train_df.shape)
            model = log_reg_fin(train_df, y_train)
            models_list.append(model)
            
        if model_fin=="Keras-Word":
            vectorizer_list.append(tokenizer)
            EMBEDDING_SIZE = 50
            model = keras_m_fin(padded_docs, y_train, vocab_size, EMBEDDING_SIZE, max_length)
            models_list.append(model)
            
        if model_fin=="Keras-Glove":
            vectorizer_list.append(tokenizer)
            model = keras_gl_fin(padded_docs, y_train, vocab_size, embedding_matrix, max_length)
            models_list.append(model)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))

(3916, 6924)
(3916, 6924)
(3916,)
(3916, 6924)
(3916, 6924)
(3916,)
(3916, 6924)
(3916, 6924)
(3916,)
(3916, 6924)
(3916, 6924)
(3916,)



HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))

(3914, 1445)
(3914, 1445)
(3914,)
(3914, 6913)
(3914, 6913)
(3914,)
(3914, 6913)
(3914, 6913)
(3914,)



HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

(2949, 1117)
(2949, 1117)
(2949,)
(2949, 5533)
(2949, 5533)
(2949,)
(2949, 58781)
(2949, 58781)
(2949,)
(2949, 5533)
(2949, 5533)
(2949,)



HBox(children=(FloatProgress(value=0.0, max=21.0), HTML(value='')))

(6672, 2074)
(6672, 2074)
(6672,)
(6672, 2074)
(6672, 2074)
(6672,)




In [324]:
mod_vec = pd.DataFrame({"category":category_list, "label":label_list, "vectorizer":vectorizer_list,
                               "model":models_list})

In [325]:

for i in [category_list, label_list, vectorizer_list, models_list]:
    print(len(i))

44
44
44
44


In [326]:
mod_vec.shape

(44, 4)

In [327]:
mod_vec.head(10)

Unnamed: 0,category,label,vectorizer,model
0,style,androgynous,<keras_preprocessing.text.Tokenizer object at ...,<keras.engine.sequential.Sequential object at ...
1,style,athleisure,<keras_preprocessing.text.Tokenizer object at ...,<keras.engine.sequential.Sequential object at ...
2,style,boho,<keras_preprocessing.text.Tokenizer object at ...,<keras.engine.sequential.Sequential object at ...
3,style,business casual,<keras_preprocessing.text.Tokenizer object at ...,<keras.engine.sequential.Sequential object at ...
4,style,casual,"TfidfVectorizer(analyzer='word', binary=False,...","LogisticRegression(C=1.0, class_weight=None, d..."
5,style,classic,"TfidfVectorizer(analyzer='word', binary=False,...","LogisticRegression(C=1.0, class_weight=None, d..."
6,style,edgy,<keras_preprocessing.text.Tokenizer object at ...,<keras.engine.sequential.Sequential object at ...
7,style,glam,<keras_preprocessing.text.Tokenizer object at ...,<keras.engine.sequential.Sequential object at ...
8,style,modern,"TfidfVectorizer(analyzer='word', binary=False,...","LogisticRegression(C=1.0, class_weight=None, d..."
9,style,retro,<keras_preprocessing.text.Tokenizer object at ...,<keras.engine.sequential.Sequential object at ...


In [328]:
# test data to be input here (this is dummy data):
test_brand = "Forever 21"
test_product_full_name = "Jeans size 34 M,"
test_description = "This is a slim jeans"
test_brand_category = "Denim Jeans"
test_details = "Blue color"

test_doc = test_brand +" " + test_product_full_name + " " + test_description + " " + test_brand_category +\
            " " + test_details

test_list = ['temp']
test_list[0] = test_doc
test_df2 = pd.DataFrame(data = test_list, columns = ["testdoc"])

In [333]:
# cleaning the test document
test_df2['testdoc'] = test_df2['testdoc'].astype(str).apply(clean_punct)

test_df2['testdoc'] = list(
    map(lambda doc: " ".join([token.text for token in nlp(doc) if not token.is_stop]), list(test_df2['testdoc'])))

test_df2['testdoc'] = lemm(test_df2['testdoc'])

test_df2

Unnamed: 0,testdoc
0,forever 21 jean size 34 slim jean denim jean b...


In [374]:
category_list = []
label_list = []
vectorizer_list = []
pred_val_list = []

for i in tqdm_notebook(cat):
    
    if i == 'style':
        df = df_style.copy()
        
    if i == 'occasion':
        df = df_occasion.copy()
    
    if i == 'fit':
        df = df_fit.copy()
    
    if i == 'color':
        df = df_color.copy()
        
    
    labels = list(df.iloc[:,1:-2].columns)
    
    y_test = test_df2['testdoc']
    
    for j in tqdm_notebook(labels):
        
        vector_toc = mod_vec.loc[(mod_vec.category==i)&(mod_vec.label==j),'vectorizer'].values[0]
        
        model_check = results_fin.loc[(results_fin.category==i)&(results_fin.label==j),'model'].values[0]
        
        if (model_check=="Keras-Word")|(model_check=="Keras-Glove"):
            
            max_length = get_max_token_length_per_doc(df['combined'])

            # integer encode the training data
            encoded_docs = integer_encode_documents(y_test, vector_toc)
            # pad the documents
            padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
                
            model = mod_vec.loc[(mod_vec.category==i)&(mod_vec.label==j),'model'].values[0]
            
            predi = model.predict(padded_docs)[0]
            y_pred = predi[0]
        else:
            model = mod_vec.loc[(mod_vec.category==i)&(mod_vec.label==j),'model'].values[0]
            
            test_vector = vector_toc.transform(y_test)

            test_df = pd.DataFrame(test_vector.toarray(), columns=vector_toc.get_feature_names())

            predi = model.predict_proba(test_df)[0]
            y_pred = predi[1]
            
        category_list.append(i)
        label_list.append(j)
        pred_val_list.append(y_pred)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=11.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=21.0), HTML(value='')))





In [375]:
pred_labels = pd.DataFrame({"category":category_list, "label":label_list, "pred_val":pred_val_list})

In [376]:
pred_labels

Unnamed: 0,category,label,pred_val
0,style,androgynous,0.757564
1,style,athleisure,8.8e-05
2,style,boho,0.012177
3,style,business casual,0.013776
4,style,casual,0.938546
5,style,classic,0.647485
6,style,edgy,0.02749
7,style,glam,0.000896
8,style,modern,0.344091
9,style,retro,0.050023


In [400]:
pred_label_sel = pred_labels[pred_labels['pred_val']>0.5].copy()

for i in tqdm_notebook(list(pred_label_sel['category'].unique())):
    
    if i != 'color':
        labels = pred_label_sel[pred_label_sel['category']==i]
        list_labels = list(labels['label'])
        for j in list_labels:
            print(f'The predicted labels for {i} is {j}')
    else:
        labels = pred_label_sel[pred_label_sel['category']==i]
        list_labels = list(labels['label'])
        
        if len(list_labels)>1:
            print(f'The predicted labels for {i} is {j}')
        else:
            print(f'The predicted labels for {i} is {list_labels[0]}')

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

The predicted labels for style is androgynous
The predicted labels for style is casual
The predicted labels for style is classic
The predicted labels for occasion is day to night
The predicted labels for occasion is weekend
The predicted labels for color is blues

