## Overview

This notebook is the combined version of all codes from each group member's work. This notebook will take a dataframe as input and output the predictions for 5 attribute for the products. All predictions will be appended as 5 new columns to the orginial input. There are 6 sections: the first 5 are for predictions, and the last one if for appending and outputing. If one is interested in certain sections, one could check the individual folders in our repo, which contain individual works. If one has further interests or questions, there are contact information on the home page. 

## Instruction

Two kinds of inputs are allowed: CSV file or user input. 

- If the input is a Comma-separated values (csv) file, each row should represent a product and columns **must** include 'product_id','brand','product_full_name','brand_category','brand_canonical_url','description','details'. Column names should **match the forms** (spelling, underscore, and lowercase) above. Null values are allowed. Run the "**CSV Input**" cell below and change the file name to the input file name. It's originally full_data, which we got from our client. 

- If user wants to type in columns values, please use the "**User Input**" cell. Type in the values in the way shown in the example. 

**Please use the following Dropbox link to downlaod a zip file/folder, which contains all the supplements for this notebook. Please unzip them and make sure all files are in the same folder as this notebook:**

https://www.dropbox.com/s/1p7098qkp0i2wzk/Group_White_Final_Model_Files.zip?dl=0

In [1]:
import pandas as pd
import numpy as np
import spacy
import re
import pickle
import nltk
import string 
from keras.models import load_model
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from keras.preprocessing.sequence import pad_sequences

import warnings
warnings.filterwarnings('ignore')

nlp = spacy.load('en_core_web_md')

def save_obj(obj, name):
    with open(name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

Using TensorFlow backend.


### CSV Input

In [2]:
df = pd.read_csv('full_data_final version.csv')

In [3]:
df = df.loc[:,['product_id','brand','product_full_name','description','brand_category','brand_canonical_url','details']]
df.drop_duplicates(inplace=True)
df.reset_index(drop=True,inplace=True)

### User Input

In [4]:
## below is an example of how to input data; if there is null value, please use np.nan as entry
## example:

product_id = ['id1','id2']
brand = ['brand1','brand2']
product_full_name = ['fullname1','fullname2']
brand_category = ['cate1','cate2']
brand_canonical_url = ['url1','url2']
description = ['desp1', np.nan]
details = [np.nan, 'detail2']

df = pd.DataFrame(dict(zip(['product_id','brand','product_full_name','brand_category','brand_canonical_url','description','details'],
                      [product_id,brand,product_full_name,brand_category,brand_canonical_url,description,details])))

## I. Style - Nanchun (Aslan) Shi

In [5]:
df1 = df.copy()

### 1.1 Embedding

In [6]:
## select columns to be used for embedding model

emb_df = df1.loc[:,['description','details']]

In [7]:
## import from self-created module; check Preprocessing.py for details

from Preprocessing import embedding_preprocessing
emb_pre = embedding_preprocessing()

In [8]:
## preprocessing

emb_vector_df = pd.DataFrame(emb_pre.preprocess(emb_df))

In [9]:
## load embedding model

emb_model = load_model('style_embedding_model.h5')

In [10]:
## predict

emb_pred_vectors = emb_model.predict(emb_vector_df)

### 1.2 TF-IDF

In [11]:
## select columns to be used for tf-idf model

tfidf_df = df1.loc[:,['brand','product_full_name','brand_category','brand_canonical_url']]

In [12]:
## import from self-created module; check Preprocessing.py for details

from Preprocessing import tfidf_preprocessing
tfidf_pre = tfidf_preprocessing()

In [13]:
## preprocessing

tfidf_vector_df = tfidf_pre.preprocess(tfidf_df)

In [14]:
## load tf-idf model

tfidf_model = load_model('style_tfidf_model.h5')

In [15]:
## predict

tfidf_pred_vectors = tfidf_model.predict(tfidf_vector_df)

### 1.3 Prediction

In [16]:
def get_pred_classes(mat):
    pred = list(map(lambda v: list(np.argsort(v))[-2:], mat))
    return np.array(pred)

label_dict = load_obj('style_label_dict_rev')

In [17]:
final_vectors = 0.4*emb_pred_vectors + 0.6*tfidf_pred_vectors

In [18]:
final_pred_classes = get_pred_classes(final_vectors)

In [19]:
df1['style_prediction'] = list(map(lambda x: [label_dict[x[0]], label_dict[x[1]]], final_pred_classes))

In [20]:
df1.head(3)

Unnamed: 0,product_id,brand,product_full_name,description,brand_category,brand_canonical_url,details,style_prediction
0,01DSE9TC2DQXDG6GWKW9NMJ416,Banana Republic,Ankle-Strap Pump,"A modern pump, in a rounded silhouette with an...",Unknown,https://bananarepublic.gap.com/browse/product....,"A modern pump, in a rounded silhouette with an...","[modern, businesscasual]"
1,01DSE9SKM19XNA6SJP36JZC065,Banana Republic,Petite Tie-Neck Top,Dress it down with jeans and sneakers or dress...,Unknown,https://bananarepublic.gap.com/browse/product....,Dress it down with jeans and sneakers or dress...,"[businesscasual, classic]"
2,01DSJX8GD4DSAP76SPR85HRCMN,Loewe,52MM Padded Leather Round Sunglasses,Padded leather covers classic round sunglasses.,JewelryAccessories/SunglassesReaders/RoundOval...,https://www.saksfifthavenue.com/loewe-52mm-pad...,100% UV protection Case and cleaning cloth inc...,"[casual, classic]"


## II. Fit - Xinyi (Alex) Guo

In [21]:
df2 = df.copy()

### 2.1 Preprocessing Functions

In [22]:
def removePunctuation(text, punctuations=string.punctuation+"``"+"’"+"”"):
    words=nltk.word_tokenize(text)
    newWords = [word for word in words if word.lower() not in punctuations]
    cleanedText = " ".join(newWords)
    return cleanedText

In [23]:
nltk_stopwords = set(stopwords.words("English"))
def removeStopwords(text, stopwords=nltk_stopwords):
    words = nltk.word_tokenize(text)
    newWords = [word for word in words if word.lower() not in stopwords]
    cleanedText = " ".join(newWords)
    return cleanedText

In [24]:
def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    words = nltk.word_tokenize(text)
    lemmatizedWords = [lemmatizer.lemmatize(word.lower()) for word in words]
    lemmatizedText = " ".join(lemmatizedWords)
    return lemmatizedText

In [25]:
def preprocessing(df, columns = ["brand", "product_full_name", "description", "details"]):
    df['details'] = df['details'].str.replace("\n", "")
    #replace null values with UNKNOWN_TOKEN
    df['brand'] = df['brand'].fillna('UNKNOWN_TOKEN')
    df['description'] = df['description'].fillna('UNKNOWN_TOKEN')
    df['details'] = df['details'].fillna('UNKNOWN_TOKEN')
    df['product_full_name'] = df['product_full_name'].fillna('UNKNOWN_TOKEN')
    #remove punctuation and stopwords then lemmatize
    for col in columns: 
        df[col] = df[col].apply(removePunctuation)
        df[col] = df[col].apply(removeStopwords)
        df[col] = df[col].apply(lemmatize)
    return df

### 2.2 Keras Modeling Functions

In [26]:
def integer_encode_documents(docs, tokenizer):
    return tokenizer.texts_to_sequences(docs)

In [27]:
def get_max_token_length_per_doc(docs):
    return max(list(map(lambda x: len(x.split()), docs)))

In [28]:
def predict(X_test_df, target, max_length):
    #load data
    X_test_df["input_doc"] = X_test_df.brand + " " + X_test_df.product_full_name + " " \
                                + X_test_df.description + " " + X_test_df.details 
    X_test = X_test_df.loc[:, "input_doc"].values
    test_docs = list(X_test)

    #load model
    model = load_model("{}_model.h5".format(target))
    with open('{}_tokenizer.pickle'.format(target), 'rb') as handle:
        tokenizer = pickle.load(handle)
        
    #predict
    encoded_test_docs = integer_encode_documents(test_docs, tokenizer)
    padded_test_docs = pad_sequences(encoded_test_docs, maxlen=max_length, padding='post')
    prediction_proba = model.predict(padded_test_docs, verbose = 0)
    
    return prediction_proba

### 2.3 Main Function

In [29]:
def main(df):
    '''
    This function will predict the fit of the clothing. It takes a dataframe as an input. The CSV file needs to have 
    "brand", "product_full_name", "description", and "details" columns. The function will output a dataframe with an 
    additional fit column. 
    '''
    #load data
#     inputFile = input("What's the name of the csv file? (ex. full_data.csv)")
    fullData = df
    #Preprocess data
    print("Start preprocessing data...")
    testData = fullData.copy()
    testData = testData.loc[:, ["brand", "product_full_name", "description", "details"]]
    testData = preprocessing(testData)
    print("Start predicting fit...")
    #Predict fit
    maxLengthDict = {'straightregular': 185,
                 'semifitted': 185,
                 'relaxed': 202,
                 'oversized': 202,
                 'fittedtailored': 202}
    prob_df = pd.DataFrame()
    fitType = ['straightregular', 'semifitted', 'relaxed', 'oversized', 'fittedtailored']
    for fit in fitType:
        prediction_proba = predict(testData, target = fit, max_length = maxLengthDict[fit])
        prob_df[fit] = prediction_proba.flatten()
        print(fit, "fit prediction done")
    prob_df['predict_fit'] = prob_df.idxmax(axis=1)
    fullData['fit'] = prob_df['predict_fit']
#     fullData.to_csv("full_data with fit prediction.csv")
    return fullData

In [30]:
df2 = main(df2)

Start preprocessing data...
Start predicting fit...
straightregular fit prediction done
semifitted fit prediction done
relaxed fit prediction done
oversized fit prediction done
fittedtailored fit prediction done


In [31]:
df2.head(3)

Unnamed: 0,product_id,brand,product_full_name,description,brand_category,brand_canonical_url,details,fit
0,01DSE9TC2DQXDG6GWKW9NMJ416,Banana Republic,Ankle-Strap Pump,"A modern pump, in a rounded silhouette with an...",Unknown,https://bananarepublic.gap.com/browse/product....,"A modern pump, in a rounded silhouette with an...",straightregular
1,01DSE9SKM19XNA6SJP36JZC065,Banana Republic,Petite Tie-Neck Top,Dress it down with jeans and sneakers or dress...,Unknown,https://bananarepublic.gap.com/browse/product....,Dress it down with jeans and sneakers or dress...,semifitted
2,01DSJX8GD4DSAP76SPR85HRCMN,Loewe,52MM Padded Leather Round Sunglasses,Padded leather covers classic round sunglasses.,JewelryAccessories/SunglassesReaders/RoundOval...,https://www.saksfifthavenue.com/loewe-52mm-pad...,100% UV protection Case and cleaning cloth inc...,semifitted


## III. Occasion - Bingru Xue

In [32]:
df3 = df.copy()

### 3.1 Preprocess

In [33]:
# change text to lower case and replace null values
for i in df3.columns:
    df3[i] = df3[i].str.lower()
df3= df3.replace(np.nan, 'UNKNOWN_TOKEN', regex=True)
df3['details'] = df3['details'].str.replace("\n", "")
df3['text'] = df3['description']+' '+df3['details']

In [34]:
def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)
    
    #remove stopwords and do lemmatization
    doc = nlp(sentence)
    tokens = [token.lemma_ for token in doc if not token.is_stop]
    
    return " ".join(tokens)

df3['text'] = df3['text'].apply(preprocess_text)
df3['product_full_name'] = df3['product_full_name'].apply(preprocess_text)
df3['brand_category'] = df3['brand_category'].apply(preprocess_text)

In [35]:
def preprocess_url(url):
    url = re.sub('https://www.', '', url)
    url = re.sub('.com', '', url)
    url = re.sub('/', ' ', url)
    url = re.sub('-', ' ', url)
    url = re.sub(r'[0-9]+', ' ', url)
    url = re.sub(r"\s+[a-zA-Z]\s+", ' ', url)
    url = re.sub(r'\s+', ' ', url)

    doc = nlp(url)
    tokens = [token.lemma_ for token in doc if not token.is_stop]
    
    return " ".join(tokens)

df3['brand_canonical_url'] = df3['brand_canonical_url'].apply(preprocess_url)

In [36]:
df3['brand_info'] = df3['brand']+' '+df3['product_full_name']+' '+\
                    df3['brand_category']+' '+df3['brand_canonical_url']

### 3.2 Embedding Model: Description & Detail

In [37]:
docs = df3['text']

In [38]:
# Replace words in new input that are out-of-vocabulary to a single token
tokenizer = load_obj("occasion_tokeniver")
new_doc = []
def replace_oov(sentence):
    now_sen =[]
    for word in word_tokenize(sentence):
        if word in tokenizer.word_index.keys():
            now_sen.append(word)
        else:
            now_sen.append("UNKNOWN_TOKEN")
    return " ".join(now_sen)
docs.apply(replace_oov)

0        modern pump rounded silhouette ankle strap ext...
1        dress jean sneaker dress tailor trouser heel t...
2        padded leather cover classic round sunglass uv...
3        iconic mid design get add dose support padded ...
4        UNKNOWN_TOKEN shade offer UNKNOWN_TOKEN view i...
                               ...                        
48085    unknown token cozy double breasted jacket craf...
48086    UNKNOWN_TOKEN hour long wear water resistant U...
48087    ruffled trim sweatshirt lend romance stripe le...
48088    pretty plaid dress velvet collar velvet bow po...
48089    unknown token corduroy dress bow cotton cordur...
Name: text, Length: 48090, dtype: object

In [39]:
# Pad documents

token = tokenizer.texts_to_sequences(docs)
pad = pad_sequences(token, padding='post', maxlen=165, truncating='post')

In [40]:
# Load in trained model for word embedding

embedding_model = load_model('occasion_embedding_model.h5')

In [41]:
# Predict the probability of a product belonging to each of occasion type

occasion_type = ["cold weather","day to night","night out","vacation","weekend","work","workout"]
pred = embedding_model.predict(pad)
embedding_df = pd.DataFrame(data= pred, columns = occasion_type,index=df3.index)

### 3.3 Vectorization Model: Brand, Name, URL, Brand Category

In [42]:
info = df3['brand_info']

In [43]:
# Load in TF-IDF Vectorizer and transform new input

vectorizer = load_obj("occasion_vectorizer")
vector = vectorizer.transform(info)
tf_idf_df = pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())

In [44]:
# Predict the probability of a product belonging to each of occasion type using corresponding trained model

occasion_type = ["cold weather","day to night","night out","vacation","weekend","work","workout"]
vector_df = pd.DataFrame(columns = occasion_type, index = df3.index)

for label in occasion_type:
    filename = "{}".format(label)+"_vector_model"
    vector_model = load_obj(filename)
    prob = vector_model.predict_proba(vector)[:,1]
    vector_df[label] = prob

### 3.4 Combine embedding model and vector model

In [45]:
# Get weighted average predicted probability from 2 models

occasion_result_df = 0.4*embedding_df + 0.6*vector_df

In [46]:
# Decision rule:
# If a product's probability of belonging to an occasion type is >0.5, assign this occasion
# If none of occasion probability is >0.5 for a product, assign with occasion with highest probability

def decision(probs):
    if sum(probs>0.5)>0:
        probs[probs > 0.5] = 1
        probs[probs <= 0.5] = 0
    else:
        probs[probs == np.max(probs)] = 1
        probs[probs != np.max(probs)] = 0
    return probs

occasion_result_df.apply(decision, axis=1)

Unnamed: 0,cold weather,day to night,night out,vacation,weekend,work,workout
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,1.0,1.0,0.0
2,0.0,0.0,0.0,1.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...
48085,1.0,1.0,0.0,0.0,1.0,0.0,0.0
48086,0.0,0.0,0.0,1.0,1.0,0.0,0.0
48087,0.0,1.0,0.0,0.0,1.0,0.0,0.0
48088,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [47]:
# Output occasion labels for each product

occasion = []
for i in occasion_result_df.index:
    label = []
    for j in occasion_result_df.columns:
        if occasion_result_df.loc[i,j].any():
            label.append(j)
    occasion.append(label)
occasion_result_df['Occasion'] = occasion

In [48]:
occasion_result_df.head(3)

Unnamed: 0,cold weather,day to night,night out,vacation,weekend,work,workout,Occasion
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,[day to night]
1,0.0,1.0,0.0,0.0,1.0,1.0,0.0,"[day to night, weekend, work]"
2,0.0,0.0,0.0,1.0,1.0,0.0,0.0,"[vacation, weekend]"


In [49]:
occasion_final = occasion_result_df.loc[:,"Occasion"]

# IV. Patterns/Prints - Jiayue (Daniel) Chen

In [50]:
df4 = df.copy()

## 4.1 Preprocessing

In [51]:
#replace null values with UNKNOWN_TOKEN
df4['description'] = df4['description'].fillna('UNKNOWN_TOKEN')
df4['details'] = df4['details'].fillna('UNKNOWN_TOKEN')
df4['brand_category'] = df4['brand_category'].fillna('UNKNOWN_TOKEN')
df4['details'] = df4['details'].str.replace("\n", "")

In [52]:
# define a function to remove punctuation
def removePunctuation(text, punctuations=string.punctuation+"``"+"’"+"”"):
    words=nltk.word_tokenize(text)
    newWords = [word for word in words if word.lower() not in punctuations]
    cleanedText = " ".join(newWords)
    return cleanedText

nltk_stopwords = set(stopwords.words("English"))

# define a function to remove stopwords
def removeStopwords(text, stopwords=nltk_stopwords):
    words = word_tokenize(text)
    newWords = [word for word in words if word.lower() not in stopwords]
    cleanedText = " ".join(newWords)
    return cleanedText

# define a function to lemmatize all texts
def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    words = word_tokenize(text)
    lemmatizedWords = [lemmatizer.lemmatize(word.lower()) for word in words]
    lemmatizedText = " ".join(lemmatizedWords)
    return lemmatizedText

columns = ["brand", "product_full_name", "description", "details"]
for col in columns: 
    df4[col] = df4[col].apply(removePunctuation)
    df4[col] = df4[col].apply(removeStopwords)
    df4[col] = df4[col].apply(lemmatize)

In [53]:
subset = pd.DataFrame()

In [54]:
subset["input_doc"] = df4.brand+" "+df4.product_full_name+" "+df4.description+" "+df4.details
subset['abstract'] = 0
subset['animal'] = 0
subset['camouflage'] = 0
subset['colorblock'] = 0
subset['dots'] = 0
subset['floral'] = 0
subset['geometric'] = 0
subset['graphic'] = 0
subset['houndstooth'] = 0
subset['logo'] = 0
subset['monogram'] = 0
subset['multiprint'] = 0
subset['paisley'] = 0
subset['pinstripe'] = 0
subset['plaid'] = 0
subset['stripe'] = 0
subset['stripehorizontal'] = 0
subset['stripevertical'] = 0
subset['tiedye'] = 0
subset['tropical'] = 0

## 4.2 Modeling & Prediction

In [55]:
def integer_encode_documents(docs, tokenizer):
    return tokenizer.texts_to_sequences(docs)

def get_max_token_length_per_doc(docs):
    return max(list(map(lambda x: len(x.split()), docs)))

In [56]:
# From the complete pattern's file, we generate a list of max_length
max_length_list = [150, 150, 124, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150]

In [57]:
pattern_values = list(subset.columns.values[1:,])
for i in range(1, subset.shape[1]):
    X_test = subset.iloc[:, 0].values
    test_docs = list(X_test)
    model = load_model("{}_model.h5".format(pattern_values[i-1]))
    with open('{}_tokenizer.pickle'.format(pattern_values[i-1]), 'rb') as handle:
        tokenizer = pickle.load(handle)
        
   #predict
    encoded_test_docs = integer_encode_documents(test_docs, tokenizer)
    padded_test_docs = pad_sequences(encoded_test_docs, maxlen = max_length_list[i-1], padding = 'post')
    prediction = model.predict(padded_test_docs, verbose = 0)
    subset.iloc[:, i] = prediction

In [58]:
subset = subset.drop(['input_doc'], axis = 1)
subset['patterns/prints'] = subset.idxmax(axis = 1)

In [59]:
df4['patterns/prints'] = subset['patterns/prints']
df4.head(3)

Unnamed: 0,product_id,brand,product_full_name,description,brand_category,brand_canonical_url,details,patterns/prints
0,01DSE9TC2DQXDG6GWKW9NMJ416,banana republic,ankle-strap pump,modern pump rounded silhouette ankle strap ext...,Unknown,https://bananarepublic.gap.com/browse/product....,modern pump rounded silhouette ankle strap ext...,animal
1,01DSE9SKM19XNA6SJP36JZC065,banana republic,petite tie-neck top,dress jean sneaker dress tailored trouser heel...,Unknown,https://bananarepublic.gap.com/browse/product....,dress jean sneaker dress tailored trouser heel...,floral
2,01DSJX8GD4DSAP76SPR85HRCMN,loewe,52mm padded leather round sunglass,padded leather cover classic round sunglass,JewelryAccessories/SunglassesReaders/RoundOval...,https://www.saksfifthavenue.com/loewe-52mm-pad...,100 uv protection case cleaning cloth included...,logo


## V. Category - Yuyao Shen 

According to Womens+Attributes.xlsx, general category has 6 attributes: top, bottom, onepiece, shoe, handbag and scarf. From tagged data, attributes include top, bottom, onepiece, shoe, sweater, accessory, blazer, hoodie etc. To combine these two categorization, I chose to keep top, bottom, onepiece, shoe, accessory only and blazer, sweater and hoodie are included in top.

In [60]:
df5 = df.copy()

### 5.1 Preprocess

In [61]:
punctuations = string.punctuation
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS
from spacy.lang.en import English
parser = English()

def clean_text(text):
    '''
    use regular expression to clean text 
    replace numbers and units to variables
    '''
    p = re.compile(r'<.*?>')
    text = p.sub('', text)
    text = text.lower()
    text = re.sub('\xa0', '',text)
    text = re.sub(r'\d{1,3}(\.|\’)?\d{1,3}?(\"|\”)',"length_val", text)
    text = re.sub(r'\d{1,3}\s*?%',"percentage_val", text)
    text = text.strip(string.punctuation).replace("\n", " ").replace("\r", " ")
    text = re.sub(r'[^\w\s]','',text)
    text = re.sub(r'\d{1,3}\s*?mm',"mm_val", text)
    text = re.sub(r'\d{1,3}\s*?cm',"cm_val", text)
    text = re.sub(r'\d{1,3}\s*?(inches|inch)',"inches_val", text)
    text = re.sub(r'\d{1,3}\s*?(lbs|kg)',"weight_val", text)
    text = re.sub(r'size\s*?\d{1,3}\s*?',"size_val", text)
    text = re.sub(r'\b\d+\b',' ',text)
    text = re.sub(r'\s+',' ',text) 
    mytokens = parser(text)
    mytokens = [ word.lemma_.lower().strip() for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    return " ".join(mytokens)

In [62]:
def clean_cate_test(inputFile):
    '''
    Input testing dataset that is going to be labeled
    Keep relevant columns 
    Basic cleaning
    Output cleaned testing dataset for category in a dataframe
    '''
    #full_test_data = pd.read_csv(inputFile)
    full_test_data = df5  
    ### keep relevant columns only
    test_data = full_test_data[['product_full_name', 'details','description', 'brand_category']]
    ### fill null values with 'Unknown_token'
    test_data.fillna('Unknown_token', inplace = True)
    X_test = test_data['product_full_name'] + ' '+ test_data['details'] + ' '+test_data['description']+ ' '+test_data['brand_category']
    return test_data, X_test

In [63]:
def get_pred_classes(mat):
    pred = list(map(lambda v: list(np.argsort(v))[-1:], mat))
    return pred

In [64]:
full_test_data, X_test = clean_cate_test('full_data_final version.csv')
X_test = X_test.apply(clean_text)

### 5.2 Vectorization

In [65]:
### loading vectorizer
Pkl_Filename = "category_token.pkl"  
with open(Pkl_Filename, 'rb') as file:  
    tk = pickle.load(file)

In [66]:
### vectorize incoming data
from keras.preprocessing.sequence import pad_sequences
vector_text_test = tk.texts_to_sequences(X_test)
padded_token_lists_test = pad_sequences(vector_text_test, maxlen=175, padding='post')
X_test = pd.DataFrame(padded_token_lists_test, index = full_test_data.index)

### 5.3  Modeling 

In [67]:
### loading model
Pkl_Filename = "category_model.pkl"  
with open(Pkl_Filename, 'rb') as file:  
    model = pickle.load(file)

In [68]:
### use trained model to predict incoming data
pred_vectors_test = model.predict(X_test)
test_pred_classes = get_pred_classes(pred_vectors_test)
categories = ['accessory', 'bottom', 'onepiece', 'shoe', 'top']
cate_pred = [categories[i[0]] for i in test_pred_classes]
predicted_test = pd.Series(cate_pred).str.capitalize() 
df5['category']  = list(predicted_test)

In [69]:
df5.head(3)

Unnamed: 0,product_id,brand,product_full_name,description,brand_category,brand_canonical_url,details,category
0,01DSE9TC2DQXDG6GWKW9NMJ416,Banana Republic,Ankle-Strap Pump,"A modern pump, in a rounded silhouette with an...",Unknown,https://bananarepublic.gap.com/browse/product....,"A modern pump, in a rounded silhouette with an...",Shoe
1,01DSE9SKM19XNA6SJP36JZC065,Banana Republic,Petite Tie-Neck Top,Dress it down with jeans and sneakers or dress...,Unknown,https://bananarepublic.gap.com/browse/product....,Dress it down with jeans and sneakers or dress...,Onepiece
2,01DSJX8GD4DSAP76SPR85HRCMN,Loewe,52MM Padded Leather Round Sunglasses,Padded leather covers classic round sunglasses.,JewelryAccessories/SunglassesReaders/RoundOval...,https://www.saksfifthavenue.com/loewe-52mm-pad...,100% UV protection Case and cleaning cloth inc...,Accessory


## VI. Output

In [70]:
df['style'] = df1.style_prediction
df['fit'] = df2.fit
df['occasion'] = occasion_final
df['patterns/prints'] = df4['patterns/prints']
df['category'] = df5.category

- If the input is CSV, please change the file name below and run the last 3 cells, then the **df_full** will be the final output
- If user input was used, no need to run the cells below, **df** is the final output

In [71]:
#### Please change the file name accordingly

df_full = pd.read_csv('full_data_final version.csv')

In [72]:
### Append predicted tags to the orginal dataset

df_full = df_full.merge(df, how = 'left',  on=['product_id', 'brand', 'product_full_name', 'description', 
                                     'brand_category', 'brand_canonical_url', 'details'])

In [73]:
df_full

Unnamed: 0,product_id,brand,mpn,product_full_name,description,brand_category,created_at,updated_at,deleted_at,brand_canonical_url,details,labels,bc_product_id,style,fit,occasion,patterns/prints,category
0,01DSE9TC2DQXDG6GWKW9NMJ416,Banana Republic,514683,Ankle-Strap Pump,"A modern pump, in a rounded silhouette with an...",Unknown,2019-11-11 22:37:15.719107+00,2019-12-19 20:40:30.786144+00,,https://bananarepublic.gap.com/browse/product....,"A modern pump, in a rounded silhouette with an...","{""Needs Review""}",,"[modern, businesscasual]",straightregular,[day to night],animal,Shoe
1,01DSE9SKM19XNA6SJP36JZC065,Banana Republic,526676,Petite Tie-Neck Top,Dress it down with jeans and sneakers or dress...,Unknown,2019-11-11 22:36:50.682513+00,2019-12-19 20:40:30.786144+00,,https://bananarepublic.gap.com/browse/product....,Dress it down with jeans and sneakers or dress...,"{""Needs Review""}",,"[businesscasual, classic]",semifitted,"[day to night, weekend, work]",floral,Onepiece
2,01DSJX8GD4DSAP76SPR85HRCMN,Loewe,4.001E+11,52MM Padded Leather Round Sunglasses,Padded leather covers classic round sunglasses.,JewelryAccessories/SunglassesReaders/RoundOval...,2019-11-13 17:33:59.581661+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/loewe-52mm-pad...,100% UV protection Case and cleaning cloth inc...,"{""Needs Review""}",,"[casual, classic]",semifitted,"[vacation, weekend]",logo,Accessory
3,01DSJVKJNS6F4KQ1QM6YYK9AW2,Converse,4.00012E+11,Baby's & Little Kid's All-Star Two-Tone Mid-To...,The iconic mid-top design gets an added dose o...,"JustKids/Shoes/Baby024Months/BabyGirl,JustKids...",2019-11-13 17:05:05.203733+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/converse-babys...,Canvas upper Round toe Lace-up vamp SmartFOAM ...,"{""Needs Review""}",,"[classic, casual]",straightregular,"[day to night, weekend]",logo,Shoe
4,01DSK15ZD4D5A0QXA8NSD25YXE,Alexander McQueen,4.00011E+11,64MM Rimless Sunglasses,Hexagonal shades offer a rimless view with int...,JewelryAccessories/SunglassesReaders/RoundOval,2019-11-13 18:42:30.941321+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/alexander-mcqu...,100% UV protection Gradient lenses Adjustable ...,"{""Needs Review""}",,"[casual, modern]",relaxed,[weekend],animal,Accessory
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48974,01DSNVXY8EJ9FQAJ3MPDMPASHD,Bonpoint,4.00091E+11,Baby's Hooded Jacket,,JustKids/Baby024months/InfantGirls/Outerwear,2019-11-14 21:08:28.040417+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/bonpoint-babys...,Cozy double breasted jacket crafted from cotto...,"{""Needs Review""}",,"[classic, casual]",oversized,"[cold weather, day to night, weekend]",logo,Top
48975,01DSGYHA3RMCHENBJVQPBGXM97,Laura Mercier,4.00096E+11,Flawless Fusion Ultra-Longwear Foundation,"WHAT IT ISA 15-hour long wearing, water resist...",SaksBeautyPlace/ForHer/Color/Foundation/Liquid...,2019-11-12 23:17:47.761072+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/laura-mercier-...,"WHAT IT ISA 15-hour long wearing, water resist...","{""Needs Review""}",,"[casual, androgynous]",relaxed,"[vacation, weekend]",logo,Top
48976,01DSJT8H12CAFQQH07SQSQWJ8C,Splendid,4.001E+11,Baby Girl's 2-Piece Ruffle Sweatshirt & Stripe...,Ruffled-trim sweatshirt lends romance to this ...,"JustKids/Baby024months/InfantGirls/Tops,JustKi...",2019-11-13 16:41:34.491443+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/splendid-baby-...,"Crewneck Long sleeves Rib-knit neck, cuffs and...","{""Needs Review""}",,"[classic, casual]",relaxed,"[day to night, weekend]",stripevertical,Top
48977,01DSH2PF9J7QZ44D842B3GMCFN,Florence Eiseman,4.00012E+11,Little Girl's Plaid & Velvet Dress,Pretty plaid dress with velvet collar and velv...,"JustKids/Girls214/ToddlerGirls24/Dresses,JustK...",2019-11-13 00:30:31.212215+00,2019-12-19 20:40:30.786144+00,,https://www.saksfifthavenue.com/florence-eisem...,Peter Pan collar Short sleeves Back zipper Two...,"{""Needs Review""}",,"[modern, casual]",relaxed,[weekend],plaid,Onepiece
