# ProductNet: Categorize products using image(s) and text descriptors
## Jen Sheng Wong and Kartik Nanda (Cohort 14)
Based on following paper: https://arxiv.org/pdf/1904.09037.pdf

## Problem Statement:
Relates primarily to products on retail/marketplace sites such as Amazon. The problem deals with 3 main aspects:
* Trying to categorize products. Number of categories ~5000 (using the Google taxonomy: https://github.com/fellowship/platform-demos3/blob/master/ProductNet/taxonomy-with-ids.en-US.xls)
* Product has images (1 or more)
* Product has text - title, description, keywords

Possible end-problems to solve: 
a) Find the category, given product images and user provided text description. 
b) Find mis-categorized products

## Dataset: 
Products for sale on Amazon; product details from 1996 through 2014, by Prof. McAuley at UCSD
Citations:
* R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016
* J. McAuley, C. Targett, J. Shi, A. van den Hengel. Image-based recommendations on styles and substitutes. SIGIR, 2015

## Dataset Storage:
* Used Google storage bucket located at gs://platform-ai-research/datasets/ProductNet/
* Also used Jen Sheng's Google Drive for intermediate files, images etc

## Generating labels for the dataset
The dataset has three text fields - categories, description and title. Any and all of these can be used to generate labels for the dataset. 

The first attempt involved using the 'categories' entry in the dataset itself as the label. This results in ~90,000 unique labels. The error_rates working with this were in the range of 98%.

The second attempt extracted the label from the categories by picking the first category entry in the categories column. This reduced the number of unique labels to ~40 for a smaller sampled dataset (10k instead of 5.7 million). error-rate was ~70%

Third attempt generates labels by mapping the categories data to google taxonomy entries. We then use these new labels to train.

In [1]:
!pip install fastai
!pip install pyarrow

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import os
import gc
from tqdm import tqdm

from PIL import Image

from fastai.vision import *
from fastai.metrics import error_rate, accuracy



In [0]:
# Read in the cleaned dataset from the GS bucket
# Set the file name - this is the output file from step1 (workbook_1)
file_name = 'metadata_clean_0513'
gs_path = 'platform-ai-research/datasets/ProductNet/'     # location of the bucket

# set local to False if running on Colab
local = False      # this needs to be automated (how?)

# Reading in the datafile
exists = os.path.isfile(file_name)
if (not exists):
    if (not local):
        # Login to access the GS bucket
        from google.colab import auth
        auth.authenticate_user()

        # Copy the datafile to the Colab local dir
        try:
            remote_file = gs_path + file_name
            !gsutil cp gs://{remote_file} {file_name}
        except Exception as e:
            print('File Does Not Exist')
            sys.exit()
                
    else: 
        print('File Does Not Exist')
        sys.exit()

# Read in the file
df = pd.read_feather(file_name)

In [0]:
df.info()

In [4]:
df = df[['asin', 'categories']]    # keep only the asin and categories columns
df.head()

Unnamed: 0,asin,categories
0,37214,"Clothing, Shoes & Jewelry, Girls, Clothing, Sh..."
1,32069,"Sports & Outdoors, Other Sports, Dance, Clothi..."
2,31909,"Sports & Outdoors, Other Sports, Dance"
3,32034,"Sports & Outdoors, Other Sports, Dance, Clothi..."
4,31852,"Sports & Outdoors, Other Sports, Dance"


In [6]:
# Using the Universal Sentence Encoder to map categories to google taxonomy labels
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
import numpy as np

import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)
from __future__ import absolute_import, division, print_function
import tensorflow_hub as hub

hub_embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")

W0515 18:05:40.806525 140166987462528 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


In [0]:
# generate the embeddings for various text input
def get_use_matrix(s):
    embeddings = hub_embed(s)

    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        xtrain_embeddings = session.run(embeddings)
        
    return xtrain_embeddings

In [46]:
# Can't run the entire dataset on colab (not enough RAM) so trying in pieces
pieces = 400; n = int(df.shape[0]/pieces)
classes_embeddings = np.empty(shape=[0, 512])

for i in range(0,pieces,1):
    df2 = df.loc[i*n:(i+1)*n-1]
    
    # generate embeddings for the categories entries
    s = df2.categories.values
    
#     %%time
    if (i%20):
        print (i)

    classes_embeddings_piece = get_use_matrix(s)
    
    #classes_embeddings = np.append(classes_embeddings, classes_embeddings_piece, axis=0)
    fn = 'arr_'+str(i)+'.pkl'
    np.save(fn, classes_embeddings_piece, allow_pickle=True)
    
    df2 = None; classes_embeddings_piece = None
    gc.collect()

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.91 µs
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.91 µs
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.91 µs
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.44 µs
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 7.15 µs
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.44 µs
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.2 µs
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.44 µs
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.39 µs
CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 10 µs
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.68 µs
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.68 µs
CPU times: user 3 µs, sys: 0 ns, total:

KeyboardInterrupt: ignored

In [47]:
# Embeddings for the categories have been generated
classes_embeddings.shape

(173817, 512)

In [0]:
# Now generate the embeddings for the taxonomy - the labels we are trying to map to
# Read in the taxonomy file from the bucket
file_name = 'taxonomy-with-ids.en-US.xls'

remote_file = gs_path + file_name
!gsutil cp gs://{remote_file} .

In [0]:
# Read the taxonomy file
taxo = pd.read_excel('taxonomy-with-ids.en-US.xls')

taxo = taxo.fillna('')

# get columns. Different columns provide a different depth into the taxonomy.
# For example column 1 would be top level, and has few (31) categories
# Column 2 is one level deeper, and has 129 labels, column 3 is deeper still, and so on
level_2 = taxo.iloc[:, 2].unique()

In [0]:
taxo_embeddings = get_use_matrix(level_2)
taxo_embeddings.shape

In [0]:
# We now have the embeddings for both the categories data (the source) and the taxonomy (the target)
# Need to map the source (categories embeddings) to the target.
# the label is the embedding in the taxonomy that is "closest" to the categories embedding

# Calculate the distance
# cos_sim = linear_kernel(classes_embeddings, taxo_embeddings)
cos_sim = cosine_similarity(classes_embeddings, taxo_embeddings)

In [0]:
cos_sim.shape

In [0]:
top_idx = []

for cs in cos_sim:
    top_idx.append(np.argmax(cs))

In [0]:
len(top_idx)

In [0]:
# Create the label column in the df
df['label'] = level_2[top_idx]
df.head()

In [0]:
df.to_feather('df_mapped_label_may15')