# ProductNet: Categorize products using image(s) and text descriptors
## Jen Sheng Wong and Kartik Nanda (Cohort 14)
Based on following paper: https://arxiv.org/pdf/1904.09037.pdf

## Problem Statement:
Relates primarily to products on retail/marketplace sites such as Amazon. The problem deals with 3 main aspects:
* Trying to categorize products. Number of categories ~5000 (using the Google taxonomy: https://github.com/fellowship/platform-demos3/blob/master/ProductNet/taxonomy-with-ids.en-US.xls)
* Product has images (1 or more)
* Product has text - title, description, keywords

Possible end-problems to solve: 
a) Find the category, given product images and user provided text description. 
b) Find mis-categorized products

## Dataset: 
Products for sale on Amazon; product details from 1996 through 2014, by Prof. McAuley at UCSD
Citations:
* R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016
* J. McAuley, C. Targett, J. Shi, A. van den Hengel. Image-based recommendations on styles and substitutes. SIGIR, 2015

## Dataset Storage:
* Used Google storage bucket located at gs://platform-ai-research/datasets/ProductNet/
* Also used Jen Sheng's Google Drive for intermediate files, images etc

## Generating labels for the dataset
The dataset has three text fields - categories, description and title. Any and all of these can be used to generate labels for the dataset. 

The first attempt involved using the 'categories' entry in the dataset itself as the label. This results in ~90,000 unique labels. The error_rates working with this were in the range of 98%.

The second attempt extracted the label from the categories by picking the first category entry in the categories column. This reduced the number of unique labels to ~40 for a smaller sampled dataset (10k instead of 5.7 million). error-rate was ~70%

Third attempt generates labels by mapping the categories data to google taxonomy entries. We then use these new labels to train.

In [1]:
!pip install fastai
!pip install pyarrow

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import os
import gc
from tqdm import tqdm
import warnings

from PIL import Image

from fastai.vision import *
from fastai.metrics import error_rate, accuracy

warnings.simplefilter(action='ignore')



In [0]:
# Read in the cleaned dataset from the GS bucket
# Set the file name - this is the output file from step1 (workbook_1)
file_name = 'metadata_clean_0513'
gs_path = 'platform-ai-research/datasets/ProductNet/'     # location of the bucket

# set local to False if running on Colab
local = False      # this needs to be automated (how?)

# Reading in the datafile
exists = os.path.isfile(file_name)
if (not exists):
    if (not local):
        # Login to access the GS bucket
        from google.colab import auth
        auth.authenticate_user()

        # Copy the datafile to the Colab local dir
        try:
            remote_file = gs_path + file_name
            !gsutil cp gs://{remote_file} {file_name}
        except Exception as e:
            print('File Does Not Exist')
            sys.exit()
                
    else: 
        print('File Does Not Exist')
        sys.exit()

# Read in the file
df = pd.read_feather(file_name)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5607160 entries, 0 to 5607159
Data columns (total 7 columns):
index          int64
asin           object
categories     object
description    object
imUrl          object
title          object
label          object
dtypes: int64(1), object(6)
memory usage: 299.5+ MB


In [4]:
df = df[['asin', 'categories']]    # keep only the asin and categories columns
df.head()

Unnamed: 0,asin,categories
0,37214,"Clothing, Shoes & Jewelry, Girls, Clothing, Sh..."
1,32069,"Sports & Outdoors, Other Sports, Dance, Clothi..."
2,31909,"Sports & Outdoors, Other Sports, Dance"
3,32034,"Sports & Outdoors, Other Sports, Dance, Clothi..."
4,31852,"Sports & Outdoors, Other Sports, Dance"


In [5]:
len(df)

5607160

In [6]:
# Using the Universal Sentence Encoder to map categories to google taxonomy labels
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
import numpy as np

import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)
from __future__ import absolute_import, division, print_function
import tensorflow_hub as hub

hub_embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")

W0515 20:37:55.286013 139745971382144 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


# Generate embeddings for Google Taxonomy 
Here, we make use of level 1 and level 2 as labels.

In [7]:
# Now generate the embeddings for the taxonomy - the labels we are trying to map to
# Read in the taxonomy file from the bucket
file_name = 'taxonomy-with-ids.en-US.xls'

remote_file = gs_path + file_name
!gsutil cp gs://{remote_file} .

Copying gs://platform-ai-research/datasets/ProductNet/taxonomy-with-ids.en-US.xls...
/ [0 files][    0.0 B/603.0 KiB]                                                / [1 files][603.0 KiB/603.0 KiB]                                                
Operation completed over 1 objects/603.0 KiB.                                    


In [0]:
# Read the taxonomy file
taxo = pd.read_excel('taxonomy-with-ids.en-US.xls')

taxo = taxo.fillna('')

# get columns. Different columns provide a different depth into the taxonomy.
# For example column 1 would be top level, and has few (31) categories
# Column 2 is one level deeper, and has 129 labels, column 3 is deeper still, and so on
level_2 = taxo.iloc[:, 2].unique()
level_1 = taxo.iloc[:, 1].unique()

In [9]:
taxo=None; gc.collect()

11104

In [0]:
# generate the embeddings for various text input
def get_use_matrix(s):
    embeddings = hub_embed(s)

    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        xtrain_embeddings = session.run(embeddings)
        
    return xtrain_embeddings

In [11]:
taxo_embeddings_l1 = get_use_matrix(level_1)
taxo_embeddings_l2 = get_use_matrix(level_2)

taxo_embeddings_l1.shape, taxo_embeddings_l2.shape

((21, 512), (192, 512))

In [0]:
# We now have the embeddings for both the categories data (the source) and the taxonomy (the target)
# Need to map the source (categories embeddings) to the target.
# the label is the embedding in the taxonomy that is "closest" to the categories embedding

In [0]:
# Next step is to download the images
if (not os.path.exists('./labels/')):
    !mkdir 'labels'

In [0]:
# Can't run the entire dataset on colab (not enough RAM) so trying in pieces
pieces = 400; n = int(df.shape[0]/pieces)
classes_embeddings = np.empty(shape=[0, 512])

In [15]:
START = 5 # can change to i if i is not None
END = pieces

for i in range(START, END):
    print(i)
    df2 = df.loc[i*n:(i+1)*n-1]
    
    # generate embeddings for the categories entries
    s = df2.categories.values

    # get embeddings for the all entries in a piece
    classes_embeddings_piece = get_use_matrix(s)
    
    # Calculate the distance
    cos_sim_l1 = cosine_similarity(classes_embeddings_piece, taxo_embeddings_l1)
    cos_sim_l2 = cosine_similarity(classes_embeddings_piece, taxo_embeddings_l2)
    
    # Get the closest label
    top_1_label = [level_1[np.argmax(cs)] for cs in cos_sim_l1]
    
    # Closest top N labels for multiclass
    N = 3 # Try top 3
    
    # Classes probability 
    class_prob = [np.argsort(cs)[::-1][:N] for cs in cos_sim_l2]
    top_3_labels = [level_2[idx] for idx in class_prob]    
    
    # Assign to df    
    df2['top_1'] = top_1_label
    df2['top_3'] = top_3_labels
    df2['top_3'] = df2['top_3'].apply(lambda x: ', '.join(map(str, x)))
    
    df_idx = len(df2)*(i+1)
    
    fn = './labels/labels_' + str(df_idx)
    
    print(fn)
    
    df2 = df2.reset_index(drop=True)
    df2.to_feather(fn)
       
    df2 = None; 
    classes_embeddings_piece = None;
    cos_sim_l1 = None; cos_sim_l2 = None; 
    top_1_label = None; top_3_labels = None;
    
    gc.collect()

5


ResourceExhaustedError: ignored

In [0]:
read = pd.read_feather(fn)

read.head()

In [16]:
# PUSH TO GS
gs_path = 'platform-ai-research/datasets/ProductNet/'
file_name = 'labels'

remote_file = gs_path + file_name
!gsutil cp -r {file_name} gs://{remote_file} 

Copying file://labels/labels_28035 [Content-Type=application/octet-stream]...
Copying file://labels/labels_56068 [Content-Type=application/octet-stream]...
Copying file://labels/labels_11214 [Content-Type=application/octet-stream]...
Copying file://labels/labels_70085 [Content-Type=application/octet-stream]...
\ [4 files][ 11.5 MiB/ 11.5 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying file://labels/labels_14017 [Content-Type=application/octet-stream]...
Copying file://labels/labels_42051 [Content-Type=application/octet-stream]...
| [6 files][ 16.1 MiB/ 16.1 MiB]                                                
Operation completed over 6 objects/16.1 MiB.                                     


In [0]:
# Embeddings for the categories have been generated
classes_embeddings.shape

(173817, 512)

In [18]:
level_2[:5]

# np.argsort(cs)

array(['Live Animals', 'Pet Supplies', '', 'Clothing', 'Clothing Accessories'], dtype=object)

In [30]:
idxs = [[4, 3, 2, 1], [1, 2, 3, 4]]

[level_2[idx] for idx in np.argsort(idxs)[::-1][:3]]

[array(['Live Animals', 'Pet Supplies', '', 'Clothing'], dtype=object),
 array(['Clothing', '', 'Pet Supplies', 'Live Animals'], dtype=object)]

In [88]:
np.argmax(idxs)

0

In [0]:
# Create the label column in the df
df['label'] = level_2[top_idx]
df.head()

In [0]:
df.to_feather('df_mapped_label_may15')