<br/>

# <font color=teal>Ecommerce Shopping Project</font>

This project is based on the Amazon KDD Cup 2023 challenge to classify customer shopping queries as related or not related to product descriptions

find the challenge [Here](https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge)


<br/>

<br/>

---

<br/>

## <font color=orange>Step 4 - Test the model</font>

<br/>

Load the SBERT model

Load the weights we got from training and saved in step 3

Try to print the top 5 matches to a customer query

<br/>


<br/>


### <font color=orange>Goal</font>

<br/>


Correctly pick the best matches to a customer query 

<br/>

<br/>


### <font color=orange>Input</font>

The trained weights from our model 

<br/>


<br/>


### <font color=orange>Approach</font>

Load the weights we trained

Tokenize the product titles 

When we get a query, score all product titles and sort them in order of relevance

Return the top n products that match the customer's query


<br/>

<br/>

### <font color=orange>Output</font>

The top n products for a customer's query



---


#### <a id="top"></a>

<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Table of contents</h3>
</div>

<div style="background-color: rgba(60, 121, 245, 0.03); padding:30px; font-size:15px; font-family: consolas;">
<ul>
    <li><a href="#0" target="_self" rel=" noreferrer nofollow">0. Imports and housekeeping</a></li>
    <li><a href="#1" target="_self" rel=" noreferrer nofollow">1. AI Crowd Login</a></li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">2. Read untokenized data</a></li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">3. Add tokenized columns</a></li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">4. Save as a hugging face dataset</a></li>

</ul>
</div>

<a id="0"></a>
<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Housekeeping</h3>
</div>

Environment

In [1]:
import importlib

def install_if_not_installed(package_name):
    try:
        importlib.import_module(package_name)  # Try to import the package
    except ImportError:
        print(f"\n=========================\ninstalling {package_name}\n=========================")
        !pip install {package_name}  # If it's not installed, install it

In [2]:
install_if_not_installed('transformers')
install_if_not_installed('datasets')
install_if_not_installed('aicrowd-cli')


installing aicrowd-cli


Imports

In [3]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import pathlib
import os
import sys
import re
import csv
import string
import time
from numpy import random
# import gensim.downloader as api
from PIL import Image
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Layer
from tensorflow.keras.losses import BinaryCrossentropy,CategoricalCrossentropy, SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.python.keras.callbacks import EarlyStopping

from datasets import load_dataset, load_from_disk, DatasetDict
from transformers import AutoTokenizer,create_optimizer,TFAutoModel

# Local

import json

<a id="0"></a>
<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Colab Setup</h3>
</div>

mount Google Drive and import local packages

In [4]:

# Check if running in Google Colab
if 'google.colab' in sys.modules:
    from google.colab import drive
    from google.colab import files

    drive.mount('/content/drive')

    from drive.MyDrive.projects.capstone3.source.sentence_transformer import TransformerModel
    from drive.MyDrive.projects.capstone3.source.config import get_config, get_directory
    from drive.MyDrive.projects.capstone3.source.secrets import get_secret

else:
    sys.path.append('../src')  # Add the 'src' directory to the Python path
    sys.path.append('../_secrets')  # Add the 'src' directory to the Python path

    from src.sentence_transformer import TransformerModel
    from src.config import get_directory, get_config, get_directory
    from _secrets.secret_vars import get_secret


Check GPU

In [5]:
if 'google.colab' in sys.modules:

    gpu_info = !nvidia-smi
    gpu_info = '\n'.join(gpu_info)
    if gpu_info.find('failed') >= 0:
      print('Not connected to a GPU')
    else:
      print(gpu_info)

Check Ram

In [6]:
# Check free RAM
if 'google.colab' in sys.modules:
    !free -h

Check GPU

In [7]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


2023-10-07 18:07:51.386662: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2023-10-07 18:07:51.386689: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2023-10-07 18:07:51.386692: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2023-10-07 18:07:51.386877: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-10-07 18:07:51.386904: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [8]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('!!!!!!!!!!!!  Not using a high-RAM runtime  !!!!!!!!!!!!!!!!!!')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 17.2 gigabytes of available RAM

!!!!!!!!!!!!  Not using a high-RAM runtime  !!!!!!!!!!!!!!!!!!


<a id="0"></a>
<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Ai Crowd Setup Setup</h3>
</div>

In [9]:
api_key = get_secret('api_key')
os.environ["AICROWD_API_KEY"] = api_key

In [10]:
! aicrowd login

[32mAPI Key valid[0m
[33mGitlab oauth token invalid or absent.
It is highly recommended to simply run `aicrowd login` without passing the API Key.[0m
[32mSaved details successfully![0m


<a id="0"></a>
<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Get the model from Hugging Face, and load our own weights</h3>
</div>

In [11]:
EPOCHS = 20
BATCH_SIZE=128
LEARNING_RATE = .0001

convert to tensorflow dataset

Instantiate the model

In [12]:
model_id=get_config("model_id")
print("Using model ", model_id)

Using model  sentence-transformers/all-MiniLM-L6-v2


In [13]:
tf_model = TFAutoModel.from_pretrained(model_id)
tf_model.summary()

2023-10-07 18:07:56.882613: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-10-07 18:07:56.882636: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at sentence-transformers/all-MiniLM-L6-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "tf_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  22713216  
                                                                 
Total params: 22713216 (86.64 MB)
Trainable params: 22713216 (86.64 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Wrap in our custom model

In [14]:
model=TransformerModel(tf_model)

In [15]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [16]:

model_path=get_config("model_path")
print("Using model path", model_path)

weights_file = os.path.join(model_path, 'model_weights.h5')
tf_model.load_weights(weights_file)


Using model path /Users/christopherlomeli/Source/courses/datascience/Springboard/capstone/capstone3/data/model


<a id="0"></a>
<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Return the top 5 products that match the customer query</h3>
</div>


This was trained on a smaller portion of the whole tokenized dataset, so we do not expect perfection - this is just a test

We see that the model was mostly correct for our customer query for a pencil, but we did get s VOW sexy pencil dress as a hot, so we would need the full model and some tweaking to do better

I'm running the whole dataset on Colab, and may adjust these findings


A next step would be to create a test set for calculating an f1 score, but for this kind of model it's not out-of-the-box, as it is with tabular data.

We would need to:
 
- pull a test set of queries, 

- pick the top products for each query

- then we could run 'inference' and we have here for each query, and then score the results




<font color=teal> Pick the top 5 products for the customer query of "a blue hb pencil"</font>

In [17]:
from src.inference import InferenceEngine

customer_query = "a blue hb pencil"

# the code for the inference engine is in the inference.py source file
inference_engine = InferenceEngine(tf_model=tf_model, tokenizer=tokenizer)

# query the engine for the top 5 matches to the customer query
results = inference_engine.query(customer_query, top=5)

# print out the top 5 matches
for r in results:
    print(r)

Embedding files already exist, no need to remake them....
(1, 384)
(0, array(["L'VOW Sexy Ruched Pencil Dress for Women Spaghetti Strap Bodycon Backless Maxi Formal Dress(Black,X-Large)"],
      dtype='<U400'))
(1, array(['Beginner Primary Size Pencils, Wood-Cased #2 HB Soft Without Eraser, Yellow, 12-Pack - New'],
      dtype='<U400'))
(2, array(['Wood-Cased #2 HB Pencils, Yellow, Pre-sharpened, Class Pack, 1000 pencils'],
      dtype='<U400'))
(3, array(['TICONDEROGA My First Pencils, Wood-Cased #2 HB Soft, Pre-Sharpened with Eraser, Yellow, 12-Pack (33312)'],
      dtype='<U400'))
(4, array(['TICONDEROGA Laddie Pencils, Wood-Cased #2 HB Soft without Eraser, Yellow, 12-Pack (13040)'],
      dtype='<U400'))
