<br/>

# <font color=teal>Ecommerce Shopping Project</font>

This project is based on the Amazon KDD Cup 2023 challenge to classify customer shopping queries as related or not related to product descriptions

find the challenge [Here](https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge)


<br/>

<br/>


---

<br/>

## <font color=orange>Step 2 - Tokenize product and query text</font>

<br/>


### <font color=orange>Goal</font>

<br/>


- Tokenize the prpepared data

- Save as a hugging face dataset

<br/>

<br/>


### <font color=orange>Input</font>
<font color=purple>Local clean data </font>

<font color=purple>Spanish</font>

- `es_prod_query_0.parquet`
- `es_prod_query_1.parquet`
-`es_prod_query_2.parquet`
- `es_prod_query_3.parquet`
- `es_prod_query_4.parquet`

<font color=purple>Japanese</font>

- `jp_prod_query_0.parquet`
- `jp_prod_query_1.parquet`
- `jp_prod_query_2.parquet`
- `jp_prod_query_3.parquet`
- `jp_prod_query_4.parquet`
- `jp_prod_query_5.parquet`

<font color=purple>English</font>

- `us_prod_query_0.parquet`
- `us_prod_query_1.parquet`
- `us_prod_query_2.parquet`
- `us_prod_query_3.parquet`
- `us_prod_query_4.parquet`
- `us_prod_query_5.parquet`

<font color=purple>Metadata</font>
 `metadata.json`

<br/>


<br/>


### <font color=orange>Approach</font>

Read in the clean dataset we created earlier
 
Tokenize the test columns using the pre-trained hugging-face ```sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2``` model 

We'll use this prepared data in subsequent steps to train the model and make predictions


<br/>

<br/>

### <font color=orange>Output</font>

<font color=purple>Local tokenized data </font>

Data nlp directory

- <font color=purple>dataset_info.json</font>

- <font color=purple>state.json</font>


- <font color=purple>train</font>
    
    - `data-00000-of-000003.arrow`
    - `data-00001-of-000003.arrow`
    - `data-00002-of-000003.arrow`



---


#### <a id="top"></a>

<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Table of contents</h3>
</div>

<div style="background-color: rgba(60, 121, 245, 0.03); padding:30px; font-size:15px; font-family: consolas;">
<ul>
    <li><a href="#0" target="_self" rel=" noreferrer nofollow">0. Imports and housekeeping</a></li>
    <li><a href="#1" target="_self" rel=" noreferrer nofollow">1. AI Crowd Login</a></li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">2. Read untokenized data</a></li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">3. Add tokenized columns</a></li>
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">4. Save as a hugging face dataset</a></li>

</ul>
</div>

<a id="0"></a>
<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Imports and housekeeping</h3>
</div>

In [1]:
import sys

sys.path.append('../src')  # Add the 'src' directory to the Python path
sys.path.append('../_secrets')  # Add the 'src' directory to the Python path

In [None]:
import tensorflow as tf### models
import numpy as np### math computations
import matplotlib.pyplot as plt### plotting bar chart
import sklearn### machine learning library
# import cv2## image processing
from sklearn.metrics import confusion_matrix, roc_curve### metrics
import seaborn as sns### visualizations
import datetime
import pathlib
import io
import pandas as pd
import os
import re
import csv
import string
import time
from numpy import random
# import gensim.downloader as api
from PIL import Image
import tensorflow_datasets as tfds
import tensorflow_probability as tfp
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Layer
from tensorflow.keras.layers import Dense,Flatten,InputLayer,BatchNormalization,Dropout,Input,LayerNormalization
from tensorflow.keras.losses import BinaryCrossentropy,CategoricalCrossentropy, SparseCategoricalCrossentropy
from tensorflow.keras.metrics import Accuracy,TopKCategoricalAccuracy, CategoricalAccuracy, SparseCategoricalAccuracy
from tensorflow.keras.optimizers import Adam
# from google.colab import drive
# from google.colab import files
from datasets import load_dataset
from transformers import AutoTokenizer

# Local
from _secrets.secret_vars import get_secret
from src.config import get_directory, get_config
import json



<a id="1"></a>
<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">AI Crowd Login</h3>
</div>

Get the AI crown key from 'secrets' source

In [None]:
api_key = get_secret('api_key')
os.environ["AICROWD_API_KEY"] = api_key

Login

In [None]:
! aicrowd login 

<a id="2"></a>

<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Read untokenized data</h3>
</div>


Read our metadata file

In [4]:

meta_file = os.path.join(get_directory('prep_data'), 'metadata.json')

with open(meta_file, 'r') as fp:
    meta = json.load(fp)

In [5]:
us_files = meta['us']
us_files

['/Users/christopherlomeli/Source/courses/datascience/Springboard/capstone/capstone3/data/prep/us/us_prod_query_0.parquet',
 '/Users/christopherlomeli/Source/courses/datascience/Springboard/capstone/capstone3/data/prep/us/us_prod_query_1.parquet',
 '/Users/christopherlomeli/Source/courses/datascience/Springboard/capstone/capstone3/data/prep/us/us_prod_query_2.parquet',
 '/Users/christopherlomeli/Source/courses/datascience/Springboard/capstone/capstone3/data/prep/us/us_prod_query_3.parquet',
 '/Users/christopherlomeli/Source/courses/datascience/Springboard/capstone/capstone3/data/prep/us/us_prod_query_4.parquet',
 '/Users/christopherlomeli/Source/courses/datascience/Springboard/capstone/capstone3/data/prep/us/us_prod_query_5.parquet',
 '/Users/christopherlomeli/Source/courses/datascience/Springboard/capstone/capstone3/data/prep/us/us_prod_query_6.parquet',
 '/Users/christopherlomeli/Source/courses/datascience/Springboard/capstone/capstone3/data/prep/us/us_prod_query_7.parquet',
 '/Users

In [6]:

# Read all the Parquet files into a single DataFrame
df = pd.concat([pd.read_parquet(file) for file in [us_files[0]]], ignore_index=True)


<a id="3"></a>

<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Add tokenized columns</h3>
</div>


In [7]:
BATCH_SIZE=128
MAX_LENGTH=64
os.environ["TOKENIZERS_PARALLELISM"] = "false"


In [8]:


model_id=get_config("model_id")
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [9]:

# Tokenize 'query' and 'product' columns using the BERT tokenizer
tokenized_query = tokenizer(df['query'].tolist(), max_length=MAX_LENGTH, padding='max_length', truncation=True)
tokenized_product = tokenizer(df['product'].tolist(), max_length=MAX_LENGTH, padding='max_length', truncation=True)

# Add tokenized outputs to the DataFrame
df['input_ids_query'] = tokenized_query['input_ids']
df['token_type_ids_query'] = tokenized_query['token_type_ids']
df['attention_mask_query'] = tokenized_query['attention_mask']

df['input_ids_product'] = tokenized_product['input_ids']
df['token_type_ids_product'] = tokenized_product['token_type_ids']
df['attention_mask_product'] = tokenized_product['attention_mask']


<a id="4"></a>

<div style="background-color: teal; padding: 10px;">
    <h3 style="color: white;">Save as a hugging face dataset</h3>
</div>


In [10]:
from datasets import DatasetDict, Dataset

dataset_dict = DatasetDict({
    'train': Dataset.from_pandas(df)
})

In [11]:
stage_dir = os.path.join(get_directory('data'), 'nlp' )
if not os.path.exists(stage_dir):
    os.makedirs(stage_dir)
    
nlp_data = os.path.join(stage_dir, 'us_tokenized')

In [12]:
dataset_dict.save_to_disk(nlp_data)

Saving the dataset (0/1 shards):   0%|          | 0/53191 [00:00<?, ? examples/s]

CPU times: user 35.6 ms, sys: 236 ms, total: 272 ms
Wall time: 1.18 s
