# Product Name Search Using Vector Search

This project demonstrates how to use vertex vector search with waymo's public [WANDS retail dataset](https://github.com/wayfair/WANDS/tree/main)

## Setup
### Install and Import Python Packages

In [4]:
%%capture
!pip3 install seaborn
!pip3 install tensorflow
!pip3 install tensorflow_hub
!pip3 install tensorflow_datasets
#!pip install google-cloud-aiplatform

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import os
import csv
import time
import math
import functools
from sklearn.metrics.pairwise import cosine_similarity
from typing import Generator, List, Tuple, Optional
from concurrent.futures import ThreadPoolExecutor
from tqdm.auto import tqdm
from typing import List, Optional
from sklearn.feature_extraction.text import TfidfVectorizer

# google cloud imports
from google.cloud import aiplatform
import vertexai
from vertexai.language_models import TextEmbeddingModel

2024-03-22 18:32:22.080602: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Global Variables

In [40]:
PROJECT = "felipe-sandbox-354619"
DATASET = "https://github.com/wayfair/WANDS/blob/main/dataset/product.csv"
LOCATION = "us-east4"
BUCKET = "gs://felipe-sandbox-354619-bucket-regional"
BUCKET_URI = BUCKET + "/vector-search-embeddings"
N_DIMENSIONS = 500

# Vector Search variables
INDEX_ID = "product_names_ind"
INDEX_ENDPOINT_ID = "product_endpoint"

### Start Clients

In [7]:
aiplatform.init(project=PROJECT, location=LOCATION)

### Remove existing metadata and coeffecients

In [12]:
!rm -r tmp
!rm -r tmp2 # for gecko embeddings (not currently used)

### Create metadata path 

In [13]:
import os

log_dir='tmp'
if not os.path.exists(log_dir):
    os.makedirs(log_dir)
    
log_dir='tmp2' # for gecko embedding (not currently used)
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

## Load Product Data Into a Dataframe

### Download Data

In [14]:
!git clone https://github.com/wayfair/WANDS.git

fatal: destination path 'WANDS' already exists and is not an empty directory.


### Upload to a pandas dataset

In [15]:
# get products
df = pd.read_csv("WANDS/dataset/product.csv", sep='\t')
df.head(3)

Unnamed: 0,product_id,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count
0,0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0
1,1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0
2,2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0


## Convert Product Names to Embeddings

### N-gram character embeddings

In [16]:
product_name_lst = df['product_name'].values.tolist() # convert to list; required for tensor conversion

In [17]:
vectorizer = TfidfVectorizer(max_features=10000,analyzer='char', ngram_range=(2,5))
vector_matrix = vectorizer.fit_transform(product_name_lst)

# Get feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dense array 
vector_matrix = vector_matrix.toarray()

# Create a DataFrame (optional)
vector_df = pd.DataFrame(vector_matrix, columns=feature_names)

### Test query similarity

In [18]:
def get_similarity(vectorizer, vector_list, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cosineSimilarities = cosine_similarity(query_tfidf, vector_list).flatten()
    return cosineSimilarities

similarities = get_similarity(vectorizer, vector_matrix, "montauk solid")

In [19]:
top_indices = np.argpartition(similarities, -5)[-5:][::-1]
for ind in top_indices:
    print(product_name_lst[ind])
    print(similarities[ind])

montauk solid wood bed
0.6580396790893387
montauk desk
0.5074540300764361
montauk pine solid wood dining table
0.44169117269445723
mormont solid coffee table
0.39920548313914345
tyrell solid oak solid wood dining table
0.3972741422655692


### Dimensionality Reduction (if needed)
reduce embeddings to n_dimensions

In [20]:
pca = PCA(n_components=N_DIMENSIONS) 
vector_reduced_df = pca.fit_transform(vector_df)

In [21]:
len(vector_reduced_df[0])

500

### Save Embeddings to Cloud Bucket

In [22]:
output_list = []
index = 0
for vector in vector_reduced_df:
    output_list.append(str(index) + "," + ",".join(str(v) for v in vector))
    index+=1

In [23]:
# write file
with open("tmp/vectors.csv", "w") as f:
    for string in output_list:
        f.write(string + "\n")

f.close()

In [24]:
!gsutil cp tmp/vectors.csv $BUCKET_URI

Copying file://tmp/vectors.csv [Content-Type=text/csv]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

- [1 files][435.4 MiB/435.4 MiB]                                                
Operation completed over 1 objects/435.4 MiB.                                    


## Create Vector Search Index 

### Create index endpoint (Only run if index has not been initialized)

In [34]:
## create `IndexEndpoint`
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name = INDEX_ENDPOINT_ID,
    public_endpoint_enabled = True
)


Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/185246287903/locations/us-east4/indexEndpoints/3274099336912306176/operations/4893843791858892800
MatchingEngineIndexEndpoint created. Resource name: projects/185246287903/locations/us-east4/indexEndpoints/3274099336912306176
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/185246287903/locations/us-east4/indexEndpoints/3274099336912306176')


In [80]:
# create Index
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name = INDEX_ID,
    contents_delta_uri = BUCKET_URI,
    dimensions = N_DIMENSIONS,
    approximate_neighbors_count = 10,
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/185246287903/locations/us-east4/indexes/7200745630770135040/operations/2115685771724718080
MatchingEngineIndex created. Resource name: projects/185246287903/locations/us-east4/indexes/7200745630770135040
To use this MatchingEngineIndex in another session:
index = aiplatform.MatchingEngineIndex('projects/185246287903/locations/us-east4/indexes/7200745630770135040')


In [35]:
# deploy the Index to the Index Endpoint
my_index_endpoint.deploy_index(
    index = my_index, deployed_index_id = INDEX_ID
)

NameError: name 'my_index' is not defined

### Instantiate index endpoint

In [45]:
# test index search
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint("3274099336912306176") # get number on console

### Query index

In [61]:
# run query

def query_to_vector(query):
    query_vector = vectorizer.transform([query]).toarray()
    vector_reduced = pca.transform(query_vector).tolist()
    return vector_reduced

def get_results(query): 
    
    vectorized_query = query_to_vector(QUERY)
    
    response = my_index_endpoint.find_neighbors(
        deployed_index_id = "product_name_1711133053130",
        queries = vectorized_query,
        num_neighbors = 5
    )

    print("Results: \n")
    # show the results
    for idx, neighbor in enumerate(response[0]):
        product_nm = df.loc[int(neighbor.id)]["product_name"]
        print(f"{neighbor.distance:.2f} {product_nm}")

In [62]:
QUERY = "solid wood bed"
get_results(QUERY)

Results: 

0.64 abeyta solid wood bed
0.64 montauk solid wood bed
0.63 sumfleth solid wood bed
0.63 anoeska solid wood bed
0.61 ralphio solid wood bed




### CleanUp

In [60]:
!rm -r WANDS

rm: cannot remove 'WANDS/': No such file or directory


In [65]:
# wait for a confirmation
input("Press Enter to delete Index Endpoint, Index and Cloud Storage bucket:")

# delete Index Endpoint
my_index_endpoint.undeploy_all()
my_index_endpoint.delete(force = True)