# Taxonomy Categorization Use-Case

This notebook demonstrates a taxonomy categorization use-case using text embeddings and Vertex AI Vector Search.

### The steps performed include:
* Parameters, variables, and any helper functions are defined
* Sample pre-processing to deduplicate initial data; Combine and transform relevant data columns
* Apply text embeddings using REST requests
* Create and Deploy Index for Vector Search
* Query Index and get Ranked Results of classes

User paramaters are indicated by `@param`

### Import Libraries & Define Parameters

In [None]:
# # Install the packages
# ! pip3 install --upgrade --quiet google-cloud-aiplatform \
#                                  google-cloud-storage

In [None]:
import pandas as pd
import numpy as np
import json
import google.auth.transport.requests
import google.auth
from google.cloud import storage
import requests
from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import \
    Namespace

In [None]:
PROJECT_ID = "sandbox-401718" # @param
REGION = "us-central1" # @param
BUCKET_URI = f"gs://{PROJECT_ID}-category-textembedding-{REGION}"
INPUT_URI = f"{BUCKET_URI}/input-test"
OUTPUT_URI = f"{BUCKET_URI}/output-test"

### Create buckets

In [None]:
# ! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

## Textembeddings on GCS

The following data splitting strategy is used for building and querying the index:

![taxonomy-flow_.png](./imgs/taxonomy-flow_.png)

### Pre-processing and Enrichment

Preprocess involves cleaning, transforming, and standardizing the raw data to reduce noise and inconsistencies. Preprocessing can help ensure optimal performance and accurate results for downstream matching algorithms.

Sample preprocessing techniques to consider:
* Remove exact duplicate records (included below) 
* Normalize keyword variations in spelling and abbreviations (e.g., "Ave" to "Avenue")
* Ensure consistent formatting across all data points
* Identify and correct invalid or inaccurate domain information


 Information is also strategically combined from multiple sources to create richer representations of product taxonomy classes.

In [None]:
df_data = pd.read_csv(' AAG_test_dataset.xlsx - Descriptions.csv')
df_heirarchy = pd.read_csv('EUHIERARCHY_062024.xlsx - Sheet1.csv')
df_data_holdout = pd.read_csv('inventory-test-source.csv') # used for testing accuracy 

In [None]:
# Split unlabeled and labeled test data set

# DataFrame where 'structure_assignment_euhierarchy' is NOT null
df_data_label = df_data[df_data['structure_assignment_euhierarchy'].notna()].copy() 

# Remove holdout dataset from Labeled dataset 
# Perform the merge with indicator
merged_df = df_data_label.merge(df_data_holdout, how='left', indicator=True)
df_data_label = merged_df[merged_df['_merge'] == 'left_only'].drop('_merge', axis=1)

# DataFrame where 'structure_assignment_euhierarchy' IS null
df_data_unlabel = df_data[df_data['structure_assignment_euhierarchy'].isnull()].copy() # simulate live data


### Combine and transform relevant data columns

Engineer data to be optimized for embedding generation. For example, merging a dataset containing multiple product examples per category with the hierarchical taxonomy dataset provides the embedding model with a more diverse and comprehensive understanding of each category. Instead of relying solely on the taxonomy, which only offers one example per class, this method leverages labeled data to enhance the model to learn more nuanced representations, ultimately enhancing categorization accuracy.

In [None]:
# Merge the dataframes based on the 'structure_assignment_euhierarchy' and 'Tier_5_ID' columns

## This provides additional "ground truth" examples for the vector database instead 
## of just relying on the heirarchy which only includes 1 example per unique class

df_data_merge = pd.merge(
    df_data_label[
        [
            "supplier",
            "short_description_en",
            "supplier_description_en",
            "long_description_en",
            "structure_assignment_euhierarchy",
        ]
    ],
    df_heirarchy[
        [
            "Tier_1_EN",
            "Tier_2_EN",
            "Tier_3_EN",
            "Tier_4_EN",
            "Tier_5_EN",
            "Tier_5_ID",
        ]
    ],
    left_on="structure_assignment_euhierarchy",
    right_on="Tier_5_ID",
    how="right",
)

# Label data with unique ID's to map embeddings back to class
np.random.seed(42)
df_data_merge['unique_id'] = np.random.randint(10000000000000, size=len(df_data_merge))
if df_data_merge['unique_id'].nunique() == len(df_data_merge):
    print("All IDs are unique")
else:
    print("There are duplicate IDs!") 

In [None]:
# Format features for embeddings

# Select and combine features into comma separated string

df_data_merge['combined_text'] = (df_data_merge['supplier'].fillna('') + ', ' + 
                       df_data_merge['short_description_en'].fillna('') + ', ' + 
                       df_data_merge['supplier_description_en'].fillna('') + ', ' + 
                       df_data_merge['long_description_en'].fillna('') + ', ' +
                       df_data_merge['Tier_1_EN'].fillna('') + ', ' +  
                       df_data_merge['Tier_2_EN'].fillna('') + ', ' + 
                       df_data_merge['Tier_3_EN'].fillna('') + ', ' + 
                       df_data_merge['Tier_4_EN'].fillna('') + ', ' + 
                       df_data_merge['Tier_5_EN'].fillna('') + ', ' + 
                       df_data_merge['Tier_5_ID'].fillna(''))

df = df_data_merge

In [None]:
df['combined_text'][13000] # Example

In [None]:
# Create JSON file
# Create a list of dictionaries from the 'combined_text' column
text_data = [{"content": combined_text} for combined_text in df['combined_text']]   ########
# text_data = [{"content": combined_text_sentence} for combined_text_sentence in df['combined_text_sentence']] #######

# Save to a JSONL file
with open('input.jsonl', 'w') as outfile:
    for entry in text_data:
        json.dump(entry, outfile)
        outfile.write('\n') 

In [None]:
# save to GCS 
storage_client = storage.Client()

BUCKET_NAME = "/".join(INPUT_URI.split("/")[:3])
bucket = storage_client.bucket(BUCKET_NAME[5:])

# Define the blob including any folders from INPUT_URI
blob_name = "/".join(INPUT_URI.split("/")[3:])+"/input.jsonl"
blob = bucket.blob(blob_name)

# Upload the file 
blob.upload_from_filename("input.jsonl")

print(f"File uploaded to cloud storage in {INPUT_URI}")

### REST request for Batch Prediction Job

This section details the code that makes the API request to Vertex AI to generate the text embeddings.
<br> For list of available embeddings models see: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings#supported-models

In [None]:
# Credentials

# Set up Application Default Credentials (ADC)
credentials, project_id = google.auth.default()
auth_req = google.auth.transport.requests.Request()
credentials.refresh(auth_req)
access_token = credentials.token

In [None]:
MODEL = "publishers/google/models/text-embedding-004" # optimized for english content
# MODEL = "publishers/google/models/text-multilingual-embedding-002"
url = f"https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/us-central1/batchPredictionJobs"

headers = {
        'Authorization': 'Bearer ' + access_token,
        'Content-Type': 'application/json; charset=utf-8'
    }

In [None]:
request_body = str(
    {
        "name": "batch-test",
        "displayName": "batch-test",
        "model": MODEL,
        "inputConfig": {
            "instancesFormat": "jsonl",
            "gcs_source": {"uris": [f"{INPUT_URI}/input.jsonl"]},
        },
        "outputConfig": {
            "predictionsFormat": "jsonl",
            "gcs_destination": {"output_uri_prefix": OUTPUT_URI},
        },
    }
)


In [None]:
# request_body = '{"name": "test", "displayName": "test", "model": "publishers/google/models/text-embedding-004", "inputConfig": {"instancesFormat": "jsonl", "gcs_source": {"uris": ["gs://sandbox-401718-fuzzymatch-textembedding/input-test/input.jsonl"]}}, "outputConfig": {"predictionsFormat": "jsonl", "gcs_destination": {"output_uri_prefix": "gs://sandbox-401718-fuzzymatch-textembedding/output-test"}}}'
r = requests.post(url, data=request_body, headers=headers)

In [None]:
print(r.status_code)

In [None]:
print(r.content)

### Import JSON From Path and Map

This section outlines the steps of retrieving the generated embeddings files from the Batch Job output and creating a mapping to tie it back to the respective accounts in the original data. <br> **Note** To keep things simple, users need to provide the paths to the output embedding files manually. In the future, this process can be automated by retrieving the file paths directly from the Batch Prediction Job's output information.

* Download embed file: The embedding results from the batch prediction job are downloaded from GCS.
* Map embeddings to original dataset: The downloaded embeddings are associated with their corresponding addresses in the original DataFrame.

In [None]:
EMBED_FILES = [
    "gs://sandbox-401718-category-textembedding-us-central1/output-test/prediction-model-2024-08-02T18:50:02.710567Z/000000000000.jsonl",  # @param
]

# Write to the local JSON Lines file directly
with open("embeddings.jsonl", "w", encoding="utf-8") as outfile:
    for embed_file in EMBED_FILES:
        bucket_name = embed_file.split("/")[2]
        blob_name = "/".join(embed_file.split("/")[3:])
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(blob_name)

        # Download the entire file as a string

        file_content = blob.download_as_string().decode("utf-8")
        lines = file_content.splitlines()
        for line in lines:
            outfile.write(line + "\n")  # Write each line as a separate JSON object
print(f"Combined JSON Lines data saved to `embeddings.jsonl`")


In [None]:
# Load the JSON data from your local .jsonl file
response_json = []
with open("embeddings.jsonl", "r") as f:
    for line in f:
        response_json.append(json.loads(line))  # Parse each line

# Create a dictionary to map addresses to embeddings
combined_text_embedding_map = {}
for item in response_json:
    combined_text = item['instance']['content']
    embedding = item['predictions'][0]['embeddings']['values'] 
    combined_text_embedding_map[combined_text] = embedding

# Map embeddings to the DataFrame using the address lookup
df['Embeddings'] = df['combined_text'].map(combined_text_embedding_map)

In [None]:
df.head() # Embeddings column now included

In [None]:
# # check unique id's
# combined_text_ = df['unique_id'].unique()
# len(combined_text_) == len(df)

## Create Index for Vector Search

This section describes the process of creating an index for Vertex Vector Search. Bruce force (exhaustive) search index is used in this example, and is used to find the exact nearest neighbors to the query vector. Brute force Index is computationally rigorous compared to ANN which is focuses on performant approximations and retrieval efficiency.

For more information about the methods and their tradeoff: https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index

### Format data
https://cloud.google.com/vertex-ai/docs/vector-search/setup/format-structure

In [None]:
# Create a list of dictionaries (same as before)
data = []
for index, row in df.iterrows(): 
    data.append({
        "id": str(row['unique_id']),
        "embedding": row['Embeddings'],
        # "address_hash_key": row['address_hash_key']
    })
    
# Export the data as a JSON Lines file
with open("index_input_data.json", "w", encoding="utf-8") as f:
    for entry in data:
        json.dump(entry, f)  # Write each dictionary as JSON
        f.write('\n')        # Add a newline to separate objects 

In [None]:
# save to GCS 
BUCKET_NAME = "/".join(INPUT_URI.split("/")[:3])
bucket = storage_client.bucket(BUCKET_NAME[5:])

# Define the blob including any folders from INPUT_URI
blob_name = "/".join(INPUT_URI.split("/")[3:])+"/initial/index_input_data.json"
blob = bucket.blob(blob_name)

# Upload the file 
blob.upload_from_filename("index_input_data.json")

print(f"File uploaded to cloud storage in {INPUT_URI}/initial/")

### Create Index

For similarity calculations, the documentation strongly recommends using DOT_PRODUCT_DISTANCE + UNIT_L2_NORM instead of the COSINE distance. These algorithms have been more optimized for the DOT_PRODUCT distance, and when combined with UNIT_L2_NORM, offers the same ranking and mathematical equivalence as the COSINE distance.

In [None]:
import os
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=INPUT_URI)

In [None]:
DIMENSIONS = len(df["Embeddings"][0])
DISPLAY_NAME = "index_category"
DISPLAY_NAME_BRUTE_FORCE = DISPLAY_NAME + "_brute_force"

In [None]:
brute_force_index = aiplatform.MatchingEngineIndex.create_brute_force_index(
    display_name=DISPLAY_NAME_BRUTE_FORCE,
    contents_delta_uri=f"{INPUT_URI}/initial/",
    dimensions=DIMENSIONS,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    feature_norm_type="UNIT_L2_NORM",
    description="Category index (brute force)",
)

In [None]:
INDEX_BRUTE_FORCE_RESOURCE_NAME = brute_force_index.resource_name #'projects/757654702990/locations/us-central1/indexes/9080369554546229248'

brute_force_index = aiplatform.MatchingEngineIndex(
    index_name=INDEX_BRUTE_FORCE_RESOURCE_NAME
)

In [None]:
# tree_ah_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
#     display_name=DISPLAY_NAME,
#     contents_delta_uri=EMBEDDINGS_INITIAL_URI,
#     dimensions=DIMENSIONS,
#     approximate_neighbors_count=150,
#     distance_measure_type="DOT_PRODUCT_DISTANCE",
#     leaf_node_embedding_count=500,
#     leaf_nodes_to_search_percent=7,
#     description="ANN index",
#     labels={"label_name": "label_value"},
# )

### Deploy Index to Endpoint

In [None]:
# Retrieve the project number
PROJECT_NUMBER = !gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
PROJECT_NUMBER = PROJECT_NUMBER[0]

VPC_NETWORK = "beusebio-network"
VPC_NETWORK_FULL = f"projects/{PROJECT_NUMBER}/global/networks/{VPC_NETWORK}"

In [None]:
# Endpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name="index_endpoint_for_category",
    description="category index",
    network=VPC_NETWORK_FULL,
)

INDEX_ENDPOINT_NAME = my_index_endpoint.resource_name

In [None]:
brute_force_index

In [None]:
my_index_endpoint

In [None]:
# Deploy
DEPLOYED_BRUTE_FORCE_INDEX_ID = "brute_force_deploy_comma_text_004_20holdout"
my_index_endpoint = my_index_endpoint.deploy_index(
    index=brute_force_index, deployed_index_id=DEPLOYED_BRUTE_FORCE_INDEX_ID
)

my_index_endpoint.deployed_indexes

### Query and get ranked results
Query the deployed index to find nearest neighbor match.

**Important:** The text embedding model may encounter errors when processing certain non-English or special characters. Please ensure data is cleansed of such characters or pre-process them appropriately to prevent issues during embedding generation

In [None]:
# # GET MatchingEngineIndexEndpoint if exists
# my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(
#     "projects/####/locations/us-central1/indexEndpoints/####"
# )

# DEPLOYED_BRUTE_FORCE_INDEX_ID = "brute_force_deploy_comma_text_004_20holdout"

In [None]:
df_data_holdout['combined_text'] = (df_data_holdout['supplier'].fillna('') + ', ' + 
                       df_data_holdout['short_description_en'].fillna('') + ', ' + 
                       df_data_holdout['supplier_description_en'].fillna('') + ', ' + 
                       df_data_holdout['long_description_en'].fillna(''))

df_data_holdout = df_data_holdout.reset_index(drop=True)

In [None]:
row = 2 # hold out set 1 - 20   ########### @param

In [None]:
query = df_data_holdout["combined_text"][row]

unwanted_chars = ["™","®","©"]  # Add more characters to remove as needed
for char in unwanted_chars:
    query = query.replace(char, "")
    
query

In [None]:
# Credentials

# Set up Application Default Credentials (ADC)
credentials, project_id = google.auth.default()
auth_req = google.auth.transport.requests.Request()
credentials.refresh(auth_req)
access_token = credentials.token

In [None]:
# Get online text embeddings for hold out set

url = f"https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/us-central1/publishers/google/models/text-embedding-004:predict"

headers = {
        'Authorization': 'Bearer ' + access_token,
        'Content-Type': 'application/json; charset=utf-8'
    }

request_body = str(
        {
          "instances": [
            { "content": query}
          ],
        }
)

r = requests.post(url, data=request_body, headers=headers)

In [None]:
print(r.status_code)

In [None]:
response_data = r.json()

# Access the embeddings
embeddings = response_data['predictions'][0]['embeddings']['values']

In [None]:
# Test query
response = my_index_endpoint.match(
    deployed_index_id=DEPLOYED_BRUTE_FORCE_INDEX_ID,
    queries=[embeddings],
    num_neighbors=5,
)

response

### Map Response back to Class

Retrieve the original catyegory classes from the Vector Search results.

In [None]:
matched_data = []
for neighbor in response[0]:  # Accessing the inner list
    matched_id = neighbor.id
    distance = neighbor.distance
    matched_class = df[df["unique_id"] == int(matched_id)]["Tier_5_ID"].iloc[0]
    matched_data.append(
        {"ID": matched_id, "Tier_5_ID": matched_class, "Distance": distance}
    )
matched_df = pd.DataFrame(matched_data)
[
    (item["Tier_5_ID"], item["Distance"])
    for item in matched_df[["Tier_5_ID", "Distance"]].to_dict(orient="records")
]


In [None]:
# Predicted Classes

filtered_df = matched_df[matched_df["Distance"] >= 0.7]
predicted_class = filtered_df["Tier_5_ID"].unique().tolist()

# Ground Truth

ground_truth_class = df_data_holdout.loc[row, "structure_assignment_euhierarchy"]

print(
    f"The prediction classes: {predicted_class} \nThe ground truth class is {ground_truth_class}"
)


### Considerations for Refining and Improving Results
* **Threshold Selection:**  Choose an appropriate similarity threshold to define matches, balancing accuracy and the tolerance for errors. 
* **Integration:** Integrate with LLM's. 
* **Post-Processing:** Apply additional techniques and rules to refine the matching results further.
* **Fine Tuning:** Fine Tune the Text Gecko Embeddings model to increase overall embedding task effectiveness.
* **Input Formatting**: Optimize the input data for embedding generation by strategically selecting and combining features into human-readable sentences, potentially improving the quality of the embeddings