# Metaclasses Generation from Category Tree and Language Adaptiation
This notebook guides you through the process of generating metaclasses for product categorization and adapting the system for different languages. We'll be preparing the necessary data structures to configure the `TextCleaner` class, which is used both in this notebook and in the production system.

The process is iterative: we'll start with initial data structures, analyze the category tree, and then refine our configurations based on the results.

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.<br>
SPDX-License-Identifier: MIT-0

## Process Overview

This notebook follows a structured process to prepare and analyze category data:

1. Load and analyze the category tree
2. Prepare and refine text cleaning configurations
3. Clean and analyze category names
4. Generate and analyze word embeddings
5. Identify special categories (like media)
6. Adapt the process for different languages (optional)
7. Persist the prepared data for use in production

Each step builds on the previous ones, creating a robust foundation for product categorization.

## Setup
### Prerequisites
The IAM user or role used by this notebook needs access to Amazon S3 and AWS Systems Manager.

In [None]:
import pandas as pd
import numpy as np
import json
from collections import Counter
import boto3
import hashlib
import time

from mypy_boto3_ssm import SSMClient
from mypy_boto3_dynamodb import DynamoDBClient, DynamoDBServiceResource
from mypy_boto3_bedrock_runtime import BedrockRuntimeClient


In [None]:
ssm: SSMClient = boto3.client('ssm')
ddb: DynamoDBClient = boto3.client('dynamodb')
ddbr: DynamoDBServiceResource = boto3.resource('dynamodb')
bedrock: BedrockRuntimeClient = boto3.client('bedrock-runtime')
iam = boto3.client('iam')
ssm_prefix = '/ProductCategorization/'
vector_table_prefix = "SmartProductOnboarding-"

## Load and Analyze the Category Tree

First, let's load our category data and examine its structure.

In [None]:
language = "english"

In [None]:
# Load category data
with open("data/labelcats.json", "r") as f:
    category_tree = json.load(f)


# Function to get leaf categories
def get_leaf_categories(cat_tree):
    leaf_categories = []
    for cat_id in cat_tree:
        if cat_id == 'root':
            continue
        category = cat_tree[cat_id]
        if len(category.get('childs', [])) <= 0:
            leaf_categories.append({
                'name': category['name'],
                'id': category['id'],
                'description': category['description']
            })
    return leaf_categories


leaf_categories = get_leaf_categories(category_tree)
leaf_df = pd.DataFrame(leaf_categories)

print(f"Total leaf categories: {len(leaf_df)}")
leaf_df.head()


### Text Refinement Preparation

Before we start processing our category data, we need to set up some initial data structures. These structures are crucial for effective text cleaning and categorization:

- Singularization exceptions: Words that don't follow standard singularization rules
- Descriptors: Common words that don't help differentiate categories
- Brands: Brand names that should be excluded from category analysis
- Synonyms: Alternative terms for the same concept

These structures help ensure that our text cleaning process is accurate and that we're focusing on the most meaningful words in our category names.

In [None]:
# Initial singularization exceptions
singularize_exceptions = {
    "clothes": "clothes",
    "canvas": "canvas",
    "fruticosus": "fruticosus",
    "dies": "die",
    "lotus": "lotus",
    "tinctorius": "tinctorius",
    "gymnastics": "gymnastics",
    "mollis": "mollis",
    "myosotis": "myosotis",
    "australis": "australis",
    "gas": "gas",
    "gps": "gps",
    "guatemalensis": "guatemalensis",
    "elegans": "elegans",
    "christmas": "christmas",
    "cosmos": "cosmos",
    "xps": "xps",
    "muralis": "muralis",
    "narcissus": "narcissus",
    "barbatus": "barbatus",
    "cactus": "cactus",
    "hibiscus": "hibiscus",
    "callus": "callus",
    "cycas": "cycas",
    "prunus": "prunus",
    "overalls": "overalls",
    "nitrous": "nitrous",
    "bellis": "bellis",
    "coreopsis": "coreopsis",
    "iris": "iris",
    "erinus": "erinus",
    "plectranthus": "plectranthus",
    "euryops": "euryops",
    "hyacinthus": "hyacinthus",
    "rhipsalidopsis": "rhipsalidopsis",
    "cos": "cos",
    "orientalis": "orientalis",
    "annuus": "annuus",
    "lotononis": "lotononis",
    "sylvestris": "sylvestris",
    "argus": "argus",
    "sinensis": "sinensis",
    "crocus": "crocus",
    "corylus": "corylus",
    "edulis": "edulis",
    "paris": "paris",
    "helianthus": "helianthus",
    "orchis": "orchis",
    "zamioculcas": "zamioculcas",
    "psoriasis": "psoriasis",
    "stylus": "stylus",
    "abies": "abies",
    "cupressus": "cupressus",
    "grandis": "grandis",
    "hupehensis": "hupehensis",
    "pinus": "pinus",
    "cannabis": "cannabis",
    "cucumis": "cucumis",
    "ficus": "ficus",
    "physalis": "physalis",
    "corniculatus": "corniculatus",
    "dypsis": "dypsis",
    "vulgaris": "vulgaris",
    "nephrolepis": "nephrolepis",
    "gracilis": "gracilis",
    "asiaticus": "asiaticus",
    "babacos": "babacos",
    "helleborus": "helleborus",
    "lupinus": "lupinus",
    "rhapis": "rhapis",
    "cyperus": "cyperus",
    "ruscus": "ruscus",
    "opulus": "opulus",
    "lutescens": "lutescens",
    "perennis": "perennis",
    "index": "index,"
}

# Initial descriptors (common words that don't help differentiate categories)
descriptors = [
    "accessory",
    "live",
    "part",
    "replacement",
    "equipment",
    "product",
    "cut",
    "ready",
    "prepared",
    "processed",
    "unprepared",
    "unprocessed",
    "sport",
    "baby",
    "garden",
    "non",
    "kit",
    "set",
    "pack",
]

# List the brands in your store. If any brand names are dictionary words, they should not be included here.
brands = []

# Initial synonyms (regional variations, etc.)
# For English, this might be empty or contain British/American variations
synonyms = {
    "sneaker": "shoe",
}

print("Initial data structures prepared.")


### TextCleaner Instantiation

In [None]:
# This cell will be re-run after refinements
def instantiate_text_cleaner():
    from amzn_smart_product_onboarding_metaclasses.text_cleaner import TextCleaner

    return TextCleaner(
        singularize=singularize_exceptions,
        brands=brands,
        synonyms=synonyms,
        descriptors=descriptors,
        language="english"  # or your target language
    )


text_cleaner = instantiate_text_cleaner()
print("TextCleaner instantiated with current data structures.")

### Category Name Analysis

In this section, we'll clean our category names and analyze the results. This analysis helps us:

1. Identify common words across categories
2. Spot potential new descriptors (words that appear frequently but don't differentiate categories)
3. Ensure our cleaning process is working as expected

By iterating on this analysis, we can refine our text cleaning process and improve our understanding of the category structure.


In [None]:
# This cell will be re-run after refinements
def analyze_category_names(df):
    # Clean category names using TextCleaner
    cleaned_df = df.copy()
    cleaned_df['clean_name'] = cleaned_df['name'].apply(text_cleaner.clean_text)
    cleaned_df = cleaned_df.dropna()

    # Tokenize and count words in cleaned category names
    all_words = ' '.join(cleaned_df['clean_name']).split()
    word_counts = Counter(all_words)

    print("Most common words in cleaned category names:")
    print(word_counts.most_common(20))

    # Identify potential new descriptors (words that appear in many categories)
    potential_new_descriptors = [word for word, count in word_counts.items()
                                 if count > len(cleaned_df) * 0.05 and word not in descriptors]
    print("\nPotential new descriptors to consider:")
    print(potential_new_descriptors)

    # Display sample of cleaned names
    print("\nSample of cleaned category names:")
    print(cleaned_df[['name', 'clean_name']].head())

    # Return the new dataframe
    return cleaned_df


cleaned_leaf_df = analyze_category_names(leaf_df)
cleaned_leaf_df.head()

In [None]:
mappings_df = cleaned_leaf_df[["clean_name", "id"]].rename(columns={"clean_name": "name"})

#### Refinement Process
If there are any words above that are in many categories, consider adding them to the list of descriptors in `Text Refinement Preparation` and re-run the `TextCleaner` and `Category Name Analysis` steps.

### Word Frequency Analysis

Understanding the frequency of words in our cleaned category names provides valuable insights:

1. It helps identify terms that are central to our category structure
2. It can reveal potential issues in our cleaning process (e.g., if very common words aren't being removed as expected)
3. It guides the refinement of our descriptor list and other cleaning parameters

This analysis is crucial for optimizing our categorization process and ensuring we're focusing on the most meaningful terms.


In [None]:
def analyze_word_frequency(df):
    word_freq_df = pd.DataFrame(df['clean_name'].str.split(expand=True).stack().value_counts()).reset_index()
    word_freq_df.columns = ['word', 'frequency']
    word_freq_df['percentage'] = word_freq_df['frequency'] / len(df) * 100
    return word_freq_df


word_freq_df = analyze_word_frequency(cleaned_leaf_df)
print("Word frequency analysis:")
word_freq_df

In [None]:
word_map = {}
n = 0
for _, cat in cleaned_leaf_df.iterrows():
    for word in cat['clean_name'].split():
        if word == '' or word == ' ' or word == 'other':
            n += 1
            continue
        if word not in word_map:
            word_map[word] = {cat['id']}
        else:
            word_map[word].add(cat['id'])

unique_count_df = pd.DataFrame([{'word': k, 'count': len(v)} for k, v in word_map.items()])

print(f'Total of "Other" {n}')
print(f'Unique leaf category words {len(word_map)}')

In [None]:
unique_leaves = {}
unique_leaves_list = []
n = 0
for word in cleaned_leaf_df['clean_name']:
    if word == '' or word == ' ' or word == 'other':
        n += 1
        continue
    if word not in unique_leaves:
        unique_leaves[word] = 1
        unique_leaves_list.append(word)
    else:
        unique_leaves[word] += 1
print(f'Total of "Other" {n}')
print(f'Unique leaf category names {len(unique_leaves)}')

## Vector Embeddings

While exact word matches are our primary method for categorization, vector embeddings provide a powerful complement:

1. They allow us to find similar words, helping with synonyms and related terms
2. They can capture semantic relationships that aren't apparent from exact matches alone
3. They're especially useful for handling nuanced or ambiguous category names

By combining exact matches with vector-based similarity, we create a more robust and flexible categorization system.


In [None]:
#Chilean Spanish Embeddings
#EMBEDDINGS_MODEL_URL="https://zenodo.org/records/3255001/files/embeddings-l-model.vec"

#English Embeddings
DEFAULT_EMBEDDINGS_MODEL_URL = ("https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip", "bb43875cfae187e8cef0be558a1851fb1c62daca")

In [None]:
import tempfile
import zipfile
import requests
import os
import io
import gzip
from decimal import Decimal
from amazon.ion import simpleion
from urllib.parse import urlparse

In [None]:
def secure_download(url, destination):
    # Parse the URL to extract the hostname
    parsed_url = urlparse(url)

    # Check if the URL uses HTTPS
    if parsed_url.scheme != 'https':
        raise ValueError("URL must use HTTPS")

    # Perform the request with a timeout and verify SSL certificates
    response = requests.get(url, timeout=30, verify=True, stream=True)

    # Raise an exception for bad status codes
    response.raise_for_status()

    # Write the content to the file
    with open(destination, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

This next step may take a long time while downloading and loading the vectors file.

In [None]:
def vector_generator(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    count = 0
    for line in fin:
        if count % 100000 == 0:
            print(f"{count}/{n}")
        tokens = line.rstrip().split(' ')
        word: str = tokens[0]
        if word.isalpha():
            yield {
                "Item": {
                    "word": word,
                    "vector": list(map(Decimal, tokens[1:]))
                }
            }
        count += 1


In [None]:
def get_embeddings_models(model_urls):
    files = []
    for model_url in model_urls:
        (url, checksum) = model_url if type(model_url) == "tuple" else (model_url, "0")
        vecfile_basename = os.path.basename(url)
        should_download = True
        if os.path.exists(os.path.join("data", vecfile_basename)):
            with open(os.path.join("data", vecfile_basename), "rb", buffering=0) as f:
                sha1 = hashlib.file_digest(f, 'sha1').hexdigest()
            if checksum == sha1:
                should_download = False
        if should_download:
            secure_download(url, os.path.join("data", vecfile_basename))
            with open(os.path.join("data", vecfile_basename), "rb", buffering=0) as f:
                sha1 = hashlib.file_digest(f, 'sha1').hexdigest()
            print(url, sha1)
        if vecfile_basename.endswith('.zip'):
            print(f'Extracting embeddings {url}')
            with zipfile.ZipFile(os.path.join("data", vecfile_basename), 'r') as zip_ref:
                zip_ref.extractall("data")
            print(f'Extracted embeddings {url}')
            files.append(os.path.join("data", vecfile_basename.replace('.zip', '')))
        elif vecfile_basename.endswith('.vec'):
            files.append(os.path.join("data", vecfile_basename))
    return files

In [None]:
file_idx = 0
vectors_fobj = gzip.open(f"data/english_vectors_import/vectors_{file_idx}.ion.gz", "wb")
batch = []
count = 0
for model in get_embeddings_models([DEFAULT_EMBEDDINGS_MODEL_URL]):
    for item in vector_generator(model):
        batch.append(item)
        if len(batch) >= 1_000:
            simpleion.dump(batch, vectors_fobj, binary=True, sequence_as_stream=True)
            count += len(batch)
            batch = []
        if count >= 100_000:
            vectors_fobj.close()
            file_idx += 1
            vectors_fobj = gzip.open(f"data/english_vectors_import/vectors_{file_idx}.ion.gz", "wb")
            count = 0
if batch:
    simpleion.dump(batch, vectors_fobj, binary=True, sequence_as_stream=True)
vectors_fobj.close()

In [None]:
!aws s3 sync data/english_vectors_import/ s3://aws-strunkjd-tmp-use1/english_vectors_import/

In [None]:
vector_table_name = vector_table_prefix + "english_vectors"

In [None]:
job = ddb.import_table(
    S3BucketSource={
        "S3Bucket": "aws-strunkjd-tmp-use1",
        "S3KeyPrefix": "english_vectors_import",
    },
    InputFormat="ION",
    InputCompressionType="GZIP",
    TableCreationParameters={
        'TableName': vector_table_name,
        'AttributeDefinitions': [
            {
                'AttributeName': 'word',
                'AttributeType': 'S'
            },
        ],
        'KeySchema': [
            {
                'AttributeName': 'word',
                'KeyType': 'HASH'
            },
        ],
        'BillingMode': 'PAY_PER_REQUEST',
    }
)

In [None]:
job_status = ddb.describe_import(ImportArn=job["ImportTableDescription"]["ImportArn"])
while job_status["ImportTableDescription"]["ImportStatus"] == 'IN_PROGRESS':
    time.sleep(60)
    print(job_status["ImportTableDescription"]["ImportStatus"])
    job_status = ddb.describe_import(ImportArn=job["ImportTableDescription"]["ImportArn"])

job_status

In [None]:
vector_table = ddbr.Table(vector_table_name)

In [None]:
category_vectors = {}
batch = []
for w in word_map.keys():
    batch.append(w)  
    if len(batch) > 100:
        raise Exception("max batch size is 100")
    elif len(batch) == 100:
        result = ddb.batch_get_item(
            RequestItems={
                vector_table_name: {
                    "Keys": [{
                        "word": {
                            "S": word,
                        }
                    } for word in batch]
                }
            },
        )
        for item in result['Responses'][vector_table_name]:
            word = item['word']['S']
            vector = []
            for v in item['vector']['L']:
                vector.append(float(v['N']))
            category_vectors[word] = vector
        batch = []
if batch:
    result = ddb.batch_get_item(
        RequestItems={
            vector_table_name: {
                "Keys": [{
                    "word": {
                        "S": word,
                    }
                } for word in batch]
            }
        },
    )
    for item in result['Responses'][vector_table_name]:
        word = item['word']['S']
        vector = []
        for v in item['vector']['L']:
            vector.append(float(v['N']))
        category_vectors[word] = vector

In [None]:
print(f"wordvectors contains {len(category_vectors)}/{len(word_map)} category words")

In [None]:
word_index = list(category_vectors.keys())

In [None]:
d = len(next(iter(category_vectors.values())))

In [None]:
import faiss

In [None]:
index = faiss.index_factory(d, "Flat", faiss.METRIC_INNER_PRODUCT)

In [None]:
index_array = np.array([v for v in category_vectors.values()]).astype(np.float32)
faiss.normalize_L2(index_array)
index.add(index_array)


In [None]:
index.ntotal

In [None]:
sv = np.array([np.array(category_vectors["shirt"]).astype(np.float32)])
faiss.normalize_L2(sv)

In [None]:
sv.shape

In [None]:
D, I = index.search(sv, 5)

In [None]:
for distance, idx in zip(D[0],I[0]):
    print(f"{word_index[idx]}: {distance}")

In [None]:
def get_word_embeddings(table: DynamoDBServiceResource.Table, word: str) -> np.ndarray:
    response = table.get_item(Key={"word": word})
    if "Item" not in response:
        raise KeyError(f"No embedding for {word}")
    vector = response["Item"]["vector"]
    sv = np.array([list(map(float, vector))]).astype(np.float32)
    faiss.normalize_L2(sv)
    return sv

In [None]:
sv = get_word_embeddings(vector_table, "jersey")

In [None]:
D, I = index.search(sv, 10)

In [None]:
for distance, idx in zip(D[0],I[0]):
    print(f"{word_index[idx]}: {distance}")

### Grant read access to the table

In [None]:
word_embeddings_policy_arn = ssm.get_parameter(Name=f"{ssm_prefix}WordEmbeddingsPolicyArn")['Parameter']['Value']

In [None]:
iam.create_policy_version(
    PolicyArn=word_embeddings_policy_arn,
    PolicyDocument=json.dumps({
        "Version": "2012-10-17",
        "Statement": [
            {
            "Sid": "wordembeddings",
            "Effect": "Allow",
            "Action": [
                    "dynamodb:GetItem",
                    "dynamodb:BatchGetItem",
                    "dynamodb:Scan",
                    "dynamodb:Query",
                    "dynamodb:ConditionCheckItem"
                ],
            "Resource": vector_table.table_arn,
            }
        ]
    }),
    SetAsDefault=True,
)

## OPTIONAL Experiment: Using aligned vectors for multilingual metaclass identification
We should be able to use aligned word vectors to find metaclass words across languages.

In our experimentation it did not work, but this may work with better word vectors.

In [None]:
# Add aligned vectors for your desired languages from https://fasttext.cc/docs/en/aligned-vectors.html
ALIGNED_EMBEDDINGS_MODEL_URLS = [
    ("https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.en.align.vec", "a3ca1fd0beaf3e99ef1c911cc256306286934860"),
    ("https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.es.align.vec", "98254cb84228ce452f30f1444576fbce756e65d5"),
]

In [None]:
file_idx = 0
vectors_fobj = gzip.open(f"data/aligned_vectors_import/vectors_{file_idx}.ion.gz", "wb")
batch = []
count = 0
for model in get_embeddings_models([ALIGNED_EMBEDDINGS_MODEL_URLS]):
    for item in vector_generator(model):
        batch.append(item)
        if len(batch) >= 1_000:
            simpleion.dump(batch, vectors_fobj, binary=True, sequence_as_stream=True)
            count += len(batch)
            batch = []
        if count >= 100_000:
            vectors_fobj.close()
            file_idx += 1
            vectors_fobj = gzip.open(f"data/aligned_vectors_import/vectors_{file_idx}.ion.gz", "wb")
            count = 0
if batch:
    simpleion.dump(batch, vectors_fobj, binary=True, sequence_as_stream=True)
vectors_fobj.close()

In [None]:
!aws s3 sync data/aligned_vectors_import/ s3://aws-strunkjd-tmp-use1/aligned_vectors_import/

In [None]:
aligned_vector_table_name = "aligned_vectors"

In [None]:
job = ddb.import_table(
    S3BucketSource={
        "S3Bucket": "aws-strunkjd-tmp-use1",
        "S3KeyPrefix": "aligned_vectors_import",
    },
    InputFormat="ION",
    InputCompressionType="GZIP",
    TableCreationParameters={
        'TableName': aligned_vector_table_name,
        'AttributeDefinitions': [
            {
                'AttributeName': 'word',
                'AttributeType': 'S'
            },
        ],
        'KeySchema': [
            {
                'AttributeName': 'word',
                'KeyType': 'HASH'
            },
        ],
        'BillingMode': 'PAY_PER_REQUEST',
    }
)

In [None]:
job_status = ddb.describe_import(ImportArn=job["ImportTableDescription"]["ImportArn"])
while job_status["ImportTableDescription"]["ImportStatus"] == 'IN_PROGRESS':
    time.sleep(60)
    print(job_status["ImportTableDescription"]["ImportStatus"])
    job_status = ddb.describe_import(ImportArn=job["ImportTableDescription"]["ImportArn"])

job_status

In [None]:
aligned_vector_table = ddbr.Table(aligned_vector_table_name)

In [None]:
aligned_category_vectors = {}
batch = []
for w in word_map.keys():
    batch.append(w)  
    if len(batch) > 100:
        raise Exception("max batch size is 100")
    elif len(batch) == 100:
        result = ddb.batch_get_item(
            RequestItems={
                aligned_vector_table_name: {
                    "Keys": [{
                        "word": {
                            "S": word,
                        }
                    } for word in batch]
                }
            },
        )
        for item in result['Responses'][aligned_vector_table_name]:
            word = item['word']['S']
            vector = []
            for v in item['vector']['L']:
                vector.append(float(v['N']))
            aligned_category_vectors[word] = vector
        batch = []
if batch:
    result = ddb.batch_get_item(
        RequestItems={
            aligned_vector_table_name: {
                "Keys": [{
                    "word": {
                        "S": word,
                    }
                } for word in batch]
            }
        },
    )
    for item in result['Responses'][aligned_vector_table_name]:
        word = item['word']['S']
        vector = []
        for v in item['vector']['L']:
            vector.append(float(v['N']))
        aligned_category_vectors[word] = vector

In [None]:
print(f"wordvectors contains {len(aligned_category_vectors)}/{len(word_map)} category words")

In [None]:
aligned_word_index = list(aligned_category_vectors.keys())

In [None]:
d = len(next(iter(aligned_category_vectors.values())))

In [None]:
import faiss

In [None]:
aligned_index = faiss.index_factory(d, "Flat", faiss.METRIC_INNER_PRODUCT)

In [None]:
aligned_index_array = np.array([v for v in aligned_category_vectors.values()]).astype(np.float32)
faiss.normalize_L2(aligned_index_array)
aligned_index.add(aligned_index_array)


Let's try it out. For "camiseta" we hope to see the "shirt" metaclass in the results.

In [None]:
sv = get_word_embeddings(aligned_vector_table, "camiseta")

In [None]:
D, I = aligned_index.search(sv, 20)

In [None]:
for distance, idx in zip(D[0],I[0]):
    print(f"{aligned_word_index[idx]}: {distance}")

### Clean up
Since this didn't work, delete the table

In [None]:
ddb.delete_table(TableName=aligned_vector_table_name)

## OPTIONAL Experiment: Using Nova Micro for Multilingual Metaclass Identification
Here we show how a small multilingual model can translate products from other languages or even rephrase titles and descriptions so the metaclass identification process can narrow down the categories to search.

We initially tried to ask the LLM to select from the full list of metaclass words, but even though the model can technically handle the long list of metaclass words in its context window, the attention mechanism makes it inefficient at doing precise word matching against such a large list. When we ask it to compare against hundreds of words simultaneously, it struggles to maintain precise attention to each word, leading to inconsistent matching.

In [None]:
metaclass_words = "\n".join(sorted(word_map.keys()))

In [None]:
metaclass_prompt_template = '''You are a multilingual retail catalog manager with expertise in product classification across different languages and markets.

## BEGIN metaclass words ##
{metaclass_words}
## END metaclass words ##

Product title: {title}

Your task is to analyze the product title and identify ALL likely metaclass words, regardless of the title's language. Consider common variants, abbreviations, and multilingual equivalents.

<instructions>
1. ONLY use metaclass words from the provided list
2. Consider regional and linguistic variations in product naming
3. Ignore brands, sizes, colors, and other non-category attributes
</instructions>

Please show your analysis:
<thinking>
Your step-by-step reasoning here
</thinking>

Metaclass words (comma-separated):
'''

In [None]:
title = "Polera estampada de unicornios"
prompt= metaclass_prompt_template.format(title=title, metaclass_words=metaclass_words)
messages=[
        {
            'role': 'user',
            'content': [
                {
                    'text': prompt,
                }
            ]
        }
    ]
response = bedrock.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    inferenceConfig={
        'temperature': 0.0,
    },
)

In [None]:
print(response['output']['message']['content'][0]['text'])

We should see the word "shirt" in the result. If you run this a few times, you will likely see different results. It doesn't consistently return shirt.

Another option is to translate and rephrase the original and follow our normal metaclass identification process looking for exact words and vector matches.

In [None]:
rephrase_prompt_template = '''
You are a retail catalog specialist. Your task is to analyze this product and create a simple normalized title that describes what this product fundamentally is.

Product title: {title}
Product description: {description}

Steps:
1. Identify the core product type from both title and description
2. Remove from consideration:
   - Brand names
   - Marketing terms
   - Decorative elements
   - Colors, sizes, materials
   - Target audience
   - Usage occasions
3. Convert to a basic product term

##Output format##
{{
  "normalized_title": "simple normalized title that describes what this product fundamentally is",
}}
'''


In [None]:
title = "Polera estampada de unicornios"
description = "Esta polera de 100% algodon. Tiene el diseno de un unicornio atravesando un arcoiris."
prompt= rephrase_prompt_template.format(title=title, description=description)
messages=[
        {
            'role': 'user',
            'content': [
                {
                    'text': prompt,
                }
            ]
        }
    ]
response = bedrock.converse(
    modelId="us.amazon.nova-micro-v1:0",
    messages=messages,
    inferenceConfig={
        'temperature': 0.0,
        'maxTokens': 100,
    },
)

In [None]:
print(response['output']['message']['content'][0]['text'])

In [None]:
title = "Unicorn T-shirt"
description = "100% cotton t-shirt. Design: A unicorn jumping through a rainbow."
prompt= rephrase_prompt_template.format(title=title, description=description)
messages=[
        {
            'role': 'user',
            'content': [
                {
                    'text': prompt,
                }
            ]
        }
    ]
response = bedrock.converse(
    modelId="us.amazon.nova-micro-v1:0",
    messages=messages,
    inferenceConfig={
        'temperature': 0.0,
        'maxTokens': 100,
    },
)

In [None]:
print(response['output']['message']['content'][0]['text'])

## Media Categories

Some product categories, particularly in media (books, movies, games), present unique challenges:

1. Their titles often don't contain words that suggest the product type
2. They require special handling to ensure accurate categorization

By always including these categories in our categorization process, we ensure that media products are correctly identified, even when their titles don't contain typical category keywords.

Review the category tree, and make a list of categories to always include.

In [None]:
def get_child_category_ids(cat_ids: list[str]) -> list[str]:
    categories = []
    for cat_id in cat_ids:
        if not category_tree[cat_id]["childs"]:
            categories.append(cat_id)
        else:
            categories.extend(get_child_category_ids([child["id"] for child in category_tree[cat_id]["childs"]]))
    return categories


In [None]:
always_category_ids = [
    # Classes
    "68040100",  # Pre-Recorded or Digital Content Media
    "68050100",  # Audio Visual/Photography Variety Packs
    "65010400",  # Computer/Video Game Software
    "65010900",  # Computers/Video Games Variety Packs
    "60010200",  # Books
    "60010300",  # Periodicals

    # Bricks
    "10001194",  # GPS Software - Mobile Communications
    "10006237",  # GPS Software - Mobile Communications - Digital
    "10001197",  # Mobile Phone Software
    "10006238",  # Mobile Phone Software - Digital
    "10000624",  # Cross Segment Variety Packs
    "10002103",  # Textual/Printed/Reference Materials Variety Packs
]
always_categories = get_child_category_ids(always_category_ids)


In [None]:
len(always_categories)

## Language Adaptation(Optional)

To adapt this process for a different language, we need to modify several components. Let's go through the steps you'd need to take.

### 1. Update Stop Words

For a new language, you'll need to update the stop words used in text cleaning.

In [None]:
import nltk

nltk.download('stopwords')

# Example: Changing to Spanish stop words
from nltk.corpus import stopwords

spanish_stop_words = set(stopwords.words('spanish'))
print("Sample of Spanish stop words:")
print(list(spanish_stop_words)[:20])

# In the TextCleaner class, you'd update the language and stop words like this:
# text_cleaner.language = 'spanish'
# text_cleaner._remove_stopwords_tokenize_text = lambda text: ' '.join(
#     [w for w in nltk.word_tokenize(text) if w.lower() not in spanish_stop_words]
# )

### 2. Update Singularization Rules

The singularization process is highly language-dependent. For languages other than English, you may need to implement custom singularization logic.

In [None]:
# Example: Spanish singularization rules (simplified)
def spanish_singularize(word):
    if word.endswith('es'):
        return word[:-2]
    elif word.endswith('s'):
        return word[:-1]
    return word


print("Example of Spanish singularization:")
print(spanish_singularize("gatos"))  # Should print "gato"


### 3. Update Synonyms and Descriptors

Synonyms and descriptors need to be updated for the new language:

In [None]:
# Example: Spanish synonyms and descriptors
spanish_synonyms = {
    "ordenador": "computadora",
    "movil": "celular"
}

spanish_descriptors = [
    "nuevo", "grande", "pequeno"
]

print("Example Spanish synonyms:", spanish_synonyms)
print("Example Spanish descriptors:", spanish_descriptors)


### 4. Word Embeddings for Different Languages

For the metaclass identification process, you'll need word embeddings in the target language. You can find pre-trained word embeddings for many languages on the [FastText website](https://fasttext.cc/docs/en/crawl-vectors.html).

```
# Example of loading Spanish word embeddings (you would need to download these first)
# spanish_embeddings_file = 'path_to_spanish_embeddings.vec'
# spanish_wordvectors = KeyedVectors.load_word2vec_format(spanish_embeddings_file)

# print("Example of Spanish word vector:")
# print(spanish_wordvectors['gato'])
```

### Next Steps

After adapting these components for your target language:

1. Re-run the category name analysis with the new language settings.
2. Generate new metaclasses based on the cleaned category names in the target language.
3. Use the language-specific word embeddings for further processing and analysis.

Remember to thoroughly test the adapted system with a sample of product data in the target language to ensure it's performing as expected.

## Persist Categories and Configuration

Persisting our prepared data and configurations is a critical step.

By storing this data in S3 and configuration paths in SSM Parameter Store, we create a flexible, scalable foundation for our categorization system.

Upload needed files to the configuration S3 bucket created in the CDK deployment and save the paths in SSM parameter store.

In [None]:
df_words = pd.DataFrame(unique_leaves.keys(), columns=['name'])
df_words.to_json('data/metaclasses.json')
df_words.head(15)

In [None]:
mappings_df.to_json('data/mappings.json')

In [None]:
def encode_set(obj):
    if isinstance(obj, set):
        return list(obj)

In [None]:
with open("data/word_map.json", "w") as f:
    json.dump(word_map, f, default=encode_set)

In [None]:
with open("data/category_vectors.json", "w") as f:
    json.dump(category_vectors, f)

In [None]:
with open('data/marcas.json', 'w') as f:
    json.dump(brands, f)

In [None]:
with open("data/singularize.json", "w") as f:
    json.dump(singularize_exceptions, f)

In [None]:
with open("data/descriptors.json", "w") as f:
    json.dump(descriptors, f)

In [None]:
with open("data/synonyms.json", "w") as f:
    json.dump(synonyms, f)

In [None]:
with open("data/always.json", "w") as f:
    json.dump(always_categories, f)

In [None]:
config_bucket = ssm.get_parameter(Name=f"{ssm_prefix}ConfigurationBucket")['Parameter']['Value']

In [None]:
!aws s3 cp data/labelcats.json s3://{config_bucket}/data/
!aws s3 cp data/metaclasses.json s3://{config_bucket}/data/
!aws s3 cp data/mappings.json s3://{config_bucket}/data/
!aws s3 cp data/word_map.json s3://{config_bucket}/data/
!aws s3 cp data/marcas.json s3://{config_bucket}/data/
!aws s3 cp data/singularize.json s3://{config_bucket}/data/
!aws s3 cp data/synonyms.json s3://{config_bucket}/data/
!aws s3 cp data/descriptors.json s3://{config_bucket}/data/
!aws s3 cp data/always.json s3://{config_bucket}/data/
!aws s3 cp data/category_vectors.json s3://{config_bucket}/data/


In [None]:
ssm.put_parameter(
    Name=f"{ssm_prefix}CategorizationConfig",
    Value=json.dumps({
        "language": language,
        "wordEmbeddingsTable": vector_table_name,
        "categoryTree": "data/labelcats.json",
        "metaclasses": "data/metaclasses.json",
        "mappings": "data/mappings.json",
        "categoryVectors": "data/category_vectors.json",
        "wordMap": "data/word_map.json",
        "brands": "data/marcas.json",
        "singularize": "data/singularize.json",
        "synonyms": "data/synonyms.json",
        "descriptors": "data/descriptors.json",
        "alwaysCategories": "data/always.json",
    }),
    Type="String",
    Overwrite=True,
)

## Conclusion

This notebook has guided you through the process of generating metaclasses from a category tree and adapting the process for different languages. Key steps included:

1. Loading and analyzing the category tree
2. Preparing and refining text cleaning configurations
3. Generating metaclasses from cleaned category names
4. Outlining the process for adapting to different languages

By following these steps and iterating as needed, you can create a robust system for categorizing products in your language and market.