# Category Tree Preparation and Attribute Schema Generation

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.<br>
SPDX-License-Identifier: MIT-0

## Introduction

This notebook demonstrates the process of preparing a category tree and generating attribute schemas for product categorization and attribute extraction. While we use the GS1 Global Product Classification (GPC) as an example, this process can be adapted for your own category tree.

Important Note for Customers:
This accelerator uses the GS1 GPC as an example. When adapting this process for your tree, keep in mind:

1. Navigation Trees vs. Taxonomies: Many retailers have navigation trees, which often include duplicate categories to make products findable in multiple locations. However, for automatic categorization, it's better to use a taxonomy where each product has exactly one correct category.

2. Converting Navigation Trees to Taxonomies: If you're starting with a navigation tree, you'll need to convert it to a taxonomy. This typically involves:
   - Identifying and removing duplicate categories
   - Identifying and removing attribute categories
   - Creating mappings from your taxonomy and attributes back to your navigation tree to preserve findability

3. Category Descriptions: Clear, concise descriptions for each category are very helpful for accurate categorization. If your tree doesn't include these, it's worth the effort to create them.

4. Attribute Schemas: You'll need to define the relevant attributes for each category. These should be specific enough to capture important product details but general enough to apply to all products in the category.

The following process demonstrates how to structure and optimize your category tree and attribute schemas, regardless of their source. Adapt each step as needed for your specific category structure.


## Setup and Data Loading

In [None]:
import csv
import json
import random

import boto3
from botocore.config import Config
from jinja2 import Template
from pympler import asizeof


Download the GS1 GPC in json format from https://gpc-browser.gs1.org/ and store it in the `data/` folder. Update the file name in the cell below.

In [None]:
with open('data/GPC as of November 2024 v20241202 GB.json', 'r') as fp:
    gpc = json.load(fp)

Let's inspect the general format of the gpc data

In [None]:
print(gpc['Schema'][0])

## Category Tree Generation

In [None]:
def iterate_category_tree(tree, path=[]):
    if tree['Level'] <= 4:
        path.append({'id': str(tree['Code']), 'name': tree['Title'], })

        childs = []
        for child in tree.get('Childs', []):
            if child['Level'] <= 4:
                childs.append({
                    'id': str(child['Code']),
                    'name': child['Title'],
                })
                yield from iterate_category_tree(child, path)

        yield {
            'id': str(tree['Code']),
            'name': tree['Title'],
            'full_path': path.copy(),
            'description': tree['Definition'],
            'childs': childs,
        }
        path.pop()

In [None]:
cattree = {}
cattree['root'] = {
    'id': 'root',
    'name': 'root',
    'full_path': [],
    'description': 'Top level',
    'childs': [],
}
for schema in gpc['Schema']:
    cattree['root']['childs'].append({
        'id': str(schema['Code']),
        'name': schema['Title'],
    })
    for cat in iterate_category_tree(schema):
        cattree[cat['id']] = cat

In [None]:
len(cattree)

In [None]:
cattree['10001674']

In [None]:
with open('data/labelcats.json', 'w') as f:
    json.dump(cattree, f)

## Attribute Schema Generation
In order to do so, let's first build a dictionary that maps L4 category Codes to L5 and above attribute definitions.

In [None]:
def build_dict(data, full_path=""):
    result = {}

    if data['Level'] <= 4:
        if data['Level'] == 4:
            result[data['Code']] = {
                'category': data['Title'],
                'subcategory': full_path,
                'attributes_schema': data['Childs'] if data['Childs'] else None
            }
        else:
            # Recursively process each child
            for child in data.get('Childs', []):
                result.update(build_dict(child, full_path=f"{full_path}/{data['Title']}"))
    else:
        return data['Childs']

    return result


Now let's load all schemas in the gpc file

In [None]:
attrs_dict = {}

for schema in gpc['Schema']:
    attr_dict = build_dict(schema)
    attrs_dict.update(attr_dict)

In [None]:
# let's inspect one of the categories
code, category_schema = random.choice(list(attrs_dict.items()))

print(f"Category code: {code}")
print(f"Category schema: {category_schema}")

In [None]:
# How many categories do we have?
len(attrs_dict.keys())

In [None]:
# let's inspect the Computer Pointing Devices category, a.k.a. Mouses!
attrs_dict[10001151]

In [None]:
asizeof.asizeof(attrs_dict) / 1_000_000  # converting bytes to MB decimal

### Optimizing for semantic meaning of content of data structure

The memory footprint of the datastructure (~93MB) is a slight problem (specially if we want to load it in memory in our lambda function), also it might carry information that is not useful for attribute extraction. As an example this is an excerpt from a json used in a previous prototype, which is very lean and every property carries significant meaning for attribute extraction.

```json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "Estado": {
      "type": "string",
      "enum": ["Novo", "Usado"],
      "description": "Indica a condição do produto, como novo, usado ou recondicionado."
    },
    "Marca": {
      "type": "string",
      "enum": ["Samsung", "Electrolux", "Brastemp", "Apple", "Xiaomi", "Oster"],
      "description": "Nome do fabricante ou da marca do produto."
    },
    "Voltagem": {
      "type": "string",
      "enum": ["110 V", "220 V", "Bivolt"],
      "description": "A voltagem elétrica necessária para o funcionamento do ar condicionado (em volts)."
    },
    ...
```




That said, let's remove: `Level`, `Code`, all None-valued properties, and let's use the `Active` property as a condition.

In [None]:
def optimize_dict(target):
    if isinstance(target, list):
        return [optimize_dict(elm) for elm in target]

    if target is None:
        return

    if not target.get("Active", False):
        return

    new_dict = {k: v for k, v in target.items() if v is not None and k not in ["Code", "Level", "Childs", "Active"]}
    if "Childs" in target and target["Childs"] is not None:
        new_dict["Childs"] = optimize_dict(target["Childs"])

    return new_dict

In [None]:
# linted_attrs_dict = {k: optimize_dict(v) for k, v in attrs_dict.items()}
linted_attrs_dict = {}
for code, category_schema in attrs_dict.items():
    linted_attrs_dict[str(code)] = {
        "category_name": category_schema["category"],
        "subcategory_name": category_schema["subcategory"],
        "attributes_schema": optimize_dict(category_schema["attributes_schema"])
    }

In [None]:
# let's again, sample one of the items

random.choice(list(linted_attrs_dict.items()))

In [None]:
asizeof.asizeof(linted_attrs_dict) / 1_000_000  # converting bytes to MB decimal

As a side-effect we shaved off ~20MB, but we are still somewhat large. Let's serialize to json and check the resulting size

In [None]:
with open('data/linted_attrs.json', 'w') as f:
    f.write(json.dumps(linted_attrs_dict))

In [None]:
ssm_prefix = '/ProductCategorization/'
ssm = boto3.client('ssm')

In [None]:
config_bucket = ssm.get_parameter(Name=f"{ssm_prefix}ConfigurationBucket")['Parameter']['Value']

In [None]:
config_bucket

In [None]:
!aws s3 cp data/linted_attrs.json s3://{config_bucket}/data/attributes_schema.json

The serialized schema has ~20MB. This is manageable for the Lambda function, so we will move fw as is.

## Experimenting with attribute extraction (Optional)

Let's load a couple of products and try to extract attributes from their title/description/image data.

Download the amazon.csv dataset from https://www.kaggle.com/datasets/karkavelrajaj/amazon-sales-dataset and put it in the `data/` directory.

In [None]:
with open('data/amazon.csv', mode='r') as csvfile:
    reader = csv.DictReader(csvfile)
    data = [row for row in reader]

# let's get this one product we know to be a computer mouse
product = next(filter(lambda p: p['product_id'] == "B0819HZPXL", data))

In [None]:
# B0819HZPXL - 
product

Let's experiment with the same prompt used in our recent attr extraction prototype

In [None]:
config = Config(
    connect_timeout=120,
    read_timeout=120,
    retries={
        "max_attempts": 10,
        "mode": "adaptive",
    })

bedrock_client = boto3.session.Session().client('bedrock-runtime', config=config)

In [None]:
prompt_template = """You are an AI assistant tasked with extracting product attributes from a given title and
description. You will be provided with a category, subcategory, and a JSON schema with attributes.
Your job is to identify which of these attributes are present in the title and
description, and what their values are.

Here is the information about the product category and attributes:

<category>
{{category}}
</category>

<subcategory>
{{subcategory}}
</subcategory>

<attributes_schema>
{{attributes_schema}}
</attributes_schema>

Now, here is the product information you need to analyze:

<title>
{{product_title}}
</title>

<description>
{{product_description}}
</description>

Your task is to extract the actual attributes and their values from the title and description.
Follow these steps:

1. Carefully read through the title and description.
2. For each attribute listed in the schema, determine if it is mentioned in the title or description.
3. If an attribute is present, identify its specific value based on the information provided.
4. If an attribute is not mentioned or its value cannot be determined, set its value to null.
3. For colors, approximate to the closest one.

Before providing your final answer, use a <scratchpad> to think through your extraction process.
List out each attribute, whether you found it, and what value you assigned to it.

After your analysis, provide your answer as a JSON object following the schema. After the scratchpad, only output valid json.

Important notes:
- Include all attributes listed in the schema, even if their value is null.
- Be as specific and accurate as possible when extracting values.
- Don't assume anything.

Remember, your goal is to extract as much accurate information as possible from the given title and
description, based on the provided category, subcategory, and possible attributes in the schema.
"""

In [None]:
prompt_template = """You are an AI assistant tasked with extracting product attributes from a given title and
description. You will be provided with a category, subcategory, and an XML schema for attributes.
Your job is to identify which of these attributes are present in the title and
description, and what their values are.

Here is the information about the product category and attributes:

<category>
{{category}}
</category>

<subcategory>
{{subcategory}}
</subcategory>

<attributes_schema>
{{attributes_schema}}
</attributes_schema>

Now, here is the product information you need to analyze:

<title>
{{product_title}}
</title>

<description>
{{product_description}}
</description>

Your task is to extract the actual attributes and their values from the title and description.
Follow these steps:

1. Carefully read through the title and description.
2. For each attribute listed in the schema, determine if it is mentioned in the title or description.
3. If an attribute is present, identify its specific value based on the information provided.
4. If an attribute is not mentioned or its value cannot be determined, set its value to null.
3. For colors, approximate to the closest one.

Before providing your final answer, use a <scratchpad></scratchpad> to think through your extraction process.
List out each attribute, whether you found it, and what value you assigned to it.

After your analysis, provide your answer as an XML object with the following format:

<attributes>
  <attribute>
    <name>attribute name</name>
    <value>value of attribute</value>
  </attribute>
  <attribute>
    <name>other attribute name</name>
    <value>value of other attribute</value>
  </attribute>
</attributes>

After the scratchpad, only output valid XML.

Important notes:
- Include all attributes listed in the schema, even if their value is null.
- Be as specific and accurate as possible when extracting values.
- Don't assume anything.
- wrap your entire answer in <response></response> XML tags.

Remember, your goal is to extract as much accurate information as possible from the given title and
description, based on the provided category, subcategory, and possible attributes in the schema."""

In [None]:
# Helper function to convert JSON to XML
def json_to_xml(json_obj, line_padding=""):
    """Recursively convert JSON object to XML string."""
    result_list = []

    if isinstance(json_obj, dict):
        for tag_name, sub_obj in json_obj.items():
            result_list.append(f"{line_padding}<{tag_name}>")
            result_list.append(json_to_xml(sub_obj, line_padding + "  "))
            result_list.append(f"{line_padding}</{tag_name}>")
    elif isinstance(json_obj, list):
        for sub_obj in json_obj:
            result_list.append(json_to_xml(sub_obj, line_padding))
    else:
        result_list.append(f"{line_padding}{json_obj}")

    return "\n".join(result_list)

In [None]:
category_schema = linted_attrs_dict["10001151"]
category_schema

In [None]:
prompt = Template(prompt_template).render({
    "category": category_schema['category_name'],
    "subcategory": category_schema['subcategory_name'],
    "attributes_schema": json_to_xml(category_schema['attributes_schema']),
    "product_title": product['product_name'],
    "product_description": product['about_product'],
})



In [None]:
print(prompt)

In [None]:
message = {
    'role': 'user',
    'content': [
        {'text': prompt}
    ]
}

response = bedrock_client.converse(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    # system=[{"text": system_prompt}],
    inferenceConfig={
        "temperature": 0,
    },
    messages=[message])

In [None]:
print(response["output"]["message"]["content"][0]["text"])

In [None]:
response

In [None]:
product["product_name"]

In [None]:
product["about_product"]

In [None]:
another_product = next(filter(lambda p: p["product_id"] == "B08VF8V79P", data))
another_product

In [None]:
prompt = Template(prompt_template).render({
    "category": "Battery Chargers",
    "subcategory": "Chargers",
    "attributes_schema": json_to_xml(linted_attrs_dict["10000548"]),
    "product_title": another_product['product_name'],
    "product_description": another_product['about_product'],
})

In [None]:
print(prompt)

In [None]:
message = {
    'role': 'user',
    'content': [
        {'text': prompt}
    ]
}

response = bedrock_client.converse(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    # system=[{"text": system_prompt}],
    inferenceConfig={
        "temperature": 0,
    },
    messages=[message])

print(response["output"]["message"]["content"][0]["text"])

## Conclusion

This notebook has processed the GPC data to create optimized data structures for product categorization and attribute extraction. The main outputs are:

1. A cleaned and structured category tree (labelcats.json)
2. An optimized attribute schema (linted_attrs.json)

These files have been saved locally and uploaded to S3 for use in the system. The notebook also includes an experiment demonstrating attribute extraction using the prepared data structures and Amazon Bedrock.
