# Tree Creation From Classification Labels

## Method

Trees are created from the classification labels only. The data to be classified is not used and therefore labeled datasets are not required for this step. Labeled datasets will be useful for evaluation instead.

'Categories', 'Conditions', and 'Labels' are used interchangeably here. Categories are created from the labels' texts and new categories will be made in the same style as the original dataset.

### Algorithms

#### Tree Formation

* Create the tree using any heirarchical labels in the output labels if they exist. Otherwise create a tree with a single root node and one leaf for every classification label.
* Walk down the tree starting at the root and stop when the current node has more children than desired. Then:
    * Place the category texts of the child nodes in a vector store and generate embeddings for each of them.
    * Sample from the vector store to choose relatively spread out categories. This is done by performing KMeans clustering and choosing cluster representatives.
    * Provide the LLM with representative categories and ask it to create new categories that divide those further.
    * Classify the old child categories into the newly created categories.
    * Place old child categories that cannot be classified into the new categories as children of the current node.
    * Repeat this process until less than or equal to the desired number of categories is reached.

#### Classification

* Start at the root node and work down the tree.
* Prompt the LLM to choose the correct next category from among the children of the current node if a valid one exists.
* If a the LLM responds saying that node of the child categories are suitable for the item to be classified, mask/hide the current node and treat the parent of the current node as the next node.
* If the failure above occurs while the current node is the root, the item will be skipped and receive no classification.

## Code

### Reading The Dataset

The dataset used here is [UNSPSC Codes](https://data.ok.gov/dataset/unspsc-codes)

Detect the correct charset to use when reading the categories dataset. Then read the dataset as a DataFrame.

In [1]:
import chardet

with open("amazon_categories.csv", 'rb') as f:
    result = chardet.detect(f.read())
    print(result)

{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}


In [2]:
import pandas as pd

df = pd.read_csv("amazon_categories.csv", encoding=result['encoding'])

df.head()

Unnamed: 0,id,category_name
0,1,Beading & Jewelry Making
1,2,Fabric Decorating
2,3,Knitting & Crochet Supplies
3,4,Printmaking Supplies
4,5,Scrapbooking & Stamping Supplies


### Dataset Exploration And Evaluation

Test the existing heirarchical structure to determine if it can be used immediately for classification.

From the results above, it can be seen that with the 4 existing heirarchical levels the number of children that would be created for each node would be up to ~100. This is more than our objective of 25 children max.

We will need to divide up these children to make classification using an LLM more reliable.

In [3]:
from model import create_tree_from_categories, check_tree, Node, display_tree, create_vector_store, ask_model_category, format_node, optimize_tree, ProgressBars, display_lazy_tree, clean_tree

Create the tree using the breadcrumbs present in the dataset so we can retain the existing heirarchy as a head-start.

In [4]:
root = create_tree_from_categories(df, condition_col="category_name", extra_cols=["category_name"])

Validate the new tree representation by performing a similar check to above and comparing. Things look consistent here.

In [5]:
check_tree(root)

sub_branches: 0, avg: 0.0, max: 0

leaves at this level: 248
total leaves: 248


Explore the new tree visually if desired. High numbers of child nodes can be seen on many of the nodes.

In [6]:
node = root

print(format_node(node))

display_lazy_tree(node, max_initial_depth=2)

All Products


VBox(children=(FigureWidget({
    'data': [{'branchvalues': 'total',
              'ids': [1e622c45-7be6-49cc-…

Create Ollama model instances to process LLM requests locally (Change this to your desired models from LangChain)

In [7]:
cats = [n.condition for n in node.children]
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model="mxbai-embed-large",
)

vectorstore = create_vector_store(texts=cats, embeddings=embeddings)

from langchain_ollama import ChatOllama
create_llm = lambda: ChatOllama(
    model="qwen2.5:14b",
    # temperature=0,
)

Test out a single iteration of creating new categories at the root level

In [8]:
cats, tokens = ask_model_category(node=root, embeddings=embeddings, create_llm=create_llm)
cats

PROMPT: 
	Human: Your job is to create a set categories that will serve as nodes in a tree.
    The new nodes will lie between the parent category below and the child categories listed.
    
    Previous categories that describe these items:
    All Products
    
    Direct parent category that should be divided:
    All Products
    
    Sample of items that should fit into the new categories:
    Luggage
 Boys' Clothing
 Home Décor Products
 Video Game Consoles & Accessories
 Arts & Crafts Supplies
 Girls' Accessories
 Personal Care Products
 Industrial Power & Hand Tools
 Car Electronics & Accessories
 Toys & Games
 Smart Home: New Smart Devices
 Kids' Play Trucks
 Computers
 Pumps & Plumbing Equipment
 Lighting & Ceiling Fans
    
    

    The items in the full dataset cover the scope of: Products across all industries

    Provide two or more new categories to lie between the parent and children.
    These categories should not overlap and should serve to divide the existing chil

CategoryAnswer(reasoning='To categorize the given items effectively while ensuring no overlap between categories, we need to identify common attributes or characteristics among the items that can be grouped logically. The following new categories aim at creating a more refined and organized structure that enhances search performance by dividing the existing children into smaller portions.', categories=['Travel & Outdoor', 'Home & Living', 'Electronics & Gaming', 'Arts & Hobbies', 'Personal Care & Beauty', 'Industrial Tools & Equipment', 'Vehicle Accessories', 'Toys & Educational Play'])

Token counts are also captured by any of the functions that interact with LLM calls

In [9]:
tokens

TokenCounts(prompt=337, completion=130, total=467)

The original dataset includes more categories than needed. We will be classifying consumer products from Amazon later so let's remove the extra segments and families that are irrelevant to save time optimizing the tree.

We will remove all service segments and 'Food Beverage and Tabacco Products' which is very large and complex and not particularly relevant. We will also remove the 'live' and 'fresh' families.

The following segments will be removed:
* Farming and Fishing and Forestry and Wildlife Contracting Services
* Mining and oil and gas services
* Building and Facility Construction and Maintenance Services
* Industrial Production and Manufacturing Services
* Industrial Cleaning Services
* Environmental Services
* Transportation and Storage and Mail Services
* Management and Business Professionals and Administrative Services
* Engineering and Research and Technology Based Services
* Editorial and Design and Graphic and Fine Art Services
* Public Utilities and Public Sector Related Services
* Financial and Insurance Services
* Healthcare Services
* Education and Training Services
* Travel and Food and Lodging and Entertainment Services
* Personal and Domestic Services
* National Defense and Public Order and Security and Safety Services
* Politics and Civic Affairs Services
* Food Beverage and Tobacco Products

The following families will be removed:
* Live animals
* Live rose bushes
* Live plants of high species or variety count flowers
* Live plants of low species or variety count flowers
* Live chrysanthemums
* Live carnations
* Live orchids
* Fresh cut rose
* Fresh cut blooms of high species or variety count flowers
* Fresh cut blooms of low species or variety count flowers
* Fresh cut chrysanthemums
* Fresh cut floral bouquets
* Fresh cut carnations
* Fresh cut orchids
* Fresh cut greenery

Create a fresh tree like shown above and run the optimizer. This will create new categories as necessary and ensure that a max of 25 children will exist at any node. (This takes a very long time)

In [10]:
root = create_tree_from_categories(df, condition_col="category_name", extra_cols=["category_name"])
progress_bars = ProgressBars(n_leaves=len(df))
display(progress_bars.ui)
optimize_tree(root=root, max_children=15, progress_bars=progress_bars, embeddings=embeddings, create_llm=create_llm)

VBox(children=(IntProgress(value=0, bar_style='info', description='Leaves Completed:', layout=Layout(width='90…

Working on node 'All Products' with 248 subcategories.
PROMPT: 
	Human: Your job is to create a set categories that will serve as nodes in a tree.
    The new nodes will lie between the parent category below and the child categories listed.
    
    Previous categories that describe these items:
    All Products
    
    Direct parent category that should be divided:
    All Products
    
    Sample of items that should fit into the new categories:
    Luggage
 Boys' Clothing
 Home Décor Products
 Video Game Consoles & Accessories
 Arts & Crafts Supplies
 Girls' Accessories
 Personal Care Products
 Industrial Power & Hand Tools
 Car Electronics & Accessories
 Toys & Games
 Smart Home: New Smart Devices
 Kids' Play Trucks
 Computers
 Pumps & Plumbing Equipment
 Lighting & Ceiling Fans
    
    

    The items in the full dataset cover the scope of: Products across all industries

    Provide two or more new categories to lie between the parent and children.
    These categories should n

TokenCounts(prompt=114460, completion=50551, total=165011)

In [11]:
clean_tree(root=root)

Save the tree for later.

In [12]:
import pickle

with open("tree_v4.pkl", "wb") as f:
    pickle.dump(root, f)

Test the saved tree by loading it again. (Using a copy of the file)

In [13]:
import pickle

with open("tree_v4.pkl", "rb") as f:
    root = pickle.load(f)

Explore the final tree.

In [14]:
check_tree(root)

sub_branches: 26, avg: 2.6, max: 6

leaves at this level: 2
sub_branches: 65, avg: 2.5, max: 12

leaves at this level: 9
sub_branches: 78, avg: 1.2, max: 13

leaves at this level: 52
sub_branches: 91, avg: 1.1666666666666667, max: 13

leaves at this level: 66
sub_branches: 35, avg: 0.38461538461538464, max: 13

leaves at this level: 84
sub_branches: 0, avg: 0.0, max: 0

leaves at this level: 35
total leaves: 248


In [15]:
root.children[0].condition

'Education & Research Supplies'

In [None]:
display_lazy_tree(root, max_initial_depth=8)

VBox(children=(FigureWidget({
    'data': [{'branchvalues': 'total',
              'ids': [a0d9c074-bfb1-49f1-…

: 