## **Pre-Trained Models for Product Categorization**

In the context of rapidly evolving product catalogs—such as e-commerce platforms or inventory management systems—automatically categorizing products becomes essential for efficient organization, searchability, and user experience. Utilizing pre-trained models for text classification offers a scalable solution to this challenge.

Many pre-trained models, especially those based on architectures like BERT or RoBERTa, support zero-shot classification. This means they can categorize new products into previously unseen categories by simply providing descriptive labels, eliminating the need for labeled training data for every category.

## **Natural Language Inference (NLI) - NLI-based Zero Shot Text Classification**

Model details 🤗: [`facebook/bart-large-mnli`](https://huggingface.co/facebook/bart-large-mnli)

Yin et al. proposed a method for using pre-trained NLI models as a ready-made zero-shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. This method is effective in many cases, particularly when used with larger pre-trained models like BART and Roberta. 

In [1]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [8]:
# Using the pipeline to classify sequences into any of the specified class names
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
result = classifier(sequence_to_classify, candidate_labels)
result

{'sequence': 'one day I will see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9938650727272034, 0.0032737923320382833, 0.0028610334265977144]}

In [14]:
# Define a threshold
threshold = 0.9

# Filter labels based on the threshold
filtered_labels = [
    (result['labels'][i], result['scores'][i]) 
    for i in range(len(result['scores'])) 
    if result['scores'][i] > threshold
]

# Print the labels and scores above the threshold
if filtered_labels:
    for label, score in filtered_labels:
        print(f"Label: '{label}' with a score of {score:.3f}")
else:
    print("No labels exceed the threshold.")

Label: 'travel' with a score of 0.994


In [15]:
# If more than one candidate label can be correct, pass multi_label=True to calculate each class independently
candidate_labels = ['travel', 'cooking', 'dancing', 'exploration']
result = classifier(sequence_to_classify, candidate_labels, multi_label=True)
result

{'sequence': 'one day I will see the world',
 'labels': ['travel', 'exploration', 'dancing', 'cooking'],
 'scores': [0.994511067867279,
  0.9383884072303772,
  0.005706179421395063,
  0.0018192887073382735]}

In [16]:
# Define a threshold
threshold = 0.9

# Filter labels based on the threshold
filtered_labels = [
    (result['labels'][i], result['scores'][i]) 
    for i in range(len(result['scores'])) 
    if result['scores'][i] > threshold
]

# Print the labels and scores above the threshold
if filtered_labels:
    for label, score in filtered_labels:
        print(f"Label: '{label}' with a score of {score:.3f}")
else:
    print("No labels exceed the threshold.")

Label: 'travel' with a score of 0.995
Label: 'exploration' with a score of 0.938


In [5]:
# Other example
other_candidate_labels = ['electronics', 'food', 'toys', 'books']
classifier('''Latest model of smartphone with 5G connectivity and 128GB internal storage''', 
           candidate_labels=other_candidate_labels)

{'sequence': 'Latest model of smartphone with 5G connectivity and 128GB internal storage',
 'labels': ['electronics', 'toys', 'food', 'books'],
 'scores': [0.9613720774650574,
  0.013380033895373344,
  0.012821927666664124,
  0.012426052242517471]}

## **🤗 Zero Shot Classification in pt-br**: [comprehend-it-multilang-base](https://huggingface.co/knowledgator/comprehend_it-multilingual-t5-base)

`comprehend_it-multilingual-t5-base` is an encoder-decoder model based on mT5-base that was trained on multi-language natural language inference datasets as well as on multiple text classification datasets. The model demonstrates a better contextual understanding of text and verbalized label because both inputs are encoded by different parts of a model - encoder and decoder respectively. The zero-shot classifier supports nearly 100 languages and can work in both directions, meaning that labels and text can belong to different languages.

In [2]:
from liqfit.pipeline import ZeroShotClassificationPipeline
from liqfit.models import T5ForZeroShotClassification
from transformers import T5Tokenizer

model = T5ForZeroShotClassification.from_pretrained('knowledgator/comprehend_it-multilingual-t5-base')
tokenizer = T5Tokenizer.from_pretrained('knowledgator/comprehend_it-multilingual-t5-base')
classifier = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer,
                                                      hypothesis_template = '{}', encoder_decoder = True)

You are using a model of type T5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.


In [3]:
description = '''Este water gel leve e refrescante, proporciona hidratação imediata que ajuda a aliviar o repuxamento e aspereza da pele sensível.'''
candidate_labels = ['beleza', 'cozinha', 'livros']
result = classifier(description, candidate_labels, multi_label = False)
result

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'sequence': 'Este water gel leve e refrescante, proporciona hidratação imediata que ajuda a aliviar o repuxamento e aspereza da pele sensível.',
 'labels': ['beleza', 'livros', 'cozinha'],
 'scores': [0.9386342167854309, 0.04146720468997955, 0.01989862322807312]}

In [4]:
import pandas as pd

result = pd.DataFrame(result).drop(['sequence'], axis=1)
result

Unnamed: 0,labels,scores
0,beleza,0.938634
1,livros,0.041467
2,cozinha,0.019899


In [5]:
description = '''A fritadeira eletrica sem óleo start fry da elgin possui um design único, capacidade para até 3,5 litros,
            potência de 1400w e revestimento antiaderente. Seu sistema de circulação de ar ultra rápido frita e economiza energia.
            Sua grelha de fritura é removível e super fácil de limpar. Ela conta com uma proteção contra super aquecimento.
            Possui controle de temperatura de 80°c a 200°c que permite você programar a temperatura de preparo para cada tipo de alimento,
            timer para até 60 minutos com aviso sonoro e desligamento automático, assim você pode deixar preparando sua refeição
            enquanto realiza outras tarefas.'''

In [6]:
result = classifier(description, candidate_labels, multi_label=False)
result = pd.DataFrame(result).drop(["sequence"], axis=1)
result

Unnamed: 0,labels,scores
0,cozinha,0.709798
1,beleza,0.236621
2,livros,0.053581


In [7]:
# Applying the model to some data
df = pd.read_csv('../dados/descricoes_produtos.csv')
df.head(7)

Unnamed: 0,Descrição
0,Liquidificador de alta potência com jarra de v...
1,"Forno Micro-ondas de 20 litros, com menu desco..."
2,Máquina de café espresso com reservatório de á...
3,Torradeira com capacidade para quatro fatias e...
4,"Panela elétrica multifuncional que cozinha, as..."
5,Smartphone com 128GB de armazenamento e câmera...
6,"Smartwatch monitoramento de passos, frequência..."


In [8]:
candidate_labels = ['eletrodomésticos', 'eletrônicos', 'beleza', 'brinquedos']

In [11]:
def categorize(description: str) -> str:
    """
    Categorizes a given description into the highest scoring label.
    
    Args:
        description (str): The text description to categorize.

    Returns:
        str: The label with the highest score for the given description.
    """

    result = classifier(description, candidate_labels, multi_label=False)
    categoria_max = max(zip(result['labels'],result['scores']), key=lambda x: x[1])[0]
    return categoria_max

In [13]:
# Apply the function
df['Categoria'] = df['Descrição'].apply(categorize)

In [15]:
df.head(7)

Unnamed: 0,Descrição,Categoria
0,Liquidificador de alta potência com jarra de v...,eletrodomésticos
1,"Forno Micro-ondas de 20 litros, com menu desco...",eletrodomésticos
2,Máquina de café espresso com reservatório de á...,eletrodomésticos
3,Torradeira com capacidade para quatro fatias e...,eletrodomésticos
4,"Panela elétrica multifuncional que cozinha, as...",eletrodomésticos
5,Smartphone com 128GB de armazenamento e câmera...,eletrônicos
6,"Smartwatch monitoramento de passos, frequência...",eletrônicos


In [16]:
df.tail(7)

Unnamed: 0,Descrição,Categoria
15,Perfume feminino com notas de jasmim e sândalo...,beleza
16,"Kit de barbear com creme, pós-barba e lâminas ...",beleza
17,Sérum facial anti-idade com vitamina C e ácido...,beleza
18,"Máscara facial de argila purificante, ideal pa...",beleza
19,Quebra-cabeça de 1000 peças com imagem de pais...,brinquedos
20,Kit de ciências para crianças com experiências...,brinquedos
21,Jogo de tabuleiro clássico de estratégia para ...,brinquedos
