# Predicción de categoria de productos

### Links

- [Text Classification with BERT in PyTorch](https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f)
- [Best models](https://huggingface.co/models)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('../../src')

from matplotlib import pyplot as plt
import seaborn as sns

import numpy  as np
import pandas as pd
import torch
from   torch import nn

import logging
import random

import data  as dt
import model as ml
import util  as ut
import pytorch_common.util as pu

## Setup

Se configura el default logger para que la consola sea el output y loguee mensajes a partir del nivel INFO.

In [3]:
pu.LoggerBuilder().on_console().build()

Por defecto usamos GPU. De no existir este hardware, el fallback es CPU:

In [4]:
torch.cuda.is_available()

In [5]:
torch.__version__

In [6]:
pu.set_device_name('gpu')

logging.info(pu.get_device())

2022-09-18 17:36:45,753 - INFO - cuda:0


In [7]:
torch.cuda.get_arch_list()

Fijamos la semilla:

In [8]:
def set_seed(value):
    random.seed(value)
    np.random.seed(value)
    torch.manual_seed(value)

In [9]:
set_seed(42)

Parametros de la notebook:

In [10]:
DATASET_PATH       = '../../datasets/fashion-outfits'
WEIGHTS_PATH       = '../../weights'
MODEL_WEIGHTS_PATH = f'{WEIGHTS_PATH}/model_weights.h5'

!mkdir -p {WEIGHTS_PATH}

Modelo pre entrenado a utilizar. Este es una capa/module mas de nuestro modelo:

In [11]:
BERT_MODEL ='bert-base-cased'
# BERT_MODEL ='distilbert-base-uncased'
# BERT_MODEL ='albert-base-v2'

## Cargar dataset

In [12]:
train_set = pd.read_csv(f'{DATASET_PATH}/train.csv')
val_set   = pd.read_csv(f'{DATASET_PATH}/val.csv')
test_set  = pd.read_csv(f'{DATASET_PATH}/test.csv')

In [13]:
train_set.head()

Unnamed: 0,id,family,category,sub_category,description,highlights,brand,gender,materials,branch,branch_seq
0,17162989,clothing,dresses,cocktail & party dresses,polka-dot long-sleeve dress,"[black/gold-tone, metallic threading, polka do...",zimmermann,women,['Silk' 'viscose' 'Metallic Fibre'],clothing~dresses~cocktail & party dresses,140
1,17037913,clothing,jackets,fitted jackets,zip-up hooded jacket,"[white, front zip fastening, classic hood, lon...",k-way r&d,women,['Polyester'],clothing~jackets~fitted jackets,100
2,17219240,clothing,denim,wide-leg jeans,panelled wide-leg jeans,"[black, cotton, panelled design, crescent moon...",marine serre,women,['Cotton'],clothing~denim~wide-leg jeans,77
3,17404088,clothing,trousers,cropped trousers,slim-cut tailored trousers,"[black, belt loops, hook and zip fly fastening...",merci,women,['viscose' 'Spandex/Elastane' 'Polyester'],clothing~trousers~cropped trousers,5
4,17052498,clothing,tops,sweatshirts,bead-embellished crewneck sweatshirt,"[black, cotton, bead embellishment, crew neck,...",alexandre vauthier,women,['glass' 'Cotton' 'Lambskin'],clothing~tops~sweatshirts,45


In [14]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92151 entries, 0 to 92150
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            92151 non-null  int64 
 1   family        92151 non-null  object
 2   category      92151 non-null  object
 3   sub_category  92151 non-null  object
 4   description   92148 non-null  object
 5   highlights    87976 non-null  object
 6   brand         92151 non-null  object
 7   gender        92151 non-null  object
 8   materials     92115 non-null  object
 9   branch        92151 non-null  object
 10  branch_seq    92151 non-null  int64 
dtypes: int64(2), object(9)
memory usage: 7.7+ MB


In [15]:
train_set['description'] = train_set['description'].apply(str)
val_set['description']   = val_set['description'].apply(str)
test_set['description']  = test_set['description'].apply(str)

Tamaño máximo de la secuencia de entrada:

In [16]:
max_length = 2 + train_set['description'].apply(len).max()
max_length

In [17]:
tokenizer = ml.Tokenizer(
    BERT_MODEL,
    padding    = 'max_length',
    max_length = max_length
)

In [18]:
train_dataset = dt.BertDataset(
    train_set,
    feature_col = 'description', 
    target_col  = 'branch_seq',
    tokenizer   = tokenizer
)

val_dataset = dt.BertDataset(
    val_set,
    feature_col = 'description', 
    target_col  = 'branch_seq',
    tokenizer   = tokenizer
)

test_dataset = dt.BertDataset(
    test_set,
    feature_col = 'description', 
    target_col  = 'branch_seq',
    tokenizer   = tokenizer
)

Cantidad de clases a predecir:

In [19]:
n_classes = train_set['branch_seq'].max() + 1
n_classes

In [20]:
classifier = ml.BertClassifier(
    output_dim = n_classes,
    model      = BERT_MODEL,
    dropout    = 0.1
)

## Entrenamiento y evaluación

In [21]:
# classifier.load(MODEL_WEIGHTS_PATH)

In [22]:
model = ml.BertModel(
    classifier,
    batch_size = 70,
    criterion  = nn.CrossEntropyLoss(), 
    optimizer  = torch.optim.Adam(classifier.parameters(), lr = 0.0000001)
)

In [None]:
model.fit(train_dataset, val_dataset, epochs=2)

In [None]:
classifier.save(MODEL_WEIGHTS_PATH)

In [None]:
summary = model.validate(test_dataset)

In [None]:
summary.accuracy, summary.loss

In [None]:
summary.targets

In [None]:
summary.predictions

In [None]:
np.concatenate(summary.predictions).shape

In [None]:
summary.show_sample_metrics(0)

In [None]:
summary.show_sample_metrics(1)

In [None]:
summary.show_metrics()