# Predicción de categoria de productos

### Links

- [Text Classification with BERT in PyTorch](https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f)
- [Best models](https://huggingface.co/models)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('../src')

from matplotlib import pyplot as plt
import seaborn as sns

import numpy  as np
import pandas as pd
import torch
from   torch import nn

import logging
import random

import data  as dt
import model as ml
import util  as ut
import pytorch_common.util as pu

## Setup

In [3]:
pu.LoggerBuilder().on_console().build()

In [4]:
torch.cuda.is_available()

In [5]:
torch.__version__

In [6]:
pu.set_device_name('gpu')

logging.info(pu.get_device())

2022-08-28 16:27:55,788 - INFO - cuda:0


In [7]:
torch.cuda.get_arch_list()

In [8]:
def set_seed(value):
    random.seed(value)
    np.random.seed(value)
    torch.manual_seed(value)

In [9]:
set_seed(42)

In [1]:
DATASET_PATH       = '../datasets'
WEIGHTS_PATH       = '../weights'
MODEL_WEIGHTS_PATH = f'{WEIGHTS_PATH}/model_weights.h5'

!mkdir -p {WEIGHTS_PATH}

In [11]:
BERT_MODEL ='bert-base-cased'
# BERT_MODEL ='distilbert-base-uncased'
# BERT_MODEL ='albert-base-v2'

## Cargar dataset

In [12]:
train_set = pd.read_csv(f'{DATASET_PATH}/ebay-train.csv')
val_set   = pd.read_csv(f'{DATASET_PATH}/ebay-val.csv')
test_set  = pd.read_csv(f'{DATASET_PATH}/ebay-test.csv')

In [13]:
train_set.head()

Unnamed: 0,title,category,category_seq
0,2009 Chevrolet Aveo 1.2 LS 5dr Engine Control ...,eBay ~ Vehicle Parts & Accessories ~ Car Parts...,32
1,1pz Front Shock Absorber SX JAPANPARTS Mm-10068,eBay ~ Vehicle Parts & Accessories ~ Car Parts...,16
2,Mintex MDC2808C Brake Disc,eBay ~ Vehicle Parts & Accessories ~ Car Parts...,49
3,Gates T39145 Tensioner,eBay ~ Vehicle Parts & Accessories ~ Car Parts...,87
4,Gimbal Camera Lens Protector Cover Cap Accesso...,eBay ~ Toys & Games ~ Radio Control & RC Toys ...,5


Tamaño máximo de la secuencia de entrada:

In [14]:
max_length = 2 + train_set['title'].apply(len).max()
max_length

In [15]:
tokenizer = ml.Tokenizer(
    BERT_MODEL,
    padding    = 'max_length',
    max_length = max_length
)

In [16]:
train_dataset = dt.BertDataset(
    train_set,
    feature_col = 'title', 
    target_col  = 'category_seq',
    tokenizer   = tokenizer
)

val_dataset = dt.BertDataset(
    val_set,
    feature_col = 'title', 
    target_col  = 'category_seq',
    tokenizer   = tokenizer
)

test_dataset = dt.BertDataset(
    test_set,
    feature_col = 'title', 
    target_col  = 'category_seq',
    tokenizer   = tokenizer
)

Cantidad de clases a predecir:

In [17]:
n_classes = train_set['category'].unique().shape[0]
n_classes

In [18]:
classifier = ml.BertClassifier(
    output_dim = n_classes,
    model      = BERT_MODEL,
    dropout    = 0.1
)

## Entrenamiento y evaluación

In [19]:
classifier.load(MODEL_WEIGHTS_PATH)

In [20]:
model = ml.BertModel(
    classifier,
    batch_size = 25,
    criterion  = nn.CrossEntropyLoss(), 
    optimizer  = torch.optim.Adam(classifier.parameters(), lr = 0.0000001)
)

In [26]:
model.fit(train_dataset, val_dataset, epochs=2)

2022-08-28 16:37:42,651 - INFO - Time: 0:00:12.30 | Epoch: 1 | Train(loss: 0.003595, acc: 99.94%) | Val(loss: 0.037137, acc: 82.47%) | acc diff: 17.47%
2022-08-28 16:37:54,926 - INFO - Time: 0:00:12.27 | Epoch: 2 | Train(loss: 0.003462, acc: 99.94%) | Val(loss: 0.037409, acc: 83.51%) | acc diff: 16.44%


In [30]:
classifier.save(MODEL_WEIGHTS_PATH)

In [31]:
model.validate(test_dataset)