# Classificador de notícias

## Objetivo

Utilizando o modelo "microsoft/deberta-v3-small" como base para treinar um modelo classificador de notícias, entre os seguintes tipos World, Sports, Business e Sci/Tech. Inspirado no modelo "wesleyacheng/news-topic-classification-with-bert".

## Instalação


In [17]:
!pip install datasets transformers



## Data Frames

Realizar a importação do dataset, o utilizado foi o ["ag_news"](https://huggingface.co/datasets/ag_news) vindo do hugging face.

In [18]:
from datasets import load_dataset

dataset = load_dataset("ag_news")

dataset

  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

## Tokenização

In [19]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
model_nm = 'microsoft/deberta-v3-small'
tokenizer = AutoTokenizer.from_pretrained(model_nm)

loading configuration file https://huggingface.co/microsoft/deberta-v3-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/8e0c12a7672d1d36f647c86e5fc3a911f189d8704e2bc94dde4a1ffe38f648fa.9df96bac06c2c492bc77ad040068f903c93beec14607428f25bf9081644ad0da
Model config DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-small",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "position_biased_input": false,
  "position_buckets": 256,
  "relative_attention": true,
  "share_att

Foi necessário renomear a coluna 'label'.

In [20]:
def tokenizerFunc(x): return tokenizer(x["text"])
tok_ds = dataset.map(tokenizerFunc, batched=True);
tok_ds = tok_ds.rename_column('label', 'labels')
tok_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7600
    })
})

Também foi necessário utilizar um tamanho menor do dataset para evitar problemas de memória, no caso foi utilizado 60% do dataset original.

In [21]:
fraction = 0.6 
num_train_examples = int(len(tok_ds['train']) * fraction)
num_test_examples = int(len(tok_ds['test']) * fraction)
num_validation_examples = int(len(tok_ds['train']) * fraction) 

tok_ds['train'] = tok_ds['train'].select(range(num_train_examples))
tok_ds['test'] = tok_ds['test'].select(range(num_test_examples))
tok_ds['validation'] = tok_ds['train'].select(range(num_validation_examples))

## Treinamento do modelo

In [22]:
from transformers import TrainingArguments,Trainer

In [23]:
bs = 50
epochs = 4
lr = 8e-5

In [24]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

PyTorch: setting up devices


In [25]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=4)
trainer = Trainer(model, args, train_dataset=tok_ds['train'], eval_dataset=tok_ds['test'], tokenizer=tokenizer)

loading configuration file https://huggingface.co/microsoft/deberta-v3-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/8e0c12a7672d1d36f647c86e5fc3a911f189d8704e2bc94dde4a1ffe38f648fa.9df96bac06c2c492bc77ad040068f903c93beec14607428f25bf9081644ad0da
Model config DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-small",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_

In [26]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text.
***** Running training *****
  Num examples = 72000
  Num Epochs = 4
  Instantaneous batch size per device = 50
  Total train batch size (w. parallel, distributed & accumulation) = 50
  Gradient Accumulation steps = 1
  Total optimization steps = 5760


Epoch,Training Loss,Validation Loss
1,0.265,0.222356
2,0.1643,0.177723
3,0.0981,0.19923
4,0.0616,0.212139


Saving model checkpoint to outputs/checkpoint-500
Configuration saved in outputs/checkpoint-500/config.json
Model weights saved in outputs/checkpoint-500/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-500/tokenizer_config.json
Special tokens file saved in outputs/checkpoint-500/special_tokens_map.json
added tokens file saved in outputs/checkpoint-500/added_tokens.json
  args.max_grad_norm,
Saving model checkpoint to outputs/checkpoint-1000
Configuration saved in outputs/checkpoint-1000/config.json
Model weights saved in outputs/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-1000/tokenizer_config.json
Special tokens file saved in outputs/checkpoint-1000/special_tokens_map.json
added tokens file saved in outputs/checkpoint-1000/added_tokens.json
The following columns in the evaluation set  don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text.
***** Running Evaluation *****

TrainOutput(global_step=5760, training_loss=0.16805264088842603, metrics={'train_runtime': 2659.7917, 'train_samples_per_second': 108.279, 'train_steps_per_second': 2.166, 'total_flos': 9234269808673200.0, 'train_loss': 0.16805264088842603, 'epoch': 4.0})

## Testando os resultados do modelo

In [28]:
preds = trainer.predict(tok_ds['validation']).predictions.astype(float);
preds

The following columns in the test set  don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text.
***** Running Prediction *****
  Num examples = 72000
  Batch size = 100


array([[-0.82421875, -3.57226562,  5.87890625, -1.23242188],
       [-1.29101562, -3.49804688,  6.02734375, -1.046875  ],
       [-0.17114258, -3.85742188,  5.76953125, -1.45996094],
       ...,
       [ 6.2890625 , -2.91796875, -1.92285156, -2.046875  ],
       [ 5.90234375, -3.265625  , -1.39160156, -1.70214844],
       [ 6.05078125, -3.23046875, -1.60058594, -1.72265625]])

In [29]:
preds[:10]

array([[-0.82421875, -3.57226562,  5.87890625, -1.23242188],
       [-1.29101562, -3.49804688,  6.02734375, -1.046875  ],
       [-0.17114258, -3.85742188,  5.76953125, -1.45996094],
       [ 2.88671875, -4.19140625,  3.26953125, -1.63769531],
       [ 3.38671875, -4.16015625,  2.87890625, -1.89550781],
       [-1.25097656, -3.59765625,  6.10546875, -1.1015625 ],
       [-1.4296875 , -3.33789062,  6.5546875 , -1.59082031],
       [-1.12890625, -3.47460938,  6.30859375, -1.47753906],
       [-1.109375  , -3.421875  ,  6.05078125, -1.25878906],
       [-1.38769531, -3.31640625,  6.3515625 , -1.3984375 ]])

## Transformando em zip para subir para o hugging face

In [52]:
save_directory = "./news"
tokenizer.save_pretrained(save_directory);
trainer.save_model(save_directory);

tokenizer config file saved in ./news/tokenizer_config.json
Special tokens file saved in ./news/special_tokens_map.json
added tokens file saved in ./news/added_tokens.json
Saving model checkpoint to ./news
Configuration saved in ./news/config.json
Model weights saved in ./news/pytorch_model.bin
tokenizer config file saved in ./news/tokenizer_config.json
Special tokens file saved in ./news/special_tokens_map.json
added tokens file saved in ./news/added_tokens.json


In [53]:
import zipfile
with zipfile.ZipFile("news.zip", 'w',zipfile.ZIP_DEFLATED) as zipf:
    for root, _, files in os.walk(save_directory):
        for file in files:
            zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), save_directory))