https://huggingface.co/docs/transformers/training
https://huggingface.co/docs/transformers/tasks/sequence_classification

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root

In [3]:
!pip install transformers datasets sentencepiece torch evaluate accelerate



In [4]:
import pandas as pd
import transformers
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer)
from datasets import load_dataset
import sentencepiece as spm
import evaluate
import torch
import numpy as np

In [5]:
dataset = load_dataset(
    "csv",
    data_files= "/content/drive/MyDrive/data/final_review_dataset.csv",
    delimiter=";"
)

In [6]:
dataset['train'][7988]

{'text': 'This hotel occupies an historic building that was once the old high school in the town of Namekagon, located in northern Wisconsin. It has three floors, with 12 rooms in all, some of which are doubles, while others are triples.\nThe rooms are pretty spacious and have all the amenities you need for a short-term stay. Each room is equipped with a full-sized bed, a desk, a coffee maker, a microwave, a TV, and a private bathroom with a bathtub.\nThey also have a kitchen, a living room and a dining room, which is open to the public. The',
 'label': 0}

In [8]:
dataset = dataset['train'].train_test_split(test_size=0.2)

In [9]:
id2label = {0: "FAKE", 1: "REAL"}
label2id = {"FAKE": 0, "REAL": 1}

In [27]:
model_name =  "microsoft/deberta-v3-base"

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, id2label=id2label, label2id=label2id)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['pooler.dense.bias', 'classifier.weight', 'pooler.dense.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Tokenization and DataCollator

In [11]:
def tokenize_function(examples):

    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [12]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/7998 [00:00<?, ? examples/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [13]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Evaluation metrics are loaded

In [14]:
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")
f1_metric = evaluate.load("f1")
accuracy_metric= evaluate.load("accuracy")

In [15]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    results = {}
    results.update(f1_metric.compute(predictions=preds, references = labels, average="macro"))
    results.update(precision_metric.compute(predictions=preds, references = labels, average="macro"))
    results.update(recall_metric.compute(predictions=preds, references = labels, average="macro"))
    results.update(accuracy_metric.compute(predictions=preds, references = labels))
    return results

In [32]:
batch_size = 4
gradient_accumulation_steps=4
logging_steps = 100
new_model_name = "deberta-v3-base-fake-hotel-review-detector"
num_train_epochs = 2
learning_rate = 5e-5
evaluation_strategy = "epoch"
save_strategy = "epoch"
load_best_model_at_end = True

In [33]:
training_args = TrainingArguments(
    output_dir=new_model_name,
    logging_steps=logging_steps,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=num_train_epochs,
    learning_rate=learning_rate,
    evaluation_strategy=evaluation_strategy,
    save_strategy=save_strategy,
    load_best_model_at_end=load_best_model_at_end,
    gradient_checkpointing=True
)

Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [28]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [29]:
transformers.logging.set_verbosity_info()

In [30]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 7,998
  Num Epochs = 2
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 1,000
  Number of trainable parameters = 184,423,682


Epoch,Training Loss,Validation Loss,F1,Precision,Recall,Accuracy
1,0.0615,0.070354,0.977,0.977467,0.977314,0.977
2,0.0149,0.044512,0.991499,0.991472,0.991586,0.9915


The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to deberta-v3-base-fake-hotel-review-detector/checkpoint-500
Configuration saved in deberta-v3-base-fake-hotel-review-detector/checkpoint-500/config.json
Model weights saved in deberta-v3-base-fake-hotel-review-detector/checkpoint-500/pytorch_model.bin
tokenizer config file saved in deberta-v3-base-fake-hotel-review-detector/checkpoint-500/tokenizer_config.json
Special tokens file saved in deberta-v3-base-fake-hotel-review-detector/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text. If tex

TrainOutput(global_step=1000, training_loss=0.06719247192144394, metrics={'train_runtime': 1928.1918, 'train_samples_per_second': 8.296, 'train_steps_per_second': 0.519, 'total_flos': 2294295547550016.0, 'train_loss': 0.06719247192144394, 'epoch': 2.0})

In [31]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4


{'eval_loss': 0.044512223452329636,
 'eval_f1': 0.9914990627716705,
 'eval_precision': 0.9914722868071701,
 'eval_recall': 0.9915858459765701,
 'eval_accuracy': 0.9915,
 'eval_runtime': 58.215,
 'eval_samples_per_second': 34.355,
 'eval_steps_per_second': 8.589,
 'epoch': 2.0}

In [37]:
model.push_to_hub('dbauer1860/deberta-v3-base-fake-hotel-review-detector', create_pr=1)

Configuration saved in deberta-v3-base-fake-hotel-review-detector/config.json
Model weights saved in deberta-v3-base-fake-hotel-review-detector/pytorch_model.bin
Uploading the following files to dbauer1860/deberta-v3-base-fake-hotel-review-detector: pytorch_model.bin,config.json


CommitInfo(commit_url='https://huggingface.co/dbauer1860/deberta-v3-base-fake-hotel-review-detector/commit/4261db6392d0f9dc0046cf1cea10b1f5caba02c5', commit_message='Upload DebertaV2ForSequenceClassification', commit_description='', oid='4261db6392d0f9dc0046cf1cea10b1f5caba02c5', pr_url='https://huggingface.co/dbauer1860/deberta-v3-base-fake-hotel-review-detector/discussions/1', pr_revision='refs/pr/1', pr_num=1)