<a href="https://colab.research.google.com/github/componavt/neural_synset/blob/master/src/dataset/wikt_labels_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading labels with definitions

Source code: [Loading a custom dataset](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/load_custom_dataset.ipynb#scrollTo=D2ekPOyykZDq), [video](https://www.youtube.com/watch?v=HyQgpJTkRdE).

Video: [The pipeline function](https://www.youtube.com/watch?v=tiZFewofSLM).

Install the Transformers and Datasets libraries to run this notebook.

In [None]:
! pip install -U accelerate
! pip install -U transformers

! pip install datasets transformers[sentencepiece]
! pip install torch               # required by TrainingArguments
! pip install transformers[torch] # required by TrainingArguments

In [None]:
!wget https://github.com/componavt/neural_synset/raw/master/data/label_meaning.csv

In [None]:
cat label_meaning.csv

In [None]:
from datasets import load_dataset

ds = load_dataset("csv", data_files="label_meaning.csv", sep="|")
#ds = load_dataset("csv", data_files="label_meaning.csv", sep="|", split='train')
#ds["train"]
ds

In [5]:
# 80% train, 20% test + validation
#da = ds.train_test_split(test_size=0.2, shuffle=True)
#da
#datushka["train"]
#datushka["test"]

In [6]:
#print(datushka["train"][0])
#print(len(list(datushka["train"])))

# Pipeline: zero shot classification with labels

When more than one label is passed, we assume that there is only one true label and that the others are false so that the output probabilities add up to 1. This can be changed by passing `multi_class=True`:
nlp(sequence_to_classify, candidate_labels, multi_class=True)

Source: huggingface/transformers/[Zero shot classification pipeline #5760 ](https://github.com/huggingface/transformers/pull/5760).

In [None]:
from transformers import pipeline
model_name = "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"
pipe = pipeline("zero-shot-classification", model=model_name)

In [8]:
#def meaning_iterator():
#    for i in range(0, len(da["train"]), 1):
#        yield da["train"][i]["meaning"]

#print(len(datushka["train"]))
#datushka["train"][0]["meaning"]
#nlp(datushka["train"][0]["meaning"], ["positive", "negative"], multi_label=True)
###pipe(meaning_iterator(), ["positive", "negative"], multi_label=True)

In [9]:
#sequence_to_classify = "тот, кто говорит много пустого и несерьёзного; болтун"
candidate_labels = ["книжн.", "ирон.", "религ.", "груб."]
#pipe(sequence_to_classify, candidate_labels, multi_label=True)

#pipe(meaning_iterator(), candidate_labels, multi_label=True)
#pipe(da["train"][0]["meaning"], candidate_labels, multi_label=True)

# AutoTokenizer and PyTorch optimized training loop
From [quicktour.ipynb#AutoTokenizer](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/quicktour.ipynb#scrollTo=c-mB_1hXw57y&line=1&uniqifier=1)

In [None]:
# setup hyperparameters (learning rate, batch size, and the number of epochs to train for)
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir="./pt_training",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    )

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
#encoding = tokenizer( da["train"][0]["meaning"] )
#print(encoding)

In [None]:
def tokenize_dataset(dataset):
  return tokenizer(dataset["meaning"])

In [None]:
#da["train"]
ds

In [None]:
#dataset = ds["train"].map(tokenize_dataset, batched=True)
dataset = ds.map(tokenize_dataset, batched=True)
dataset

In [None]:
# create a batch of examples from dataset
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_collator

In [None]:
# 80% train, 20% test + validation
dataset = dataset['train'].train_test_split(test_size=0.2, shuffle=True)
dataset

In [None]:
# gather all these classes in Trainer:
from transformers import Trainer
trainer = Trainer(
    model=pt_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)  # doctest: +SKIP

In [None]:
trainer.train()

#see https://discuss.huggingface.co/t/the-model-did-not-return-a-loss-from-the-inputs-only-the-following-keys-logits-for-reference-the-inputs-it-received-are-input-values/25420