<a href="https://colab.research.google.com/github/componavt/neural_synset/blob/master/src/dataset/wikt_labels_loading_a_custom_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading a custom dataset

Source code: [Loading a custom dataset](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/load_custom_dataset.ipynb#scrollTo=D2ekPOyykZDq), [video](https://www.youtube.com/watch?v=HyQgpJTkRdE).

Video: [The pipeline function](https://www.youtube.com/watch?v=tiZFewofSLM).

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
! pip install datasets transformers[sentencepiece]
! pip install torch               # required by TrainingArguments
! pip install transformers[torch] # required by TrainingArguments

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets

In [2]:
!wget https://github.com/componavt/neural_synset/raw/master/data/label_meaning.csv

--2024-03-26 10:05:09--  https://github.com/componavt/neural_synset/raw/master/data/label_meaning.csv
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/componavt/neural_synset/master/data/label_meaning.csv [following]
--2024-03-26 10:05:09--  https://raw.githubusercontent.com/componavt/neural_synset/master/data/label_meaning.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1227 (1.2K) [text/plain]
Saving to: ‘label_meaning.csv’


2024-03-26 10:05:09 (58.7 MB/s) - ‘label_meaning.csv’ saved [1227/1227]



In [3]:
cat label_meaning.csv

"word"|"meaning"|"книжн."|"ирон."|"религ."|"груб."
подвизаться|осуществлять деятельность, работать, действовать в какой-нибудь области|1|1|0|0
подвизаться|совершать подвиг в чём-либо, часто о ежедневном борении|0|0|1|0
заткнуться|то же, что замолчать; перестать говорить, кричать, плакать; замолкнуть|0|0|0|1
пустобрёх|тот, кто говорит много пустого и несерьёзного; болтун|0|0|0|1
излаять|сильно изругать|0|0|0|1
бизнес-дама|о предпринимательнице|0|1|0|0
агнец божий|кроткий, робкий, безобидный человек|0|1|0|0
всезнайка|человек, который считает себя знающим всё|0|1|0|0
галантерейный|относящийся к галантерее|0|0|0|0
галантерейный|чрезмерно любезный, вежливый до слащавости|0|1|0|0
дитятя|дитя, ребёнок, чадо|0|1|0|0


In [4]:
from datasets import load_dataset

ds = load_dataset("csv", data_files="label_meaning.csv", sep="|")
#ds = load_dataset("csv", data_files="label_meaning.csv", sep="|", split='train')
#ds["train"]
ds

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['word', 'meaning', 'книжн.', 'ирон.', 'религ.', 'груб.'],
        num_rows: 11
    })
})

In [5]:
# 80% train, 20% test + validation
#da = ds.train_test_split(test_size=0.2, shuffle=True)
#da
#datushka["train"]
#datushka["test"]

In [6]:
#print(datushka["train"][0])
#print(len(list(datushka["train"])))

# Pipeline: zero shot classification with labels

When more than one label is passed, we assume that there is only one true label and that the others are false so that the output probabilities add up to 1. This can be changed by passing `multi_class=True`:
nlp(sequence_to_classify, candidate_labels, multi_class=True)

Source: huggingface/transformers/[Zero shot classification pipeline #5760 ](https://github.com/huggingface/transformers/pull/5760).

In [7]:
from transformers import pipeline
model_name = "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"
pipe = pipeline("zero-shot-classification", model=model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

In [8]:
def meaning_iterator():
    for i in range(0, len(da["train"]), 1):
        yield da["train"][i]["meaning"]

#print(len(datushka["train"]))
#datushka["train"][0]["meaning"]
#nlp(datushka["train"][0]["meaning"], ["positive", "negative"], multi_label=True)
###pipe(meaning_iterator(), ["positive", "negative"], multi_label=True)

In [9]:
#sequence_to_classify = "тот, кто говорит много пустого и несерьёзного; болтун"
candidate_labels = ["книжн.", "ирон.", "религ.", "груб."]
#pipe(sequence_to_classify, candidate_labels, multi_label=True)

#pipe(meaning_iterator(), candidate_labels, multi_label=True)
#pipe(da["train"][0]["meaning"], candidate_labels, multi_label=True)

# AutoTokenizer and PyTorch optimized training loop
From [quicktour.ipynb#AutoTokenizer](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/quicktour.ipynb#scrollTo=c-mB_1hXw57y&line=1&uniqifier=1)

In [None]:
! pip install accelerate -U

# setup hyperparameters (learning rate, batch size, and the number of epochs to train for)
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir="./pt_training",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    )

In [11]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [12]:
#encoding = tokenizer( da["train"][0]["meaning"] )
#print(encoding)

In [13]:
def tokenize_dataset(dataset):
  return tokenizer(dataset["meaning"])

In [14]:
#da["train"]
ds

DatasetDict({
    train: Dataset({
        features: ['word', 'meaning', 'книжн.', 'ирон.', 'религ.', 'груб.'],
        num_rows: 11
    })
})

In [15]:
#dataset = ds["train"].map(tokenize_dataset, batched=True)
dataset = ds.map(tokenize_dataset, batched=True)
dataset

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['word', 'meaning', 'книжн.', 'ирон.', 'религ.', 'груб.', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 11
    })
})

In [16]:
# create a batch of examples from dataset
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_collator

DataCollatorWithPadding(tokenizer=DebertaV2TokenizerFast(name_or_path='MoritzLaurer/mDeBERTa-v3-base-mnli-xnli', vocab_size=250101, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	250101: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length

In [17]:
# 80% train, 20% test + validation
dataset = dataset['train'].train_test_split(test_size=0.2, shuffle=True)
dataset

DatasetDict({
    train: Dataset({
        features: ['word', 'meaning', 'книжн.', 'ирон.', 'религ.', 'груб.', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 8
    })
    test: Dataset({
        features: ['word', 'meaning', 'книжн.', 'ирон.', 'религ.', 'груб.', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3
    })
})

In [18]:
# gather all these classes in Trainer:
from transformers import Trainer
trainer = Trainer(
    model=model_name,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)  # doctest: +SKIP

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


AttributeError: 'str' object has no attribute 'to'