<a href="https://colab.research.google.com/github/componavt/neural_synset/blob/master/src/dataset/Ilya_load_dataset_meaning_label.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading a custom dataset

Source code: [Loading a custom dataset](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/load_custom_dataset.ipynb#scrollTo=D2ekPOyykZDq), [video](https://www.youtube.com/watch?v=HyQgpJTkRdE).

Video: [The pipeline function](https://www.youtube.com/watch?v=tiZFewofSLM).

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
! pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/510.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/510.5 kB[0m [31m1.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m501.8/510.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.

In [2]:
!wget https://github.com/componavt/neural_synset/raw/master/data/label_meaning.csv

--2024-03-12 14:21:57--  https://github.com/componavt/neural_synset/raw/master/data/label_meaning.csv
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/componavt/neural_synset/master/data/label_meaning.csv [following]
--2024-03-12 14:21:57--  https://raw.githubusercontent.com/componavt/neural_synset/master/data/label_meaning.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1227 (1.2K) [text/plain]
Saving to: ‘label_meaning.csv’


2024-03-12 14:21:57 (76.8 MB/s) - ‘label_meaning.csv’ saved [1227/1227]



In [3]:
cat label_meaning.csv

"word"|"meaning"|"книжн."|"ирон."|"религ."|"груб."
подвизаться|осуществлять деятельность, работать, действовать в какой-нибудь области|1|1|0|0
подвизаться|совершать подвиг в чём-либо, часто о ежедневном борении|0|0|1|0
заткнуться|то же, что замолчать; перестать говорить, кричать, плакать; замолкнуть|0|0|0|1
пустобрёх|тот, кто говорит много пустого и несерьёзного; болтун|0|0|0|1
излаять|сильно изругать|0|0|0|1
бизнес-дама|о предпринимательнице|0|1|0|0
агнец божий|кроткий, робкий, безобидный человек|0|1|0|0
всезнайка|человек, который считает себя знающим всё|0|1|0|0
галантерейный|относящийся к галантерее|0|0|0|0
галантерейный|чрезмерно любезный, вежливый до слащавости|0|1|0|0
дитятя|дитя, ребёнок, чадо|0|1|0|0


In [4]:
from datasets import load_dataset

ds = load_dataset("csv", data_files="label_meaning.csv", sep="|")
ds["train"]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['word', 'meaning', 'книжн.', 'ирон.', 'религ.', 'груб.'],
    num_rows: 11
})

In [5]:
# 80% train, 20% test + validation
train_testvalid = ds["train"].train_test_split(test_size=0.2, shuffle=True)
train_testvalid

DatasetDict({
    train: Dataset({
        features: ['word', 'meaning', 'книжн.', 'ирон.', 'религ.', 'груб.'],
        num_rows: 8
    })
    test: Dataset({
        features: ['word', 'meaning', 'книжн.', 'ирон.', 'религ.', 'груб.'],
        num_rows: 3
    })
})

# Pipeline: zero shot classification with labels: education, business and politics

In [8]:
ds.features
{
  'word' : Value(dtype= 'string' , id = None ),
  'meaning' : Value(dtype= 'string' , id = None ),
  'книжн.' : Value(dtype= 'string' , id = None ),
  'ирон.' : Value(dtype= 'string' , id = None ),
  'религ.' : Value(dtype= 'string' , id = None ),
  'груб.' : Value(dtype= 'string' , id = None ),
}




AttributeError: 'DatasetDict' object has no attribute 'features'