## "Data is the new Oil"

When solving a Data Science problem, we come across 2 major challenges
* Availability of Data
* Even if we find data, the data quality is an issue

Hugging Face hosts 1000+ of Quality Datasets for various tasks such as text classification, summarization, computer vision application and many others </br>

What makes it even more appealing is that this high quality data is under Open Licence

In this notebook, we shall learn

* how to quickly and easily load datasets from hugging face hub
* Working with datasets (splitting, indexing, slicing)
* Operations on datasets (sort, shuffle, select, ...)


References

* https://huggingface.co/datasets
* https://huggingface.co/docs/datasets/v2.15.0/process

### Installing and importing the libraries

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

### Working with the data

In [None]:
#loading the dataset of choice


data = load_dataset("dair-ai/emotion")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [None]:
#loading specific split
data_tr = load_dataset("dair-ai/emotion", split="train")
data_tr_sample = load_dataset("dair-ai/emotion", split="train[:1000]+validation[:200]")
print(data_tr)
print(data_tr_sample)

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})
Dataset({
    features: ['text', 'label'],
    num_rows: 1200
})


In [None]:
#Indexing and slicing
data_tr[0]

{'text': 'i didnt feel humiliated', 'label': 0}

In [None]:
data_tr[:5]  #returns output as list

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'im grabbing a minute to post i feel greedy wrong',
  'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
  'i am feeling grouchy'],
 'label': [0, 0, 3, 2, 3]}

In [None]:
#downloading a huggingface dataset
#!pip install git
!git lfs install

Git LFS initialized.


In [None]:
!git clone https://huggingface.co/datasets/Open-Orca/FLAN

Cloning into 'FLAN'...
remote: Enumerating objects: 2233, done.[K
remote: Total 2233 (delta 0), reused 0 (delta 0), pack-reused 2233[K
Receiving objects: 100% (2233/2233), 329.30 KiB | 1.98 MiB/s, done.
Resolving deltas: 100% (26/26), done.
Updating files: 100% (2169/2169), done.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'


Exiting because of "interrupt" signal.
^C


In [None]:
flan = load_dataset("./FLAN")
flan

Resolving data files:   0%|          | 0/97 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

KeyboardInterrupt: ignored

### Dataset operations

In [None]:
print(data_tr["label"][:10],
data_tr["text"][:10])

[0, 0, 3, 2, 3, 0, 5, 4, 1, 2] ['i didnt feel humiliated', 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing a minute to post i feel greedy wrong', 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property', 'i am feeling grouchy', 'ive been feeling a little burdened lately wasnt sure why that was', 'ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny', 'i feel as confused about life as a teenager or as jaded as a year old man', 'i have been with petronas for years i feel that petronas has performed well and made a huge profit', 'i feel romantic too']


In [None]:
# Sorting
data_sorted = data_tr.sort("label")
data_sorted[:10]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'ive been feeling a little burdened lately wasnt sure why that was',
  'i feel like i have to make the suffering i m seeing mean something',
  'i feel low energy i m just thirsty',
  'i didnt really feel that embarrassed',
  'i feel pretty pathetic most of the time',
  'i started feeling sentimental about dolls i had as a child and so began a collection of vintage barbie dolls from the sixties',
  'i still love my so and wish the best for him i can no longer tolerate the effect that bm has on our lives and the fact that is has turned my so into a bitter angry person who is not always particularly kind to the people around him when he is feeling stressed',
  'i feel so inhibited in someone elses kitchen like im painting on someone elses picture'],
 'label': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [None]:
data["train"]

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

In [None]:
#shuffle
data_tr.shuffle(seed=42)[:10]

{'text': ['while cycling in the country',
  'i had pocket qq and was feeling pretty confident lol',
  'i am in no way complaining or whining or feeling ungrateful',
  'i feel a bit stressed because it feels like im supposed to do something all the time and that i should be reading now',
  'i tell the people closest to me things that i am feeling and its as if they arent surprised because theyd known it all along',
  'when my relatives and i were in a car going slowly on a frozen road',
  'i am trying to work on finding the joy in the simple thing that god is finding joy in my obedience to him even if it doesn t feel very joyful in the way that i am used to',
  'i know intellectually that it s not true but i feel entirely isolated',
  'i suppose he feels badly because he was a bit skeptical of her pain over the last few months shes had a hyperchondria and exaggeration habit in the past though he never openly questioned her about it',
  'i didn t feel like doing much chris and i mostly j

In [None]:
#select with indices
data_tr.select([0,10,340,21])

Dataset({
    features: ['text', 'label'],
    num_rows: 4
})

In [None]:
#filter
data_tr.filter(lambda x: x["label"]==0)[:10]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'ive been feeling a little burdened lately wasnt sure why that was',
  'i feel like i have to make the suffering i m seeing mean something',
  'i feel low energy i m just thirsty',
  'i didnt really feel that embarrassed',
  'i feel pretty pathetic most of the time',
  'i started feeling sentimental about dolls i had as a child and so began a collection of vintage barbie dolls from the sixties',
  'i still love my so and wish the best for him i can no longer tolerate the effect that bm has on our lives and the fact that is has turned my so into a bitter angry person who is not always particularly kind to the people around him when he is feeling stressed',
  'i feel so inhibited in someone elses kitchen like im painting on someone elses picture'],
 'label': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [None]:
#split
data_tr.train_test_split(test_size=0.1)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 14400
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1600
    })
})