# Datasets

For the most part I'll not use this capability a lot. I'll probably just push all the data into a sqlite3 db and then use my sqlite3 `Dataset`. HF has a dataset hub where I can explore the data pretty well without having to download it. In this notebook I'll go over what I think will be some common concepts that I'll use a lot. [Here is a good tutorial](https://huggingface.co/learn/nlp-course/chapter5/1?fw=pt) with a lot more details.

When I call `load_dataset`, it will download the dataset in `~/.cache/huggingface/datasets` directory. But I can change this if needed by setting the `HF_HOME` env.



Here I load the MRPC dataset that is one of the 10 datasets in the GLUE benchmark. Each instance is a pair of sentences with a binary label that indicates whethere the two sentences are equivalent or not.

In [1]:
from datasets import load_dataset

In [2]:
mrpc = load_dataset("glue", "mrpc")
mrpc

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [3]:
mrpc["train"].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [4]:
mrpc["train"][0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

The `Dataset` object has a lot of pandas like functions like, `select`, `shuffle`, `map`, `filter`, etc. 

In [11]:
# This selects the first 10 instances in the dataset
small_mrpc = mrpc["train"].select(range(10))
small_mrpc

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 10
})

In [14]:
for i in range(10):
    print(f"[{i}]: {small_mrpc[i]["sentence1"]}\n")

[0]: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .

[1]: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .

[2]: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .

[3]: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .

[4]: The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .

[5]: Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .

[6]: The Nasdaq had a weekly gain of 17.27 , or 1.2 percent , closing at 1,520.15 on Friday .

[7]: The DVD-CCA then appealed to the state Supreme Court .

[8]: That compared with $ 35.18 million , or 24 cents per share , in the year-ago period .

[9]: Shares of Genentech , a much larger company with several products on the market , rose 

In [15]:
# This first shuffles the dataset and then selects the first 10 instances
shuffled_small_mrpc = mrpc["train"].shuffle(seed=1).select(range(10))
for i in range(10):
    print(f"[{i}]: {shuffled_small_mrpc[i]["sentence1"]}\n")

[0]: " Germany is on the right path , " Mr. Schrder said , specifying his government 's announced structural reform plan , known as Agenda 2010 , and cuts in taxes .

[1]: The Saudi newspaper Okaz reported Monday that suspects who escaped Saturday 's raid fled in a car that broke down on the outskirts of Mecca Sunday afternoon .

[2]: She was surrounded by about 50 women who regret having abortions .

[3]: Democratic Lt. Gov. Cruz Bustamante and Republican state Sen. Tom McClintock are in different political worlds .

[4]: The decades-long conflict has killed more than 10,000 people in the resource-rich province , most of them civilians .

[5]: Energy prices dropped by 8.6 percent , the biggest decline since July 1986 .

[6]: But it now has 600 fewer stores - about 1,500 in all - and a $ 2 billion loan to compete against bigger merchants .

[7]: No legislative action is final without concurrence of the House , and it appeared the measure faced a tougher road there .

[8]: Ebert asked F

In [8]:
def dbg(x):
    print("START\n", x, "\nEND\n")
    return {}

small_mrpc.map(dbg)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

START
 {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0} 
END

START
 {'sentence1': "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'sentence2': "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", 'label': 0, 'idx': 1} 
END

START
 {'sentence1': 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .', 'sentence2': "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .", 'label': 1, 'idx': 2} 
END

START
 {'sentence1': 'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', 'sentence2': 'Tab sh

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 10
})

We can see that each instance in the dataset is passed to the udf given to map. The udf can return any `dict`, this will be merged with the original `dict` resulting in the creation of additional columns.

In [9]:
def lens(x):
    return {
        "len1": len(x["sentence1"]),
        "len2": len(x["sentence2"])
    }

new_small_mrpc = small_mrpc.map(lens)
print(new_small_mrpc)
print(new_small_mrpc[0])


Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'len1', 'len2'],
    num_rows: 10
})
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'len1': 103, 'len2': 111}


I can call also call `map` on a dataset with `batched=True` along with a batch size (defaults to 1000). This will process a batch of instances at a time. The input will still be a `dict` with the same keys as a single instance. However, the values of each key will be a list of the instances in the batch. E.g., the input still has a key called `sentence1`, but its value is a list of 3 sentences that are in this batch.

In [10]:
small_mrpc.map(dbg, batched=True, batch_size=3)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

START
 {'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'], 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale ."], 'label': [1, 0, 1], 'idx': [0, 1, 2]} 
END

START
 {'sentence1': ['Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', 'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .', 'Revenue in th

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 10
})

And just like in the single instance case, the batched udf map should also return a dict of str -> list.

In [18]:
def batched_len(batch):
    return {
        "len1": [len(x) for x in batch["sentence1"]],
        "len2": [len(x) for x in batch["sentence2"]],
    }
    

new_small_rpc = small_mrpc.map(batched_len, batched=True, batch_size=3)
new_small_rpc

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'len1', 'len2'],
    num_rows: 10
})

In [19]:
new_small_rpc[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0,
 'len1': 103,
 'len2': 111}