# NLP Course

Please see the [Hugging Face NLP Course page](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt).

## 5. The 🤗 Datasets library

### What if my dataset isn’t on the Hub?

In [1]:
!ls -la

total 627740
drwxr-xr-x 7 so_olliphant so_olliphant      4096 Feb 18 04:29 .
drwxr-xr-x 3 so_olliphant so_olliphant      4096 Feb 18 02:34 ..
drwxr-xr-x 8 so_olliphant so_olliphant      4096 Feb 18 02:28 .git
drwxr-xr-x 2 so_olliphant so_olliphant      4096 Feb 18 03:06 .ipynb_checkpoints
-rw-r--r-- 1 so_olliphant so_olliphant     14239 Feb 18 02:28 Ch2_p38.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     26388 Feb 18 02:51 HF_NLP_Ch2.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     58778 Feb 18 03:05 HF_NLP_Ch3.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant      7373 Feb 18 03:06 HF_NLP_Ch4.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant    299022 Feb 18 03:54 HF_NLP_Ch5.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant    159365 Feb 18 02:28 HF_NLP_Ch6.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant        15 Feb 18 02:28 README.md
-rw-r--r-- 1 so_olliphant so_olliphant   8385528 Feb 18 03:06 SQuAD_it-test.json
-rw-r--r-- 1 so_olliphant so_olliphant   1051245 Feb 18 03:06 SQuAD_it-test.json.gz
-r

----

##### When there are separate datafiles for train, test, etc.

Loading the local datafile `SQuAD_it-train.json` (after `gunzip`ing). Also, we know that the `data` JSON object (an array) contains the data examples.

In [2]:
!head -n 20 SQuAD_it-train.json

{
    "data": [
        {
            "title": "Terremoto del Sichuan del 2008",
            "paragraphs": [
                {
                    "context": "Il terremoto del Sichuan del 2008 o il terremoto del Gran Sichuan, misurato a 8.0 Ms e 7.9 Mw, e si è verificato alle 02:28:01 PM China Standard Time all' epicentro (06:28:01 UTC) il 12 maggio nella provincia del Sichuan, ha ucciso 69.197 persone e lasciato 18.222 dispersi.",
                    "qas": [
                        {
                            "id": "56cdca7862d2951400fa6826",
                            "answers": [
                                {
                                    "text": "2008",
                                    "answer_start": 29
                                }
                            ],
                            "question": "In quale anno si è verificato il terremoto nel Sichuan?"
                        },
                        {
                            "id": "56cdca7862d295

In [3]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

----

##### Using a `dict` where the keys are the dataset split names; and the values are the local `gunzip`ped datafiles

In [4]:
data_files = {
    "train": "SQuAD_it-train.json", 
    "test": "SQuAD_it-test.json"
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

----

##### Using a `dict` where the keys are the dataset split names; and the values are the download `gz` archive files

In [5]:
data_files = {
    "train": "SQuAD_it-train.json.gz", 
    "test": "SQuAD_it-test.json.gz"
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

----

##### Using a `dict` where the keys are the dataset split names; and the values are the URLs for the target data archive files.

In [6]:
base_url = "https://github.com/crux82/squad-it/raw/master/"

data_files = {
    "train": f"{base_url}SQuAD_it-train.json.gz",
    "test":  f"{base_url}SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

### Time to slice and dice

In [7]:
!ls -la

total 627740
drwxr-xr-x 7 so_olliphant so_olliphant      4096 Feb 18 04:29 .
drwxr-xr-x 3 so_olliphant so_olliphant      4096 Feb 18 02:34 ..
drwxr-xr-x 8 so_olliphant so_olliphant      4096 Feb 18 02:28 .git
drwxr-xr-x 2 so_olliphant so_olliphant      4096 Feb 18 03:06 .ipynb_checkpoints
-rw-r--r-- 1 so_olliphant so_olliphant     14239 Feb 18 02:28 Ch2_p38.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     26388 Feb 18 02:51 HF_NLP_Ch2.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     58778 Feb 18 03:05 HF_NLP_Ch3.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant      7373 Feb 18 03:06 HF_NLP_Ch4.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant    299022 Feb 18 03:54 HF_NLP_Ch5.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant    159365 Feb 18 02:28 HF_NLP_Ch6.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant        15 Feb 18 02:28 README.md
-rw-r--r-- 1 so_olliphant so_olliphant   8385528 Feb 18 03:06 SQuAD_it-test.json
-rw-r--r-- 1 so_olliphant so_olliphant   1051245 Feb 18 03:06 SQuAD_it-test.json.gz
-r

In [8]:
from datasets import load_dataset

data_files = {
    "train": "drugsComTrain_raw.tsv", 
    "test": "drugsComTest_raw.tsv"
}

# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [9]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

> * The Unnamed: 0 column looks suspiciously like an anonymized ID for each patient.<br/>

In [10]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

In [11]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)

drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [12]:
help(drug_dataset["train"].unique)

Help on method unique in module datasets.arrow_dataset:

unique(column: str) -> List method of datasets.arrow_dataset.Dataset instance
    Return a list of the unique elements in a column.
    
    This is implemented in the low-level backend and as such, very fast.
    
    Args:
        column (`str`):
            Column name (list all the column names with [`~datasets.Dataset.column_names`]).
    
    Returns:
        `list`: List of unique elements in the given column.
    
    Example:
    
    ```py
    >>> from datasets import load_dataset
    >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
    >>> ds.unique('label')
    [1, 0]
    ```



In [13]:
drug_dataset = drug_dataset.filter(lambda r: r['drugName'] is not None and len(r['drugName'].strip()) > 0)
drug_dataset = drug_dataset.filter(lambda r: r['condition'] is not None and len(r['drugName'].strip()) > 0)

print(f"after filtering out rows where drugName or condition are empty:\n{drug_dataset}")

# train
drug_names = [
    n.lower() 
    for n in drug_dataset['train'].unique('drugName')
]
conditions = [
    n.lower() 
    for n in drug_dataset['train'].unique('condition')
]

print(f"train, unique drug names: {len(drug_names)}")
print(f"train, unique conditions: {len(conditions)}")

print()

# test
drug_names = [
    n.lower() 
    for n in drug_dataset['test'].unique('drugName')
]
conditions = [
    n.lower() 
    for n in drug_dataset['test'].unique('condition')
]

print(f"test, unique drug names: {len(drug_names)}")
print(f"test, unique conditions: {len(conditions)}")

after filtering out rows where drugName or condition are empty:
DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 160398
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53471
    })
})
train, unique drug names: 3431
train, unique conditions: 884

test, unique drug names: 2635
test, unique conditions: 708


----

> * The condition column includes a mix of uppercase and lowercase labels.<br/>

In [14]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

drug_dataset = drug_dataset.map(lowercase_condition)

In [15]:
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

['left ventricular dysfunction', 'adhd', 'birth control']

...

In [16]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

In [17]:
drug_dataset = drug_dataset.map(compute_review_length)

# Inspect the first training example
drug_dataset["train"][0]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

In [18]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

In [19]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)

print(drug_dataset.num_rows)

{'train': 138514, 'test': 46108}


> * The reviews are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.

In [20]:
import html

In [21]:
%time drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

CPU times: user 6.49 ms, sys: 4.47 ms, total: 11 ms
Wall time: 9.95 ms


...

`map` method's superpowers?

... with `batched=True`, this should be slightly faster?

In [22]:
%time _ = drug_dataset.map(lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True)

CPU times: user 0 ns, sys: 11.6 ms, total: 11.6 ms
Wall time: 9.71 ms


----

In [23]:
from transformers import AutoTokenizer

tokenizer_slow = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
tokenizer_fast = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)

def tokenize_function_slow(examples):
    return tokenizer_slow(examples["review"], truncation=True)

def tokenize_function_fast(examples):
    return tokenizer_fast(examples["review"], truncation=True)

In [24]:
%time _ = drug_dataset.map(tokenize_function_slow, batched=True)

CPU times: user 607 ms, sys: 3.43 ms, total: 610 ms
Wall time: 608 ms


In [25]:
%time _ = drug_dataset.map(tokenize_function_slow, batched=False)

CPU times: user 617 ms, sys: 7.41 ms, total: 624 ms
Wall time: 622 ms


In [26]:
%time _ = drug_dataset.map(tokenize_function_fast, batched=True)

CPU times: user 29.3 ms, sys: 91 µs, total: 29.4 ms
Wall time: 27.5 ms


In [27]:
%time _ = drug_dataset.map(tokenize_function_fast, batched=False)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

CPU times: user 1min 20s, sys: 368 ms, total: 1min 20s
Wall time: 1min 20s


...

In [28]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)

In [29]:
%time _ = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

CPU times: user 725 ms, sys: 3.95 ms, total: 729 ms
Wall time: 727 ms


...

In [30]:
# was this line left out of the tutorial???
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)

def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

In [31]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

##### WARNING

The next code block will throw up and error regarding `ArrowInvalid: Column 8 named input_ids expected length 1000 but got length 1463`

<hr width=40%/>

##### Uh, wtf? Or, I think we need a bit more clarification...

OK, that HF tutorial is not at all clear as to what happens with `return_overflowing_tokens`. 

Let's take a quick detour and dig just a bit more into the behavior when using `return_overflowing_tokens`...

In [32]:
from transformers import XLMRobertaTokenizerFast

model_id = "xlm-roberta-large-finetuned-conll03-english"
t = XLMRobertaTokenizerFast.from_pretrained(model_id)

sample = "this is an example and context is important to retrieve meaningful contextualized token embeddings from the self attention mechanism of the transformer"
encoded_default = t(sample, truncation=True).input_ids

print()
print(f"Original example string: {sample}")
print()
print(f"Tokenized, this string has {len(t.tokenize(sample))} tokens:\n{t.batch_decode(encoded_default)}")


Original example string: this is an example and context is important to retrieve meaningful contextualized token embeddings from the self attention mechanism of the transformer

Tokenized, this string has 32 tokens:
['<s>', 'this', 'is', 'an', 'example', 'and', 'context', 'is', 'important', 'to', 're', 'tri', 'eve', 'meaning', 'ful', 'context', 'ual', 'ized', 'to', 'ken', '', 'embe', 'dding', 's', 'from', 'the', 'self', 'attention', 'mechanism', 'of', 'the', 'transform', 'er', '</s>']


In [33]:
# now let's limit tokenization to 10 tokens...
encoded_max_length = t(sample, max_length=10, truncation=True).input_ids

print(f"Setting max_length in tokenization to: {len(encoded_max_length)}")
print(f"Now the tokens look like this:\n{t.batch_decode(encoded_max_length)}")

Setting max_length in tokenization to: 10
Now the tokens look like this:
['<s>', 'this', 'is', 'an', 'example', 'and', 'context', 'is', 'important', '</s>']


In [34]:
# and NOW we ask the tokenizer to return_overflowing_tokens
encoded_overflow = t(sample, max_length=10, truncation=True, return_overflowing_tokens=True).input_ids

print([len(x) for x in encoded_overflow])
print(*t.batch_decode(encoded_overflow), sep="\n")

[10, 10, 10, 10]
<s> this is an example and context is important</s>
<s> to retrieve meaningful contextual</s>
<s>ized token embeddings from</s>
<s> the self attention mechanism of the transformer</s>


In [35]:
# but it may be the case that this sequential listing of the overflow 
# results is less-than-ideal input...
# so NOW we ask for striding of the overflow
# with the hope that the resulting strided sequences make more sense
encoded_overflow_stride = t(sample, max_length=10, truncation=True, stride=3, return_overflowing_tokens=True).input_ids

print([len(x) for x in encoded_overflow_stride])
print(*t.batch_decode(encoded_overflow_stride), sep="\n")

[10, 10, 10, 10, 10, 9]
<s> this is an example and context is important</s>
<s> context is important to retrieve meaning</s>
<s>trieve meaningful contextualized to</s>
<s>ualized token embeddings</s>
<s>embeddings from the self attention mechanism</s>
<s> self attention mechanism of the transformer</s>


<hr width=40%/>
...

#### OK, so, where were we...?

> The problem is that we’re trying to mix two different datasets of different sizes: the drug_dataset columns will have a certain number of examples (the 1,000 in our error), but the tokenized_dataset we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using return_overflowing_tokens=True). That doesn’t work for a Dataset, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset....

##### Clarification
* We are trying to build a new dataset of tokenized values that we will use for training.
* Since we are splitting each original row in `drug_dataset` so that we have an initial 128-token value and possibly and additional set of tokens from those rows where there are more than 128 tokens (overflow), we are essentially not going to be able to do a one-to-one mapping from the original `drug_dataset` to our new tokenized one.
* So we need to use the [`remove_columns` argument to to `datasets.map`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map.remove_columns), which _...removes a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept_
* You would do well to remember that in this instance for training, the **tokenized** dataset is expected to have the following fields: `input_ids`, `token_type_ids`, and `attention_mask`

In [36]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

...

Now compare the resulting `tokenized_dataset` vis a vis the original `drug_dataset`...

In [37]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

In [38]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 68876
    })
})

In [39]:
print(tokenized_dataset["train"][666])

{'input_ids': [101, 107, 1258, 1781, 155, 1548, 3365, 12894, 1673, 1111, 127, 1808, 1106, 16775, 1139, 7011, 117, 2945, 1105, 27447, 3578, 2628, 1114, 1139, 22216, 15203, 117, 146, 8181, 1304, 1376, 3893, 119, 1327, 146, 1225, 4361, 1108, 1103, 160, 27514, 2349, 18784, 119, 146, 3388, 1851, 24119, 1107, 127, 1808, 1114, 1185, 1849, 1107, 5497, 15640, 119, 1258, 191, 8136, 4869, 1142, 4517, 1106, 1139, 3995, 117, 146, 1108, 1678, 1228, 1103, 15683, 1114, 1185, 10602, 3154, 1105, 1107, 1275, 1808, 1159, 146, 1575, 3927, 1233, 4832, 1114, 1185, 6730, 1105, 5497, 170, 2332, 2852, 1822, 176, 1193, 2093, 7257, 7448, 10211, 119, 107, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1,

In [40]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 138514
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

In [41]:
print(drug_dataset["train"][0])

{'patient_id': 95260, 'drugName': 'Guanfacine', 'condition': 'adhd', 'review': '"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective."', 'rating': 8.0, 'date': 'April 27, 2010', 'usefulCount': 192, 'review_length': 141}


<hr width=40%/>
...

In [42]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [43]:
drug_dataset = drug_dataset.map(tokenize_and_split, batched=True)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

...

In [44]:
drug_dataset.set_format("pandas")
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

In [45]:
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length,input_ids,token_type_ids,attention_mask
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141,"[101, 107, 1422, 1488, 1110, 9079, 1194, 1117,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141,"[101, 119, 1124, 1110, 1750, 6438, 113, 170, 1...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134,"[101, 107, 146, 1215, 1106, 1321, 1330, 9619, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [46]:
train_df = drug_dataset["train"][:]

In [47]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"count": "frequency"})
)
frequencies.head()

Unnamed: 0,condition,frequency
0,birth control,43445
1,depression,12322
2,acne,8285
3,anxiety,7523
4,pain,6513


In [48]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['condition', 'frequency'],
    num_rows: 819
})

In [49]:
drug_avg_rating = (
    train_df.groupby(['drugName'])['rating']
            .mean()
            .to_frame()
            .reset_index()
            .rename(columns={"rating": "avg. rating"})
)
print(drug_avg_rating)

                                  drugName  avg. rating
0                A + D Cracked Skin Relief    10.000000
1                               A / B Otic    10.000000
2     Abacavir / dolutegravir / lamivudine     7.901639
3       Abacavir / lamivudine / zidovudine     9.000000
4                                Abatacept     7.500000
...                                    ...          ...
3047                                 Zyvox     9.210526
3048                               ZzzQuil     4.000000
3049                 depo-subQ provera 104     1.000000
3050                                  ella     7.125000
3051                                femhrt     5.500000

[3052 rows x 2 columns]


In [50]:
drug_dataset.reset_format()

...

In [51]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

In [52]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)

# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")

# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 165417
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 41355
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

...

In [53]:
drug_dataset_clean.save_to_disk("drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/165417 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/41355 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/68876 [00:00<?, ? examples/s]

In [54]:
!ls -la drug-reviews

total 24
drwxr-xr-x 5 so_olliphant so_olliphant 4096 Feb 18 03:26 .
drwxr-xr-x 7 so_olliphant so_olliphant 4096 Feb 18 04:29 ..
-rw-r--r-- 1 so_olliphant so_olliphant   43 Feb 18 04:31 dataset_dict.json
drwxr-xr-x 2 so_olliphant so_olliphant 4096 Feb 18 03:26 test
drwxr-xr-x 2 so_olliphant so_olliphant 4096 Feb 18 03:26 train
drwxr-xr-x 2 so_olliphant so_olliphant 4096 Feb 18 03:26 validation


In [55]:
!ls -la drug-reviews/train

total 191660
drwxr-xr-x 2 so_olliphant so_olliphant      4096 Feb 18 03:26 .
drwxr-xr-x 5 so_olliphant so_olliphant      4096 Feb 18 03:26 ..
-rw-r--r-- 1 so_olliphant so_olliphant 196240960 Feb 18 04:31 data-00000-of-00001.arrow
-rw-r--r-- 1 so_olliphant so_olliphant      1920 Feb 18 04:31 dataset_info.json
-rw-r--r-- 1 so_olliphant so_olliphant       250 Feb 18 04:31 state.json


In [56]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 165417
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 41355
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

In [57]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/166 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/42 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/69 [00:00<?, ?ba/s]

In [58]:
!ls -la 

total 627536
drwxr-xr-x 7 so_olliphant so_olliphant      4096 Feb 18 04:31 .
drwxr-xr-x 3 so_olliphant so_olliphant      4096 Feb 18 02:34 ..
drwxr-xr-x 8 so_olliphant so_olliphant      4096 Feb 18 02:28 .git
drwxr-xr-x 2 so_olliphant so_olliphant      4096 Feb 18 03:06 .ipynb_checkpoints
-rw-r--r-- 1 so_olliphant so_olliphant     14239 Feb 18 02:28 Ch2_p38.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     26388 Feb 18 02:51 HF_NLP_Ch2.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     58778 Feb 18 03:05 HF_NLP_Ch3.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant      7373 Feb 18 03:06 HF_NLP_Ch4.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant     88422 Feb 18 04:31 HF_NLP_Ch5.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant    159365 Feb 18 02:28 HF_NLP_Ch6.ipynb
-rw-r--r-- 1 so_olliphant so_olliphant        15 Feb 18 02:28 README.md
-rw-r--r-- 1 so_olliphant so_olliphant   8385528 Feb 18 03:06 SQuAD_it-test.json
-rw-r--r-- 1 so_olliphant so_olliphant   1051245 Feb 18 03:06 SQuAD_it-test.json.gz
-r

In [59]:
!head -n 1 drug-reviews-train.jsonl

{"patient_id":37933,"drugName":"Adipex-P","condition":"weight loss","review":"\"Been on Adipex-p for 28 days and have lost 20 pounds. Doctor put me on it because I am menopausal and have a hard time losing weight. First week had a hard time sleeping but now no problems, I check my blood pressure daily and no issues there. Definitely get dry mouth with it that has been my only issue, I usually chew gum and keep a drink handy. Glad I was given this drug, doubt I would have lost what I did without it.\"","rating":9.0,"date":"September 5, 2017","usefulCount":35,"review_length":85,"input_ids":[101,107,18511,1113,24930,9717,11708,118,185,1111,1743,1552,1105,1138,1575,1406,6549,119,4157,1508,1143,1113,1122,1272,146,1821,1441,4184,25134,1348,1105,1138,170,1662,1159,3196,2841,119,1752,1989,1125,170,1662,1159,5575,1133,1208,1185,2645,117,146,4031,1139,1892,2997,3828,1105,1185,2492,1175,119,3177,16598,3150,1193,1243,3712,1779,1114,1122,1115,1144,1151,1139,1178,2486,117,146,1932,22572,5773,19956,1

In [60]:
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

----

### Big data? 🤗 Datasets to the rescue!

##### What is the Pile?

In [61]:
#from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
#data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
data_files = "https://huggingface.co/datasets/casinca/PUBMED_title_abstracts_2019_baseline/resolve/main/PUBMED_title_abstracts_2019_baseline.jsonl.zst"

pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

Loading dataset shards:   0%|          | 0/42 [00:00<?, ?it/s]

Dataset({
    features: ['meta', 'text'],
    num_rows: 15518009
})

In [62]:
pubmed_dataset[0]

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that i

...

##### The magic of memory mapping

In [63]:
import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 2969.16 MB


In [64]:
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

Number of files in dataset : 20978892555
Dataset size (cache file) : 19.54 GB


...

In [65]:
import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)

Iterated over 15518009 examples (about 19.5 GB) in 98.7s, i.e. 0.198 GB/s


...

##### Streaming datasets

* You might want to start reading [Differences between Dataset and IterableDataset](https://huggingface.co/docs/datasets/v3.2.0/en/about_mapstyle_vs_iterable#differences-between-dataset-and-iterabledataset)
* Look for real-life examples of `IterableDataset` usage...

In [66]:
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

In [67]:
next(iter(pubmed_dataset_streamed))

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that i

In [68]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
tokenized_dataset

IterableDataset({
    features: Unknown,
    num_shards: 1
})

In [69]:
item = next(iter(tokenized_dataset))
print(item)

{'meta': {'pmid': 11409574, 'language': 'eng'}, 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that in

In [70]:
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)

print(next(iter(shuffled_dataset)))

{'meta': {'pmid': 11410799, 'language': 'eng'}, 'text': 'Randomized study of dose or schedule modification of granulocyte colony-stimulating factor in platinum-based chemotherapy for elderly patients with lung cancer.\nIt is generally believed that elderly patients are less able to tolerate aggressive cancer chemotherapy than their younger counterparts. Bone marrow cellularity diminishes with age and elderly patients may have decreased tolerance to myelosuppressive agents. Between November 1995 and October 1999, 68 chemotherapy-naive elderly (70 or more years old) patients with histologically or cytologically proven lung cancer who were to receive platinum-based chemotherapy were enrolled in this study. All patients had adequate cardiac, hematological, liver and renal function to receive chemotherapy. Patients were randomized into 3 groups. Patients in groups 1 and 2 received 2 microg/kg and 4 microg/kg granulocyte colony-stimulating factor (G-CSF, lenograstim), respectively, when grad

In [71]:
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

[{'meta': {'pmid': 11409574, 'language': 'eng'},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that

In [72]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)

# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

...

Similar to the URL change above, the Pile raw data has been subject to a take-down order and is not available at the URL in the tutorial!

In [73]:
law_dataset_streamed = load_dataset(
    "monology/pile-uncopyrighted",  
    split="train", 
    streaming=True
)

law_dataset_streamed = law_dataset_streamed.filter(lambda x: x["meta"]["pile_set_name"] == "FreeLaw")
law_dataset_streamed

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

IterableDataset({
    features: Unknown,
    num_shards: 30
})

In [74]:
print(next(iter(law_dataset_streamed)))

{'text': '     The summaries of the Colorado Court of Appeals published opinions\n  constitute no part of the opinion of the division but have been prepared by\n  the division for the convenience of the reader. The summaries may not be\n    cited or relied upon as they are not the official language of the division.\n  Any discrepancy between the language in the summary and in the opinion\n           should be resolved in favor of the language in the opinion.\n\n\n                                                                  SUMMARY\n                                                            February 8, 2018\n\n                                2018COA12\n\nNo. 14CA0144, People v. Trujillo — Criminal Law — Sentencing\n— Probation — Indeterminate Sentence\n\n     A division of the court of appeals considers whether a\n\nColorado statute authorizes imposition of a sentence to an\n\nindeterminate term of probation and whether the defendant was\n\nentitled to the benefit of amendments to

...

In [75]:
from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

[{'meta': {'pmid': 11409574, 'language': 'eng'},
  'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that

In [76]:
combined_dataset

IterableDataset({
    features: ['meta', 'text'],
    num_shards: 1
})

----

### Creating your own dataset

##### Github token

Be sure to paste in your Github token. Please see [Managing your personal access tokens](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) in the Github docs.

... ah, 

> `Reached GitHub rate limit. Sleeping for one hour ...`

So let's not even bother continuing...

----

#### Semantic search with FAISS

In [77]:
from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

In [78]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

In [79]:
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)

issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

In [80]:
print(issues_dataset[0])

{'html_url': 'https://github.com/huggingface/datasets/issues/2945', 'title': 'Protect master branch', 'comments': ['Cool, I think we can do both :)', '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).'], 'body': 'After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistak

In [81]:
issues_dataset.set_format("pandas")
df = issues_dataset[:]
df

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"[Cool, I think we can do both :), @lhoestq now...",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,[Hi ! I guess the caching mechanism should hav...,## Describe the bug\r\nAfter upgrading to data...
2,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,[I tried `unshuffled_original_da` and it is al...,## Describe the bug\r\n\r\nCannot download OSC...
3,https://github.com/huggingface/datasets/issues...,load_dataset using default cache on Windows ca...,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfo...",## Describe the bug\r\nStandard process to dow...
4,https://github.com/huggingface/datasets/issues...,to_tf_dataset keeps a reference to the open da...,"[I did some investigation and, as it seems, th...",To reproduce:\r\n```python\r\nimport datasets ...
...,...,...,...,...
803,https://github.com/huggingface/datasets/issues/6,Error when citation is not given in the Datase...,[Yes looks good to me.\r\nNote that we may ref...,The following error is raised when the `citati...
804,https://github.com/huggingface/datasets/issues/5,ValueError when a split is empty,[To fix this I propose to modify only the file...,"When a split is empty either TEST, VALIDATION ..."
805,https://github.com/huggingface/datasets/issues/4,[Feature] Keep the list of labels of a dataset...,[Yes! I see mostly two options for this:\r\n- ...,It would be useful to keep the list of the lab...
806,https://github.com/huggingface/datasets/issues/3,[Feature] More dataset outputs,[Yes!\r\n- pandas will be a one-liner in `arro...,Add the following dataset outputs:\r\n\r\n- Sp...


In [82]:
df["comments"][0].tolist()

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

##### Using `pandas.DataFrame.explode` to break out the `comments` into separate rows

> Transform each element of a list-like to a row, replicating index values.

See [`pandas.DataFrame.explode` API documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html#pandas-dataframe-explode)

In [83]:
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...


In [84]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

In [85]:
comments_dataset[:4]

{'html_url': ['https://github.com/huggingface/datasets/issues/2945',
  'https://github.com/huggingface/datasets/issues/2945',
  'https://github.com/huggingface/datasets/issues/2943',
  'https://github.com/huggingface/datasets/issues/2943'],
 'title': ['Protect master branch',
  'Protect master branch',
  'Backwards compatibility broken for cached datasets that use `.filter()`',
  'Backwards compatibility broken for cached datasets that use `.filter()`'],
 'comments': ['Cool, I think we can do both :)',
  '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).',
  "Hi ! I guess the caching mechanism should have considered the new `filt

----

... or alternatively, taking a hint from [Batch mapping with `Dataset.map`](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping)...



In [86]:
issues_dataset.reset_format()
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

In [87]:
def manually_flatten_list(batch):
    """foo"""
    retval = {
        'html_url': [],
        'title': [],
        'comments': [], 
        'body': []
    }

    for html_url, title, comments, body in zip(batch['html_url'], batch['title'], batch['comments'], batch['body']):
        for c in comments:
            retval['html_url'].append(html_url)
            retval['title'].append(title)
            retval['comments'].append(c)
            retval['body'].append(body)
    
    return retval

In [88]:
comments_dataset_2 = issues_dataset.map(
    manually_flatten_list,
    remove_columns=issues_dataset.column_names,
    batched=True)
comments_dataset_2

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

In [89]:
comments_dataset_2[:4]

{'html_url': ['https://github.com/huggingface/datasets/issues/2945',
  'https://github.com/huggingface/datasets/issues/2945',
  'https://github.com/huggingface/datasets/issues/2943',
  'https://github.com/huggingface/datasets/issues/2943'],
 'title': ['Protect master branch',
  'Protect master branch',
  'Backwards compatibility broken for cached datasets that use `.filter()`',
  'Backwards compatibility broken for cached datasets that use `.filter()`'],
 'comments': ['Cool, I think we can do both :)',
  '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).',
  "Hi ! I guess the caching mechanism should have considered the new `filt

----

In [90]:
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)
comments_dataset

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2964
})

In [91]:
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

In [92]:
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }

comments_dataset = comments_dataset.map(concatenate_text)
comments_dataset

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2175
})

In [93]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

In [94]:
import torch

device = torch.device("cuda")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [95]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [96]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [97]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

In [98]:
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

In [99]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

In [100]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [101]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [102]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [103]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.505014419555664
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
```python
datasets = load_dataset('text', data_files=data_files)
```

We'll do a new release soon
SCORE: 24.5555419921875
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's no internet