# NLP Course

Please see the [Hugging Face NLP Course page](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt).

## 5. The 🤗 Datasets library

### What if my dataset isn’t on the Hub?

In [1]:
!ls -la

total 627300
drwxr-xr-x 5 se_olliphant se_olliphant      4096 Feb  5 12:42 .
drwxr-xr-x 3 se_olliphant se_olliphant      4096 Jan 31 06:42 ..
drwxr-xr-x 2 se_olliphant se_olliphant      4096 Feb  4 07:27 .ipynb_checkpoints
-rw-r--r-- 1 se_olliphant se_olliphant     14239 Jan 31 07:05 Ch2_p38.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant     22696 Feb  1 04:21 HF_NLP_Ch2.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant     49292 Feb  4 03:25 HF_NLP_Ch3.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant      5472 Feb  4 03:26 HF_NLP_Ch4.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant     29410 Feb  5 12:42 HF_NLP_Ch5.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant   8385528 Feb  4 06:25 SQuAD_it-test.json
-rw-r--r-- 1 se_olliphant se_olliphant   1051245 Feb  4 06:25 SQuAD_it-test.json.gz
-rw-r--r-- 1 se_olliphant se_olliphant  43605829 Feb  4 06:25 SQuAD_it-train.json
-rw-r--r-- 1 se_olliphant se_olliphant   7725286 Feb  4 06:25 SQuAD_it-train.json.gz
-rw-r--r-- 1 se_olliphant se_olliphant      2417 Feb 

----

##### When there are separate datafiles for train, test, etc.

Loading the local datafile `SQuAD_it-train.json` (after `gunzip`ing). Also, we know that the `data` JSON object (an array) contains the data examples.

In [2]:
!head -n 20 SQuAD_it-train.json

{
    "data": [
        {
            "title": "Terremoto del Sichuan del 2008",
            "paragraphs": [
                {
                    "context": "Il terremoto del Sichuan del 2008 o il terremoto del Gran Sichuan, misurato a 8.0 Ms e 7.9 Mw, e si è verificato alle 02:28:01 PM China Standard Time all' epicentro (06:28:01 UTC) il 12 maggio nella provincia del Sichuan, ha ucciso 69.197 persone e lasciato 18.222 dispersi.",
                    "qas": [
                        {
                            "id": "56cdca7862d2951400fa6826",
                            "answers": [
                                {
                                    "text": "2008",
                                    "answer_start": 29
                                }
                            ],
                            "question": "In quale anno si è verificato il terremoto nel Sichuan?"
                        },
                        {
                            "id": "56cdca7862d295

In [3]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

----

##### Using a `dict` where the keys are the dataset split names; and the values are the local `gunzip`ped datafiles

In [4]:
data_files = {
    "train": "SQuAD_it-train.json", 
    "test": "SQuAD_it-test.json"
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

----

##### Using a `dict` where the keys are the dataset split names; and the values are the download `gz` archive files

In [5]:
data_files = {
    "train": "SQuAD_it-train.json.gz", 
    "test": "SQuAD_it-test.json.gz"
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

----

##### Using a `dict` where the keys are the dataset split names; and the values are the URLs for the target data archive files.

In [6]:
base_url = "https://github.com/crux82/squad-it/raw/master/"

data_files = {
    "train": f"{base_url}SQuAD_it-train.json.gz",
    "test":  f"{base_url}SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

### Time to slice and dice

In [7]:
!ls -la

total 627300
drwxr-xr-x 5 se_olliphant se_olliphant      4096 Feb  5 12:42 .
drwxr-xr-x 3 se_olliphant se_olliphant      4096 Jan 31 06:42 ..
drwxr-xr-x 2 se_olliphant se_olliphant      4096 Feb  4 07:27 .ipynb_checkpoints
-rw-r--r-- 1 se_olliphant se_olliphant     14239 Jan 31 07:05 Ch2_p38.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant     22696 Feb  1 04:21 HF_NLP_Ch2.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant     49292 Feb  4 03:25 HF_NLP_Ch3.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant      5472 Feb  4 03:26 HF_NLP_Ch4.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant     29410 Feb  5 12:42 HF_NLP_Ch5.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant   8385528 Feb  4 06:25 SQuAD_it-test.json
-rw-r--r-- 1 se_olliphant se_olliphant   1051245 Feb  4 06:25 SQuAD_it-test.json.gz
-rw-r--r-- 1 se_olliphant se_olliphant  43605829 Feb  4 06:25 SQuAD_it-train.json
-rw-r--r-- 1 se_olliphant se_olliphant   7725286 Feb  4 06:25 SQuAD_it-train.json.gz
-rw-r--r-- 1 se_olliphant se_olliphant      2417 Feb 

In [8]:
from datasets import load_dataset

data_files = {
    "train": "drugsComTrain_raw.tsv", 
    "test": "drugsComTest_raw.tsv"
}

# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [9]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

> * The Unnamed: 0 column looks suspiciously like an anonymized ID for each patient.<br/>

In [10]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

In [11]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)

drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [12]:
help(drug_dataset["train"].unique)

Help on method unique in module datasets.arrow_dataset:

unique(column: str) -> List method of datasets.arrow_dataset.Dataset instance
    Return a list of the unique elements in a column.
    
    This is implemented in the low-level backend and as such, very fast.
    
    Args:
        column (`str`):
            Column name (list all the column names with [`~datasets.Dataset.column_names`]).
    
    Returns:
        `list`: List of unique elements in the given column.
    
    Example:
    
    ```py
    >>> from datasets import load_dataset
    >>> ds = load_dataset("rotten_tomatoes", split="validation")
    >>> ds.unique('label')
    [1, 0]
    ```



In [13]:
drug_dataset = drug_dataset.filter(lambda r: r['drugName'] is not None and len(r['drugName'].strip()) > 0)
drug_dataset = drug_dataset.filter(lambda r: r['condition'] is not None and len(r['drugName'].strip()) > 0)

print(f"after filtering out rows where drugName or condition are empty:\n{drug_dataset}")

# train
drug_names = [
    n.lower() 
    for n in drug_dataset['train'].unique('drugName')
]
conditions = [
    n.lower() 
    for n in drug_dataset['train'].unique('condition')
]

print(f"train, unique drug names: {len(drug_names)}")
print(f"train, unique conditions: {len(conditions)}")

print()

# test
drug_names = [
    n.lower() 
    for n in drug_dataset['test'].unique('drugName')
]
conditions = [
    n.lower() 
    for n in drug_dataset['test'].unique('condition')
]

print(f"test, unique drug names: {len(drug_names)}")
print(f"test, unique conditions: {len(conditions)}")

after filtering out rows where drugName or condition are empty:
DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 160398
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53471
    })
})
train, unique drug names: 3431
train, unique conditions: 884

test, unique drug names: 2635
test, unique conditions: 708


----

> * The condition column includes a mix of uppercase and lowercase labels.<br/>

In [14]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

drug_dataset = drug_dataset.map(lowercase_condition)

In [15]:
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

['left ventricular dysfunction', 'adhd', 'birth control']

...

In [16]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

In [17]:
drug_dataset = drug_dataset.map(compute_review_length)

# Inspect the first training example
drug_dataset["train"][0]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

In [18]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

In [19]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)

print(drug_dataset.num_rows)

{'train': 138514, 'test': 46108}


> * The reviews are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.

In [20]:
help(drug_dataset.sort)

Help on method sort in module datasets.dataset_dict:

sort(column_names: Union[str, Sequence[str]], reverse: Union[bool, Sequence[bool]] = False, null_placement: str = 'at_end', keep_in_memory: bool = False, load_from_cache_file: Optional[bool] = None, indices_cache_file_names: Optional[Dict[str, Optional[str]]] = None, writer_batch_size: Optional[int] = 1000) -> 'DatasetDict' method of datasets.dataset_dict.DatasetDict instance
    Create a new dataset sorted according to a single or multiple columns.
    
    Args:
        column_names (`Union[str, Sequence[str]]`):
            Column name(s) to sort by.
        reverse (`Union[bool, Sequence[bool]]`, defaults to `False`):
            If `True`, sort by descending order rather than ascending. If a single bool is provided,
            the value is applied to the sorting of all column names. Otherwise a list of bools with the
            same length and order as column_names must be provided.
        null_placement (`str`, defaults to 

In [21]:
foo = drug_dataset.sort('review_length', reverse=True)
foo['train'][:3]

{'patient_id': [121004, 181160, 216072],
 'drugName': ['Venlafaxine', 'Prozac', 'Copper'],
 'condition': ['migraine', 'obsessive compulsive disorde', 'birth control'],
 'review': ['"Two and a half months ago I was prescribed Venlafaxine to help prevent chronic migraines.\r\nIt did help the migraines (reduced them by almost half), but with it came a host of side effects that were far worse than the problem I was trying to get rid of.\r\nHaving now come off of the stuff, I would not recommend anyone ever use Venlafaxine unless they suffer from extreme / suicidal depression. I mean extreme in the most emphatic sense of the word. \r\nBefore trying Venlafaxine, I was a writer. While on Venlafaxine, I could barely write or speak or communicate at all. More than that, I just didn&#039;t want to. Not normal for a usually outgoing extrovert.\r\nNow, I&#039;m beginning to write again - but my ability to speak and converse with others has deteriorated by about 95%. Writing these words is taking f

In [22]:
import html

In [23]:
%time drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

CPU times: user 9.12 ms, sys: 165 µs, total: 9.29 ms
Wall time: 8.13 ms


...

`map` method's superpowers?

In [24]:
%time _ = drug_dataset.map(lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True)

CPU times: user 4.2 ms, sys: 8.1 ms, total: 12.3 ms
Wall time: 9.44 ms


----

In [25]:
from transformers import AutoTokenizer

tokenizer_slow = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
tokenizer_fast = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)

def tokenize_function_slow(examples):
    return tokenizer_slow(examples["review"], truncation=True)

def tokenize_function_fast(examples):
    return tokenizer_fast(examples["review"], truncation=True)

In [26]:
%time _ = drug_dataset.map(tokenize_function_fast, batched=True)

CPU times: user 12.2 ms, sys: 8.53 ms, total: 20.7 ms
Wall time: 18.7 ms


In [27]:
%time _ = drug_dataset.map(tokenize_function_fast, batched=False)

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 24.6 s, sys: 106 ms, total: 24.7 s
Wall time: 24.7 s


...

In [28]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)

In [29]:
%time _ = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

CPU times: user 668 ms, sys: 0 ns, total: 668 ms
Wall time: 666 ms


...

In [30]:
# was this line left out of the tutorial???
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)

def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

In [31]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

In [32]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

ArrowInvalid: Column 8 named input_ids expected length 1000 but got length 1463

> The problem is that we’re trying to mix two different datasets of different sizes: the drug_dataset columns will have a certain number of examples (the 1,000 in our error), but the tokenized_dataset we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using return_overflowing_tokens=True). That doesn’t work for a Dataset, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset....

##### Clarification
* We are trying to build a new dataset of tokenized values that we will use for training.
* Since we are splitting each original row in `drug_dataset` so that we have an initial 128-token value and possibly and additional set of tokens from those rows where there are more than 128 tokens (overflow), we are essentially not going to be able to do a one-to-one mapping from the original `drug_dataset` to our new tokenized one.
* So we need to use the [`remove_columns` argument to to `datasets.map`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map.remove_columns), which _...removes a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept_
* You would do well to remember that in this instance for training, the **tokenized** dataset is expected to have the following fields: `input_ids`, `token_type_ids`, and `attention_mask`

In [33]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

...

Now compare the resulting `tokenized_dataset` vis a vis the original `drug_dataset`...

In [34]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

In [35]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 68876
    })
})

In [36]:
print(tokenized_dataset["train"][66])

{'input_ids': [101, 107, 1130, 19972, 11083, 1225, 1136, 1250, 1111, 1139, 1488, 132, 1119, 1108, 18302, 1228, 1103, 2928, 1229, 1119, 1108, 1781, 1122, 117, 1105, 1515, 1558, 2492, 1107, 1705, 106, 1135, 3093, 1106, 1250, 1103, 3714, 1113, 1140, 106, 107, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'overflow_to_sample_mapping': 43}


In [37]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 138514
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

In [38]:
print(drug_dataset["train"][0])

{'patient_id': 95260, 'drugName': 'Guanfacine', 'condition': 'adhd', 'review': '"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective."', 'rating': 8.0, 'date': 'April 27, 2010', 'usefulCount': 192, 'review_length': 141}


<hr width=40%/>

##### Uh, wtf? Or, I think we need a bit more clarification...

OK, that HF tutorial is not at all clear as to what happens with `return_overflowing_tokens`. 

Let's take a quick detour and dig in just a bit more...

In [39]:
from transformers import XLMRobertaTokenizerFast

model_id = "xlm-roberta-large-finetuned-conll03-english"
t = XLMRobertaTokenizerFast.from_pretrained(model_id)

sample = "this is an example and context is important to retrieve meaningful contextualized token embeddings from the self attention mechanism of the transformer"
encoded_default = t(sample, truncation=True).input_ids

print()
print(f"Original example string: {sample}")
print()
print(f"Tokenized, this string has {len(t.tokenize(sample))} tokens:\n{t.batch_decode(encoded_default)}")


Original example string: this is an example and context is important to retrieve meaningful contextualized token embeddings from the self attention mechanism of the transformer

Tokenized, this string has 32 tokens:
['<s>', 'this', 'is', 'an', 'example', 'and', 'context', 'is', 'important', 'to', 're', 'tri', 'eve', 'meaning', 'ful', 'context', 'ual', 'ized', 'to', 'ken', '', 'embe', 'dding', 's', 'from', 'the', 'self', 'attention', 'mechanism', 'of', 'the', 'transform', 'er', '</s>']


In [40]:
# now let's limit tokenization to 10 tokens...
encoded_max_length = t(sample, max_length=10, truncation=True).input_ids

print(f"Setting max_length in tokenization to: {len(encoded_max_length)}")
print(f"Now the tokens look like this:\n{t.batch_decode(encoded_max_length)}")

Setting max_length in tokenization to: 10
Now the tokens look like this:
['<s>', 'this', 'is', 'an', 'example', 'and', 'context', 'is', 'important', '</s>']


In [41]:
# and NOW we ask the tokenizer to return_overflowing_tokens
encoded_overflow = t(sample, max_length=10, truncation=True, return_overflowing_tokens=True).input_ids

print([len(x) for x in encoded_overflow])
print(*t.batch_decode(encoded_overflow), sep="\n")

[10, 10, 10, 10]
<s> this is an example and context is important</s>
<s> to retrieve meaningful contextual</s>
<s>ized token embeddings from</s>
<s> the self attention mechanism of the transformer</s>


In [42]:
# but it may be the case that this sequential listing of the overflow 
# results is less-than-ideal input...
# so NOW we ask for striding of the overflow
# with the hope that the resulting strided sequences make more sense
encoded_overflow_stride = t(sample, max_length=10, truncation=True, stride=3, return_overflowing_tokens=True).input_ids

print([len(x) for x in encoded_overflow_stride])
print(*t.batch_decode(encoded_overflow_stride), sep="\n")

[10, 10, 10, 10, 10, 9]
<s> this is an example and context is important</s>
<s> context is important to retrieve meaning</s>
<s>trieve meaningful contextualized to</s>
<s>ualized token embeddings</s>
<s>embeddings from the self attention mechanism</s>
<s> self attention mechanism of the transformer</s>


<hr width=40%/>
...

In [43]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [44]:
drug_dataset = drug_dataset.map(tokenize_and_split, batched=True)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

...

In [45]:
drug_dataset.set_format("pandas")
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length,input_ids,token_type_ids,attention_mask
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141,"[101, 107, 1422, 1488, 1110, 9079, 1194, 1117,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141,"[101, 119, 1124, 1110, 1750, 6438, 113, 170, 1...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134,"[101, 107, 146, 1215, 1106, 1321, 1330, 9619, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [46]:
train_df = drug_dataset["train"][:]

In [47]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"count": "frequency"})
)
frequencies.head()

Unnamed: 0,condition,frequency
0,birth control,43445
1,depression,12322
2,acne,8285
3,anxiety,7523
4,pain,6513


In [48]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['condition', 'frequency'],
    num_rows: 819
})

In [49]:
drug_avg_rating = (
    train_df.groupby(['drugName'])['rating']
            .mean()
            .to_frame()
            .reset_index()
)
print(drug_avg_rating)

                                  drugName     rating
0                A + D Cracked Skin Relief  10.000000
1                               A / B Otic  10.000000
2     Abacavir / dolutegravir / lamivudine   7.901639
3       Abacavir / lamivudine / zidovudine   9.000000
4                                Abatacept   7.500000
...                                    ...        ...
3047                                 Zyvox   9.210526
3048                               ZzzQuil   4.000000
3049                 depo-subQ provera 104   1.000000
3050                                  ella   7.125000
3051                                femhrt   5.500000

[3052 rows x 2 columns]


In [50]:
drug_dataset.reset_format()

...

In [51]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

In [52]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)

# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")

# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 165417
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 41355
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

...

In [53]:
drug_dataset_clean.save_to_disk("drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/165417 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/41355 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/68876 [00:00<?, ? examples/s]

In [54]:
!ls -la drug-reviews

total 24
drwxr-xr-x 5 se_olliphant se_olliphant 4096 Feb  5 12:39 .
drwxr-xr-x 5 se_olliphant se_olliphant 4096 Feb  5 12:42 ..
-rw-r--r-- 1 se_olliphant se_olliphant   43 Feb  5 12:43 dataset_dict.json
drwxr-xr-x 2 se_olliphant se_olliphant 4096 Feb  5 12:39 test
drwxr-xr-x 2 se_olliphant se_olliphant 4096 Feb  5 12:39 train
drwxr-xr-x 2 se_olliphant se_olliphant 4096 Feb  5 12:39 validation


In [55]:
!ls -la drug-reviews/train

total 191660
drwxr-xr-x 2 se_olliphant se_olliphant      4096 Feb  5 12:39 .
drwxr-xr-x 5 se_olliphant se_olliphant      4096 Feb  5 12:39 ..
-rw-r--r-- 1 se_olliphant se_olliphant 196240960 Feb  5 12:44 data-00000-of-00001.arrow
-rw-r--r-- 1 se_olliphant se_olliphant      1916 Feb  5 12:44 dataset_info.json
-rw-r--r-- 1 se_olliphant se_olliphant       250 Feb  5 12:44 state.json


In [56]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 165417
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 41355
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

In [57]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/166 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/42 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/69 [00:00<?, ?ba/s]

In [58]:
!ls -la 

total 627300
drwxr-xr-x 5 se_olliphant se_olliphant      4096 Feb  5 12:42 .
drwxr-xr-x 3 se_olliphant se_olliphant      4096 Jan 31 06:42 ..
drwxr-xr-x 2 se_olliphant se_olliphant      4096 Feb  4 07:27 .ipynb_checkpoints
-rw-r--r-- 1 se_olliphant se_olliphant     14239 Jan 31 07:05 Ch2_p38.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant     22696 Feb  1 04:21 HF_NLP_Ch2.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant     49292 Feb  4 03:25 HF_NLP_Ch3.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant      5472 Feb  4 03:26 HF_NLP_Ch4.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant     29410 Feb  5 12:42 HF_NLP_Ch5.ipynb
-rw-r--r-- 1 se_olliphant se_olliphant   8385528 Feb  4 06:25 SQuAD_it-test.json
-rw-r--r-- 1 se_olliphant se_olliphant   1051245 Feb  4 06:25 SQuAD_it-test.json.gz
-rw-r--r-- 1 se_olliphant se_olliphant  43605829 Feb  4 06:25 SQuAD_it-train.json
-rw-r--r-- 1 se_olliphant se_olliphant   7725286 Feb  4 06:25 SQuAD_it-train.json.gz
-rw-r--r-- 1 se_olliphant se_olliphant      2417 Feb 

In [59]:
!head -n 1 drug-reviews-train.jsonl

{"patient_id":37933,"drugName":"Adipex-P","condition":"weight loss","review":"\"Been on Adipex-p for 28 days and have lost 20 pounds. Doctor put me on it because I am menopausal and have a hard time losing weight. First week had a hard time sleeping but now no problems, I check my blood pressure daily and no issues there. Definitely get dry mouth with it that has been my only issue, I usually chew gum and keep a drink handy. Glad I was given this drug, doubt I would have lost what I did without it.\"","rating":9.0,"date":"September 5, 2017","usefulCount":35,"review_length":85,"input_ids":[101,107,18511,1113,24930,9717,11708,118,185,1111,1743,1552,1105,1138,1575,1406,6549,119,4157,1508,1143,1113,1122,1272,146,1821,1441,4184,25134,1348,1105,1138,170,1662,1159,3196,2841,119,1752,1989,1125,170,1662,1159,5575,1133,1208,1185,2645,117,146,4031,1139,1892,2997,3828,1105,1185,2492,1175,119,3177,16598,3150,1193,1243,3712,1779,1114,1122,1115,1144,1151,1139,1178,2486,117,146,1932,22572,5773,19956,1

In [60]:
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

----

### Big data? 🤗 Datasets to the rescue!