# Working With HuggingFace Datasets

- [Loading Data](https://huggingface.co/learn/nlp-course/chapter5/2?fw=pt)
- [Manipulating Data](https://huggingface.co/learn/nlp-course/chapter5/3?fw=pt)

In [1]:
# Built-in library
import re
import json
from typing import Any, Dict, List, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
from pprint import pprint
import pandas as pd
from rich import print

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
%load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

### [Working With Remote And Local Datasets](https://huggingface.co/learn/nlp-course/chapter5/2?fw=pt)

<br>

[![image.png](https://i.postimg.cc/43xyFsHR/image.png)](https://postimg.cc/9DscD3BL)

### Loading A Local Dataset

- For this example we’ll use the [SQuAD-it dataset](https://github.com/crux82/squad-it/), which is a large-scale dataset for question answering in Italian.

- The training and test splits are hosted on GitHub, so we can download them with a simple wget command:

<br>

```python
# Create the my_data directory if it does not exist.
!mkdir -p my_data

# Download the file SQuAD_it-train.json.gz to the my_data directory.
!wget -P my_data https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget -P my_data https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

# This will download two compressed files called SQuAD_it-train.json.gz 
# and SQuAD_it-test.json.gz, which can be decompressed with the Linux gzip command:
!gzip -dkv my_data/SQuAD_it-*.json.gz
```

In [2]:
# Run this once!
"""
!mkdir -p my_data

# Download data
!wget -P my_data "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"

# Unzip the data
!unzip my_data/drugsCom_raw.zip

"""

'\n!mkdir -p my_data\n\n# Download data\n!wget -P my_data "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"\n\n# Unzip the data\n!unzip my_data/drugsCom_raw.zip\n\n'

In [3]:
from datasets import Dataset
from datasets import load_dataset
from datasets.dataset_dict import DatasetDict

# Load the data
data_files: dict[str, Any] = {
    "train": "my_data/drugsComTrain_raw.tsv",
    "test": "my_data/drugsComTest_raw.tsv",
}

# \t is the tab character in Python
drug_dataset: DatasetDict = load_dataset("csv", data_files=data_files, delimiter="\t")

### To Do

```text
- Prepare a dataset using Pandas and save as a Hgginface dataset.
```

In [4]:
import pandas as pd

df: pd.DataFrame = pd.read_csv("my_data/drugsComTrain_raw.tsv", sep="\t")
df = df.rename(columns={"Unnamed: 0": "id"})
print(f"Shape of data: {df.shape}\n")

df.head()

Unnamed: 0,id,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil""",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembe...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I ...",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch""",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. ...",9.0,"November 27, 2016",37


In [5]:
df_1 = df.copy()
df_2 = df.copy()

df_1 = df_1.iloc[:20]
df_2 = df_2.iloc[20:25]

# Save data
fp_1: str = "my_data/sample_train_data.json"
fp_2: str = "my_data/sample_test_data.json"

df_1.to_json(fp_1, orient="records", indent=4)
df_2.to_json(fp_2, orient="records", indent=4)

In [6]:
# Load a json data as a Hugginface dataset
sample_dataset = load_dataset("json", data_files=fp_1)

sample_dataset

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'condition', 'review', 'rating', 'drugName', 'date', 'usefulCount'],
        num_rows: 20
    })
})

In [7]:
sample_dataset.get("train")[0]

{'id': 206461,
 'condition': 'Left Ventricular Dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'drugName': 'Valsartan',
 'date': 'May 20, 2012',
 'usefulCount': 27}

In [8]:
# Load the train and test dataset
data_files = {
    "train": "my_data/sample_train_data.json",
    "test": "my_data/sample_test_data.json",
}
sample_dataset = load_dataset("json", data_files=data_files)
sample_dataset

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'condition', 'review', 'rating', 'drugName', 'date', 'usefulCount'],
        num_rows: 20
    })
    test: Dataset({
        features: ['id', 'condition', 'review', 'rating', 'drugName', 'date', 'usefulCount'],
        num_rows: 5
    })
})

### Loading A Remote Dataset

```text

- Loading remote files is just as simple as loading local ones! 
- Instead of providing a path to local files, point the data_files argument of load_dataset() to one or more URLs where the remote files are stored. 

- For example, for the SQuAD-it dataset hosted on GitHub, point data_files to the SQuAD_it-*.json.gz URLs as follows:
```

```python

url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
```

## [Manipulating Data](https://huggingface.co/learn/nlp-course/chapter5/3?fw=pt)


```text
Download and Save the data
--------------------------
```

```python
!mkdir -p my_data

!wget -P my_data "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"

# Download and unzip
!unzip my_data/drugsCom_raw.zip
```

In [9]:
from datasets import load_dataset
from datasets.dataset_dict import DatasetDict


data_files: dict[str, Any] = {
    "train": "my_data/drugsComTrain_raw.tsv",
    "test": "my_data/drugsComTest_raw.tsv",
}

# \t is the tab character in Python
drug_dataset: DatasetDict = load_dataset("csv", data_files=data_files, delimiter="\t")
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [10]:
# Grab a small random sample to get a quick feel for the type of data you’re working with.
# Create a random sample by chaining the Dataset.shuffle() and Dataset.select() functions together:

SEED: int = 42
NUM: int = 1_000

drug_sample: DatasetDict = drug_dataset["train"].shuffle(seed=SEED).select(range(NUM))
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

In [11]:
drug_dataset.keys()

dict_keys(['train', 'test'])

In [12]:
# Verify that the number of IDs (`Unnamed` column) matches the number of rows in each split using Dataset.unique():

for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

#### Rename Column(s)

In [13]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [14]:
COLUMNS: dict[str, str] = {"patient_id": "id", "drugName": "name"}
drug_dataset.rename_columns(COLUMNS)

DatasetDict({
    train: Dataset({
        features: ['id', 'name', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['id', 'name', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

#### Find The Number of unique Items

In [15]:
# Number of unique drugs
result_train: int = len(drug_dataset.get("train").unique("drugName"))
result_test: int = len(drug_dataset.get("test").unique("drugName"))

print((result_train, result_test))

In [16]:
dir(drug_dataset)

['__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__ior__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__or__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__ror__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_values_features',
 '_check_values_type',
 'align_labels_with_mapping',
 'cache_files',
 'cast',
 'cast_column',
 'class_encode_column',
 'cleanup_cache_files',
 'clear',
 'column_names',
 'copy',
 'data',
 'filter',
 'flatten',
 'flatten_indices',
 'formatted_as',
 'from_csv',
 'from_json',
 'from_parquet',
 'from_text',
 'fromkeys',
 'get',
 'items',
 'keys',
 'load_from_disk',
 'map',
 'num_columns',
 'num_rows',
 'pop',
 'popitem',
 'prepare_for_task

### Apply A Custom Function

```text
- Dataset.map()
- Dataset.filter()
```

In [17]:
def lowercase_condition(example: dict[str, Any]) -> dict[str, Any]:
    """This converts the value of the condition to lowercase."""
    return {"condition": example.get("condition").lower()}


try:
    drug_dataset.map(lowercase_condition)
except AttributeError as err:
    print(err)

Map:   0%|          | 0/161297 [00:00<?, ? examples/s]

In [18]:
# There are some rows that have an invalid condition. i.e condition == None. Drop those rows
# e.g.
my_list: list[str] = [
    "ML Engineer",
    "Data Scientist",
    "Data Engineer",
    "Research Engineer",
    "Banker",
]
# Drop `Banker`
list(filter(lambda x: x != "Banker", my_list))

['ML Engineer', 'Data Scientist', 'Data Engineer', 'Research Engineer']

In [19]:
# Drop the invalid conditions. i.e where condition is None
drug_dataset_1 = drug_dataset.filter(lambda x: x.get("condition") is not None)

print(f"Size BEFORE dropping rows: {drug_dataset.num_rows}")
print(f"Size AFTER dropping rows: {drug_dataset_1.num_rows}")

In [20]:
# Convert to lowercase
drug_dataset_1 = drug_dataset_1.map(lowercase_condition)

In [21]:
# Before
drug_dataset.get("train").shuffle(seed=123).select(range(5))[:].get("condition")

['ADHD',
 'ADHD',
 'Birth Control',
 'Post Traumatic Stress Disorde',
 'Allergic Rhinitis']

In [22]:
# After
drug_dataset_1.get("train").shuffle(seed=123).select(range(5))[:].get("condition")

['herpes simplex, mucocutaneous/immunocompetent host',
 '13</span> users found this comment helpful.',
 'ibromyalgia',
 'birth control',
 'adhd']

#### Creating New Columns


In [23]:
drug_dataset_1.get("train").features

{'patient_id': Value(dtype='int64', id=None),
 'drugName': Value(dtype='string', id=None),
 'condition': Value(dtype='string', id=None),
 'review': Value(dtype='string', id=None),
 'rating': Value(dtype='float64', id=None),
 'date': Value(dtype='string', id=None),
 'usefulCount': Value(dtype='int64', id=None)}

In [24]:
def add_review_length(example: dict[str, Any]) -> dict[str, Any]:
    """This is used to add the lenght of a review to the dataset."""
    return {"review_length": len(example.get("review").split(" "))}


drug_dataset_1 = drug_dataset_1.map(add_review_length)
drug_dataset_1

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 160398
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 53471
    })
})

In [25]:
drug_dataset_1.get("train")[20]

{'patient_id': 12372,
 'drugName': 'Atripla',
 'condition': 'hiv infection',
 'review': '"Spring of 2008 I was hospitalized with pnuemonia and diagnosed with Lyme diease and full blown AIDS with CD4 count of &quot;11&quot; viral load some number so high in the millions I could never remember. I was taking Combivir and Kaletra with Dapsone for the 1st year then it stopped working. I started Kaletra with the Dapsone my CD4 count is now 209 and rising. For a few weeks I was very aggressive and broke all my dishes in the house LOL. I take vitamin supplements and drink a boost pluz every day. LIfe is good now!"',
 'rating': 8.0,
 'date': 'July 9, 2010',
 'usefulCount': 11,
 'review_length': 97}

#### Sort Values

In [26]:
# Sort by the length of the review
print(drug_dataset_1.get("train").sort(column_names="review_length", reverse=True)[:3])
# drug_dataset_1.sort()

In [27]:
# Drop review lwngths that are less than 30
drug_dataset_2: DatasetDict = drug_dataset_1.filter(
    lambda x: x.get("review_length") >= 30
)

drug_dataset_2.num_rows

{'train': 139894, 'test': 46552}

In [28]:
# The last thing we need to deal with is the presence of HTML character codes in our reviews.
# We can use Python’s html module to unescape these characters, like so:

import html


text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [29]:
# Drop the HTML character codes
# This approach is slower
drug_dataset_2.map(lambda x: {"review": html.unescape(x.get("review"))})

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 139894
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46552
    })
})

#### Note

```text
- The Dataset.map() method can be used with the batched argument set to True to process multiple examples at once, improving processing speed. 
- The function receives a dictionary with fields as lists and returns a similar dictionary with updated or added values.
```

In [30]:
# Drop the HTML character codes
# This approach is faster (recommended)
drug_dataset_2 = drug_dataset_2.map(
    lambda x: {"review": [html.unescape(obj) for obj in x.get("review")]}, batched=True
)

In [31]:
print(drug_dataset_2.get("train")[10].get("review"))

#### Tokenize The Texts

In [32]:
from transformers import AutoTokenizer


CHECKPOINT: str = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)


def tokenize_function(examples: dict[str, Any]) -> dict[str, Any]:
    """This is used to tokenize the text. It returns a dict containing the
    input_ids, token_type_ids and attention_mask."""
    return tokenizer(examples.get("review"), truncation=True)

In [33]:
text: str = "My name is Chinedu"

# Return the token IDs and attention mask
print(tokenizer(text, truncation=True))

In [34]:
# Return the tokenized output
print(tokenizer.tokenize(text))

In [35]:
# Tokenize the data with batched=True [faster]
tokenized_dataset: DatasetDict = drug_dataset_2.map(tokenize_function, batched=True)

Map:   0%|          | 0/46552 [00:00<?, ? examples/s]

In [36]:
def tokenize_and_split(examples: dict[str, Any]) -> dict[str, Any]:
    """Tokenize and truncate the text to a maximum length of 128 and return
    all the chunks of the texts."""
    return tokenizer(
        examples.get("review"),
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

In [37]:
sample: dict[str, Any] = drug_dataset_2.get("train")[0]
print(sample)

In [38]:
result: dict[str, Any] = tokenize_and_split(sample)

# It returns the tokens and the overflowing tokens
# i.e. (tokens, overflow_tokens)
[len(inp) for inp in result.get("input_ids")]

[128, 49]

In [39]:
# tokenize_and_split returns `overflow_to_sample_mapping` as one of the keys
result.get("overflow_to_sample_mapping")

[0, 0]

In [40]:
# The first example in the training set was tokenized into two features due to exceeding the
# maximum token limit. This process will be applied to all elements of the dataset.
from pyarrow.lib import ArrowInvalid

try:
    tokenized_dataset_1: DatasetDict = tokenized_dataset.map(
        tokenize_and_split, batched=True
    )
except ArrowInvalid as err:
    print(err)

Map:   0%|          | 0/139894 [00:00<?, ? examples/s]

In [41]:
# The error occurred due to a mismatch in column lengths between two datasets. To resolve this,
# we can either remove the columns from the old dataset or make them the same size as in the new
# dataset using the remove_columns argument.
# Remove the columns ...

tokenized_dataset_1: DatasetDict = tokenized_dataset.map(
    tokenize_and_split,
    batched=True,
    remove_columns=tokenized_dataset.get("train").column_names,
)

# The old column names have been dropped!
tokenized_dataset_1

Map:   0%|          | 0/46552 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 208152
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 69320
    })
})

In [42]:
# The two datasets have different num of rows
len(tokenized_dataset_1.get("train")), len(drug_dataset_2.get("train"))

(208152, 139894)

In [43]:
# We can solve the mismatched length problem by making the old columns the same size as the
# new ones, using the overflow_to_sample_mapping field.


def tokenize_and_split_2(examples: dict[str, Any]) -> dict[str, Any]:
    """Tokenize and truncate the text to a maximum length of 128 and return
    all the chunks of the texts. It ensures that the old and new columns are the same."""
    result: dict[str, Any] = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [44]:
# Approach 2
tokenized_dataset_2: DatasetDict = tokenized_dataset.map(
    tokenize_and_split_2, batched=True
)

# The old column names were NOT been dropped!
tokenized_dataset_2

Map:   0%|          | 0/46552 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 208152
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 69320
    })
})

### Convert Huggingface Datasets To Dataframes

```text
- More analyses can be performed once the data has been converted to tabular format.
```

In [45]:
dataset_df = tokenized_dataset_2.get("train").to_pandas()

dataset_df.head()

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length,input_ids,token_type_ids,attention_mask
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembe...",8.0,"April 27, 2010",192,141,"[101, 107, 1422, 1488, 1110, 9079, 1194, 1117, 2223, 1989, 1104, 1130, 19972, 11083, 119, 1284, 1245, 4264, 1165, 1119, 1310, 1142, 1314, 1989, 117, 1165, 1119, 1408, 1781, 1103, 2439, 13753, 1119, 1209, 1129, 1113, 119, 1370, 1160, 1552, 117, 1119, 1180, 6374, 1243, 1149, 1104, 1908, 117, 1108, 1304, 172, 14687, 1183, 117, 1105, 7362, 1111, 2212, 129, 2005, 1113, 170, 2797, 1313, 1121, 1278, 12020, 113, 1304, 5283, 1111, 1140, 119, 114, 146, 1270, 1117, 3995, 1113, 6356, 2106, 1105, 1131, 1163, 1106, 6166, 1122, 1149, 170, 1374, 1552, 119, 3969, 1293, 1119, 1225, 1120, 1278, 117, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
1,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembe...",8.0,"April 27, 2010",192,141,"[101, 107, 1422, 1488, 1110, 9079, 1194, 1117, 2223, 1989, 1104, 1130, 19972, 11083, 119, 1284, 1245, 4264, 1165, 1119, 1310, 1142, 1314, 1989, 117, 1165, 1119, 1408, 1781, 1103, 2439, 13753, 1119, 1209, 1129, 1113, 119, 1370, 1160, 1552, 117, 1119, 1180, 6374, 1243, 1149, 1104, 1908, 117, 1108, 1304, 172, 14687, 1183, 117, 1105, 7362, 1111, 2212, 129, 2005, 1113, 170, 2797, 1313, 1121, 1278, 12020, 113, 1304, 5283, 1111, 1140, 119, 114, 146, 1270, 1117, 3995, 1113, 6356, 2106, 1105, 1131, 1163, 1106, 6166, 1122, 1149, 170, 1374, 1552, 119, 3969, 1293, 1119, 1225, 1120, 1278, 117, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
2,92703,Lybrel,birth control,"""I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still...",5.0,"December 14, 2009",17,133,"[101, 107, 146, 1215, 1106, 1321, 1330, 9619, 14255, 4487, 17046, 117, 1134, 1125, 1626, 21822, 5120, 117, 1105, 1108, 1304, 2816, 118, 1304, 1609, 6461, 117, 12477, 1775, 126, 1552, 117, 1185, 1168, 1334, 3154, 119, 1252, 1122, 4049, 21055, 176, 2556, 13040, 1673, 117, 1134, 1110, 1136, 1907, 1107, 1646, 117, 1177, 146, 6759, 1106, 149, 1183, 9730, 1233, 117, 1272, 1103, 13288, 1132, 1861, 119, 1332, 1139, 1168, 17029, 2207, 117, 146, 1408, 149, 1183, 9730, 1233, 2411, 117, 1113, 1139, 1148, 1285, 1104, 1669, 117, 1112, 1103, 7953, 1163, 119, 1262, 1103, 1669, 5695, 1111, 1160, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
3,92703,Lybrel,birth control,"""I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still...",5.0,"December 14, 2009",17,133,"[101, 107, 146, 1215, 1106, 1321, 1330, 9619, 14255, 4487, 17046, 117, 1134, 1125, 1626, 21822, 5120, 117, 1105, 1108, 1304, 2816, 118, 1304, 1609, 6461, 117, 12477, 1775, 126, 1552, 117, 1185, 1168, 1334, 3154, 119, 1252, 1122, 4049, 21055, 176, 2556, 13040, 1673, 117, 1134, 1110, 1136, 1907, 1107, 1646, 117, 1177, 146, 6759, 1106, 149, 1183, 9730, 1233, 117, 1272, 1103, 13288, 1132, 1861, 119, 1332, 1139, 1168, 17029, 2207, 117, 146, 1408, 149, 1183, 9730, 1233, 2411, 117, 1113, 1139, 1148, 1285, 1104, 1669, 117, 1112, 1103, 7953, 1163, 119, 1262, 1103, 1669, 5695, 1111, 1160, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
4,138000,Ortho Evra,birth control,"""This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch""",8.0,"November 3, 2015",10,89,"[101, 107, 1188, 1110, 1139, 1148, 1159, 1606, 1251, 1532, 1104, 3485, 1654, 119, 146, 112, 182, 5171, 146, 1355, 1114, 1103, 10085, 117, 146, 1138, 1151, 1113, 1122, 1111, 129, 1808, 119, 1335, 1148, 1135, 10558, 1139, 181, 21883, 2572, 1133, 1115, 4841, 25984, 119, 1109, 1178, 1205, 5570, 1110, 1115, 1122, 1189, 1139, 6461, 2039, 113, 126, 118, 127, 1552, 1106, 1129, 6129, 114, 146, 1215, 1106, 1178, 1138, 6461, 1111, 124, 118, 125, 1552, 12477, 1775, 1145, 1189, 1139, 172, 4515, 3491, 5827, 1111, 1103, 1148, 1160, 1552, 1104, 1139, 1669, 117, 146, 1309, 1125, 172, 4515, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


In [46]:
freq: pd.DataFrame = (
    df["condition"]
    .value_counts()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)

freq.head()

Unnamed: 0,condition,frequency
0,Birth Control,28788
1,Depression,9069
2,Pain,6145
3,Anxiety,5904
4,Acne,5588


#### Create A Dataset Object Using A Pandas DataFrame

In [47]:
freq_dataset: DatasetDict = Dataset.from_pandas(df=freq)

freq_dataset

Dataset({
    features: ['condition', 'frequency'],
    num_rows: 884
})

In [48]:
freq_dataset[:3]

{'condition': ['Birth Control', 'Depression', 'Pain'],
 'frequency': [28788, 9069, 6145]}

### Save The Dataset

<br>

[![image.png](https://i.postimg.cc/8zQHjhXW/image.png)](https://postimg.cc/7f97RTkY)

In [49]:
# Save the data as an Arrow format
sample_data = df.iloc[:15]
Dataset.from_pandas(sample_data).save_to_disk("./my_data/sample_data")

Saving the dataset (0/1 shards):   0%|          | 0/15 [00:00<?, ? examples/s]

#### Load Dataset

```python
from datasets import load_from_disk


my_data: DatasetDict = load_from_disk("path-to-data")

```

In [50]:
from datasets import load_from_disk

load_from_disk("./my_data/sample_data")

Dataset({
    features: ['id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
    num_rows: 15
})

<hr>

#### Save To Disk (JSON/CSV)

```text

- For the CSV and JSON formats, each split has to be stored as a separate file. 
- One way to do this is by iterating over the keys and values in the DatasetDict object
```



In [51]:
print(tokenized_dataset_2.items())
print("========")
print(tokenized_dataset_2.keys())

In [52]:
# Save as JSON
for _split, _dataset in tokenized_dataset_2.items():
    _dataset.to_json(f"./my_data/drug-reviews-{_split}.jsonl")

Creating json from Arrow format:   0%|          | 0/209 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/70 [00:00<?, ?ba/s]

## [Working With Big Data](https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt)