# Working With HuggingFace Datasets

- [Loading Data](https://huggingface.co/learn/nlp-course/chapter5/2?fw=pt)
- [Manipulating Data](https://huggingface.co/learn/nlp-course/chapter5/3?fw=pt)

In [1]:
# Built-in library
import re
import json
from typing import Any, Dict, List, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
from pprint import pprint
import pandas as pd
from rich import print

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
%load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

### [Working With Remote And Local Datasets](https://huggingface.co/learn/nlp-course/chapter5/2?fw=pt)

<br>

[![image.png](https://i.postimg.cc/43xyFsHR/image.png)](https://postimg.cc/9DscD3BL)

### Loading A Local Dataset

- For this example we’ll use the [SQuAD-it dataset](https://github.com/crux82/squad-it/), which is a large-scale dataset for question answering in Italian.

- The training and test splits are hosted on GitHub, so we can download them with a simple wget command:

<br>

```python
# Create the my_data directory if it does not exist.
!mkdir -p my_data

# Download the file SQuAD_it-train.json.gz to the my_data directory.
!wget -P my_data https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget -P my_data https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

# This will download two compressed files called SQuAD_it-train.json.gz 
# and SQuAD_it-test.json.gz, which can be decompressed with the Linux gzip command:
!gzip -dkv my_data/SQuAD_it-*.json.gz
```

In [2]:
# Run this once!
"""
!mkdir -p my_data

!wget -P my_data "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"

# Download and unzip
!unzip my_data/drugsCom_raw.zip

"""

'\n!mkdir -p my_data\n\n!wget -P my_data "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"\n\n# Download and unzip\n!unzip my_data/drugsCom_raw.zip\n\n'

In [3]:
from datasets import load_dataset
from datasets.dataset_dict import DatasetDict


data_files: dict[str, Any] = {
    "train": "my_data/drugsComTrain_raw.tsv",
    "test": "my_data/drugsComTest_raw.tsv",
}

# \t is the tab character in Python
drug_dataset: DatasetDict = load_dataset("csv", data_files=data_files, delimiter="\t")

### To Do

```text
- Prepare a dataset using Pandas and save as a Hgginface dataset.
```

In [4]:
import pandas as pd

df: pd.DataFrame = pd.read_csv("my_data/drugsComTrain_raw.tsv", sep="\t")
df = df.rename(columns={"Unnamed: 0": "id"})
print(f"Shape of data: {df.shape}\n")

df.head()

Unnamed: 0,id,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil""",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembe...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I ...",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch""",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. ...",9.0,"November 27, 2016",37


In [5]:
df_1 = df.copy()
df_2 = df.copy()

df_1 = df_1.iloc[:20]
df_2 = df_2.iloc[20:25]

# Save data
fp_1: str = "my_data/sample_train_data.json"
fp_2: str = "my_data/sample_test_data.json"

df_1.to_json(fp_1, orient="records", indent=4)
df_2.to_json(fp_2, orient="records", indent=4)

In [6]:
# Load a json data as a Hugginface dataset
sample_dataset = load_dataset("json", data_files=fp_1)

sample_dataset

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['date', 'drugName', 'rating', 'condition', 'usefulCount', 'id', 'review'],
        num_rows: 20
    })
})

In [7]:
sample_dataset.get("train")[0]

{'date': 'May 20, 2012',
 'drugName': 'Valsartan',
 'rating': 9.0,
 'condition': 'Left Ventricular Dysfunction',
 'usefulCount': 27,
 'id': 206461,
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"'}

In [8]:
# Load the train and test dataset
data_files = {
    "train": "my_data/sample_train_data.json",
    "test": "my_data/sample_test_data.json",
}
sample_dataset = load_dataset("json", data_files=data_files)
sample_dataset

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['date', 'drugName', 'rating', 'condition', 'usefulCount', 'id', 'review'],
        num_rows: 20
    })
    test: Dataset({
        features: ['date', 'drugName', 'rating', 'condition', 'usefulCount', 'id', 'review'],
        num_rows: 5
    })
})

### Loading A Remote Dataset

```text

- Loading remote files is just as simple as loading local ones! 
- Instead of providing a path to local files, point the data_files argument of load_dataset() to one or more URLs where the remote files are stored. 

- For example, for the SQuAD-it dataset hosted on GitHub, point data_files to the SQuAD_it-*.json.gz URLs as follows:
```

```python

url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
```

## [Manipulating Data](https://huggingface.co/learn/nlp-course/chapter5/3?fw=pt)


```text
Download and Save the data
--------------------------
```

```python
!mkdir -p my_data

!wget -P my_data "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"

# Download and unzip
!unzip my_data/drugsCom_raw.zip
```

In [9]:
from datasets import load_dataset
from datasets.dataset_dict import DatasetDict


data_files: dict[str, Any] = {
    "train": "my_data/drugsComTrain_raw.tsv",
    "test": "my_data/drugsComTest_raw.tsv",
}

# \t is the tab character in Python
drug_dataset: DatasetDict = load_dataset("csv", data_files=data_files, delimiter="\t")
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [10]:
# Grab a small random sample to get a quick feel for the type of data you’re working with.
# Create a random sample by chaining the Dataset.shuffle() and Dataset.select() functions together:

SEED: int = 42
NUM: int = 1_000

drug_sample: DatasetDict = drug_dataset["train"].shuffle(seed=SEED).select(range(NUM))
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

In [11]:
drug_dataset.keys()

dict_keys(['train', 'test'])

In [12]:
# Verify that the number of IDs (`Unnamed` column) matches the number of rows in each split using Dataset.unique():

for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

#### Rename Column(s)

In [13]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [14]:
COLUMNS: dict[str, str] = {"patient_id": "id", "drugName": "name"}
drug_dataset.rename_columns(COLUMNS)

DatasetDict({
    train: Dataset({
        features: ['id', 'name', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['id', 'name', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

#### Find The Number of unique Items

In [15]:
# Number of unique drugs
result_train: int = len(drug_dataset.get("train").unique("drugName"))
result_test: int = len(drug_dataset.get("test").unique("drugName"))

print((result_train, result_test))

In [16]:
dir(drug_dataset)

['__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__ior__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__or__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__ror__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_check_values_features',
 '_check_values_type',
 'align_labels_with_mapping',
 'cache_files',
 'cast',
 'cast_column',
 'class_encode_column',
 'cleanup_cache_files',
 'clear',
 'column_names',
 'copy',
 'data',
 'filter',
 'flatten',
 'flatten_indices',
 'formatted_as',
 'from_csv',
 'from_json',
 'from_parquet',
 'from_text',
 'fromkeys',
 'get',
 'items',
 'keys',
 'load_from_disk',
 'map',
 'num_columns',
 'num_rows',
 'pop',
 'popitem',
 'prepare_for_task

### Apply A Custom Function

```text
- Dataset.map()
- Dataset.filter()
```

In [17]:
def lowercase_condition(example: dict[str, Any]):
    """This converts the value of the condition to lowercase."""
    return {"condition": example.get("condition").lower()}


try:
    drug_dataset.map(lowercase_condition)
except AttributeError as err:
    print(err)

Map:   0%|          | 0/161297 [00:00<?, ? examples/s]

In [18]:
# There are some rows that have an invalid condition. i.e condition == None. Drop those rows
# e.g.
my_list: list[str] = [
    "ML Engineer",
    "Data Scientist",
    "Data Engineer",
    "Research Engineer",
    "Banker",
]
# Drop `Banker`
list(filter(lambda x: x != "Banker", my_list))

['ML Engineer', 'Data Scientist', 'Data Engineer', 'Research Engineer']

In [19]:
# Drop the invalid conditions. i.e where condition is None
drug_dataset_1 = drug_dataset.filter(lambda x: x.get("condition") is not None)

print(f"Size BEFORE dropping rows: {drug_dataset.num_rows}")
print(f"Size AFTER dropping rows: {drug_dataset_1.num_rows}")

In [20]:
# Convert to lowercase
drug_dataset_1 = drug_dataset_1.map(lowercase_condition)

In [21]:
# Before
drug_dataset.get("train").shuffle(seed=123).select(range(5))[:].get("condition")

['ADHD',
 'ADHD',
 'Birth Control',
 'Post Traumatic Stress Disorde',
 'Allergic Rhinitis']

In [22]:
# After
drug_dataset_1.get("train").shuffle(seed=123).select(range(5))[:].get("condition")

['herpes simplex, mucocutaneous/immunocompetent host',
 '13</span> users found this comment helpful.',
 'ibromyalgia',
 'birth control',
 'adhd']