<a href="https://colab.research.google.com/github/educatorsRlearners/hugging_face_course/blob/main/05_the_%F0%9F%A4%97_Datasets_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install datasets transformers[sentencepiece]

# Introduction

To review, there are three main steps to fine-tuning a mode: 

1. Load a dataset from the Hugging Face Hub
2. Preprocess the data with ```Dataset.map()```
3. Load and compute metrics

But,  what if 
- your data isn't on the Hub?
- you ***really*** need to use ```Pandas```?
- the data set can't fit into your computer's RAM?
- you want to push your own data to the the Hub?

Well, that's what this chapter is for 😀

# [What if my dataset isn't on the Hub?](https://huggingface.co/course/chapter5/2?fw=pt) 

## Working with local and remote datasets 

We simply have to pass the file type (e.g., csv, text, json) as well as the location of the file to ```load_dataset``` like so: 

In [12]:
# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

In [13]:
from datasets import load_dataset

# local_csv_dataset = load_dataset("csv", 
#                                  data_files= "winequality-white.csv",
#                                  sep=";")

In [14]:
# local_csv_dataset['train'][:5]

Now imagine you want to load the train and test together so that you can you ```Dataset.map()``` for preprocessing. 

To do so, we pass a dictionary to ```load_dataset``` explaining which file is trian and which is test like this: 



In [15]:
# data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
# squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
# squad_it_dataset

In [16]:
data_files = {"train": "sample_data/california_housing_train.csv", 
              "test": "sample_data/california_housing_test.csv"}

housing_dataset = load_dataset("csv", data_files=data_files)

Using custom data configuration default-5a2d12c9cd054b87
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-5a2d12c9cd054b87/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e)


  0%|          | 0/2 [00:00<?, ?it/s]

In [17]:
housing_dataset

DatasetDict({
    train: Dataset({
        features: ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value'],
        num_rows: 17000
    })
    test: Dataset({
        features: ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value'],
        num_rows: 3000
    })
})

## Remote Files

For remote files, we simply pass the URL to ```data_files``` like this: 

In [21]:
url = "https://raw.githubusercontent.com/educatorsRlearners/podrevday/master/data/user_data.csv"
prd = load_dataset('csv', data_files=url)

Using custom data configuration default-e8d5249814fd0fe4


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-e8d5249814fd0fe4/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/42.0k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-e8d5249814fd0fe4/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [22]:
prd

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'id', 'username', 'displayname', 'location', 'created', 'followersCount', 'friendsCount', 'url', 'verified', 'geotext', 'city', 'country'],
        num_rows: 640
    })
})

## Raw Text

When working with text files, we simply load it the same way: 

In [28]:
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"

text_file = load_dataset("text", data_files=url)

Using custom data configuration default-599a481abc93904d
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-599a481abc93904d/0.0.0/d86c40dad297bdddf277b406c6a59f0250b5318c400bf23d420a31aff88c84c4)


  0%|          | 0/1 [00:00<?, ?it/s]

In [30]:
text_file['train'][:5]

{'text': ['First Citizen:',
  'Before we proceed any further, hear me speak.',
  '',
  'All:',
  'Speak, speak.']}

# [Time to slice and dice](https://huggingface.co/course/chapter5/3?fw=pt)

## Slicing and dicing our data 

Life is not a Kaggle competition; we will always need to format it before we do anything else. 

Luckily, the people at 🤗 know that and have built some functionality into the ```dataset``` library. 

To explore these features, we'll use the Drug Review Dataset from UC Irvine's Machine Learning Reposity. 

In [31]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2022-01-25 18:44:00--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42989872 (41M) [application/x-httpd-php]
Saving to: ‘drugsCom_raw.zip’


2022-01-25 18:44:01 (66.9 MB/s) - ‘drugsCom_raw.zip’ saved [42989872/42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


As we previously saw, we can pass a dictionary to the ```data_files``` argument as well as specify the delimiter. 

In [32]:
data_files = {'train': "drugsComTrain_raw.tsv",
              "test": "drugsComTest_raw.tsv"}

drug_dataset = load_dataset('csv', 
                            data_files=data_files,
                            delimiter="\t")

Using custom data configuration default-3761173c276c0a9a


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-3761173c276c0a9a/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-3761173c276c0a9a/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Now we can have a look at a random sample to make sure everythign is copacetic. 

To do so, we'll: 
- shuffle the dataset 
- select the first 1000 examples

In [33]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))

drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'date': ['September 2, 2015', 'November 7, 2011', 'June 5, 2013'],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'rating': [9.0, 3.0, 10.0],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than t

OK, it looks like ```Unnamed: 0``` is the id column and the others could use some general cleaning (e.g., converting to lower-case, removing html as well as carriage returns) so let's do that.

But, let's verify that each value in ```Unnamed: 0``` is unique. 

In [41]:
for split in drug_dataset.keys():
  assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

Score! 

In that case, let's rename it to something useful.

In [42]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0",
    new_column_name="patient_id"
)

drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

✏️ Try it out! Use the ```Dataset.unique()``` function to find the number of unique drugs and conditions in the training and test sets.

In [48]:
number_unique_drugs = len(drug_dataset['train'].unique('drugName'))

number_unique_conditions = len(drug_dataset['train'].unique('condition'))

print(f'There are {number_unique_drugs} drugs and {number_unique_conditions} \
unique conditions in the dataset.' )

There are 3436 drugs and 885 unique conditions in the dataset.


Now let's fix the case issue.

First, we'll remove all observations where the condition equals ```None```. 

In [50]:
def filter_nones(x):
  return x["condition"] is not None

In [51]:
drug_dataset = drug_dataset.filter(filter_nones)

  0%|          | 0/162 [00:00<?, ?ba/s]

  0%|          | 0/54 [00:00<?, ?ba/s]

Alternatively, we could have just used a lambda function like this: 

In [52]:
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

  0%|          | 0/161 [00:00<?, ?ba/s]

  0%|          | 0/54 [00:00<?, ?ba/s]

Now we can normalize the condition column using the ```map()``` method which simply replaces one value for another.

In [54]:
def lowercase_condition(example):
  return {"condition": example['condition'].lower()}

drug_dataset = drug_dataset.map(lowercase_condition)

  0%|          | 0/160398 [00:00<?, ?ex/s]

  0%|          | 0/53471 [00:00<?, ?ex/s]

## [Creating New Columns](https://huggingface.co/course/chapter5/3?fw=pt#creating-new-columns)

It's always a good idea to identify how long a review is. 

How do we do that? 

Like this: 

In [59]:
def compute_review_length(example):
  return {"review_length": len(example["review"].split())}

drug_dataset = drug_dataset.map(compute_review_length)

  0%|          | 0/160398 [00:00<?, ?ex/s]

  0%|          | 0/53471 [00:00<?, ?ex/s]

In [60]:
drug_dataset['train'][0]

{'condition': 'left ventricular dysfunction',
 'date': 'May 20, 2012',
 'drugName': 'Valsartan',
 'patient_id': 206461,
 'rating': 9.0,
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'review_length': 17,
 'usefulCount': 27}

Now we can sort our dataset by review length.

In [61]:
drug_dataset['train'].sort('review_length')[:3]

{'condition': ['birth control', 'muscle spasm', 'pain'],
 'date': ['November 4, 2008', 'March 24, 2017', 'August 20, 2016'],
 'drugName': ['Loestrin 21 1 / 20', 'Chlorzoxazone', 'Nucynta'],
 'patient_id': [103488, 23627, 20558],
 'rating': [10.0, 1.0, 6.0],
 'review': ['"Excellent."', '"useless"', '"ok"'],
 'review_length': [1, 1, 1],
 'usefulCount': [5, 2, 10]}

Let's filter out those super short reviews.

In [62]:
drug_dataset = drug_dataset.filter(lambda x: x['review_length'] > 30)

print(drug_dataset.num_rows)

  0%|          | 0/161 [00:00<?, ?ba/s]

  0%|          | 0/54 [00:00<?, ?ba/s]

{'train': 138514, 'test': 46108}


✏️ Try it out! Use the ```Dataset.sort()``` function to inspect the reviews with the largest numbers of words. See the [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.sort) to see which argument you need to use sort the reviews by length in descending order.

In [63]:
drug_dataset['train'].sort('review_length', reverse=True)[:3]

{'condition': ['migraine', 'obsessive compulsive disorde', 'birth control'],
 'date': ['June 18, 2017', 'May 26, 2017', 'September 17, 2015'],
 'drugName': ['Venlafaxine', 'Prozac', 'Copper'],
 'patient_id': [121004, 181160, 216072],
 'rating': [2.0, 10.0, 10.0],
 'review': ['"Two and a half months ago I was prescribed Venlafaxine to help prevent chronic migraines.\r\nIt did help the migraines (reduced them by almost half), but with it came a host of side effects that were far worse than the problem I was trying to get rid of.\r\nHaving now come off of the stuff, I would not recommend anyone ever use Venlafaxine unless they suffer from extreme / suicidal depression. I mean extreme in the most emphatic sense of the word. \r\nBefore trying Venlafaxine, I was a writer. While on Venlafaxine, I could barely write or speak or communicate at all. More than that, I just didn&#039;t want to. Not normal for a usually outgoing extrovert.\r\nNow, I&#039;m beginning to write again - but my ability 

And now we can finally get rid of the HTML character codes like this: 

In [66]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

Again, map will be our friend 😀

In [67]:
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x['review'])})

  0%|          | 0/138514 [00:00<?, ?ex/s]

  0%|          | 0/46108 [00:00<?, ?ex/s]

## [The ```map()``` method's superpowers](https://huggingface.co/course/chapter5/3?fw=pt#the-%3Ccode%3Emap()%3C/code%3E-method%E2%80%99s-superpowers)