# **The Datasets Library**
**Deep Dive:**
- What do you do when your **dataset is not on the Hub**?
- How can you **slice and dice** a dataset? (And what if you really need to use Pandas?)
- What do you do when your dataset is huge and will **melt your laptop’s RAM**?
- What the heck are **“memory mapping” and Apache Arrow**?
- How can you **create your own dataset** and push it to the Hub?

In [2]:
!python --version

Python 3.10.12


In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

In [3]:
import torch
import transformers
from transformers import pipeline
from transformers import AutoModel, AutoTokenizer
from datasets import Dataset, load_dataset

import warnings
warnings.filterwarnings('ignore')

In [4]:
print(transformers.__version__)

4.42.4


In [6]:
!pip show datasets

Name: datasets
Version: 2.21.0
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: thomas@huggingface.co
License: Apache 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, dill, filelock, fsspec, huggingface-hub, multiprocess, numpy, packaging, pandas, pyarrow, pyyaml, requests, tqdm, xxhash
Required-by: 


## **Not in the Hub**
`Datasets` can be used to load datasets that aren’t available on the Hugging Face Hub.

### **Working w/ Local and Remote Datasets**
Loading script by using `load_dataset()`:
`csv`, `text`, `json`, *Pickled DataFrames* `pandas`. Specify the type of script and `data_files` of specific files.

#### **Loading a Local Dataset**
For this example we’ll use the **SQuAD-it** dataset, which is a large-scale dataset for question answering in Italian.

The training and test splits are hosted on GitHub, so we can download them with a simple `wget` command

In [None]:
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

This will download two compressed files called `SQuAD_it-train.json.gz` and `SQuAD_it-test.json.gz`, which we can **decompress** with the Linux `gzip` command

In [None]:
!gzip -dkv SQuAD_it-*.json.gz

Load a `JSON` (nested dictionary) file. `SQuAD-it` uses the nested format, with all the text stored in a data field. This means we can load the dataset by specifying the field argument

In [None]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

By default, loading local files creates a `DatasetDict` object with a train split

In [None]:
squad_it_dataset

In [None]:
squad_it_dataset["train"][0]

Include both the `train` and `test` split to a single `DatasetDict`, so we can apply `dataset.map()` functions accross both splits at once

In [None]:
# maps each split name
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
# load dataset with it data_files
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

**This is exactly what we wanted. Now, we can apply various preprocessing techniques to clean up the data, tokenize the reviews, and so on.**

Skipping the decompression (`gzip`) with `data_files`

In [None]:
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

#### **Loading a Remote Dataset**
Fortunately, loading remote files is just as simple as loading local ones! Instead of providing a path to local files, we point the `data_files` argument of `load_dataset()` to one or more **URLs where the remote files are stored**.

In [None]:
from datasets import load_dataset

# Location where the dataset stored
url = "https://github.com/crux82/squad-it/raw/master/"

# Combine the file with the url (location of the dataset)
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}

# Load dataset as before
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

Downloading data:   0%|          | 0.00/7.73M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

In [None]:
squad_it_dataset['train'][0]

{'title': 'Terremoto del Sichuan del 2008',
 'paragraphs': [{'context': "Il terremoto del Sichuan del 2008 o il terremoto del Gran Sichuan, misurato a 8.0 Ms e 7.9 Mw, e si è verificato alle 02:28:01 PM China Standard Time all' epicentro (06:28:01 UTC) il 12 maggio nella provincia del Sichuan, ha ucciso 69.197 persone e lasciato 18.222 dispersi.",
   'qas': [{'answers': [{'answer_start': 29, 'text': '2008'}],
     'id': '56cdca7862d2951400fa6826',
     'question': 'In quale anno si è verificato il terremoto nel Sichuan?'},
    {'answers': [{'answer_start': 232, 'text': '69.197'}],
     'id': '56cdca7862d2951400fa6828',
     'question': 'Quante persone sono state uccise come risultato?'},
    {'answers': [{'answer_start': 29, 'text': '2008'}],
     'id': '56d4f9902ccc5a1400d833c0',
     'question': 'Quale anno ha avuto luogo il terremoto del Sichuan?'},
    {'answers': [{'answer_start': 78, 'text': '8.0 Ms e 7.9 Mw'}],
     'id': '56d4f9902ccc5a1400d833c1',
     'question': 'Che cosa ha

## **Slice and Dice**
**Data Wrangling** ||| Most of the time, the data you work with won’t be perfectly prepared for training models. **Clean up the Datasets**

🎯 Manipulate the contents of `Dataset` and `DatasetDict` object.

### **Slicing and Dicing the Data**
For this example we’ll use the **Drug Review Dataset** that’s hosted on the **UC Irvine Machine Learning Repository**, which contains *patient reviews on various drugs, along with the condition being treated and a 10-star rating of the patient’s satisfaction*.

First we need to **download and extract the data**

In [7]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2024-09-02 03:24:54--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drugsCom_raw.zip’

drugsCom_raw.zip        [        <=>         ]  41.00M  27.0MB/s    in 1.5s    

2024-09-02 03:24:56 (27.0 MB/s) - ‘drugsCom_raw.zip’ saved [42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


Load these train and test dataset by using `csv` but with `delimeter`

In [8]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t") #\t tab

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [9]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [10]:
drug_dataset['train'][0]

{'Unnamed: 0': 206461,
 'drugName': 'Valsartan',
 'condition': 'Left Ventricular Dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27}

A **good practice when doing any sort of data analysis is to grab a small random sample** to get a quick feel for the type of data you’re working with.
- Create a **random sample** by chaining `.shuffle()` and `.select()` functions together

In [11]:
# reproducibility purporses || range(1000) -> grab first 1000 examples from the shuffled dataset
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))

# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

From this sample we can already see a few quirks in our dataset:

- The `Unnamed: 0` column looks suspiciously like an anonymized ID for each patient.
- The `condition` column includes a mix of uppercase and lowercase labels.
- The `reviews` are of varying length and contain a mix of Python line separators (`\r\n`) as well as HTML character codes like `&\#039;`.

**1. Test the patient ID hypothesis for the `Unnamed: 0` column** || Verify that the number of IDs matches the number of rows in each split

In [12]:
len(drug_dataset['train'].unique('Unnamed: 0'))

161297

In [13]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

## The hypothesis is True, if returns True
## otherwise it's False hten return AssertionError

-------- *rename the `Unnamed: 0` column*

In [14]:
drug_dataset = drug_dataset.rename_column('Unnamed: 0', 'patient_id')

drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

-------- *find the number of unique drug and conditions*

In [15]:
for split in drug_dataset.keys():
    for features in ['drugName', 'condition']:
        print(f"Unique {features} for {split} dataset: {len(drug_dataset[split].unique(features))}")

Unique drugName for train dataset: 3436
Unique condition for train dataset: 885
Unique drugName for test dataset: 2637
Unique condition for test dataset: 709


**2. Normalize all the `conditions` labels**

In [16]:
## Function to lowercase
def lowercase_condition(x):
    return {"condition": x["condition"].lower()}
    # return x['condition'] is not None

## Function to remove the None value
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

Filter:   0%|          | 0/161297 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

In [17]:
## Normalize the condition
drug_dataset = drug_dataset.map(lowercase_condition)

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

In [18]:
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

['left ventricular dysfunction', 'adhd', 'birth control']

### **Creating New Columns**
`reviews` || Whenever you’re dealing with *customer reviews*, a good practice is to **check the number of words in each review**.

1. function to **counts the number of words** in each review

In [19]:
## Tons of words
drug_dataset["train"]["review"][:3]

['"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 '"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective."',
 '"I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormo

In [20]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

In [21]:
drug_dataset = drug_dataset.map(compute_review_length)

# Inspect the first training example
drug_dataset["train"][3]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'patient_id': 138000,
 'drugName': 'Ortho Evra',
 'condition': 'birth control',
 'review': '"This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch"',
 'rating': 8.0,
 'date': 'November 3, 2015',
 'usefulCount': 10,
 'review_length': 89}

Check the extreme values for the review length by using `.sort()`

In [22]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

some of the reviews contain just a single word, although it may be **okay for sentiment analysis**, *would not be informative if we want to predict the condition*.

2. Remove `reviews` that contain fewer than 30 words. (*filtering out aroung 15% of the reviews from the original*)

In [23]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

Filter:   0%|          | 0/160398 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'train': 138514, 'test': 46108}


In [None]:
# drug_dataset['train'].sort('review_length', reverse=True)[:5]

3. **Remove HTML code**

In [24]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [25]:
# implement the unescape with .map() function
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

### **The map() method's superpowers**
The `Dataset.map()` method takes a `batched` argument that, if set to `True`, causes it to **send a batch of examples to the map function at once** (the batch size is configurable but defaults to 1,000)

#### **`batched=True`** || *faster `map()`*

In [26]:
new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

 this command **executes way faster than the previous one**.

Using `Dataset.map()` with `batched=True` will be essential to unlock the speed of the **“fast”** tokenizers, which can **quickly tokenize big lists of texts**.

In [27]:
# Example -- tokenize all the drug reviews
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [28]:
# Check the performance of the function
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)
# %time tokenized_dataset = drug_dataset.map(tokenize_function, batched=False)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 1min 59s, sys: 1.23 s, total: 2min
Wall time: 1min 25s


To check the different between **fast-tokenize**r and **slow-tokenzer** we can set the `use_fast=False` method.

In [None]:
# slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

# try the Fast Tokenizer
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Use **multprocessing** for `Dataset.map()` to activate the parallelization by use the `num_proc` argument.

In [None]:
def fast_tokenize_function(examples):
    return fast_tokenizer(examples["review"], truncation=True)

%time fast_tokenized_dataset = drug_dataset.map(fast_tokenize_function, batched=True, num_proc=8)

Map (num_proc=8):   0%|          | 0/138514 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 2.59 s, sys: 857 ms, total: 3.45 s
Wall time: 1min 47s


⚠ ⚠ ⚠ **In general, we don’t recommend using Python multiprocessing for fast tokenizers with `batched=True`**

----

📓 Using `num_proc` to speed up your processing is usually a great idea, as long as the function you are using is not already doing some kind of multiprocessing of its own.

#### **`overflowing`** || *changed the number of elements*
|| Create several training features from one example

💡 In machine learning, an *example* is usually defined as the set of features that we feed to the model.

In some contexts, these features will be the set of columns in a Dataset, but in others (like here and for question answering), multiple features can be extracted from a single *example* and belong to a single column.

In [29]:
# Function to tokenizer the examples
## Truncate them to a maximum length of 128
## return all the chunks of the texts
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

Test with one example

In [30]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

So, our first example in the training set became two features because it was tokenized to more than the maximum number of tokens we specified:
- the first one of length 128 (*max*)
- the second one of length 49 (*overflow*).

In [31]:
drug_dataset['train'].column_names

['patient_id',
 'drugName',
 'condition',
 'review',
 'rating',
 'date',
 'usefulCount',
 'review_length']

**Removing old dataset**

In [32]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 138514
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

In [33]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

In [34]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'],
        num_rows: 68876
    })
})

In [35]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

**Making old columns the same size**, but still keep the old columns

In [36]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping") # mapping new feature index to the index of the original

    # list of values
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]

    return result

In [39]:
# Run without remove the old columns
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

We get the same number of training features as before, but here we’ve kept all the old fields.

If you need them for some post-processing after applying your model, you might want to use this approach.

### **From Datasets to DataFrame and back**
To enable the conversion between various third-party libraries, 🤗 Datasets provides a `Dataset.set_format(`) function.

📓 **changes the output format of the dataset, so you can easily switch to another format without affecting the underlying *data format***

In [None]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 138514
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

**Convert dataset to Pandas**

In [None]:
drug_dataset.set_format("pandas")

In [None]:
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89


Create a `pandas.DataFrame` from the whole training set

In [None]:
train_df = drug_dataset["train"][:]

In [None]:
train_df.head()

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89
3,35696,Buprenorphine / naloxone,opiate dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37,124
4,155963,Cialis,benign prostatic hyperplasia,"""2nd day on 5mg started to work with rock hard...",2.0,"November 28, 2015",43,68


From here we can use all the Pandas functionality that we want.

For example, we can do fancy chaining to **compute the class distribution among the `condition` entries**.

In [None]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)

frequencies.head()

Unnamed: 0,frequency,count
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744


And once we’re done with our Pandas analysis, we can always create a new `Dataset` object by using the `Dataset.from_pandas()`

In [None]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 819
})

Next, **compute the average rating per drug and store**.

In [None]:
train_df.columns

Index(['patient_id', 'drugName', 'condition', 'review', 'rating', 'date',
       'usefulCount', 'review_length'],
      dtype='object')

In [None]:
train_df.head(1)

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141


In [None]:
average_rating_per_drug = (
    train_df.groupby('drugName')['rating']
    .mean()
    .to_frame()
    .reset_index()
    .rename(columns={'drugName':'drug', 'rating':'average'})
)

average_rating_per_drug.head()

Unnamed: 0,drug,average
0,A + D Cracked Skin Relief,10.0
1,A / B Otic,10.0
2,Abacavir / dolutegravir / lamivudine,7.953488
3,Abacavir / lamivudine / zidovudine,9.0
4,Abatacept,7.3125


In [None]:
avg_drug_rating = Dataset.from_pandas(average_rating_per_drug)
avg_drug_rating

Dataset({
    features: ['drug', 'average'],
    num_rows: 3052
})

In [40]:
drug_dataset.reset_format()

In [41]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 138514
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

Next, is to create a validation set to prepare the dataset for training a classifier on.

### **Creating a Validation set**
Although we have a test set we could use for evaluation, it’s a good practice to **leave the test set untouched** and create a **separate validation set during development**.

Once you are **happy with the performance of your models on the validation set**, you can **do a final sanity check on the test set**. This process helps mitigate the risk that you’ll overfit to the test set and deploy a model that fails on real-world data

In [42]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 138514
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

Let's use `Dataset.train_test_split()` to split the training data into train and validation

In [43]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)

# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")

# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]

drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

#### **Saving the Dataset**
- Arrow ➡ `Dataset.save_to_disk()`
- CSV ➡ `Dataset.to_csv()`
- JSON ➡ `Dataset.to_json()`

-------- **Save to Disk**

In [None]:
drug_dataset_clean.save_to_disk("drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/110811 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/27703 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/46108 [00:00<?, ? examples/s]

where we can see that each split is associated with its own `dataset.arrow` table, and some metadata in `dataset_info.json` and `state.json`.

You can think of the `Arrow` format as a fancy table of columns and rows that is optimized for building high-performance applications that process and transport large datasets.

**Load from the disk / local**

In [None]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

-------- **Save to JSON formats**

In [None]:
drug_dataset_clean.items()

dict_items([('train', Dataset({
    features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
    num_rows: 110811
})), ('validation', Dataset({
    features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
    num_rows: 27703
})), ('test', Dataset({
    features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
    num_rows: 46108
}))])

In [None]:
## split --> separate file (train, val, test)
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/111 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/28 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/47 [00:00<?, ?ba/s]

This saves each split in `JSON Lines` format, where each row in the dataset is stored as a single line of `JSON`

In [None]:
!head -n 1 drug-reviews-train.jsonl

{"patient_id":89879,"drugName":"Cyclosporine","condition":"keratoconjunctivitis sicca","review":"\"I have used Restasis for about a year now and have seen almost no progress.  For most of my life I've had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I've talked with my doctor about this and he said it is normal but should go away after some time, but it hasn't. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I've been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I'm ready to move on.\"","rating":2.0,"date":"April 20, 2013","usefulCount":69,"review_length":147}


Load the `JSON` files using `load_dataset()`

In [None]:
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}

drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

## **Big Data??**
`Datasets` has been designed to overcome these limitations (*huge data to load into the laptop's RAM*). It frees you from *memory management* problems by treating datasets as *memory-mapped* files, and from hard drive limits by *streaming* the entries in a corpus.


🧰 [**the Pile**](https://pile.eleuther.ai/) ➡ Datasets with a huge **825 GB corpus**

### **What's the Pile?**
**The Pile** is an English text corpus that was created by EleutherAI for **training large-scale language models**. It includes a *diverse range of datasets*, *spanning scientific articles*, *GitHub code repositories*, and *filtered web text*.

Let’s start by taking a look at the **PubMed Abstracts** dataset, which is *a corpus of abstracts from 15 million biomedical publications on PubMed*. The dataset is in JSON Lines format and is compressed using the `zstandard` library.

In [44]:
!pip install zstandard

Collecting zstandard
  Downloading zstandard-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Downloading zstandard-0.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: zstandard
Successfully installed zstandard-0.23.0


**Load data from the *remote file***

In [45]:
from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

FileNotFoundError: Unable to find 'https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst'

In [None]:
pubmed_dataset[0]

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age ...'}

### **The Magic of Memory Mapping**
A simple way to measure memory usage in Python is with the `psutil` library.

In [None]:
!pip install psutil

It provides a `Process` class that allows us to check the memory usage of the current process

In [None]:
import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

Here the `rss` attribute refers to the *resident set size*, which is the fraction of memory that a process occupies in RAM. This measurement also includes the memory used by the Python interpreter and the libraries we’ve loaded, so the **actual amount of memory used to load the dataset is a bit smaller**.

In [None]:
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

🎯🎯

`Datasets` treats each dataset as a *memory-mapped file*, which *provides a mapping between RAM and filesystem storage that allows the library to access and operate on elements of the dataset **without needing to fully load it into memory**.*

Memory-mapped files can also be shared across multiple processes, which enables methods like `Dataset.map()` to be parallelized without needing to move or copy the dataset.


In [None]:
import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)

# 'Iterated over 15518009 examples (about 19.5 GB) in 64.2s, i.e. 0.304 GB/s'

`Datasets` provides a ***streaming*** feature that allows us to download and access elements on the fly, without needing to download the whole dataset.

### **Streaming Datasets**
To enable dataset streaming you just need to pass the `streaming=True` argument to the `load_dataset()` function.

In [None]:
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

the object returned with `streaming=True` is an `IterableDataset`. As the name suggests, to access the elements of an `IterableDataset` we need to iterate over it

In [None]:
next(iter(pubmed_dataset_streamed))

The elements from a streamed dataset can be processed on the fly using `IterableDataset.map()`, which is useful during training if you need to tokenize the inputs.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"])) # streaming dataset
next(iter(tokenized_dataset))

# {'input_ids': [101, 4958, 5178, 4328, 6779, ...], 'attention_mask': [1, 1, 1, 1, 1, ...]}

can also shuffle a streamed dataset using `IterableDataset.shuffle()`, but unlike `Dataset.shuffle()` this only shuffles the elements in a predefined `buffer_size`

In [None]:
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))

In this example, we selected a random example from the first 10,000 examples in the buffer. Once an example is accessed, its spot in the buffer is filled with the next example in the corpus (i.e., the 10,001st example in the case above)

to select the first 5 examples in the PubMed Abstracts dataset by using `IterableDataset.take()`

In [None]:
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

In [None]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)

# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

Let’s round out our exploration of dataset streaming with a common application: **combining multiple datasets together to create a single corpus**. 🤗 Datasets provides an `interleave_datasets()` function that **converts a list of `IterableDataset` objects into a single `IterableDataset`**, where the elements of the new dataset are obtained by alternating among the source examples.

**Combining** a stream the `FreeLaw` subset of the Pile, which is a 51 GB dataset of legal opinions from US courts

In [None]:
law_dataset_streamed = load_dataset(
    "json",
    data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
    split="train",
    streaming=True,
)

next(iter(law_dataset_streamed))

**with** `PubMed Abstract` dataset

In [None]:
from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

Here we’ve used the `islice()` function from Python’s `itertools` module to select the first two examples from the combined dataset, and we can see that they match the first examples from each of the two source datasets.

Finally, if you want to stream **the Pile** in its 825 GB entirety, you can grab all the prepared files as follows:

In [None]:
base_url = "https://the-eye.eu/public/AI/pile/"

data_files = {
    "train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation": base_url + "val.jsonl.zst",
    "test": base_url + "test.jsonl.zst",
}

pile_dataset = load_dataset("json", data_files=data_files, streaming=True)

-- german, french to swiss

In [53]:
from datasets import load_dataset

mc4_subset_with_five_languages = load_dataset("mc4", languages=["de", "fr"], streaming=True)

next(iter(mc4_subset_with_five_languages))

Downloading builder script:   0%|          | 0.00/9.68k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

The repository for mc4 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mc4.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


'train'

In [54]:
mc4_subset_with_five_languages

IterableDatasetDict({
    train: IterableDataset({
        features: ['text', 'timestamp', 'url'],
        n_shards: 4096
    })
    validation: IterableDataset({
        features: ['text', 'timestamp', 'url'],
        n_shards: 32
    })
})

In [59]:
from itertools import islice
from datasets import interleave_datasets

german = load_dataset("mc4", "de", streaming=True, split='train')
french = load_dataset("mc4", "fr", streaming=True, split='train')

swiss_language = interleave_datasets([german, french])
list(islice(swiss_language, 2))

[{'text': 'Home - Homepage des Kunstvereins Pro Ars Lausitz e.V.\nKunstverein Pro Ars Lausitz e.V.\nIm November 2011 haben sich kunstinteressierte Bürger unseres Landkreises entschlossen, den Verein Pro Ars Lausitz zu gründen. Zweck des Vereins ist die Förderung der Kunst und Kultur. Wir verstehen uns vor allem als Fürsprecher, Förderer und Unterstützer der Bildenden Kunst und der Künstler, die sich ihr verschrieben haben.\nDie große Bedeutung dieses Genres für das Leben der Menschen in unserem Kreis, für Bildung und Erholung, für Erziehung der Kinder und Jugendlichen aber auch als weicher Standortfaktor ist unbestritten.\nEin Vergleich der Situation der Bildenden Kunst und ihrer Künstler im OSL-Kreis mit anderen Kreisen läßt erkennen, dass die Förderung der Kulturszene und die Zusammenarbeit der Netzwerke noch lange nicht so gut funktioniert und aus unserer Sicht wesentlich zu verbessern ist. Dieser Aufgabe stellen sich die Mitglieder des Vereins.\nWir wollen die Kräfte vor allem auf 

## **Create Own Dataset**
📓 Create **a corpus of** [**Github issues**](https://github.com/features/issues/) which are commonly used to track bugs or features in GitHub repositories. This corpus could be used for various purposes, including:
- Exploring **how long it takes to close open issues or pull requests**.
- **Training a multilabel classifier** that can tag issues with *metadata* based on the issue’s description (e.g., “bug,” “enhancement,” or “question”)
- Creating a **semantic search engine** to *find which issues match a user’s query*.

### **Getting the Data**
You can find all the issues in 🤗 Datasets by navigating to the [**repository’s Issues**](https://github.com/huggingface/datasets/issues) tab. At the time of writing there were 660 open issues and 2213 closed ones.

---
If you click on one of these issues you’ll find it contains a `title`, a `description`, and a `set of labels` that characterize the issue.


To download all the repository’s issues, we’ll use the **GitHub REST API** to poll the **Issues endpoint**. This endpoint *returns a list of JSON objects*, with *each object containing a large number of fields* that *include the title and description as well as metadata* about the status of the issue and so on.

In [60]:
!pip install requests



make **GET** requests to the `Issues` endpoint by invoking the `requests.get()` function.

In [61]:
# Example --- retrieve the first issue on the first page
import requests

url = "https://api.github.com/repos/huggingface/datasets/issues?page=1&per_page=1"
response = requests.get(url)

In [62]:
# print the HTTP status code
response.status_code

200

where a `200` status means the request was **successful**. What we are really interested in, though, is the `payload`, which can be accessed in various formats like *bytes, strings, or JSON.*

In [63]:
response.json()

[{'url': 'https://api.github.com/repos/huggingface/datasets/issues/7134',
  'repository_url': 'https://api.github.com/repos/huggingface/datasets',
  'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/7134/labels{/name}',
  'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/7134/comments',
  'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/7134/events',
  'html_url': 'https://github.com/huggingface/datasets/issues/7134',
  'id': 2499484041,
  'node_id': 'I_kwDODunzps6U-xmJ',
  'number': 7134,
  'title': 'Attempting to return a rank 3 grayscale image from dataset.map results in extreme slowdown ',
  'user': {'login': 'navidmafi',
   'id': 46371349,
   'node_id': 'MDQ6VXNlcjQ2MzcxMzQ5',
   'avatar_url': 'https://avatars.githubusercontent.com/u/46371349?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/navidmafi',
   'html_url': 'https://github.com/navidmafi',
   'followers_url': 'https://api.github.com/us

As described in the GitHub [documentation](https://docs.github.com/en/rest/using-the-rest-api/getting-started-with-the-rest-api#rate-limiting), **`unauthenticated` requests are limited to `60 requests per hour`**. Although you can increase the `per_page` query parameter to reduce the number of requests you make, you will still hit the rate limit on any repository that has more than a few thousand issues.

So instead, you should follow GitHub’s [instructions](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) on creating **a personal access token** so that you can **boost the rate limit to `5,000 requests per hour`**. Once you have your token, you can include it as part of the request header

In [64]:
import os

# GITHUB_TOKEN = xxx  # Copy your GitHub token here
# GITHUB_TOKEN = os.getenv('GITHUB_TOKEN')
GITHUB_TOKEN = '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'  # Copy your GitHub token here
headers = {"Authorization": f"token {GITHUB_TOKEN}"}

Now that we have our access token, let’s **create a function** that can **download all the issues from a GitHub repository**

In [67]:
import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm


def fetch_issues(
    owner="huggingface",
    repo="datasets",
    num_issues=10_000,
    rate_limit=5_000,
    issues_path=Path("."),
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)

    batch = []
    all_issues = []
    per_page = 100  # Number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"

    for page in tqdm(range(num_pages)):
        # Query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}", headers=headers)
        batch.extend(issues.json())

        if len(batch) > rate_limit and len(all_issues) < num_issues:
            all_issues.extend(batch)
            batch = []  # Flush batch for next time period
            print(f"Reached GitHub rate limit. Sleeping for one hour ...")
            # time.sleep(60 * 60 + 1)
            time.sleep(1 * 5 + 1) # fast running purpose

    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True)
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl"
    )
    df.to_csv(f"{issues_path}/{repo}-issues.csv", header=True)
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.csv"
    )

Now when we call **`fetch_issues()`** it will **download all the issues in batches to avoid exceeding GitHub’s limit on the number of requests per hour**; the result will be stored in a `repository_name-issues.jsonl` file, where each line is a JSON object the represents an issue

In [68]:
# Depending on your internet connection, this can take several minutes to run...
fetch_issues()

  0%|          | 0/100 [00:00<?, ?it/s]

Reached GitHub rate limit. Sleeping for one hour ...
Downloaded all the issues for datasets! Dataset stored at ./datasets-issues.jsonl
Downloaded all the issues for datasets! Dataset stored at ./datasets-issues.csv


**load them localy**

In [74]:
issues_dataset = load_dataset("csv", data_files="datasets-issues.csv", split='train')
issues_dataset

Dataset({
    features: ['Unnamed: 0', 'url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request'],
    num_rows: 7089
})

**Great, we’ve created our first dataset from scratch!**

But why are there several thousand issues when the Issues tab of the 🤗 Datasets repository only shows around 2000 issues in total 🤔? As described in the GitHub documentation, that’s because we’ve downloaded all the pull requests as well

In [78]:
issues_dataset[0]

{'Unnamed: 0': 0,
 'url': 'https://api.github.com/repos/huggingface/datasets/issues/7134',
 'repository_url': 'https://api.github.com/repos/huggingface/datasets',
 'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/7134/labels{/name}',
 'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/7134/comments',
 'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/7134/events',
 'html_url': 'https://github.com/huggingface/datasets/issues/7134',
 'id': 2499484041,
 'node_id': 'I_kwDODunzps6U-xmJ',
 'number': 7134,
 'title': 'Attempting to return a rank 3 grayscale image from dataset.map results in extreme slowdown ',
 'user': "{'login': 'navidmafi', 'id': 46371349, 'node_id': 'MDQ6VXNlcjQ2MzcxMzQ5', 'avatar_url': 'https://avatars.githubusercontent.com/u/46371349?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/navidmafi', 'html_url': 'https://github.com/navidmafi', 'followers_url': 'https://api.github.com/users/navidmafi

Since the contents of `issues` and `pull requests` are quite different, let’s do some minor preprocessing to enable us to distinguish between them.

In [79]:
issues_dataset['html_url'][:2]

['https://github.com/huggingface/datasets/issues/7134',
 'https://github.com/huggingface/datasets/pull/7133']

In [76]:
issues_dataset['pull_request'][:2]

[None,
 "{'url': 'https://api.github.com/repos/huggingface/datasets/pulls/7133', 'html_url': 'https://github.com/huggingface/datasets/pull/7133', 'diff_url': 'https://github.com/huggingface/datasets/pull/7133.diff', 'patch_url': 'https://github.com/huggingface/datasets/pull/7133.patch', 'merged_at': None}"]

### **Cleaning up the Data**
The above snippet from GitHub’s documentation tells us that the `pull_request` column can be used to differentiate between `issues` and `pull requests`

In [80]:
sample = issues_dataset.shuffle(seed=666).select(range(3))

# Print out the URL and pull request entries
for url, pr in zip(sample["html_url"], sample["pull_request"]):
    print(f">> URL: {url}")
    print(f">> Pull request: {pr}\n")

>> URL: https://github.com/huggingface/datasets/pull/6646
>> Pull request: {'url': 'https://api.github.com/repos/huggingface/datasets/pulls/6646', 'html_url': 'https://github.com/huggingface/datasets/pull/6646', 'diff_url': 'https://github.com/huggingface/datasets/pull/6646.diff', 'patch_url': 'https://github.com/huggingface/datasets/pull/6646.patch', 'merged_at': '2024-02-07T14:59:11Z'}

>> URL: https://github.com/huggingface/datasets/issues/4071
>> Pull request: None

>> URL: https://github.com/huggingface/datasets/issues/543
>> Pull request: None



Here we can see that each pull request is associated with various URLs, while ordinary issues have a `None` entry. We can use this distinction to create a new `is_pull_request` column that checks whether the `pull_request` field is `None` or not

In [81]:
issues_dataset = issues_dataset.map(
    lambda x: {"is_pull_request": False if x["pull_request"] is None else True}
)

Map:   0%|          | 0/7089 [00:00<?, ? examples/s]

In [85]:
issues_dataset['is_pull_request'][:3]

[False, True, True]

**Average Time it Takes to Close Issues**

In [166]:
issues_dataset.set_format('pandas')

In [167]:
train_df = issues_dataset[:]

In [168]:
train_df.columns

Index(['Unnamed: 0', 'url', 'repository_url', 'labels_url', 'comments_url',
       'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user',
       'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone',
       'comments', 'created_at', 'updated_at', 'closed_at',
       'author_association', 'active_lock_reason', 'body', 'reactions',
       'timeline_url', 'performed_via_github_app', 'state_reason', 'draft',
       'pull_request', 'is_pull_request'],
      dtype='object')

In [169]:
from datetime import datetime

train_df['created_at'] = pd.to_datetime(train_df['created_at'])
train_df['closed_at'] = pd.to_datetime(train_df['closed_at'])

In [171]:
train_df['closed_at'] = train_df['closed_at'].fillna(train_df['created_at'])

In [173]:
# cl_1 = train_df['closed_at'][8].strftime("%m/%d/%Y, %H:%M:%S")
train_df['time_diff'] = train_df['closed_at'] - train_df['created_at']

In [174]:
train_df[['created_at', 'closed_at', 'time_diff']][:10]

Unnamed: 0,created_at,closed_at,time_diff
0,2024-09-01 13:55:41+00:00,2024-09-01 13:55:41+00:00,0 days 00:00:00
1,2024-08-30 07:36:56+00:00,2024-08-30 07:36:56+00:00,0 days 00:00:00
2,2024-08-29 13:48:16+00:00,2024-08-29 13:48:16+00:00,0 days 00:00:00
3,2024-08-28 12:27:48+00:00,2024-08-28 12:27:48+00:00,0 days 00:00:00
4,2024-08-27 20:31:09+00:00,2024-08-27 20:31:09+00:00,0 days 00:00:00
5,2024-08-26 10:29:48+00:00,2024-08-26 10:29:48+00:00,0 days 00:00:00
6,2024-08-26 05:29:46+00:00,2024-08-26 05:59:15+00:00,0 days 00:29:29
7,2024-08-26 05:09:35+00:00,2024-08-26 05:27:09+00:00,0 days 00:17:34
8,2024-08-26 04:53:59+00:00,2024-08-26 06:09:42+00:00,0 days 01:15:43
9,2024-08-23 22:56:01+00:00,2024-08-23 22:56:01+00:00,0 days 00:00:00


In [175]:
average_time = train_df[['created_at', 'closed_at', 'time_diff']]

In [176]:
average_time

Unnamed: 0,created_at,closed_at,time_diff
0,2024-09-01 13:55:41+00:00,2024-09-01 13:55:41+00:00,0 days 00:00:00
1,2024-08-30 07:36:56+00:00,2024-08-30 07:36:56+00:00,0 days 00:00:00
2,2024-08-29 13:48:16+00:00,2024-08-29 13:48:16+00:00,0 days 00:00:00
3,2024-08-28 12:27:48+00:00,2024-08-28 12:27:48+00:00,0 days 00:00:00
4,2024-08-27 20:31:09+00:00,2024-08-27 20:31:09+00:00,0 days 00:00:00
...,...,...,...
7084,2020-04-15 13:25:13+00:00,2020-04-29 09:23:05+00:00,13 days 19:57:52
7085,2020-04-15 10:17:10+00:00,2020-05-04 06:11:57+00:00,18 days 19:54:47
7086,2020-04-15 10:08:14+00:00,2020-05-04 06:12:27+00:00,18 days 20:04:13
7087,2020-04-14 18:18:51+00:00,2020-05-11 18:55:22+00:00,27 days 00:36:31


In [177]:
average_time_issues_close = average_time.time_diff.mean()
average_time_issues_close

Timedelta('33 days 17:50:46.496120750')

In [179]:
readable_avg_time = str(average_time_issues_close)
readable_avg_time

'33 days 17:50:46.496120750'

In [180]:
issues_dataset.reset_format()

### **Augmenting the Dataset**
The `comments` associated with an `issue` or `pull request` **provide a rich source of information**, especially if we’re interested in building a search engine to answer user queries about the library.

The GitHub REST API provides a [Comments endpoint](https://docs.github.com/en/rest/issues#list-issue-comments) that returns all the comments associated with an issue number

In [181]:
issue_number = 2792
url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
response = requests.get(url, headers=headers)
response.json()

[{'url': 'https://api.github.com/repos/huggingface/datasets/issues/comments/897594128',
  'html_url': 'https://github.com/huggingface/datasets/pull/2792#issuecomment-897594128',
  'issue_url': 'https://api.github.com/repos/huggingface/datasets/issues/2792',
  'id': 897594128,
  'node_id': 'IC_kwDODunzps41gDMQ',
  'user': {'login': 'bhavitvyamalik',
   'id': 19718818,
   'node_id': 'MDQ6VXNlcjE5NzE4ODE4',
   'avatar_url': 'https://avatars.githubusercontent.com/u/19718818?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/bhavitvyamalik',
   'html_url': 'https://github.com/bhavitvyamalik',
   'followers_url': 'https://api.github.com/users/bhavitvyamalik/followers',
   'following_url': 'https://api.github.com/users/bhavitvyamalik/following{/other_user}',
   'gists_url': 'https://api.github.com/users/bhavitvyamalik/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/bhavitvyamalik/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/

We can see that the comment is stored in the `body` field, so let’s write a simple function that returns all the comments associated with an issue by picking out the `body` contents for each element in `response.json()`.

In [182]:
def get_comments(issue_number):
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url, headers=headers)
    return [r["body"] for r in response.json()]


# Test our function works as expected
get_comments(2792)

["@albertvillanova my tests are failing here:\r\n```\r\ndataset_name = 'gooaq'\r\n\r\n    def test_load_dataset(self, dataset_name):\r\n        configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)[:1]\r\n>       self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True, use_local_dummy_data=True)\r\n\r\ntests/test_dataset_common.py:234: \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\ntests/test_dataset_common.py:187: in check_load_dataset\r\n    self.parent.assertTrue(len(dataset[split]) > 0)\r\nE   AssertionError: False is not true\r\n```\r\nWhen I try loading dataset on local machine it works fine. Any suggestions on how can I avoid this error?",
 'Thanks for the help, @albertvillanova! All tests are passing now.']

This looks good, so let’s use `Dataset.map()` to add a new comments column to each issue in our dataset:

In [183]:
issues_dataset

Dataset({
    features: ['Unnamed: 0', 'url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 7089
})

In [None]:
# Depending on your internet connection, this can take a few minutes...
issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(x["number"])}
)

The final step is to push our dataset to the Hub

### **Uploading the dataset to the Hugging Face Hub**
Now that we have our augmented dataset, it’s time to push it to the Hub so we can share it with the community!

In [None]:
# Authentication
from huggingface_hub import notebook_login

notebook_login()

Upload the dataset

In [None]:
issues_with_comments_dataset.push_to_hub("github-issues")

From here, anyone can download the dataset by simply providing `load_dataset()` with the repository `ID` as the `path` argument

In [None]:
remote_dataset = load_dataset("lewtun/github-issues", split="train")
remote_dataset

**Cool, we’ve pushed our dataset to the Hub and it’s available for others to use!**

There’s just one important thing left to do: adding a **dataset card** that explains *how the corpus was created and provides other useful information for the community*.

## **Semantic Search with FAISS**
In this section we’ll use this information to **build a search engine** *that can help us find answers to our most pressing questions about the library*!

### **Using Embeddings for semantic search**
Transformer-based language models represent each token in a span of text as an embedding vector. It turns out that one can “pool” the individual embeddings to create a vector representation for whole sentences, paragraphs, or (in some cases) documents.

 These `embeddings` can then be used to find **similar documents** in the corpus by computing the dot-product similarity (or some other similarity metric) between each embedding and returning the documents with the greatest overlap.

In this section we’ll use `embeddings` to **develop a semantic search engine**. These search engines **offer several advantages over conventional approaches** that are based on matching keywords in a query with the documents.

### **Loading and Preparing the dataset**
The first thing we need to do is download our dataset of GitHub issues, so let’s use `load_dataset()` function as usual:

In [185]:
from datasets import load_dataset

# issues_dataset = load_dataset("ditherr/github-issues", split="train")
issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3019 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

Here we’ve specified the default `train` split in `load_dataset()`, so it returns a `Dataset` instead of a `DatasetDict`.

**1. Filter out the pull requests**, as these tend to be rarely *used for answering user queries* and will *introduce noise in our search engine*.

Exclude these rows (`is_pull_request`), also exclude rows with no comments (*there are no answer in it*).

In [203]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)

issues_dataset

Filter:   0%|          | 0/3019 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

We can see that there are a lot of columns in our dataset, most of which we don’t need to build our search engine.

From a search perspective, the most informative columns are `title`, `body`, and `comments`, while `html_url` provides us with a link back to the source issue.

**2. Drop the rest of columns** except [`title`, `body`, `comments`, `html_url`]

In [204]:
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

In [205]:
issues_dataset[0]

{'html_url': 'https://github.com/huggingface/datasets/issues/2945',
 'title': 'Protect master branch',
 'comments': ['Cool, I think we can do both :)',
  '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).'],
 'body': 'After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of m

**3. Augmented Data** ➡ *adding column*

To create our embeddings we’ll augment each comment with the issue’s title and body, since these fields often include useful contextual information.

Because our `comments` column is currently a list of `comments` for each issue, we need to **“explode”** the column so that each row consists of an (`html_url`, `title`, `body`, `comment`) tuple. In Pandas we can do this with the **DataFrame.explode()** function, which creates a new row for each element in a list-like column, while replicating all the other column values

In [206]:
issues_dataset.set_format("pandas")
df = issues_dataset[:]

In [209]:
df["comments"][0].tolist()

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

In [212]:
df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"[Cool, I think we can do both :), @lhoestq now...",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,[Hi ! I guess the caching mechanism should hav...,## Describe the bug\r\nAfter upgrading to data...
2,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,[I tried `unshuffled_original_da` and it is al...,## Describe the bug\r\n\r\nCannot download OSC...
3,https://github.com/huggingface/datasets/issues...,load_dataset using default cache on Windows ca...,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfo...",## Describe the bug\r\nStandard process to dow...


When we explode `df`, we expect to get one row for each of these comments

In [211]:
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...


**Great**, we can see the rows have been replicated, with the `comments` column containing the individual comments!

Now that we’re finished with Pandas, we can quickly switch back to a `Dataset` by loading the `DataFrame` in memory

In [213]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

**4. Creates `Length` of the `Comments`** ➡ `number of words per comment`

In [214]:
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

In [216]:
comments_dataset[0]

{'html_url': 'https://github.com/huggingface/datasets/issues/2945',
 'title': 'Protect master branch',
 'comments': 'Cool, I think we can do both :)',
 'body': 'After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n  - Currently, simple merge commits are already disabled\r\n  - I propose to disable rebase merging as well\r\n- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~\r\n  - ~~This protection would rejec

**filter out short comments**, because that are not relevant for our search engine.

In [217]:
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

**5. Concatenate the issue `title`, description or `body`, and `comments` together in a new `text` column.**

In [218]:
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)
comments_dataset

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2175
})

In [220]:
comments_dataset[0]

{'html_url': 'https://github.com/huggingface/datasets/issues/2945',
 'title': 'Protect master branch',
 'comments': '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).',
 'body': 'After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pul

**We’re finally ready to create some embeddings!** Let’s take a look.

### **Creating Text Embeddings**
Pick a suitable checkpoint to load the model from.

We will use library called `sentence-transformers` that is dedicated to creating embeddings. Our use case is an example of ***asymmetric semantic search*** because we have a short query whose answer we’d like to find in a longer document, like a an issue comment.

the handy model in the documentation indicates that the `multi-qa-mpnet-base-dot-v1` checkpoint has the best performance for semantic search, so we’ll use that for our application.

In [221]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

**To speed up the embedding process**, it helps to place the model and inputs on a **GPU device**.

In [222]:
import torch

device = torch.device("cuda")
model.to(device)

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

As we mentioned earlier, we’d like to **represent each entry in our GitHub issues corpus as a single vector**, so we need to **“pool”** or average our token embeddings in some way.

One popular approach is to perform `CLS pooling` on our model’s outputs, where we simply **collect the last hidden state for the special `[CLS]` token**.

In [223]:
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2175
})

In [None]:
def cls_pooling(model_output):
    # will collect the last hiddent state
    return model_output.last_hidden_state[:, 0]

Next, we’ll create a helper function that will **tokenize a list of documents**, **place the tensors on the GPU**, **feed them to the model**, and finally **apply `CLS pooling` to the outputs**

In [None]:
def get_embeddings(text_list):
    # tokenize the documents
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    # place tensors on the GPU
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}

    # Feed them to the model
    model_output = model(**encoded_input)

    # apply cls_pooling() on the output
    return cls_pooling(model_output)
    # return model_output

We can test the function works by feeding it the first text entry in our corpus and inspecting the output shape

In [None]:
embedding = get_embeddings(comments_dataset["text"][0])
# embedding
embedding.shape

Great, we’ve converted the first entry in our corpus into a 768-dimensional vector!

We can use `Dataset.map()` to apply our `get_embeddings()` function to each row in our corpus.

In [None]:
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

In [None]:
embeddings_dataset

Notice that we’ve converted the embeddings to NumPy arrays — that’s because 🤗 Datasets requires this format when we try to index them with `FAISS`.

### **Use FAISS for efficient similarity search**
Now that we have a dataset of embeddings, we need some way to search over them.

To do this, we’ll use a special data structure in 🤗 Datasets called a ***FAISS index***. **FAISS** (short for Facebook AI Similarity Search) is a library that **provides efficient algorithms to quickly search and cluster embedding vectors**.

The basic idea behind FAISS is to create a special data structure called an *index* that allows one to find which embeddings are similar to an input embedding.

In [None]:
# Creating a FAISS index
embeddings_dataset.add_faiss_index(column="embeddings")

We can now perform queries on this index by doing a nearest neighbor lookup with the `Dataset.get_nearest_examples()` function

In [None]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

Just like with the documents, we now have a 768-dimensional vector representing the query, which we can compare against the whole corpus to find the most similar embeddings:

In [None]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

The `Dataset.get_nearest_examples()` function returns a tuple of scores that rank the overlap between the query and the document, and a corresponding set of samples (here, the 5 best matches).

In [None]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

Now we can iterate over the first few rows to see how well our query matched the available comments

In [None]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

Not bad! Our second hit seems to match the query.