# Hugging Face - lesson 5

## Loading a Local Dataset

Citations:
```
@InProceedings{10.1007/978-3-030-03840-3_29,
	author="Croce, Danilo and Zelenanska, Alexandra and Basili, Roberto",
	editor="Ghidini, Chiara and Magnini, Bernardo and Passerini, Andrea and Traverso, Paolo",
	title="Neural Learning for Question Answering in Italian",
	booktitle="AI*IA 2018 -- Advances in Artificial Intelligence",
	year="2018",
	publisher="Springer International Publishing",
	address="Cham",
	pages="389--402",
	isbn="978-3-030-03840-3"
}
```

Data downloaded from: https://github.com/crux82/squad-it/

In [1]:
# Loading the dataset
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="data/squad-it-master/SQuAD_it-train.json", field="data")

In [2]:
print("Dataset Meta data: ", squad_it_dataset)
print("Sample data: ", squad_it_dataset["train"][0])

Dataset Meta data:  DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})
Sample data:  {'title': 'Terremoto del Sichuan del 2008', 'paragraphs': [{'context': "Il terremoto del Sichuan del 2008 o il terremoto del Gran Sichuan, misurato a 8.0 Ms e 7.9 Mw, e si è verificato alle 02:28:01 PM China Standard Time all' epicentro (06:28:01 UTC) il 12 maggio nella provincia del Sichuan, ha ucciso 69.197 persone e lasciato 18.222 dispersi.", 'qas': [{'answers': [{'answer_start': 29, 'text': '2008'}], 'id': '56cdca7862d2951400fa6826', 'question': 'In quale anno si è verificato il terremoto nel Sichuan?'}, {'answers': [{'answer_start': 232, 'text': '69.197'}], 'id': '56cdca7862d2951400fa6828', 'question': 'Quante persone sono state uccise come risultato?'}, {'answers': [{'answer_start': 29, 'text': '2008'}], 'id': '56d4f9902ccc5a1400d833c0', 'question': 'Quale anno ha avuto luogo il terremoto del Sichuan?'}, {'answers': [{'answer_start': 78,

In [3]:
# Loading train and test data at the same time
data_files = {"train": "data/squad-it-master/SQuAD_it-train.json", "test": "data/squad-it-master/SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

### Loading Remote Dataset

In [4]:
# Note that hugging face can work on zip/compressed files directly
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

In [5]:

from datasets import load_dataset

data_files = {"train": "data/drugLibTest/drugLibTrain_raw.tsv", "test": "data/drugLibTest/drugLibTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

In [6]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [63, 1542, 46],
 'urlDrugName': ['flexeril', 'vagifem', 'aralen'],
 'rating': [9, 10, 1],
 'effectiveness': ['Considerably Effective',
  'Highly Effective',
  'Moderately Effective'],
 'sideEffects': ['No Side Effects', 'No Side Effects', 'Severe Side Effects'],
 'condition': ['muscle spasm',
  'chronic urinary and yeast infections due to drynes',
  'malaria'],
 'benefitsReview': ['A muscle in my neck was pinching a nerve such that  if i turned my head or lay on that side my arm would instantly start to tingle and my hand fall asleep.  The Flexeril was effective in relaxing my muscles enough to make this merely a slight sensation within about twenty minutes, and over aperiod of one two two months the problem went away entirely.',
  'Before the use of vagifem tablets, I had to endure a series of urinary infections after sometimes painful sexual intercourse.  I also had painful cracks in mucoal lining of vulva due to aging and dryness.  After beginning the use of this drug

In [7]:
# Check if Unnamed: 0 is a unique ID which most likely it will represent review ID
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

In [8]:
# Rename the column to review_id
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="review_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],
        num_rows: 3107
    })
    test: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],
        num_rows: 1036
    })
})

In [9]:
# checking number of unique values
len(drug_dataset.unique(column="urlDrugName")["train"])

502

In [10]:
def filter_nones(x):
    return x["condition"] is not None

In [11]:
# Remove records with null condition
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

In [12]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}


drug_dataset = drug_dataset.map(lowercase_condition)

In [13]:
drug_dataset["train"]["condition"][:3]

['management of congestive heart failure',
 'birth prevention',
 'menstrual cramps']

In [14]:
def compute_review_length(example):
    return {"review_length": len(example["commentsReview"].split())}

# Remove rows with no commentsReview and count the length of it
drug_dataset = drug_dataset.filter(lambda x: x["commentsReview"] is not None)
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]

Filter:   0%|          | 0/3106 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1036 [00:00<?, ? examples/s]

Map:   0%|          | 0/3094 [00:00<?, ? examples/s]

Map:   0%|          | 0/1035 [00:00<?, ? examples/s]

{'review_id': 2202,
 'urlDrugName': 'enalapril',
 'rating': 4,
 'effectiveness': 'Highly Effective',
 'sideEffects': 'Mild Side Effects',
 'condition': 'management of congestive heart failure',
 'benefitsReview': 'slowed the progression of left ventricular dysfunction into overt heart failure \r\r\nalone or with other agents in the managment of hypertension \r\r\nmangagement of congestive heart failur',
 'sideEffectsReview': 'cough, hypotension , proteinuria, impotence , renal failure , angina pectoris , tachycardia , eosinophilic pneumonitis, tastes disturbances , anusease anorecia , weakness fatigue insominca weakness',
 'commentsReview': 'monitor blood pressure , weight and asses for resolution of fluid',
 'review_length': 11}

In [15]:
drug_dataset["train"].sort("review_length")[:3]

{'review_id': [3201, 1843, 3814],
 'urlDrugName': ['omnicef', 'yasmin', 'solodyn'],
 'rating': [1, 10, 1],
 'effectiveness': ['Marginally Effective', 'Highly Effective', 'Ineffective'],
 'sideEffects': ['Severe Side Effects',
  'No Side Effects',
  'Severe Side Effects'],
 'condition': ['sore throat ear infections', 'birth control', 'acne'],
 'benefitsReview': ["don't know yet back to pediatrician today",
  "I've been on yasmin four years now, it works so great! I have no side effects, just nausea if I take two at a time and that is hardly ever.  But my acne completely went away and no pregnancy the whole time I've been on it! Would definently recommend to anybody.",
  "I started this medicine to help my acne, but after a couple of weeks there was no improvement at all.  In fact, my skin was worse. I've taken other antibiotics in the past and they were much more effective at clearing up my skin.  So I stopped taking the Solodyn."],
 'sideEffectsReview': ['My one year old daughter who h

In [16]:
# Drop records that has lesser than 30 words
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] >= 30)
print(drug_dataset.num_rows)

Filter:   0%|          | 0/3094 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1035 [00:00<?, ? examples/s]

{'train': 1818, 'test': 637}


In [17]:
# converting html characters
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [18]:
drug_dataset = drug_dataset.map(lambda x: {"commentsReview": html.unescape(x["commentsReview"])})

Map:   0%|          | 0/1818 [00:00<?, ? examples/s]

Map:   0%|          | 0/637 [00:00<?, ? examples/s]

In [19]:
new_drug_dataset = drug_dataset.map(
    lambda x: {"commentsReview": [html.unescape(o) for o in x["commentsReview"]]}, batched=True
)

Map:   0%|          | 0/1818 [00:00<?, ? examples/s]

Map:   0%|          | 0/637 [00:00<?, ? examples/s]

In [20]:
# Tokenization in batches
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["commentsReview"], truncation=True)

%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1818 [00:00<?, ? examples/s]

Map:   0%|          | 0/637 [00:00<?, ? examples/s]

CPU times: total: 1.45 s
Wall time: 691 ms


In [21]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

def slow_tokenize_function(examples, slow_tokenizer=slow_tokenizer):
    return slow_tokenizer(examples["commentsReview"], truncation=True)

%time tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

Map (num_proc=8):   0%|          | 0/1818 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/637 [00:00<?, ? examples/s]

CPU times: total: 7.92 s
Wall time: 47.1 s


In [22]:
# Tokenize and split when it is more than 128
def tokenize_and_split(examples):
    return tokenizer(
        examples["commentsReview"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

result = tokenize_and_split(drug_dataset["train"][9])
[len(inp) for inp in result["input_ids"]]

[128, 7]

In [23]:
# Resolve mapping issue by remove_column
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)
len(tokenized_dataset["train"]), len(drug_dataset["train"])

Map:   0%|          | 0/1818 [00:00<?, ? examples/s]

Map:   0%|          | 0/637 [00:00<?, ? examples/s]

(2272, 1818)

In [24]:
# resolve by removing overflow_to_sample_mapping
def tokenize_and_split(examples):
    result = tokenizer(
        examples["commentsReview"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [25]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

Map:   0%|          | 0/1818 [00:00<?, ? examples/s]

Map:   0%|          | 0/637 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2272
    })
    test: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 819
    })
})

In [26]:
# Convert dataset to pandas
drug_dataset.set_format("pandas")

drug_dataset["train"][:3]

Unnamed: 0,review_id,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview,review_length
0,1146,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...,76
1,1043,vyvanse,9,Highly Effective,Mild Side Effects,add,"My mood has noticably improved, I have more en...","a few experiences of nausiea, heavy moodswings...",I had began taking 20mg of Vyvanse for three m...,107
2,1591,xanax,10,Highly Effective,No Side Effects,panic disorder,This simply just works fast and without any of...,I really don't have any side effects other tha...,I first started taking this at 3 times per day...,231


In [27]:
train_df = drug_dataset["train"][:]

In [28]:
# Generate condition frequencies
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()

Unnamed: 0,frequency,count
0,depression,143
1,acne,90
2,anxiety,34
3,insomnia,32
4,high blood pressure,24


In [29]:
# Create a dataset object based on frequencies dataframe
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 925
})

In [30]:
# Convert drug dataset set from pandas back to arrow
drug_dataset.reset_format()

In [31]:
# Create a validation dataset

drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length'],
        num_rows: 1454
    })
    validation: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length'],
        num_rows: 364
    })
    test: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length'],
        num_rows: 637
    })
})

### Saving dataset

In [32]:
drug_dataset_clean.save_to_disk("drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/1454 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/364 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/637 [00:00<?, ? examples/s]

### Loading Arrow Dataset

In [33]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length'],
        num_rows: 1454
    })
    validation: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length'],
        num_rows: 364
    })
    test: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length'],
        num_rows: 637
    })
})

In [34]:
# For csv and json we need to store each set of data (Train, validation, test) as a file
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"data/drug-reviews/drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

In [35]:
data_files = {
    "train": "data/drug-reviews/drug-reviews-train.jsonl",
    "validation": "data/drug-reviews/drug-reviews-validation.jsonl",
    "test": "data/drug-reviews/drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [36]:
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length'],
        num_rows: 1454
    })
    validation: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length'],
        num_rows: 364
    })
    test: Dataset({
        features: ['review_id', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects', 'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview', 'review_length'],
        num_rows: 637
    })
})

## Big Data - datasets (skip because data is not available)

In [37]:
# !pip install zstandard

In [38]:
import psutil
psutil.cpu_times()

scputimes(user=20098.4375, system=7983.062500000007, idle=61830.93749999999, interrupt=438.390625, dpc=384.8125)

## Semantic Search with FAISS

In [39]:
# Loading the dataset
from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

In [40]:
# Filter out pull request and empty comments
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

In [41]:
# Remove columns that is not needed

columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

In [42]:
issues_dataset.set_format("pandas")
df = issues_dataset[:]

In [43]:
df["comments"][0].tolist()

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

In [44]:
# Explode comments column by splitting each item in comment list into a record with the rest of the column remaining the same

comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...


In [45]:
# Convert Pandas to Dataset object

from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

In [46]:
# Create a new column comment length
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

In [47]:
# filter out comment that is less than 15 words
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] >= 15)
comments_dataset

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2239
})

In [48]:
# Create a column call text where we concat title, body and comments
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

Map:   0%|          | 0/2239 [00:00<?, ? examples/s]

In [49]:
# Load tokenizer and model
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

In [50]:
import torch

device = torch.device("cpu") #change to cuda if you have GPU
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [51]:
# CLS pooling - collect the last hidden state for [CLS] token
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [52]:
# Function to tokenize list of documents and applying CLS
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [53]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

In [54]:
# Create a column to store our embedding
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/2239 [00:00<?, ? examples/s]

### Using FAISS for Efficient Similarity Search

FAISS - Facebook AI Similarity Search

In [55]:
# Create FAISS index to perform similarity search
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2239
})

In [56]:
# Perform querying on the index
# Embedding of the question
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [57]:
# Get the top 5 most similar embedding
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [58]:
import pandas as pd

# Process the results and keep it in a dataframe
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)
samples_df.head()

Unnamed: 0,html_url,title,comments,body,comment_length,text,embeddings,scores
4,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,Requiring online connection is a deal breaker ...,"`datasets.load_dataset(""csv"", ...)` breaks if ...",57,Discussion using datasets in offline mode \n `...,"[-0.47318071126937866, 0.24578335881233215, -0...",25.505005
3,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,"The local dataset builders (csv, text , json a...","`datasets.load_dataset(""csv"", ...)` breaks if ...",38,Discussion using datasets in offline mode \n `...,"[-0.4490855634212494, 0.20950505137443542, -0....",24.555557
2,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,I opened a PR that allows to reload modules th...,"`datasets.load_dataset(""csv"", ...)` breaks if ...",179,Discussion using datasets in offline mode \n `...,"[-0.47164809703826904, 0.2902263402938843, -0....",24.148975
1,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,"> here is my way to load a dataset offline, bu...","`datasets.load_dataset(""csv"", ...)` breaks if ...",76,Discussion using datasets in offline mode \n `...,"[-0.49925997853279114, 0.22699648141860962, -0...",22.894001
0,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,"here is my way to load a dataset offline, but ...","`datasets.load_dataset(""csv"", ...)` breaks if ...",47,Discussion using datasets in offline mode \n `...,"[-0.4902573823928833, 0.22889509797096252, -0....",22.406649


In [59]:
# Print the results of it
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.5050048828125
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
```python
datasets = load_dataset('text', data_files=data_files)
```

We'll do a new release soon
SCORE: 24.555557250976562
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's no internet