<a href="https://colab.research.google.com/github/chineidu/NLP-Tutorial/blob/main/notebook/06_Transformers/05_semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Semantic Search](https://huggingface.co/learn/nlp-course/chapter5/6?fw=pt)

In [1]:
import IPython

IPython.version_info

(7, 34, 0, '')

In [2]:
!pip install rich
!pip install transformers[torch]
!pip install torch datasets evaluate
!pip install black[jupyter]




### Formatter For Colab

[Run only once, at startup]

```text
- Connect to your drive
```

```python
from google.colab import drive
drive.mount("/content/drive")
```

```sh
Install black for jupyter

!pip install black[jupyter]
```

```text
- Restart kernel

[Then]

- Place your .ipynb file somewhere on your drive
```

```python
# Anytime you want format your code run:
!black /content/drive/MyDrive/YOUR_PATH/YOUR_NOTEBOOK.ipynb
```

```text
- Don't save your notebook, hit F5 to refresh the page
- Now save!
```

In [3]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# Built-in library
import re
import json
from typing import Any, Dict, List, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
from pprint import pprint
import pandas as pd
from rich import print

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
# %load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

### Load Data

In [5]:
import datasets
from datasets import load_dataset
from datasets.dataset_dict import DatasetDict, Dataset

In [6]:
fp: str = "lewtun/github-issues"

issues_dataset: Dataset = load_dataset(fp, split="train")
issues_dataset

Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

In [7]:
print(issues_dataset[10])

In [8]:
print(issues_dataset.column_names)

In [9]:
# The issues_dataset contains issues and pull requests.
# Select ONLY the issues
issues_dataset_1: Dataset = issues_dataset.filter(
    (lambda x: x.get("is_pull_request") == False and len(x.get("comments")) > 0),
)

print(f"Size if data BEFORE: {issues_dataset.num_rows}\n")

print(f"Size if data AFTER: {issues_dataset_1.num_rows}\n")

In [10]:
all_columns: list[str] = issues_dataset_1.column_names
columns_to_keep: list[str] = ["title", "body", "html_url", "comments"]
# Columns present in ONLY all_columns
columns_to_remove: set = set(columns_to_keep).symmetric_difference(set(all_columns))
print(columns_to_remove)

In [11]:
issues_dataset_1 = issues_dataset_1.remove_columns(column_names=columns_to_remove)
issues_dataset_1

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

In [12]:
issues_dataset_1.set_format("pandas")
df: pd.DataFrame = issues_dataset_1[:]
# OR df = issues_dataset_1.to_pandas()

df.head()

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues/2945,Protect master branch,"[Cool, I think we can do both :), @lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).]","After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n - ..."
1,https://github.com/huggingface/datasets/issues/2943,Backwards compatibility broken for cached datasets that use `.filter()`,"[Hi ! I guess the caching mechanism should have considered the new `filter` to be different from the old one, and don't use cached results from the old `filter`.\r\nTo avoid other users from having this issue we could make the caching differentiate the two, what do you think ?, If it's easy enough to implement, then yes please 😄 But this issue can be low-priority, since I've only encountered it in a couple of `transformers` CI tests., Well it can cause issue with anyone that updates `datasets` and re-run some code that uses filter, so I'm creating a PR, I just merged a fix, let me know if...","## Describe the bug\r\nAfter upgrading to datasets `1.12.0`, some cached `.filter()` steps from `1.11.0` started failing with \r\n`ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'file': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'speaker_id': Value(dtype='int64', id=None), 'chapter_id': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None)}`\r\n\r\nRelated feature: https://github.com/huggingface/datasets/pull/2836\r\n\r\n:question: This is probably a `wontfix` bug, since it can be solved by simply cleaning the..."
2,https://github.com/huggingface/datasets/issues/2941,OSCAR unshuffled_original_ko: NonMatchingSplitsSizesError,[I tried `unshuffled_original_da` and it is also not working],"## Describe the bug\r\n\r\nCannot download OSCAR `unshuffled_original_ko` due to `NonMatchingSplitsSizesError`.\r\n\r\n## Steps to reproduce the bug\r\n\r\n```python\r\n>>> dataset = datasets.load_dataset('oscar', 'unshuffled_original_ko')\r\nNonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=25292102197, num_examples=7345075, dataset_name='oscar'), 'recorded': SplitInfo(name='train', num_bytes=25284578514, num_examples=7344907, dataset_name='oscar')}]\r\n```\r\n\r\n## Expected results\r\n\r\nLoading is successful.\r\n\r\n## Actual results\r\n\r\nLoading throws ab..."
3,https://github.com/huggingface/datasets/issues/2937,load_dataset using default cache on Windows causes PermissionError: [WinError 5] Access is denied,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfortunately, I was not able to reproduce this bug:\r\n```ipython\r\nIn [1]: from datasets import load_dataset\r\n ...: ds = load_dataset('wiki_bio')\r\nDownloading: 7.58kB [00:00, 26.3kB/s]\r\nDownloading: 2.71kB [00:00, ?B/s]\r\nUsing custom data configuration default\r\nDownloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to C:\Users\username\.cache\huggingface\datasets\wiki_bio\default\\r\n1.1.0\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf8...","## Describe the bug\r\nStandard process to download and load the wiki_bio dataset causes PermissionError in Windows 10 and 11.\r\n\r\n## Steps to reproduce the bug\r\n```python\r\nfrom datasets import load_dataset\r\nds = load_dataset('wiki_bio')\r\n```\r\n\r\n## Expected results\r\nIt is expected that the dataset downloads without any errors.\r\n\r\n## Actual results\r\nPermissionError see trace below:\r\n```\r\nUsing custom data configuration default\r\nDownloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1...."
4,https://github.com/huggingface/datasets/issues/2934,"to_tf_dataset keeps a reference to the open data somewhere, causing issues on windows","[I did some investigation and, as it seems, the bug stems from [this line](https://github.com/huggingface/datasets/blob/8004d7c3e1d74b29c3e5b0d1660331cd26758363/src/datasets/arrow_dataset.py#L325). The lifecycle of the dataset from the linked line is bound to one of the returned `tf.data.Dataset`. So my (hacky) solution involves wrapping the linked dataset with `weakref.proxy` and adding a custom `__del__` to `tf.python.data.ops.dataset_ops.TensorSliceDataset` (this is the type of a dataset that is returned by `tf.data.Dataset.from_tensor_slices`; this works for TF 2.x, but I'm not sure `t...","To reproduce:\r\n```python\r\nimport datasets as ds\r\nimport weakref\r\nimport gc\r\n\r\nd = ds.load_dataset(""mnist"", split=""train"")\r\nref = weakref.ref(d._data.table)\r\ntfd = d.to_tf_dataset(""image"", batch_size=1, shuffle=False, label_cols=""label"")\r\ndel tfd, d\r\ngc.collect()\r\nassert ref() is None, ""Error: there is at least one reference left""\r\n```\r\n\r\nThis causes issues because the table holds a reference to an open arrow file that should be closed. So on windows it's not possible to delete or move the arrow file afterwards.\r\n\r\nMoreover the CI test of the `to_tf_dataset` ..."


In [13]:
print(df["comments"].iloc[0])

In [14]:
# There are two comments in this particular comment index
print(len(df["comments"].iloc[0]))

In [15]:
# Use the explode method to create a row for each comment in a given index.
# i.e. index 0 with 2 comments creates 2 rows and index 1 with 6 comments creates 6 rows, etc.
comments_df: pd.DataFrame = df.explode("comments", ignore_index=True)
comments_df.head(3)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues/2945,Protect master branch,"Cool, I think we can do both :)","After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n - ..."
1,https://github.com/huggingface/datasets/issues/2945,Protect master branch,"@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).","After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n - ..."
2,https://github.com/huggingface/datasets/issues/2943,Backwards compatibility broken for cached datasets that use `.filter()`,"Hi ! I guess the caching mechanism should have considered the new `filter` to be different from the old one, and don't use cached results from the old `filter`.\r\nTo avoid other users from having this issue we could make the caching differentiate the two, what do you think ?","## Describe the bug\r\nAfter upgrading to datasets `1.12.0`, some cached `.filter()` steps from `1.11.0` started failing with \r\n`ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'file': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'speaker_id': Value(dtype='int64', id=None), 'chapter_id': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None)}`\r\n\r\nRelated feature: https://github.com/huggingface/datasets/pull/2836\r\n\r\n:question: This is probably a `wontfix` bug, since it can be solved by simply cleaning the..."


In [16]:
# Convert back to Dataset
comments_dataset: Dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

In [17]:
# Add n new columns
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x.get("comments").split())}
)

comments_dataset

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2964
})

In [18]:
# Short comments
comments_dataset.sort("comment_length").select_columns(["comments", "comment_length"])[
    :10
]

{'comments': ['https://github.com/huggingface/datasets/blob/6c766f9115d686182d76b1b937cb27e099c45d68/src/datasets/builder.py#L179-L186',
  'https://github.com/huggingface/datasets/blob/6c766f9115d686182d76b1b937cb27e099c45d68/src/datasets/builder.py#L179-L186',
  '@albertvillanova ',
  'Thanks!',
  '#self-assign',
  '#take',
  '#take',
  '#self-assign',
  'Resolved',
  'Ty!'],
 'comment_length': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [19]:
# Drop short comments.
# i.e comments like: 'Thanks!', '#self-assign', etc.
comments_dataset = comments_dataset.filter(lambda x: x.get("comment_length") > 15)
comments_dataset

Filter:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

In [20]:
# Concatenate the issue title, description (body), and comments together in a new text column.
def concat_data(example: dict[str, Any]) -> dict[str, Any]:
    """This is used to concatenate the title, body and comments together."""
    title: str = example.get("title")
    comments: str = example.get("comments")
    body: str = example.get("body")
    result = {"text": f"{title} \n {body} \n {comments}"}

    return result


comments_dataset = comments_dataset.map(concat_data)
comments_dataset

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2175
})

In [21]:
comments_dataset[0]

{'html_url': 'https://github.com/huggingface/datasets/issues/2945',
 'title': 'Protect master branch',
 'comments': '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).',
 'body': 'After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pul

## Create Text Embeddings

- This [table](https://www.sbert.net/docs/pretrained_models.html#model-overview) shows the model overview for sentence-transformers (open source) semantic search.

<br>

[![image.png](https://i.postimg.cc/MZLLL8TS/image.png)](https://postimg.cc/c6QTK2w9)

In [22]:
from transformers import AutoModel, AutoTokenizer

repo_name: str = "sentence-transformers"
checkpoint: str = f"{repo_name}/multi-qa-mpnet-base-dot-v1"

# Instantiate tokenizer
tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Instantiate model
model: AutoModel = AutoModel.from_pretrained(checkpoint)

In [23]:
import torch

# To speed up the processing, push to a GPU
device_str: str = "cuda" if torch.cuda.is_available() else "cpu"
device = torch.device(device=device_str)

device

device(type='cuda')

In [24]:
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [25]:
# To represent each GitHub issue entry as a single vector, we need to pool or
# average the token embeddings. One popular approach is CLS pooling, where the last
# hidden state for the special [CLS] token is collected.


def cls_pooling(model_output):
    """To represent each GitHub issue entry as a single vector, pool or average the token embeddings.
    Using CLS pooling, where the last hidden state for the special [CLS] token is collected."""
    return model_output.last_hidden_state[:, 0]


# Create a helper function that will tokenize a list of documents, place the tensors
# on the GPU (if available), feed them to the model, and finally apply CLS pooling to the outputs:


def get_embeddings(text_list: list):
    """This is used to tokenize a list of documents, place the tensors
    on the GPU (if available), feed them to the model, and apply CLS pooling to the outputs."""
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    # Push to encoded input to the GPU (if available)
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [26]:
print(comments_dataset["text"][0])

In [27]:
# Test that the function works by feeding it the first text entry in our corpus
# and inspecting the output shape;
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

In [28]:
embedding

tensor([[-1.5532e-01, -1.0023e-01, -7.0321e-02, -7.9817e-02, -1.0425e-01,
         -1.8799e-01,  1.0403e-02,  2.7286e-01, -9.8488e-03, -8.4140e-02,
          2.9261e-01, -7.7875e-02, -1.3246e-01,  2.1589e-01, -5.6229e-02,
          1.7055e-01,  2.0032e-01, -4.0798e-02, -9.2333e-02,  2.7923e-02,
          2.2608e-02, -6.8703e-02,  8.3200e-02,  4.3630e-02, -1.8378e-01,
          5.4012e-03, -1.4812e-02,  1.5650e-01, -4.3977e-01, -4.6957e-01,
          1.4525e-01,  2.3224e-01,  2.5106e-02,  5.7053e-01, -1.0436e-04,
         -3.4550e-04,  1.6949e-01, -4.0601e-02, -1.4010e-01, -2.1546e-01,
         -4.9011e-01, -4.4480e-01, -7.0063e-02, -8.3543e-03,  1.2881e-01,
          5.2868e-02, -1.3250e-01,  2.3846e-02,  3.8485e-01,  1.0555e-01,
          2.3861e-01, -1.0285e-01,  1.8124e-01,  1.3769e-02,  3.9201e-01,
          5.4745e-01, -6.6634e-02,  3.0320e-01,  5.4018e-02,  6.7586e-04,
          6.1106e-02,  1.7411e-01, -8.2372e-02, -2.8667e-01,  2.2702e-01,
         -1.7799e-01,  5.1817e-01, -4.

In [29]:
# Get embeddings for the entire dataset

# The embeddings have been converted to NumPy arrays — that’s because 🤗 Datasets
# requires this format inorder to index them with FAISS.
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x.get("text")).detach().cpu().numpy()[0]}
)

embeddings_dataset

Map:   0%|          | 0/2175 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

### FAISS For Efficient Similarity Search

```text
FAISS: Facebook AI Similarity Search.

- Dataset of embeddings needs a search mechanism.
- 🤗 Datasets offers a FAISS index for this purpose.
- FAISS is a library providing efficient algorithms for searching and clustering embedding vectors.
- FAISS creates an index data structure to find similar embeddings.
- Creating a FAISS index in 🤗 Datasets is straightforward using Dataset.add_faiss_index() function.
- Specify the desired column to index in the dataset.
```

In [30]:
!pip install faiss-gpu



In [31]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

In [32]:
# - Queries can be performed on the FAISS index using Dataset.get_nearest_examples() function.
# - Nearest neighbor lookup enables finding similar examples based on embeddings.
# - Embedding a question is the first step to test the functionality.

question: str = "How can I load a dataset offline?"
question_embedding: np.ndarray = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [35]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

scores

array([22.40664 , 22.893995, 24.148972, 24.555553, 25.505049],
      dtype=float32)

In [36]:
samples_df: pd.DataFrame = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [38]:
print(f"Shape of data: {samples_df.shape}\n")
samples_df.head()

Unnamed: 0,html_url,title,comments,body,comment_length,text,embeddings,scores
4,https://github.com/huggingface/datasets/issues/824,Discussion using datasets in offline mode,Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.\r\n\r\n@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?,"`datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have all files locally on this machine.\r\n- AFAIK,...",57,"Discussion using datasets in offline mode \n `datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have a...","[-0.47318071126937866, 0.24578367173671722, -0.012630763463675976, 0.14121408760547638, 0.28335559368133545, -0.1470210701227188, 0.6012278199195862, 0.012776073068380356, 0.2687809467315674, 0.13127808272838593, -0.02403687685728073, -0.023985162377357483, -0.024662325158715248, 0.40210670232772827, -0.09124023467302322, -0.10213788598775864, -0.1722334325313568, 0.038877222687006, -0.1540050506591797, 0.13259492814540863, -0.14081068336963654, -0.12811940908432007, -0.3727093040943146, -0.028236284852027893, -0.16281437873840332, -0.1865873783826828, 0.009142007678747177, 0.1420215219259...",25.505049
3,https://github.com/huggingface/datasets/issues/824,Discussion using datasets in offline mode,"The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)\r\nYou can now use them offline\r\n```python\r\ndatasets = load_dataset('text', data_files=data_files)\r\n```\r\n\r\nWe'll do a new release soon","`datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have all files locally on this machine.\r\n- AFAIK,...",38,"Discussion using datasets in offline mode \n `datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have a...","[-0.44908538460731506, 0.20950642228126526, -0.05981910973787308, 0.12935049831867218, 0.26219233870506287, -0.13128556311130524, 0.5469644069671631, 0.0949263647198677, 0.3154381513595581, 0.22943100333213806, 0.05104105919599533, -0.0031593164894729853, 0.047941938042640686, 0.40367624163627625, -0.1032216027379036, -0.11917857080698013, -0.10968127101659775, 0.08062367141246796, -0.17092695832252502, 0.09250368177890778, -0.18166573345661163, -0.0897308811545372, -0.37507691979408264, -0.025563210248947144, -0.10498439520597458, -0.17466020584106445, -0.10713899880647659, 0.156695634126...",24.555553
2,https://github.com/huggingface/datasets/issues/824,Discussion using datasets in offline mode,"I opened a PR that allows to reload modules that have already been loaded once even if there's no internet.\r\n\r\nLet me know if you know other ways that can make the offline mode experience better. I'd be happy to add them :) \r\n\r\nI already note the ""freeze"" modules option, to prevent local modules updates. It would be a cool feature.\r\n\r\n----------\r\n\r\n> @mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?\r\n\r\nIndeed `load_dataset` allow...","`datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have all files locally on this machine.\r\n- AFAIK,...",179,"Discussion using datasets in offline mode \n `datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have a...","[-0.47164806723594666, 0.2902272641658783, -0.04767205938696861, 0.13344231247901917, 0.2106887549161911, -0.21222515404224396, 0.5858257412910461, 0.05341650918126106, 0.28334110975265503, 0.18411293625831604, 0.03105962462723255, 0.032638560980558395, 0.0011812117882072926, 0.3333257734775543, -0.07214410603046417, -0.05854179710149765, -0.19054260849952698, -0.005105196963995695, -0.16585849225521088, 0.08279171586036682, -0.08703479170799255, -0.17246849834918976, -0.4247152805328369, -0.11953136324882507, -0.06756021082401276, -0.16881543397903442, -0.024211646988987923, 0.21167261898...",24.148972
1,https://github.com/huggingface/datasets/issues/824,Discussion using datasets in offline mode,"> here is my way to load a dataset offline, but it **requires** an online machine\n> \n> 1. (online machine)\n> \n> ```\n> \n> import datasets\n> \n> data = datasets.load_dataset(...)\n> \n> data.save_to_disk(/YOUR/DATASET/DIR)\n> \n> ```\n> \n> 2. copy the dir from online to the offline machine\n> \n> 3. (offline machine)\n> \n> ```\n> \n> import datasets\n> \n> data = datasets.load_from_disk(/SAVED/DATA/DIR)\n> \n> ```\n> \n> \n> \n> HTH.\n\n","`datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have all files locally on this machine.\r\n- AFAIK,...",76,"Discussion using datasets in offline mode \n `datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have a...","[-0.499260276556015, 0.22699761390686035, -0.0324692539870739, 0.14187218248844147, 0.23695068061351776, -0.10291724652051926, 0.5442944169044495, 0.07441112399101257, 0.2753629684448242, 0.24428822100162506, -0.008833910338580608, -0.06653957813978195, 0.02808590605854988, 0.3756229281425476, -0.0988999605178833, -0.04195521026849747, -0.1623331606388092, 0.056354790925979614, -0.18303634226322174, 0.1076541319489479, -0.16337788105010986, -0.08695139735937119, -0.433300256729126, -0.06250766664743423, -0.015938540920615196, -0.1606818586587906, -0.07989782840013504, 0.21157224476337433, ...",22.893995
0,https://github.com/huggingface/datasets/issues/824,Discussion using datasets in offline mode,"here is my way to load a dataset offline, but it **requires** an online machine\r\n1. (online machine)\r\n```\r\nimport datasets\r\ndata = datasets.load_dataset(...)\r\ndata.save_to_disk(/YOUR/DATASET/DIR)\r\n```\r\n2. copy the dir from online to the offline machine\r\n3. (offline machine)\r\n```\r\nimport datasets\r\ndata = datasets.load_from_disk(/SAVED/DATA/DIR)\r\n```\r\n\r\nHTH.","`datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have all files locally on this machine.\r\n- AFAIK,...",47,"Discussion using datasets in offline mode \n `datasets.load_dataset(""csv"", ...)` breaks if you have no connection (There is already this issue https://github.com/huggingface/datasets/issues/761 about it). It seems to be the same for metrics too.\r\n\r\nI create this ticket to discuss a bit and gather what you have in mind or other propositions.\r\n\r\nHere are some points to open discussion:\r\n- if you want to prepare your code/datasets on your machine (having internet connexion) but run it on another offline machine (not having internet connexion), it won't work as is, even if you have a...","[-0.49025776982307434, 0.2288963347673416, -0.03322095796465874, 0.13887320458889008, 0.23637260496616364, -0.08911527693271637, 0.5481941103935242, 0.06737261265516281, 0.2955958843231201, 0.24505114555358887, -0.01017451286315918, -0.06949253380298615, 0.027625715360045433, 0.3827284574508667, -0.10571841895580292, -0.021846573799848557, -0.15303955972194672, 0.0536612905561924, -0.17722827196121216, 0.08962270617485046, -0.16522905230522156, -0.09213365614414215, -0.43372341990470886, -0.07040619850158691, -0.018133236095309258, -0.1504259705543518, -0.08321218192577362, 0.2194387912750...",22.406639


In [40]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

In [43]:
def get_query_matches(*, query: str, K: int) -> pd.DataFrame:
    """This is used to to display the top K matches to a query."""
    query_embedding: np.ndarray = get_embeddings([query]).cpu().detach().numpy()

    scores, results = embeddings_dataset.get_nearest_examples(
        "embeddings", query_embedding, k=K
    )
    results_df: pd.DataFrame = pd.DataFrame.from_dict(results)
    results_df["scores"] = scores
    results_df.sort_values("scores", ascending=False, inplace=True)

    for _, row in samples_df.iterrows():
        print(f"COMMENT: {row.comments}")
        print(f"SCORE: {row.scores}")
        print(f"TITLE: {row.title}")
        print(f"URL: {row.html_url}")
        print("=" * 50)
        print()

    return results_df

In [53]:
df.iloc[255:260, [1, 2]]

Unnamed: 0,title,comments
255,visualization for cc100 is broken,"[This looks like an issue with the cc100 dataset itself but not sure\r\nDid you try loading cc100 on your machine ?, Hi\nloading works fine, but the viewer only is broken\nthanks\n\nOn Wed, Apr 7, 2021 at 12:17 PM Quentin Lhoest ***@***.***>\nwrote:\n\n> This looks like an issue with the cc100 dataset itself but not sure\n> Did you try loading cc100 on your machine ?\n>\n> —\n> You are receiving this because you authored the thread.\n> Reply to this email directly, view it on GitHub\n> <https://github.com/huggingface/datasets/issues/2162#issuecomment-814793809>,\n> or unsubscribe\n> <https..."
256,any possibility to download part of large datasets only?,"[Not yet but it’s on the short/mid-term roadmap (requested by many indeed)., oh, great, really awesome feature to have, thank you very much for the great, fabulous work, We'll work on dataset streaming soon. This should allow you to only load the examples you need ;), thanks a lot Quentin, this would be really really a great feature to have\n\nOn Wed, Apr 7, 2021 at 12:14 PM Quentin Lhoest ***@***.***>\nwrote:\n\n> We'll work on dataset streaming soon. This should allow you to only load\n> the examples you need ;)\n>\n> —\n> You are receiving this because you authored the thread.\n> Reply ..."
257,data_args.preprocessing_num_workers almost freezes,"[Hi.\r\nI cannot always reproduce this issue, and on later runs I did not see it so far. Sometimes also I set 8 processes but I see less being showed, is this normal, here only 5 are shown for 8 being set, thanks\r\n\r\n```\r\n#3: 11%|███████████████▊ | 172/1583 [00:46<06:21, 3.70ba/s]\r\n#4: 9%|█████████████▏ | 143/1583 [00:46<07..."
258,adding ccnet dataset,"[closing since I think this is cc100, just the name has been changed. thanks ]"
259,"viewer ""fake_news_english"" error",[Thanks for reporting !\r\nThe viewer doesn't have all the dependencies of the datasets. We may add openpyxl to be able to show this dataset properly]


In [None]:
query: str = ""

In [48]:
df.head()

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues/2945,Protect master branch,"[Cool, I think we can do both :), @lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).]","After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n - ..."
1,https://github.com/huggingface/datasets/issues/2943,Backwards compatibility broken for cached datasets that use `.filter()`,"[Hi ! I guess the caching mechanism should have considered the new `filter` to be different from the old one, and don't use cached results from the old `filter`.\r\nTo avoid other users from having this issue we could make the caching differentiate the two, what do you think ?, If it's easy enough to implement, then yes please 😄 But this issue can be low-priority, since I've only encountered it in a couple of `transformers` CI tests., Well it can cause issue with anyone that updates `datasets` and re-run some code that uses filter, so I'm creating a PR, I just merged a fix, let me know if...","## Describe the bug\r\nAfter upgrading to datasets `1.12.0`, some cached `.filter()` steps from `1.11.0` started failing with \r\n`ValueError: Keys mismatch: between {'indices': Value(dtype='uint64', id=None)} and {'file': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'speaker_id': Value(dtype='int64', id=None), 'chapter_id': Value(dtype='int64', id=None), 'id': Value(dtype='string', id=None)}`\r\n\r\nRelated feature: https://github.com/huggingface/datasets/pull/2836\r\n\r\n:question: This is probably a `wontfix` bug, since it can be solved by simply cleaning the..."
2,https://github.com/huggingface/datasets/issues/2941,OSCAR unshuffled_original_ko: NonMatchingSplitsSizesError,[I tried `unshuffled_original_da` and it is also not working],"## Describe the bug\r\n\r\nCannot download OSCAR `unshuffled_original_ko` due to `NonMatchingSplitsSizesError`.\r\n\r\n## Steps to reproduce the bug\r\n\r\n```python\r\n>>> dataset = datasets.load_dataset('oscar', 'unshuffled_original_ko')\r\nNonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=25292102197, num_examples=7345075, dataset_name='oscar'), 'recorded': SplitInfo(name='train', num_bytes=25284578514, num_examples=7344907, dataset_name='oscar')}]\r\n```\r\n\r\n## Expected results\r\n\r\nLoading is successful.\r\n\r\n## Actual results\r\n\r\nLoading throws ab..."
3,https://github.com/huggingface/datasets/issues/2937,load_dataset using default cache on Windows causes PermissionError: [WinError 5] Access is denied,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfortunately, I was not able to reproduce this bug:\r\n```ipython\r\nIn [1]: from datasets import load_dataset\r\n ...: ds = load_dataset('wiki_bio')\r\nDownloading: 7.58kB [00:00, 26.3kB/s]\r\nDownloading: 2.71kB [00:00, ?B/s]\r\nUsing custom data configuration default\r\nDownloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1.03 GiB) to C:\Users\username\.cache\huggingface\datasets\wiki_bio\default\\r\n1.1.0\5293ce565954ba965dada626f1e79684e98172d950371d266bf3caaf8...","## Describe the bug\r\nStandard process to download and load the wiki_bio dataset causes PermissionError in Windows 10 and 11.\r\n\r\n## Steps to reproduce the bug\r\n```python\r\nfrom datasets import load_dataset\r\nds = load_dataset('wiki_bio')\r\n```\r\n\r\n## Expected results\r\nIt is expected that the dataset downloads without any errors.\r\n\r\n## Actual results\r\nPermissionError see trace below:\r\n```\r\nUsing custom data configuration default\r\nDownloading and preparing dataset wiki_bio/default (download: 318.53 MiB, generated: 736.94 MiB, post-processed: Unknown size, total: 1...."
4,https://github.com/huggingface/datasets/issues/2934,"to_tf_dataset keeps a reference to the open data somewhere, causing issues on windows","[I did some investigation and, as it seems, the bug stems from [this line](https://github.com/huggingface/datasets/blob/8004d7c3e1d74b29c3e5b0d1660331cd26758363/src/datasets/arrow_dataset.py#L325). The lifecycle of the dataset from the linked line is bound to one of the returned `tf.data.Dataset`. So my (hacky) solution involves wrapping the linked dataset with `weakref.proxy` and adding a custom `__del__` to `tf.python.data.ops.dataset_ops.TensorSliceDataset` (this is the type of a dataset that is returned by `tf.data.Dataset.from_tensor_slices`; this works for TF 2.x, but I'm not sure `t...","To reproduce:\r\n```python\r\nimport datasets as ds\r\nimport weakref\r\nimport gc\r\n\r\nd = ds.load_dataset(""mnist"", split=""train"")\r\nref = weakref.ref(d._data.table)\r\ntfd = d.to_tf_dataset(""image"", batch_size=1, shuffle=False, label_cols=""label"")\r\ndel tfd, d\r\ngc.collect()\r\nassert ref() is None, ""Error: there is at least one reference left""\r\n```\r\n\r\nThis causes issues because the table holds a reference to an open arrow file that should be closed. So on windows it's not possible to delete or move the arrow file afterwards.\r\n\r\nMoreover the CI test of the `to_tf_dataset` ..."
