<a href="https://colab.research.google.com/github/educatorsRlearners/hugging_face_course/blob/main/05_the_%F0%9F%A4%97_Datasets_library_semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets transformers[sentencepiece]
!pip install faiss-gpu

Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 25.7 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 61.0 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 60.5 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 73.2 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 57.9 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 6

# Semantic search with FAISS

> _What is semantic search?_

While the phrase sounds complicated, the concept is incredibly simple: 
1. Get embeddings for a token, string, or document
2. Use embeddings to mind the most similiar result to our query

That's pretty much it! We can use cosine similiarity or other similarity metrics if we wish depending on our use case but the key is creating embeddings and then using them to return the most similiar result. 

Let's get to it. 

## Loading and preparing the dataset 

Step one is download our dataset. For this example, we'll just use the one from the 🤗Hub: 

In [2]:
from huggingface_hub import hf_hub_url 

data_files = hf_hub_url(
    repo_id="lewtun/github-issues", 
    filename='datasets-issues-with-comments.jsonl',
    repo_type='dataset'
)

data_files

'https://huggingface.co/datasets/lewtun/github-issues/resolve/main/datasets-issues-with-comments.jsonl'

Now that we have the URL stored as a variable, we can pass it to ```load_dataset``` to download our data: 

In [3]:
from datasets import load_dataset

issues_dataset = load_dataset("json",
                              data_files=data_files,
                              split="train")

issues_dataset

Using custom data configuration default-ece7c1527bad24e5


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-ece7c1527bad24e5/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-ece7c1527bad24e5/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

Boom! This dataset only has a ```train``` split so we can pass ```split=train``` to avoid downloading the dictionary.

Now we'll clean this dataset up a bit by filtering out the pull requests, because they are not typically used for answering user queries, like this: 

In [4]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"])>0)
)

issues_dataset

  0%|          | 0/4 [00:00<?, ?ba/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

Sweet! Now we can drop the features which do not add anything to our model. 

In [5]:
columns = issues_dataset.column_names

columns_to_keep = ["title", "body", "html_url", "comments"]

columns_to_remove = set(columns_to_keep).symmetric_difference(columns)

issues_dataset = issues_dataset.remove_columns(columns_to_remove)

issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

Now we can turn this into a ```dataframe``` for some easy manipulation.

In [6]:
issues_dataset.set_format('pandas')

df = issues_dataset[:]

In [7]:
df.head()

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"[Cool, I think we can do both :), @lhoestq now...",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,[Hi ! I guess the caching mechanism should hav...,## Describe the bug\r\nAfter upgrading to data...
2,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,[I tried `unshuffled_original_da` and it is al...,## Describe the bug\r\n\r\nCannot download OSC...
3,https://github.com/huggingface/datasets/issues...,load_dataset using default cache on Windows ca...,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfo...",## Describe the bug\r\nStandard process to dow...
4,https://github.com/huggingface/datasets/issues...,to_tf_dataset keeps a reference to the open da...,"[I did some investigation and, as it seems, th...",To reproduce:\r\n```python\r\nimport datasets ...


We need this dataframe to be "tidy" which it is currently not. For instance: 

In [8]:
df['comments'][0].tolist()

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

Each of those comments needs to be it's own record wsow we can ```explode``` the df to take care of this issue: 

In [9]:
comments_df = df.explode('comments', ignore_index=True)

comments_df.head(2)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...


Perfect! Now we can switch it back to a ```Dataset``` for easy training. 

In [10]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

Just for fun, let's create a new feature called ```comments_length``` which, as you might have guessed, contains the number of words per comment: 

In [11]:
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

0ex [00:00, ?ex/s]

Now did we _really_ create that feature just for fun? 

Of course not; we can use it to filter out extremely shot comments like "bump" or "thanks" and the like. 

> _So how many words constitutes a meaningful comment?_ 

Solid question. Unfortunately, I don't have a good answer so let's just start with 15 😃

In [12]:
comments_dataset = comments_dataset.filter(lambda x: x['comment_length']>15)
comments_dataset

  0%|          | 0/3 [00:00<?, ?ba/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

Ok! We've removed the short comments so let's now concatenate the title, comments, and body to create one long string for embedding purposes: 

In [13]:
def concatenate_text(examples):
  return{"text":examples["title"]
         + " \n "
         + examples['body']
         + " \n "
         + examples['comments']
         }

comments_dataset = comments_dataset.map(concatenate_text)

comments_dataset

0ex [00:00, ?ex/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2175
})

And now, just to make sure it worked, we can print each column as well as the concatenated text feature. 

In [14]:
print("Individual")
print(comments_dataset['title'][0])
print("------------")
print(comments_dataset['body'][0])
print("------------")
print(comments_dataset['comments'][0])
print("------------")
print("Concatenated")
print(comments_dataset['text'][0])

Individual
Protect master branch
------------
After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:
- 00cc036fea7c7745cfe722360036ed306796a3f2
- 13ae8c98602bbad8197de3b9b425f4c78f582af1
- ...

I propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:
- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch
  - Currently, simple merge commits are already disabled
  - I propose to disable rebase merging as well
- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~
  - ~~This protection would reject direct pushes to master branch~~
  - ~~If so, for each release (when we need to commit directly to the master branch), we should pre

## [Creating text embeddings](https://huggingface.co/course/chapter5/6?fw=pt#creating-text-embeddings)

We can easily load pre-trained embeddings by passing a model checkpoint to ```AutoTokenizer.from_pretrained()```. 

> _But which checkpoint should we choose?_

Good question! 

We need to select one that most closely matches our usecase and since we'd like to find the answer to a short query in a longer string (aka, _asymmetric semenatic search_), we can use the [model overview table](https://www.sbert.net/docs/pretrained_models.html#model-overview) to identify the best checkpoint which, for our task, is the ``` multi-qa-mpnet-base-dot-v1```. 

As always, we'll be sure to use the same checkpoint for our tokenizer. 

In [15]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Having a GPU device makes all the differnece in the world when working with embeddings so let's set the device to "cuda".

In [16]:
import torch 

device = torch.device("cuda")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0): MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_features

Again, we're going to want to represent each entry of our corpus as a single vector and an easy way to do that is to pool/average our token embeddings. 

A simple way to do that is to perform what is known as _CLS pooling_ on our model's outputs where we collect the last hidden state for the CLS token like this: 

In [17]:
def cls_pooling(model_output):
  return model_output.last_hidden_state[:, 0]

Now we'll create a helper function which will do three things: 
1. tokenize a list of documents
2. place the tensors on the GPU
3. apply CLS pooling to the outputs

In [18]:
def get_embeddings(text_list):
  encoded_input = tokenizer(
      text_list, 
      padding=True,
      truncation=True,
      return_tensors='pt'
      )
  encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
  model_output = model(**encoded_input)
  return cls_pooling(model_output)

As always, trust but verify 😏

Let's convert the first entry (see below) to a vector. 

In [21]:
comments_dataset["text"][0]

'Protect master branch \n After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:\r\n- 00cc036fea7c7745cfe722360036ed306796a3f2\r\n- 13ae8c98602bbad8197de3b9b425f4c78f582af1\r\n- ...\r\n\r\nI propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:\r\n- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch\r\n  - Currently, simple merge commits are already disabled\r\n  - I propose to disable rebase merging as well\r\n- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~\r\n  - ~~This protection would reject direct pushes to master branch~~\r\n  - ~~If so, for each release (when we need to commit directly to the master branch), we should p

In [22]:
get_embeddings(comments_dataset["text"][0]).shape

torch.Size([1, 768])

Excellent! We've just converted that entire string into a 768 dimensional vector! Now we can do the rest using ```lambda``` and ```Dataset.map()```.

In [25]:
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

0ex [00:00, ?ex/s]

## [Using FAISS for efficient similarity search](https://huggingface.co/course/chapter5/6?fw=pt#using-faiss-for-efficient-similarity-search)