<a href="https://colab.research.google.com/github/educatorsRlearners/hugging_face_course/blob/main/05_the_%F0%9F%A4%97_Datasets_library_semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets transformers[sentencepiece]
!pip install faiss-gpu

# Semantic search with FAISS

> _What is semantic search?_

While the phrase sounds complicated, the concept is incredibly simple: 
1. Get embeddings for a token, string, or document
2. Use embeddings to mind the most similiar result to our query

That's pretty much it! We can use cosine similiarity or other similarity metrics if we wish depending on our use case but the key is creating embeddings and then using them to return the most similiar result. 

Let's get to it. 

## Loading and preparing the dataset 

Step one is download our dataset. For this example, we'll just use the one from the 🤗Hub: 

In [4]:
from huggingface_hub import hf_hub_url 

data_files = hf_hub_url(
    repo_id="lewtun/github-issues", 
    filename='datasets-issues-with-comments.jsonl',
    repo_type='dataset'
)

data_files

'https://huggingface.co/datasets/lewtun/github-issues/resolve/main/datasets-issues-with-comments.jsonl'

Now that we have the URL stored as a variable, we can pass it to ```load_dataset``` to download our data: 

In [9]:
from datasets import load_dataset

issues_dataset = load_dataset("json",
                              data_files=data_files,
                              split="train")

issues_dataset

Using custom data configuration default-ece7c1527bad24e5
Reusing dataset json (/root/.cache/huggingface/datasets/json/default-ece7c1527bad24e5/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

Boom! This dataset only has a ```train``` split so we can pass ```split=train``` to avoid downloading the dictionary.

Now we'll clean this dataset up a bit by filtering out the pull requests, because they are not typically used for answering user queries, like this: 

In [10]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"])>0)
)

issues_dataset

  0%|          | 0/4 [00:00<?, ?ba/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

Sweet! Now we can drop the features which do not add anything to our model. 

In [11]:
columns = issues_dataset.column_names

columns_to_keep = ["title", "body", "html_url", "comments"]

columns_to_remove = set(columns_to_keep).symmetric_difference(columns)

issues_dataset = issues_dataset.remove_columns(columns_to_remove)

issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

Now we can turn this into a ```dataframe``` for some easy manipulation.

In [13]:
issues_dataset.set_format('pandas')

df = issues_dataset[:]

In [14]:
df.head()

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"[Cool, I think we can do both :), @lhoestq now...",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,[Hi ! I guess the caching mechanism should hav...,## Describe the bug\r\nAfter upgrading to data...
2,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,[I tried `unshuffled_original_da` and it is al...,## Describe the bug\r\n\r\nCannot download OSC...
3,https://github.com/huggingface/datasets/issues...,load_dataset using default cache on Windows ca...,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfo...",## Describe the bug\r\nStandard process to dow...
4,https://github.com/huggingface/datasets/issues...,to_tf_dataset keeps a reference to the open da...,"[I did some investigation and, as it seems, th...",To reproduce:\r\n```python\r\nimport datasets ...


We need this dataframe to be "tidy" which it is currently not. For instance: 

In [15]:
df['comments'][0].tolist()

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

Each of those comments needs to be it's own record wsow we can ```explode``` the df to take care of this issue: 

In [16]:
comments_df = df.explode('comments', ignore_index=True)

comments_df.head(2)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...


Perfect! Now we can switch it back to a ```Dataset``` for easy training. 

In [18]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

Just for fun, let's create a new feature called ```comments_length``` which, as you might have guessed, contains the number of words per comment: 

In [19]:
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

0ex [00:00, ?ex/s]

Now did we _really_ create that feature just for fun? 

Of course not; we can use it to filter out extremely shot comments like "bump" or "thanks" and the like. 

> _So how many words constitutes a meaningful comment?_ 

Solid question. Unfortunately, I don't have a good answer so let's just start with 15 😃

In [22]:
comments_dataset = comments_dataset.filter(lambda x: x['comment_length']>15)
comments_dataset

  0%|          | 0/3 [00:00<?, ?ba/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

Ok! We've removed the short comments so let's now concatenate the title, comments, and body to create one long string for embedding purposes: 

In [25]:
def concatenate_text(examples):
  return{"text":examples["title"]
         + " \n "
         + examples['body']
         + " \n "
         + examples['comments']
         }

comments_dataset = comments_dataset.map(concatenate_text)

comments_dataset

0ex [00:00, ?ex/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2175
})

And now, just to make sure it worked, we can print each column as well as the concatenated text feature. 

In [35]:
print("Individual")
print(comments_dataset['title'][0])
print("------------")
print(comments_dataset['body'][0])
print("------------")
print(comments_dataset['comments'][0])
print("------------")
print("Concatenated")
print(comments_dataset['text'][0])

Individual
Protect master branch
------------
After accidental merge commit (91c55355b634d0dc73350a7ddee1a6776dbbdd69) into `datasets` master branch, all commits present in the feature branch were permanently added to `datasets` master branch history, as e.g.:
- 00cc036fea7c7745cfe722360036ed306796a3f2
- 13ae8c98602bbad8197de3b9b425f4c78f582af1
- ...

I propose to protect our master branch, so that we avoid we can accidentally make this kind of mistakes in the future:
- [x] For Pull Requests using GitHub, allow only squash merging, so that only a single commit per Pull Request is merged into the master branch
  - Currently, simple merge commits are already disabled
  - I propose to disable rebase merging as well
- ~~Protect the master branch from direct pushes (to avoid accidentally pushing of merge commits)~~
  - ~~This protection would reject direct pushes to master branch~~
  - ~~If so, for each release (when we need to commit directly to the master branch), we should pre

## [Creating text embeddings](https://huggingface.co/course/chapter5/6?fw=pt#creating-text-embeddings)