<a href="https://colab.research.google.com/github/appletreeleaf/Project/blob/main/%EC%8B%A4%EC%8A%B5%EC%9E%90%EB%A3%8C/Semantic_search_with_FAISS_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic search with FAISS (PyTorch)

```
transformer-based language modelÏóêÏÑú textÎ•º
embedding vectorÎ°ú ÌëúÌòÑÌï©ÎãàÎã§.
Ïù¥Î†áÍ≤å Íµ¨Ìïú ÏûÑÎ≤†Îî© Î≤°ÌÑ∞Î•º dot-productÎ•º ÌÜµÌï¥
corpus ÎÇ¥Ïùò Îã®Ïñ¥Îì§Í≥º Ïú†ÏÇ¨ÎèÑÎ•º Í≥ÑÏÇ∞Ìï† Ïàò ÏûàÏ£†.

Ïù¥Î≤à Ï±ïÌÑ∞ÏóêÏÑúÎäî ÏûÑÎ≤†Îî© Î≤°ÌÑ∞Î•º ÌôúÏö©Ìïú emantic search engineÏùÑ
Íµ¨ÌòÑÌï¥Î≥¥Í≤†ÏäµÎãàÎã§.
```

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install faiss-gpu

## Loading and preparing the dataset

In [None]:
from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

In [None]:
issues_dataset

In [None]:
issuse_dataset = issues_dataset.filter(lambda x : (x['is_pull_request'] == False and len(x['comments']) > 0))
issuse_dataset

In [None]:
columns = issuse_dataset.column_names
columns_to_keep = ['title', 'body', 'html_url', 'comments']
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issuse_dataset = issuse_dataset.remove_columns(columns_to_remove)
issuse_dataset

In [None]:
issuse_dataset.set_format("pandas")
df = issuse_dataset[:]

```
dataset ÌòïÌÉúÏóêÏÑú commentsÎ•º Ìïú Î≤à ÌïÑÌÑ∞ÎßÅÌï¥Ï§¨ÏùåÏóêÎèÑ
Ïó¨Ï†ÑÌûà ÎπàÏπ∏Ïù¥ ÎÇ®ÏïÑÏûàÎã§..
Ïö∞ÏÑ† ÌïúÎ≤à Îçî ÌïÑÌÑ∞Ìï¥Ï£ºÍ≤†ÏäµÎãàÎã§.
```

In [None]:
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

In [None]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset[1]

```
comments ÏπºÎüºÏóê Ï°¥Ïû¨ÌïòÎäî commentÎì§ÏùÄ
Ïó¨Îü¨ sequenceÎì§Ïù¥ ÌïòÎÇòÏùò Î¶¨Ïä§Ìä∏Î°ú Î¨∂Ïó¨ÏûàÏäµÎãàÎã§.
Ïù¥Î•º explodeÌï®ÏàòÎ•º ÏÇ¨Ïö©Ìï¥ Î∂ÑÌï†ÌïòÏó¨ ÎäòÏñ¥ÎÇú rowÏóêÎèÑ
html_url, title, commentsÎì§ÏùÑ ÎòëÍ∞ôÏù¥ Î≥µÏÇ¨Ìï¥Ï£ºÍ≤†ÏäµÎãàÎã§.
```

-------------------------------------------

### ‚úèÔ∏è Try it out!
- See if you can use Dataset.map() to explode the comments column of issues_dataset without resorting to the use of Pandas. This is a little tricky; you might find the ‚ÄúBatch mapping‚Äù section of the ü§ó Datasets documentation useful for this task.

In [None]:
tmp_dataset = issuse_dataset.map(lambda batch: {"new_comments": batch['comments']}, remove_columns=["comments"], batched=True)  # new column with 6 elements: [0, 1, 2, 0, 1, 2]
tmp_dataset

-------------------------------------------

In [None]:
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

```
commentÎì§Ïùò Í∏∏Ïù¥ Ï†ïÎ≥¥Î•º Îã¥Í≥†ÏûàÎäî
"comment_length" ÏπºÎüºÏùÑ Ï∂îÍ∞ÄÌï¥Ï£ºÍ≤†ÏäµÎãàÎã§.
```

In [None]:
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

In [None]:
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)
comments_dataset

```
ÎßàÏßÄÎßâÏúºÎ°ú title, body, commentsÎ•º concat
'text' ÏπºÎüºÏùÑ Ï∂îÍ∞ÄÌï¥Ï£ºÍ≤†ÏäµÎãàÎã§.
```

In [None]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

In [None]:
import torch

device = torch.device("cuda")
model.to(device)

In [None]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [None]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [None]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

In [None]:
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

```
FAISS ÏïåÍ≥†Î¶¨Ï¶òÏúºÎ°ú 'index'ÌïòÍ∏∞ ÏúÑÌï¥
ÏûÑÎ≤†Îî©Ïùò formatÏùÑ arrayÎ°ú Î≥ÄÍ≤ΩÌï¥Ï§çÎãàÎã§.
```

In [None]:
embeddings_dataset.add_faiss_index(column="embeddings")

In [None]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

In [None]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [None]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [None]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
attached_indexes = embeddings_dataset.list_indexes()

for index in attached_indexes:
    embeddings_dataset.drop_index(index)
embeddings_dataset

In [None]:
type(embeddings_dataset)

In [None]:
embeddings_dataset.push_to_hub("appletreeleaf/refined-github-issues")