[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pFXVER_Cp_Xg9QZAAvXe_hQg2bMHj-tS)

## **1. Install and import bibraries**

In [None]:
!pip install -qq datasets==2.16.1 evaluate==0.4.3

In [None]:
!nvidia-smi

Wed Feb 19 10:14:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   57C    P0             27W /   70W |     424MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
!sudo apt-get install libomp-dev
!pip install -qq faiss-gpu-cu12

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libomp-dev is already the newest version (1:14.0-55~exp2).
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.


In [None]:
import numpy as np
import collections
import torch
import faiss
import evaluate

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForQuestionAnswering
from transformers import TrainingArguments
from transformers import Trainer
from tqdm.auto import tqdm

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

## **2. Download dataset**

In [None]:
DATASET_NAME = 'squad_v2'
raw_datasets = load_dataset(DATASET_NAME, split='train')
raw_datasets

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 130319
})

## **3. Filter out non-answerable samples**

In [None]:
raw_datasets = raw_datasets.filter(lambda x: len(x['answers']['text']) > 0)
raw_datasets

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 86821
})

## **4. Intialize pre-trained model**

In [None]:
MODEL_NAME = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME).to(device)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **5. Create get vector embedding functions**

In [None]:
# Để tạo vector embedding cho câu hỏi, ta sử dụng vector hidden state từ token CLS trong output của model
def cls_pooling(model_output):
    return model_output.hidden_states[-1][:, 0]

In [None]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list,
        padding=True,
        truncation=True,
        return_tensors='pt'
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input, output_hidden_states=True)
    return cls_pooling(model_output)

In [None]:
# Test functionality
embedding = get_embeddings(raw_datasets['question'][1])
embedding.shape

torch.Size([1, 768])

In [None]:
# Convert to numpy array (required for HF Datasets)
EMBEDDING_COLUMN = 'question_embedding'
embeddings_dataset = raw_datasets.map(
    lambda x: {EMBEDDING_COLUMN: get_embeddings(x['question']).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/86821 [00:00<?, ? examples/s]

In [None]:
# Tạo Faiss Index
embeddings_dataset.add_faiss_index(column=EMBEDDING_COLUMN)

  0%|          | 0/87 [00:00<?, ?it/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers', 'question_embedding'],
    num_rows: 86821
})

## **6. Search similar samples with a question**

In [None]:
question = 'When did Beyonce start becoming popular?'

input_quest_embedding = get_embeddings([question]).cpu().detach().numpy()
input_quest_embedding.shape

(1, 768)

In [None]:
TOP_K = 5
scores, samples = embeddings_dataset.get_nearest_examples(
    EMBEDDING_COLUMN, input_quest_embedding, k=TOP_K
)

In [None]:
for idx, score in enumerate(scores):
    print(f'Top {idx + 1}\tScore: {score}')
    print(f'Question: {samples["question"][idx]}')
    print(f'Context: {samples["context"][idx]}')
    print(f'Answer: {samples["answers"][idx]}')
    print()

Top 1	Score: 0.0
Question: When did Beyonce start becoming popular?
Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Answer: {'text': ['in the late 1990s'], 'answer_start': [269]}

Top 2	Score: 2.6135306358337402
Question: When did Beyoncé rise to fame?
Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981)