# CPSC 477/577 Spring 2025 - Homework 3
## Part 3- RAG and Information Retrieval

Yale University  
Spring 2025  
Instructor: Arman Cohan

In this homework, we will implement and evaluate a RAG-based retrieval system using the LitSearch dataset and Snowflake's Arctic Embeddings model.

**Acknolwedgement**  The assignment is designed by TA Yilun Zhao with help and guidance from Arman Cohan.

### Submission Instructions

Submit the notebook as a .ipynb file through GradeScope.

Make sure that the notebook is running without any errors before submission. Remove any unnecessary outputs or additional `print` or debugging statements that you put in the code before submission.

### Write your name and NetID below.

**Name:**    Yuan Chang

**NetID:**   yc2238

### Main tasks include:
1. Building retrieval index using Arctic Embeddings
2. Implementing retrieval functionality

First, let's import the required packages:

In [1]:
!pip install datasets transformers torch faiss-cpu tqdm

from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModel
import faiss
import numpy as np
from tqdm.auto import tqdm

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.w

## Data Loading and Preprocessing
Load the LitSearch dataset which contains academic paper queries and corpus.
The corpus includes paper titles, abstracts and other metadata.

In [2]:
query_data = load_dataset("yale-nlp/LitSearch-NLP-Class", "query", split="full")
corpus_data = load_dataset("yale-nlp/LitSearch-NLP-Class", "corpus_new", split="full")

# Print dataset sizes to verify loading
print(f"Query set size: {len(query_data)}")
print(f"Corpus size: {len(corpus_data)}")

# After loading datasets, add structure inspection
print("\nQuery dataset columns:", list(query_data.features.keys()))
print("Sample query:", query_data[0])
print("\nCorpus dataset columns:", list(corpus_data.features.keys()))
print("Sample corpus document:", corpus_data[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

full-00000-of-00001.parquet:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

Generating full split:   0%|          | 0/597 [00:00<?, ? examples/s]

full-00000-of-00001.parquet:   0%|          | 0.00/139M [00:00<?, ?B/s]

Generating full split:   0%|          | 0/6809 [00:00<?, ? examples/s]

Query set size: 597
Corpus size: 6809

Query dataset columns: ['query_set', 'query', 'specificity', 'quality', 'corpusids']
Sample query: {'query_set': 'inline_acl', 'query': 'Are there any research papers on methods to compress large-scale language models using task-agnostic knowledge distillation techniques?', 'specificity': 0, 'quality': 2, 'corpusids': [202719327]}

Corpus dataset columns: ['corpusid', 'title', 'abstract', 'citations', 'full_paper']
Sample corpus document: {'corpusid': 253523474, 'title': 'CHARACTERIZING THE SPECTRUM OF THE NTK VIA A POWER SERIES EXPANSION', 'abstract': 'Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the He

## Task 1: Embed Documents Using the Arctic Embeddings Model (20 points)
Initialize the Arctic Embeddings model (check usage example here: https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5#using-huggingface-transformers) which will be used to encode both queries and documents.

In [3]:
# Task 1: Initialize Arctic Embeddings model
# We will be using a lightweight yet capable embedding model called Arctic
# TODO:
# 1. Load tokenizer using AutoTokenizer.from_pretrained(model_name)
# 2. Load model using AutoModel.from_pretrained(model_name)
# 3. Move model to GPU using .to('cuda')
from torch.nn.functional import normalize
model_name = "Snowflake/snowflake-arctic-embed-m-v1.5" # check
# tokenizer = None  # TODO: YOUR CODE HERE
# model = None
tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModel.from_pretrained(model_name)    # TODO: YOUR CODE HERE
model.eval()

if tokenizer is None or model is None:
    raise NotImplementedError("tokenizer or model not loaded properly.")

print(f"Model '{model_name}' and tokenizer loaded successfully.")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

def get_embeddings(texts, batch_size=128, disable=False):
    """Get embedding vectors for input texts

    TODO Implementation Steps:
    1.  **Add Prefix**: Prepend the recommended prefix
        `"Represent this sentence for searching relevant passages: "` to each text
        in the input list. This is important for Arctic model performance.
    2.  **Batch Processing**: Iterate through the prefixed texts in batches of size `batch_size`.
    3.  **Tokenization**: For each batch:
        - Tokenize the texts using the `tokenizer`. Ensure you add padding (`padding=True`),
          truncate sequences (`truncation=True`), specify a `max_length` (e.g., 512),
          and return PyTorch tensors (`return_tensors="pt"`).
        - Move the tokenized batch to the GPU (`.to('cuda')`).
    4.  **Inference**: Within a `torch.inference_mode()` context:
        - Pass the tokenized batch to the `model`.
        - Extract the [CLS] token's embedding. This is typically the embedding of the
          first token in the `last_hidden_state` (output[0][:, 0]).
    5.  **Normalization**: Apply L2 normalization to the extracted [CLS] embeddings
        Question: Why you would need to normalize?
    6.  **Collection**: Store the normalized embeddings (moved back to CPU using `.cpu()`).
    7.  **Concatenation**: After processing all batches, concatenate the collected
        batch embeddings into a single tensor using `torch.cat()`.
    8.  Return the final tensor.

    Args:
        texts: List of strings to encode
        batch_size: Batch size for processing

    Returns:
        torch.Tensor of shape (len(texts), embedding_dim)
    """
    all_embeddings = []
    prefix = "Represent this sentence for searching relevant passages: "

    # --- TODO: Implement the embedding generation logic ---
    # Follow the implementation steps described in the docstring above.
    # Approximately 10-15 lines of code are expected.

    for i in range(0, len(texts), batch_size):
        batch_texts = [prefix + t for t in texts[i : i + batch_size]]

        enc = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        )
        input_ids      = enc.input_ids.to(device)
        attention_mask = enc.attention_mask.to(device)
        with torch.no_grad():
            out = model(input_ids=input_ids, attention_mask=attention_mask)
        cls_emb = out.last_hidden_state[:, 0, :]  # shape (batch, dim)
        cls_emb = normalize(cls_emb, p=2, dim=1)
        all_embeddings.append(cls_emb.cpu()) #Move back to CPU and store

    embeddings = torch.cat(all_embeddings, dim=0)

    if embeddings.size(0) != len(texts):
        raise RuntimeError("Expected embeddings for all texts")

    return embeddings


    # --- End TODO ---

    if not all_embeddings:
         raise NotImplementedError("TODO Embedding generation logic not implemented or returned empty list.")


tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at Snowflake/snowflake-arctic-embed-m-v1.5 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model 'Snowflake/snowflake-arctic-embed-m-v1.5' and tokenizer loaded successfully.


## Task 2: Building Retrieval Index (10 points)
Construct a FAISS index for efficient similarity search.

In [6]:
# Task 2: Build FAISS Index
def build_faiss_index(corpus_data):
    """Build FAISS index for similarity search

    TODO Implementation Steps:
    1.  **Prepare Texts**: Create a list of strings (`corpus_texts`), where each string
        is the concatenation of a document's "title" and "abstract".
    2.  **Generate Embeddings**: Use the `get_embeddings` function implemented in Task 1
        to generate embeddings for all `corpus_texts`.
    3.  **Initialize Index**: Create a FAISS index suitable for dense vector similarity
        search. `faiss.IndexFlatL2` is a good choice here. Make sure its dimension
        matches the dimension of your generated embeddings.
    4.  **Add Embeddings**: Add the generated corpus embeddings to the FAISS index.
        Note: FAISS typically requires embeddings as a NumPy array (`.numpy()`).
    5.  Return the populated FAISS index.

    Args:
        embeddings: torch.Tensor or numpy array of shape (N, dim)
    Returns:
        FAISS index
    """


    # Follow the implementation steps described in the docstring above.
    # Approximately 5 lines of code are expected.

    corpus_texts = [doc["title"] + " " + doc["abstract"]for doc in corpus_data]      # YOUR CODE HERE
    corpus_embeddings = get_embeddings(corpus_texts)
    dim = corpus_embeddings.size(1)
    corpus_index = faiss.IndexFlatL2(dim)
    corpus_index.add(corpus_embeddings)    # YOUR CODE HERE (Initialize IndexFlatL2 and add embeddings)

    # --- End TODO ---

    if corpus_index is None or corpus_index.ntotal == 0:
        raise NotImplementedError("TODO 2.1: FAISS index not built or is empty.")

    return corpus_index

In [7]:
print("Building FAISS index for the corpus...")
corpus_index = build_faiss_index(corpus_data)
print(f"FAISS index built successfully with {corpus_index.ntotal} documents.")

Building FAISS index for the corpus...
FAISS index built successfully with 6809 documents.


## Task 3: Implementing Retrieval  (10 points)
Implement the retrieval function that:
1. Encodes the input query
2. Performs similarity search in the index
3. Returns the top-k most relevant documents (default value of k is 10)

In [8]:
## Task 3: Implementing Retrieval
# Implement the retrieval function that uses the built index to find relevant documents for a given query.


def retrieve(corpus_data, corpus_index, query, k=10):
    """
    Retrieve the top-k most relevant documents for a given query using the FAISS index.

    Args:
        corpus_data (datasets.Dataset): The original corpus dataset.
        corpus_index (faiss.Index): The FAISS index built from corpus embeddings.
        query (str): The search query string.
        k (int): The number of top documents to retrieve.

    Returns:
        list: A list containing the top-k relevant document entries from `corpus_data`.

    Implementation Guide/Steps:
    High-level: You need to embed the query, then search the index, then retrieve and return

    Steps:
    1.  **Get Query Embedding**: Generate the embedding for the input `query` using the
        `get_embeddings` function. Remember `get_embeddings` expects a list of texts.
        Set `disable=True` to avoid nested progress bars.
    2.  **Search Index**: Use the `corpus_index.search()` method to find the `k` nearest
        neighbors to the query embedding. This returns distances (D) and indices (I).
        FAISS search expects a NumPy array for the query embedding.
    3.  **Retrieve Documents**: Get the indices of the top-k documents from the search
        results (I[0]). Use these indices to retrieve the corresponding full document
        entries from the original `corpus_data`.
        Note: Ensure indices from FAISS (often numpy.int64) are converted to standard
        Python `int` for indexing into the `datasets.Dataset`.
    4.  Return the list of retrieved documents.
    """

    # --- TODO: Implement the retrieval logic ---
    # Follow the implementation steps described in the docstring above.
    # Approximately 3-4 lines of code are expected.
    qurey_embeddings = get_embeddings([query], disable=True).cpu().numpy()
    D, I   = corpus_index.search(qurey_embeddings, k)
    retrieved_docs = [corpus_data[int(idx)] for idx in I[0]] # YOUR CODE HERE

    # --- End TODO ---

    if retrieved_docs is None:
        raise NotImplementedError("TODO 3.1: Retrieval logic not implemented.")

    return retrieved_docs

## Evaluation Details

We will evaluate the retrieval performance using Recall@10
To receive full scores, the score should be > 0.78

In [9]:
## Evaluation Section

# This section contains code to evaluate your retrieval implementation using Recall@10.
# **You do not need to modify the code below.** Run these cells after completing Tasks 1, 2, and 3 to check your work.

# The target performance is **Recall@10 > 0.78**.

# Example usage and visualization
sample_query = "Transformer models for natural language processing"
results = retrieve(corpus_data, corpus_index, sample_query)
gold_titles = ["Hierarchical Transformer for Task Oriented Dialog Systems", "Pretrained Transformers for Text Ranking: BERT and Beyond"]

print("Sample Query:", sample_query)
print("\nTop 2 Retrieved Papers:")
for i, doc in enumerate(results[:2], 0):
    assert doc["title"] == gold_titles[i]
    print(f"\n{i}. {doc['title']}")
    print(f"Abstract: {doc['abstract'][:200]}...")

Sample Query: Transformer models for natural language processing

Top 2 Retrieved Papers:

0. Hierarchical Transformer for Task Oriented Dialog Systems
Abstract: Generative models for dialog systems have gained much interest because of the recent success of RNN and Transformer based models in tasks like question answering and summarization. Although the task o...

1. Pretrained Transformers for Text Ranking: BERT and Beyond
Abstract: The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query for a particular task. Although the most common formulation of text ranking is search, i...


In [10]:
def evaluate(corpus_index, queries, relevants, k=10):
    """
    Evaluate retrieval performance using Recall at k.

    Args:
        corpus_index: Your corpus or index structure.
        queries (list): List of query strings.
        relevants (list): List of lists, where each sublist contains
                          the relevant document IDs for the corresponding query.
        k (int): Number of documents to retrieve for each query.

    Returns:
        dict: Dictionary containing average Recall@k and a simple "Passed" flag.
    """
    recall_sum = 0.0
    n_queries = len(queries)

    for query, rel_docs in zip(queries, relevants):
        results = retrieve(corpus_data, corpus_index, query, k)  # Retrieve top-k docs
        retrieved_ids = [doc['corpusid'] for doc in results]  # or 'paper_id' if needed

        # Count how many relevant docs were retrieved
        relevant_retrieved = sum(1 for doc_id in retrieved_ids if doc_id in rel_docs)

        # Compute recall for this query (handle edge case if no relevant docs exist)
        recall_sum += relevant_retrieved / len(rel_docs)

    # Average recall over all queries
    recall = recall_sum / n_queries

    # A sample "passing" criterion
    passed = recall > 0.78  # Threshold can be adjusted as needed

    return {
        "Recall": recall,
        "Passed_Requirement": passed,
    }

In [11]:
# Run evaluation on test set
test_queries = query_data["query"]
test_relevants = query_data["corpusids"]

# Grade implementation
results = evaluate(corpus_index, test_queries, test_relevants)
print(f"Evaluation results: {results}")

Evaluation results: {'Recall': 0.818425460636516, 'Passed_Requirement': True}


## Analysis and Discussion (10 points)

Answer the following questions in a brief report below.

1- What is the time complexity of your implemented `retrieve` function?

2- Analyze failure cases where the gold passage is not found among the top-20 retrieved results. Identify and describe two distinct primary error types that contribute to these failures. For each error type, provide:
- One representative example
- A detailed explanation of the specific issue in that case

In [None]:
# TODO Your answers to above questions

your_answer_1 = """The time complexity is O(nd) over n corpus vectors of dimension d"""

your_answer_2 = """Truncation loss for long documents:The gold passage appears after the first 600
 tokens of a 2 000-token abstract, but our tokenizer truncates inputs to 512 tokens.
 By truncating, the core content carrying the gold answer was missing so its vector did not reflect the true relevance"""

if your_answer_1 == "":
  raise NotImplementedError()