## Overview of Assignment 4

This assignment focuses on exploring and implementing advanced concepts and techniques in information retrieval. The primary objectives are to build Retrieval Augumentation Generation, and learn about Language Models

## Enter your details below

## Name

Antonia Mugisa

## Banner ID

B00856440

## GitHub Link of your Assingment 4

https://github.com/CSCI4141/assignment-4-antoniamugisa

## Q1 : Setting up the libraries and the environment

In [1]:
! pip3 install datasets
! pip3 install transformers
! pip3 install tqdm 
! pip3 install urllib3==1.26.16
! pip3 install --upgrade jupyter ipywidgets 
! pip3 install faiss-cpu
! pip3 install --upgrade langchain 
! pip3 install langchain_community
! pip3 install langchain_huggingface

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [7]:
from datasets import load_dataset
from transformers import AutoTokenizer, BertModel, pipeline
import numpy as np
import torch
from torch.utils.data import DataLoader
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores.base import VectorStore
from langchain_huggingface import HuggingFacePipeline
from langchain.chains import retrieval_qa
from langchain_community.document_loaders import HuggingFaceDatasetLoader
from langchain.docstore import InMemoryDocstore
from langchain_core.documents import Document

## Q2:  Data Preprocessing and Model Selection

In [3]:
# 1. 
# Load dataset 

# Create a loader instance
dataset = load_dataset("danioshi/incubus_taylor_swift_lyrics", split="train")


In [4]:
# 2. 
# Tokenize data

# Load a pre-trained tokenizer (you can choose a different one if needed)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the dataset
def tokenize_function(example):
    return tokenizer(example["lyrics"], padding="max_length", truncation=True)

# Apply the tokenizer to the dataset
tokenizer_ds = dataset.map(tokenize_function, batched=True)




In [5]:
# 3. 

#Split into chunks for indexing 

def chunk_examples(example, chunk_size=128):
    input_ids = example["input_ids"]
    chunks = [input_ids[i:i + chunk_size] for i in range(0, len(input_ids), chunk_size)]
    return {"chunks": chunks}

# Apply the chunking function
chunked_ds = tokenizer_ds.map(chunk_examples, batched=False)



In [13]:
# 4. 

# Create a vector store

# Load the model and tokenizer
# Load the model and tokenizer
model = BertModel.from_pretrained("bert-base-uncased")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Function to generate embeddings from token IDs
def get_embeddings(input_ids_batch):
    input_ids_batch = [torch.tensor(ids).to(device) for ids in input_ids_batch]  # Convert each chunk to tensor
    with torch.no_grad():
        outputs = [model(input_ids=ids.unsqueeze(0)).last_hidden_state[:, 0].cpu().numpy() for ids in input_ids_batch]
        return np.vstack(outputs)  # Stack the embeddings to form a single numpy array

# Apply the embedding function to your dataset
def add_embeddings(example):
    # Flatten the list of chunks
    flattened_chunks = [item for sublist in example['chunks'] for item in sublist]
    # Generate embeddings
    example['embeddings'] = get_embeddings(flattened_chunks)
    return example

# Apply the function to your chunked dataset
chunked_ds_with_embeddings = chunked_ds.map(add_embeddings, batched=False)

# Prepare DataLoader for batch processing
def collate_fn(batch):
    # Flatten the list of chunks and create a single batch
    input_ids_batch = [example['chunks'] for example in batch]
    return input_ids_batch

dataloader = DataLoader(chunked_ds_with_embeddings, batch_size=32, collate_fn=collate_fn)

# Generate embeddings for the entire dataset
embeddings_list = []
documents = []

for batch in dataloader:
    batch_embeddings = get_embeddings(batch)
    embeddings_list.extend(batch_embeddings)
    # Assuming you need to collect documents from the same dataset
    documents.extend(
        Document(page_content=example['lyrics']) 
        for example in chunked_ds_with_embeddings if example['chunks'] in batch)

# Convert to numpy array
embeddings_array = np.array(embeddings_list, dtype=np.float32)

# Define the dimension of embeddings (should match the size of individual embeddings)
dimension = embeddings_array.shape[1]

# Initialize FAISS index
index = faiss.IndexFlatL2(dimension)

# Add embeddings to FAISS index
index.add(embeddings_array)

print("FAISS index created and embeddings added.")



Map:   0%|          | 0/267 [00:00<?, ? examples/s]

IndexError: too many indices for tensor of dimension 1

## Q3: Implementing RAG using LangChain for different queries

1. 

Retrieval-Augmented Generation (RAG) is an approach that enhances language model outputs by integrating external knowledge. It combines retrieval and generation to produce more informed responses. Here's a brief overview:

Components:

1. Query Encoder: Converts the input query into a vector to capture its meaning. Typically uses models like BERT.

2. Retriever: Searches a knowledge base for relevant documents using the query vector. Employs techniques like FAISS or BM25.

3. Document Encoder: Converts retrieved documents into vectors compatible with the generator.

4. Generator: Combines the query and document vectors to generate a context-rich response. Uses models like GPT.

5. Knowledge Base: The source of external information, such as Wikipedia.

Workflow:

1. Input Handling: Receive and encode the query.
2. Retrieval: Find relevant documents.
3. Encoding: Encode documents into vectors.
4. Generation: Produce a response using the combined vectors.
5. Output: Deliver an informed, context-aware response.

Advantages:

- Knowledge Integration: Generates more informed outputs.
- Flexibility: Adaptable to various domains.
- Scalability: Efficiently handles large datasets.

Use Cases:

- Question Answering: Provides accurate, contextually enhanced answers.
- Customer Support: Improves automated response quality.
- Content Generation: Creates comprehensive, well-informed content.

RAG effectively combines pre-trained models with external knowledge sources for more accurate text generation.

2.

I have chosen to use T5, a text-to-text transformer. It is highly adaptable for retrieveal and generation. It is pretrained on the C4 dataset, which is provides it with broad general knowledge. T5 is also available in various sizes and allows customization to increase performance on RAG tasks. 

In [167]:
# 3

# Create an InMemoryDocstore
docstore = InMemoryDocstore({str(i): doc for i, doc in enumerate(documents)})

# Create an index_to_docstore_id mapping
index_to_docstore_id = {i: str(i) for i in range(len(documents))}

# Initialize FAISS vector store with required arguments
hf_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    encode_kwargs={'normalize_embeddings': False}
)

vs = FAISS(
    index=index,
    docstore=docstore,
    index_to_docstore_id=index_to_docstore_id,
    embedding_function=hf_embeddings
)

# Create retriever from vector store
retriever = vs.as_retriever()

# Initialize the Hugging Face pipeline for text generation
hf_pipeline = pipeline(
    "text2text-generation", 
    model="t5-base", 
    device=0,  # Specify GPU if available
    max_length=150
)

# Wrap the Hugging Face pipeline using HuggingFacePipeline from LangChain
llm = HuggingFacePipeline(pipeline=hf_pipeline)

# Construct the RAG-like pipeline using RetrievalQA
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="refine",
    retriever=retriever,
    return_source_documents=False
)

# Step 6: Perform a query
query = "What are the common themes in Taylor Swift's lyrics?"
result = qa.invoke({"query": query})

# Step 7: Display the result
print("Generated Response:", result["result"])

KeyboardInterrupt: 

## Q4 : Modify and evaluate the different components of RAG

In [None]:
# 1.


2. 

In [None]:
# 3. 

4. 

## Q5: Selecting and implementing a pretrained model for a new task