<a href="https://colab.research.google.com/github/afsarahannan/NLP_RAG_Project-/blob/main/Retrieval_Generation_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ColBERTv2: Indexing & Search Notebook

If you're working in Google Colab, we recommend selecting "GPU" as your hardware accelerator in the runtime settings.

First, we'll import the relevant classes. Note that `Indexer` and `Searcher` are the key actors here. Next, we'll download the necessary dependencies.

In [None]:
!git -C ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git
import sys; sys.path.insert(0, 'ColBERT/')


fatal: cannot change to 'ColBERT/': No such file or directory
Cloning into 'ColBERT'...
remote: Enumerating objects: 2634, done.[K
remote: Counting objects: 100% (1137/1137), done.[K
remote: Compressing objects: 100% (336/336), done.[K
remote: Total 2634 (delta 891), reused 847 (delta 801), pack-reused 1497[K
Receiving objects: 100% (2634/2634), 2.03 MiB | 10.91 MiB/s, done.
Resolving deltas: 100% (1650/1650), done.


In [None]:
try: # When on google Colab, let's install all dependencies with pip.
    import google.colab
    !pip install -U pip
    !pip install -e ColBERT/['faiss-gpu','torch']
except Exception:
  import sys; sys.path.insert(0, 'ColBERT/')
  try:
    from colbert import Indexer, Searcher
  except Exception:
    print("You're running outside Colab, please make sure you install ColBERT in conda. Conda is recommended.")
    assert False

In [None]:
import colbert

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


In [None]:
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection

In [None]:
#load the csv dataset here
import pandas as pd
queries = pd.read_csv("/content/queries.csv", header = None ,sep='\t')
answers = pd.read_csv("/content/answers.csv", header = None ,sep='\t')

In [None]:
questions =[x for x in queries[0]]
answer_index = [x for x in queries[1]]
answer_index_integers = [[int(num) for num in sub_string.split(',')] for sub_string in answer_index]
answer_text = [x for x in answers[0]]
print(f"There are {len(questions)} questions and {len(answer_text)} answers in this notebook.")

There are 48 questions and 59 answers in this notebook.


## Indexing


Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

In [None]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300 # truncate passages at 300 tokens
# max_id = 10000

index_name = f'ML_Edge.{nbits}bits'

In [None]:
index_name

'ML_Edge.2bits'

Now run the `Indexer` on the collection subset.

In [None]:
checkpoint = 'colbert-ir/colbertv2.0'

with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                # Consider larger numbers for small datasets.

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=answer_text, overwrite=True)

## Search



In [None]:
# Create the searcher
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name, collection=answer_text)


In [None]:
#These are the type of questions we can ask the retriever
questions

In [None]:
query = 'Why do we need machine learning ?' # try with an in-range query or supply your own
print(f"Question: {query}")

# Find the top-3 passages for this query
results = searcher.search(query, k=3)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

Question: Why do we need machine learning ?

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . Why do we need machine learning ?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 2339, 2079, 2057, 2342, 3698, 4083, 1029,  102,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])

	 [1] 		 25.7 		 Machine Learning is an important field of data science because there is too much data in the world for humans to process and Classical Machine Learning is dependent on human Intervention which is a sub-field of AI that uses algorithms trained on data to produce adaptable models to perform tasks  
	 [2] 		 25.1 		 Machine learning is used when the task is simple and structured enough for Machine Lear



In [None]:
all_data = []
for passage_id, passage_rank, passage_score in zip(*results):
  data = searcher.collection[passage_id]
  all_data.append(data)


retrieved_response = ''.join(all_data)
retrieved_response

'Machine Learning is an important field of data science because there is too much data in the world for humans to process and Classical Machine Learning is dependent on human Intervention which is a sub-field of AI that uses algorithms trained on data to produce adaptable models to perform tasks  Machine learning is used when the task is simple and structured enough for Machine Learning models, When computational resources are minimal and When model interpretation is requiredMachine learning is branch of Computer Science It focuses on the use of data and algorithms to imitate the way humans learn '

## Generation

In [None]:
# Install the transformers library if you haven't already
!pip install transformers

# Import necessary libraries
from transformers import T5ForConditionalGeneration, T5Tokenizer
# Import necessary libraries
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [None]:
# Concatenate query and response
input_text = f"{query} {retrieved_response}"

# Load T5 model and tokenizer
model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Tokenize and generate response
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
generated_ids = model.generate(input_ids, max_length=150, num_return_sequences=1, num_beams=4, early_stopping=True)




In [None]:
# Decode and print response
generated_response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print("Generated response:", generated_response)

Generated response: Machine Learning is a branch of Computer Science Machine Learning is important because there is too much data in the world for humans to process, When computational resources are minimal and When model interpretation is required Machine Learning is used when the task is simple and structured enough for Machine Learning models, When computational resources are minimal and When computational resources are minimal and When model interpretation is requiredMachine learning is branch of Computer Science It focuses on the use of data and algorithms to imitate the way humans learn. Machine Learning is branch of Computer Science It focuses on


In [None]:
# Load GPT model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Tokenize input and generate response
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
attention_mask = input_ids.clone().detach()
attention_mask.fill_(1)  # Setting all tokens to attention
generated_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=150, num_return_sequences=1, num_beams=4, early_stopping=True)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
# Decode and print response
generated_response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print("Generated response:", generated_response)

Generated response: Why do we need machine learning? Machine Learning is an important field of data science because there is too much data in the world for humans to process and Classical Machine Learning is dependent on human Intervention which is a sub-field of AI that uses algorithms trained on data to produce adaptable models to perform tasks  Machine learning is used when the task is simple and structured enough for Machine Learning models, When computational resources are minimal and When model interpretation is requiredMachine learning is branch of Computer Science It focuses on the use of data and algorithms to imitate the way humans learn  Machine Learning is a branch of Computer Science It focuses on the use of data and algorithms to imitate the way humans learn Machine Learning is a branch of Computer Science It focuses on the use of data
