<a href="https://colab.research.google.com/github/afsarahannan/NLP_RAG_Project-/blob/main/Retrieval_Generation_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Project: ColBert and GPT3.5

If you're working in Google Colab, we recommend selecting "T4 GPU" as your hardware accelerator in the runtime settings.



In [19]:
!git -C ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git
import sys; sys.path.insert(0, 'ColBERT/')


Already up to date.


In [None]:
try: # When on google Colab, let's install all dependencies with pip.
    import google.colab
    !pip install -U pip
    !pip install -e ColBERT/['faiss-gpu','torch']
except Exception:
  import sys; sys.path.insert(0, 'ColBERT/')
  try:
    from colbert import Indexer, Searcher
  except Exception:
    print("You're running outside Colab, please make sure you install ColBERT in conda. Conda is recommended.")
    assert False

In [30]:
import colbert

In [31]:
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection

In [23]:
#load the csv dataset here
import pandas as pd
queries = pd.read_csv("/content/queries.csv", header = None ,sep='\t')
answers = pd.read_csv("/content/answers.csv", header = None ,sep='\t')

In [24]:
questions =[x for x in queries[0]]
answer_index = [x for x in queries[1]]
answer_index_integers = [[int(num) for num in sub_string.split(',')] for sub_string in answer_index]
answer_text = [x for x in answers[0]]
print(f"There are {len(questions)} questions and {len(answer_text)} answers in this notebook.")

There are 72 questions and 83 answers in this notebook.


## Indexing


Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

In [25]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300 # truncate passages at 300 tokens
# max_id = 10000

index_name = f'ML_Edge.{nbits}bits'

In [None]:
index_name

'ML_Edge.2bits'

Now run the `Indexer` on the collection subset.

In [None]:
checkpoint = 'colbert-ir/colbertv2.0'

with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                # Consider larger numbers for small datasets.

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=answer_text, overwrite=True)

## Search



In [None]:
# Create the searcher
with Run().context(RunConfig(experiment='notebook')):
  searcher = Searcher(index=index_name, collection=answer_text, )


In [None]:
#These are the type of sample questions we can ask the retriever
questions

In [None]:
query = 'Why do we need machine learning ?' # try with an in-range query or supply your own
print(f"Question: {query}")

# Find the top-3 passages for this query
results = searcher.search(query, k=3)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

In [None]:
all_data = []
for passage_id, passage_rank, passage_score in zip(*results):
  data = searcher.collection[passage_id]
  all_data.append(data)


retrieved_response = ''.join(all_data)
retrieved_response

In [None]:
all_data

In [None]:
retrieved_response

## Generation

### Generation with pretrained t5-base model

In [None]:
# Install the transformers library
!pip install transformers

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the fine-tuned model
model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
model_max_length = 300

# Generate text based on a prompt
prompt = f"{query} {retrieved_response}"
inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=model_max_length, truncation=True)

# Generate text with custom parameters
output = model.generate(
    input_ids=inputs,
    max_length=120,
    num_return_sequences=3,
    temperature=0.2,
    top_k=100,
    top_p=0.95,
    repetition_penalty=2.0,
    length_penalty=0.5,
    num_beams=3
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)



In [53]:
print(generated_text)

Machine Learning is important because there is too much data in the world for humans to process, When computational resources are minimal and When model interpretation is required Machine learning is branch of Computer Science It focuses on the use of data and algorithms to imitate the way humans learn.


### Generation with pretrained GPT2 model

In [50]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the fine-tuned model
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
# Generate text based on a prompt
prompt = f"{query} {retrieved_response}"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate text with custom parameters
output = model.generate(
    input_ids,
    max_length=112,             # Maximum length of the generated text
    num_return_sequences=3,    # Number of independent sequences to generate
    temperature=0.2,           #(0 to 1) Controls the randomness of the generated text the higher the number the more creative the answer
    top_k=100,                 #(10 to 100) Controls the diversity of the generated text the greater the number the more tokens will be considered to generate the response
    top_p=0.95,                #(0.1 to 0.95) Controls the nucleus sampling for the generated text the higher the number the greater the probabilty of selecting tokens that relate to a different probability distribution.
    repetition_penalty=2.0,    #(1.0 to 2.0 value must be float) Penalizes repeated sequences in the generated text the greater the number the more restricted the model in repeating the same sequence
    length_penalty=0.5,        #(0.5 to 2.0) Controls the trade-off between length and quality the greater the value the longer the genrated answer which will inturn compromise the quality of the text
    num_beams=3                #(1 to 10) Number of beams for beam search decoding (set to 1 for greedy decoding) the greater the value the higher the diversity in the answer.
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [51]:
print(generated_text)

Why do we need machine learning? Machine Learning is an important field of data science because there is too much data in the world for humans to process and Classical Machine Learning is dependent on human Intervention which is a sub-field of AI that uses algorithms trained on data to produce adaptable models to perform tasks Machine learning is used when the task is simple and structured enough for Machine Learning models, When computational resources are minimal and When model interpretation is required Machine learning is branch of Computer Science It focuses on the use of data and algorithms to imitate the way humans learn.



### Testing with Langchain GPT-3.5

NOTE: This requires an API key which will be available for 3 months starting from 14th Feb 2024

In [None]:
!pip install langchain
!pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain
!pip install openai

In [None]:
import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = "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"

In [None]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

In [None]:
# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n'))])

In [None]:
# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key="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")

In [None]:
# Chain
chain = prompt | llm

In [None]:
query

'Why do we need machine learning ?'

In [None]:
# Run
chain.invoke({"context":all_data,"question":query})

AIMessage(content='We need machine learning because there is too much data in the world for humans to process, and machine learning can help process and analyze this data. Additionally, machine learning models can be adaptable and perform tasks without human intervention. Machine learning is also useful when the task is simple and structured enough for machine learning models, when computational resources are minimal, and when model interpretation is required.')