# Document Based - Question Answering using RAG

RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs' generative process.

RAG is useful for question-answering tasks where the model needs to provide informative and contextually relevant responses. By combining retrieval and generation techniques, RAG aims to achieve a balance between factual accuracy and creative language generation.

This notebook presents a question-answering system utilizing Large Language Models (LLM). The Langchain framework and Hugging Face were employed. [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) was used as an embedding method, which is a sentence-transformers model and facilitates the mapping of sentences and paragraphs into a 384-dimensional dense vector space. Additionally, for the generator model, the Flan [T5](https://huggingface.co/docs/transformers/model_doc/t5) model was utilized, which has an encoder-decoder transformer architecture that uses a text-to-text approach.

In [1]:
# Import related libraries and packages

# !pip install unstructured_inference
# !pip install pikepdf
# !pip install pypdf

import torch
from langchain.document_loaders import UnstructuredPDFLoader # pip install unstructured
from transformers import pipeline as hfpipeline
from langchain.text_splitter import CharacterTextSplitter
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline as hfpipeline
import nltk

In [None]:
# Split text into chunks
text_splitter = CharacterTextSplitter(        
                separator = "\n\n",
                chunk_size = 1000,
                chunk_overlap  = 200,
                length_function = len,
        )

# Choose the model to create embeddings
retriever = SentenceTransformer('all-MiniLM-L6-v2') # text - vector

# Read the pdf file
# Paper: Ozek et all - Analysis of pain research literature through keyword Co-occurrence networks
loader = UnstructuredPDFLoader("KCN_PLOSDigitalHealth.pdf") # 
data = loader.load()

# Preprocess text
docs = text_splitter.split_documents(data)

replace_newline = lambda text: text.replace('\n', ' ')
replace_dash_space = lambda text: text.replace('- ', '')

modified_texts_temp = [ replace_dash_space(replace_newline(doc.page_content)) for doc in docs]

# Convert documents to sentences
modified_texts = []
for paragraph in modified_texts_temp:
    modified_texts.extend(nltk.sent_tokenize(paragraph))

# Create embeddings
corpus_embeddings = retriever.encode(modified_texts, convert_to_tensor=True)

In [14]:
# Input query
query = "Is Machine learning trending?"

# Encode query into vector space 
query_embedding = retriever.encode(query, convert_to_tensor=True)
# Use cosine similarity
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0] #

# Find relevant top three passages (vectors)
top_results = torch.topk(cos_scores, k=3)

corpus_results = []
for score, idx in zip(top_results[0], top_results[1]):
    print(modified_texts[idx], "(Score: {:.4f})\n".format(score))
    corpus_results.append(modified_texts[idx])


Specifically, machine learning and biomarkers have become one of the leading research topics. (Score: 0.5406)

It shows an  increasing trend with time. (Score: 0.5147)

On the other hand, there is no such upward trend in the time windows 2002–2006 and 2007–2011; the trend is flat, but the average weighted nearest neighbor’s degree fluctuates with the degree. (Score: 0.4684)



In [15]:
# Choose generator 
# T5 encoder-decoder transformer as the LLM will answer the query based on context
sentence_generator_model = 'google/flan-t5-base'
task = 'text2text-generation'

generator = hfpipeline(task, model=sentence_generator_model) # model default t5-base - we use google/flan-t5-base

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [16]:
# Create the format of the prompt
context = [f"{m} " for m in corpus_results]
context = " ".join(context)
query_formatted = f"Answer the following question based on the context. Question: {query} context: {context}"

print("Formatted query: \n" , query_formatted)

answer = generator(query_formatted)
print("\n\nAnswer:", answer)

Formatted query: 
 Answer the following question based on the context. Question: Is Machine learning trending? context: Specifically, machine learning and biomarkers have become one of the leading research topics.  It shows an  increasing trend with time.  On the other hand, there is no such upward trend in the time windows 2002–2006 and 2007–2011; the trend is flat, but the average weighted nearest neighbor’s degree fluctuates with the degree. 






Answer: [{'generated_text': 'an increasing trend'}]


References:
- https://research.ibm.com/blog/retrieval-augmented-generation-RAG